Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

MindMap Gallery Summary of sklearn artificial intelligence machine learning knowledge points (with practical code illustrations)

Summary of sklearn artificial intelligence machine learning knowledge points (with practical code illustrations)

A summary of practical machine learning knowledge points based on sklearn, including practical code and result diagrams written by the author, which can be used for learning, interview review, and advanced use.

Edited at 2022-03-20 14:40:39

PlotWizard

Recent works View more works>>

Summary of sklearn artificial intelligence machine learning knowledge points (with practical code illustrations)

PlotWizard

Recent works View more works>>

Recommended to you
Outline

machine learning

Guide package 1

DictVectorizer

CountVectorizer

Chinese

jieba.cut

stuttering participle

Call the function on the left and countvectorizer

TF-IDF

The main idea of TF-IDF

If a word or phrase appears with a high probability in an article, And if it rarely appears in other articles, it is considered that this word or phrase has good category distinction. Ability is suitable for classification.

Tf: term frequency: the number of times the word frequency appears

idf: inverse document frequency inverse document frequency =log(total number of documents in the corpus/number of documents in which the word appears 1)

TF-IDF=tf*idf represents the degree of importance

Normalized

Features: Map the data to (default is [0,1]) by transforming the original data

Function: Gradient descent is faster, the optimal solution is found faster, and the model is trained faster

Disadvantages: easily affected by extreme values

standardization

Features: Transform the original data to a range with a mean of 0 and a standard deviation of 1 (standard normal distribution)

𝑋′= (𝑥−mean)/𝜎

Acts on each column, mean is the mean, and 𝜎 is the standard deviation.

std becomes the variance, 𝜎= √std

If outliers occur, due to a certain amount of data, a small number of outliers will not have a big impact on the average value, so the variance will change little.

Missing value handling

Missing values can be filled in by the mean or median of each row or column.

Feature selection

subtopic

Filter:VarianceThreshold

Remove low variance features

var = VarianceThreshold(threshold=0.2) # Delete those with variance less than 0.2 data = var.fit_transform([[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]])

Embedded: regularization, decision tree, neural network

Wrapper(wrapped)

The wrapping method continuously selects feature subsets from the initial feature set, trains the learner, and evaluates the subsets based on the performance of the learner until the best subset is selected.

PCA principal component analysis

Purpose: Data dimensionality compression, reducing the dimensionality (complexity) of the original data as much as possible and losing a small amount of information

Function: It can reduce the number of features in regression analysis or cluster analysis.

Guide package 2

fit_transform

Test is fit_transform, which means that only the mean and variance of test are used in the standardization process.

test is transform, indicating that the mean and variance of train are used in the standardization process.

KNN-facebook project

Knowledge points

If the k value of KNN is too small, it is easy to overfit.

k value selection problem

If it is too large, the model is too simple and prone to underfitting.

If it is too small, the model is too complex and easy to overfit.

Approximation error and estimation error:

The approximation error is the training error on the training set

The estimation error is the test error on the test set

Actual combat

Data format: row_id x y accuracy time place_id

x y is the coordinate, time is 23234s, the number of seconds from January 1, 1970, place_id is the location

The goal is to predict which store place_id to go to based on x y

Data preprocessing

Build KNN model

Advantages and Disadvantages

advantage

Simple and effective

Retraining is cheap

Suitable for class-domain cross-samples

Suitable for automatic classification of large samples

shortcoming

lazy learning

The output is not very interpretable

Not good at unbalanced samples

Too many of one category and too few of others

Model selection and tuning

Cross-validation

Cross-validate the training set

grid search

Hyperparameter search

Actual combat

Evaluation metrics for classification models

confusion matrix

Precision rate: The prediction result is the proportion of positive examples among the positive examples (accurate check) TP/(TP FP)

Recall rate: the proportion of predicted positive examples among the samples that are actually positive examples (complete search, ability to distinguish positive samples) TP/(TP FN)

F1-score: reflects the robustness of the model

TPR, FPR, TNR, FNR, ROC curve, AUC value

Accuracy can be understood as the number of correct classifications in all experiments

Naive Bayes classification algorithm

formula

Example

Laplacian smoothing

We get an entertainment probability of 0, which is unreasonable

formula

α is the specified coefficient, which is generally 1, and m is the number of feature words counted in the training document.

Advantages and Disadvantages

advantage

The Naive Bayes model originated from classical mathematical theory and has stable classification efficiency.

It is not very sensitive to missing data and the algorithm is relatively simple. It is often used for text classification.

High classification accuracy and fast speed

shortcoming

It is necessary to know the prior probability P(F1,F2,…|C), so at some point the prediction effect will be poor due to the assumed prior model: if the articles are not collected well, for example, there are cheating articles, which are full of certain words will interfere with the results

Actual combat

Data preprocessing

Model prediction and evaluation

Decision tree classification algorithm

information entropy

The decision tree is divided based on information gain.

ID3 case

Common algorithms

ID3

The criterion for maximizing information gain

Understand from the extreme point of view: if there is only one category added to a certain feature, then the information gain is zero, and we will delete this feature

Disadvantages: When the entropy values are similar, two features are multiplied by 1/2, and three features are multiplied by 1/3, so the gain is large. Therefore, the ID3 algorithm prefers features with a relatively large number of features.

C4.5

Information gain ratio maximum criterion

CART

Classification tree: Gini coefficient minimum criterion

Whether the loan is in arrears

data

Divided according to housing availability

According to marriage

According to annual income

The remaining attributes continue to be divided

final decision tree

Summary of common decision tree types

Advantages and Disadvantages:

advantage

1. Simple understanding and explanation, tree visualization. 2. Requires little data preparation, other techniques usually require data normalization, standardization

shortcoming

1. Fully grown, the tree is too complex and easy to overfit. 2. Decision trees may be unstable because small changes in data may cause completely different trees to be generated.

ways to improve

cart pruning

pre-pruning

(1) The minimum number of samples contained in each node, such as 10. If the total number of samples at the node is less than 10, no classification will be performed.

(2) Specify the height or depth of the tree, for example, the maximum depth of the tree is 4;

(3) If the entropy of the specified node is less than a certain value, it will no longer be divided.

post-pruning

Perform pruning on the generated overfitting decision tree to obtain a simplified version of the pruned decision tree.

Titanic Survival Prediction Project

Handle missing values and split data

Convert text into vectors and then model prediction

Ensemble learning method-random forest

A random forest is a classifier that contains multiple decision trees, and its output category is determined by the mode of the category output by the individual trees.

Key steps in the random forest construction process (use N to represent the number of training cases (samples), M to represent the number of features): 1) Randomly select one sample at a time, sampling with replacement, repeat N times (duplicate samples may occur) 2) Randomly select m features, m <<M, and build a decision tree

set up process

1. Why randomly sample the training set? If random sampling is not performed and the training set of each tree is the same, then the final trained tree classification results will be exactly the same. 2. Why is sampling with replacement necessary? If there is no sampling with replacement, then the training samples of each tree are different and have no intersection. In this way, each tree is "biased" and absolutely "one-sided" (of course it is possible to say this Wrong), that is to say, each tree is very different after training; and the final classification of random forest depends on the voting of multiple trees (weak classifiers).

advantage

1 has excellent accuracy among all current algorithms 2. Able to run effectively on large data sets 3. Able to process input samples with high-dimensional features without requiring dimensionality reduction. 4. Ability to evaluate the importance of each feature in classification problems 5. Good results can also be obtained for default value problems.

Actual code

Guide package 3

Regression Algorithm-Linear Regression Analysis

Definition: Linear regression is a regression analysis that models the relationship between one or more independent variables and a dependent variable. which is characterized by a linear combination of one or more model parameters called regression coefficients

formula

loss function

Visual diagram

Solution method: How to find W in the model to minimize the loss? (The purpose is to find the W value corresponding to the minimum loss, this is the key point)

normal equation

Derivation process (𝑋 is the eigenvalue matrix, 𝑦 is the target value matrix )

Disadvantages: 1. When the features are too complex, the solution speed is too slow. 2. Sometimes the transpose of X multiplied by the inverse matrix of X is not available and cannot be solved.

Boston house price return forecast in practice

Data preprocessing

Lr=LinearRegression() prediction

gradient descent

Gradient descent formula (w=w1-learning rate * derivative of loss with respect to w)

learning rate

is a hyperparameter, adjust it to get the minimum loss

descent process

Boston house price return forecast in practice

sgd = SGDRegressor(eta0=0.008) prediction

There are many parameters for stochastic gradient descent. Only common parameters are listed here. Penalty is the penalty, divided into L1 and L2, learning rate learning_rate, and regularization strength alpha.

Some knowledge points

SGD stands for Stochastic Gradient Descent: and updates the model as the intensity schedule (i.e. the learning rate) decreases.

The regularizer is a penalty to the loss function

L1 regularization produces sparse weights, L1 regularization tends to be sparse. It will automatically perform feature selection and remove some useless features, that is, reset the weights corresponding to these features to 0. Prevent overfitting.

The main function of L2 is to prevent overfitting. When the required parameters are smaller (the parameters are higher-order term coefficients), the higher-order term coefficients are smaller, and the higher-order terms are closer to 0, indicating that the model is simpler, and the simpler the model, The more it tends to be smoother, thus preventing overfitting.

Regularization strength: large: parameters approach 0, high-order terms approach 0 Small: The parameter changes are small (the weight of higher-order terms does not change)

gradient descent method

Full Gradient Descent Algorithm (FG)

Calculate the errors of all samples in the training set, sum them up and take the average as the objective function. Batch gradient descent is slow because we need to compute all gradients on the entire dataset when performing each update. At the same time, batch gradient descent cannot handle datasets that exceed the memory capacity limit.

Stochastic Gradient Descent Algorithm (SG)

The objective function of each round of calculation is no longer the error of all samples, but only the error of a single sample. That is, only the gradient of the objective function of one sample is calculated each time to update the weight, and then the next sample is taken and the process is repeated until the loss function value Stop the decline or the loss function value is smaller than some tolerable threshold. This process is simple and efficient, and can usually better prevent update iterations from converging to the local optimal solution.

Ridge Regression

Ridge regression is a regularized version of linear regression, that is, adding regular terms to the cost function of the original linear regression (ie, linear regression with l2 regularization)

formula

Actual code

Lasso Regression(Lasso Regression)

Lasso regression is linear regression with L1 regularization

formula

How to choose the right machine learning algorithm

The reason for underfitting: fewer features of the data are learned. Solution: Increase the number of features of the data.

Causes and solutions to overfitting

reason: There are too many original features and there are some noisy features. The model is too complex because the model tries to take into account individual test data points

Solution: Perform feature selection and eliminate highly relevant features (difficult to do) Cross-validation (let all data be trained) Regularization (understanding)

Classification Algorithm - Logistic Regression

It can only solve 2-classification problems. To solve multi-classification problems, it requires continuous 2-point classification.

activation function

sigmoid function

Function formula (z is the result of regression)

Output: probability value in the interval [0,1], default 0.5 as the threshold

cost loss function

Calculation process 1

Derivative of cost loss function with respect to w

The derivation process

Gradient descent to find the optimal w

Gradient descent formula

Gradient descent process gradually obtains the optimal dividing line

Practical combat - logistic regression for binary classification for cancer prediction

Data preprocessing

Model prediction

result

discriminative and generative models

Unsupervised learning - cluster analysis

k-means

Basic principles of algorithm

Kmeans performance evaluation metrics

Contour coefficient

Silhouette coefficient explanation

1. If 〖𝑠𝑐〗_ responsible is less than 0, it means that the average distance of 𝑎_ responsible is greater than the other nearest clusters. Clustering effect is not good 2. If 〖𝑠𝑐〗_ responsible is larger, it means that the average distance of 𝑎_ responsible is smaller than the nearest other clusters. Good clustering effect 3. The value of the silhouette coefficient is between [-1,1]. The closer it is to 1, the better the cohesion and separation are.

Practical combat-Taobao user cluster analysis

Read tables, merge tables

Make a crosstab of user ID and product ID

PCA principal component analysis dimensionality reduction

clustering model

Clustering results

Silhouette coefficient calculation

Outlier detection method

Draw box plot

principle

Z-score

principle

DBSCAN

All data points are defined as core points (Core Points), border points (Border Points) or noise points, and then clustered

Isolation Forest|Isolation Forest

It requires fewer splits to isolate outliers than to isolate non-outliers, i.e. outliers have lower isolation numbers compared to non-outlier points. Therefore, a data point is defined as an outlier if its number of orphans is below the threshold.

Ensemble learning

Definition: Unifying the results of base classifiers into a final decision

Classification

Boosting (serial)

The prediction of the next base classifier depends on the output of the previous base classifier

The Boosting method uses a serial method to train base classifiers, and there are dependencies between each base classifier. Its basic idea is to stack base classifiers layer by layer. During training, each layer gives higher weight to the samples that were misclassified by the previous layer's base classifier. During testing, the final result is obtained based on the weighting of the results of each layer of classifiers.

Bagging (parallel)

There is no strong dependence between the base classifiers and they can be trained in parallel. For example, a random forest based on a decision tree based classifier. In order to make the base classifiers independent of each other, the training set is divided into several subsets (when the number of training samples is small, there may be overlap between subsets). It is more like a collective decision-making process. Each individual learns individually. The learning content can be the same, different, or partially overlapping. However, due to differences between individuals, the final judgments will not be completely consistent. When making the final decision, each individual makes a judgment individually, and then the final collective decision is made through voting.

Understand the differences between Boosting and Bagging methods from the perspective of eliminating the bias and variance of the base classifier

The error of the base classifier is the sum of the bias and variance errors. The bias is mainly due to systematic errors caused by the limited expressive ability of the classifier, which is manifested in the non-convergence of the training error. The variance is due to the classifier being too sensitive to the sample distribution, resulting in overfitting when the number of training samples is small.

deviation

Bias refers to the deviation between the average output of the trained model and the output of the real model. The error caused by bias is usually reflected in the training error.

variance

The variance refers to the variance of the output of all models trained from all sampled training data sets of size m. The variance is usually caused by the complexity of the model being too high relative to the number of training samples m. The error caused by variance is usually reflected in the increment of test error relative to training error. Low variance predictions have good clustering of values

Shooting model example

Suppose a shot is the model making a prediction on a sample. Hitting the bull's-eye position means the prediction is accurate, and the further it deviates from the bull's-eye, the greater the prediction error.

In the upper left corner, the shooting results are accurate and concentrated, indicating that the bias and variance of the model are very small; Although the center of the shooting results in the upper right picture is around the bullseye, the distribution is relatively scattered, indicating that the model has a small deviation but a large variance; The lower left figure shows that the model variance is small and the deviation is large; The picture on the lower right shows that the model has a large variance and a large deviation.

The relationship between generalization error, bias, variance, and model complexity

The Boosting method reduces the bias of the integrated classifier by gradually focusing on the samples that were misclassified by the base classifier.

The Bagging method adopts a divide-and-conquer strategy to reduce the variance of the integrated classifier by sampling training samples multiple times, training multiple different models separately, and then synthesizing them.

Bagging diagram

Model 1, Model 2, and Model 3 are all trained using a subset of the training set. Viewed individually, their decision boundaries are very tortuous and tend to overfit. The decision boundary of the integrated model (shown by the red line) is smoother than that of each independent model. This is due to the integrated weighted voting method, which reduces the variance.

Basic steps of ensemble learning

(1) Find a base classifier whose errors are independent of each other. (2) Train the base classifier. (3) Merge the results of the base classifiers. There are two methods of merging base classifiers: voting and stacking.

Example

Adaboost

Select ID3 decision tree as base classifier The reason is: the tree model has a simple structure and is prone to randomness, so More commonly used

For correctly classified samples, the weight is reduced, and for incorrectly classified samples, the weight is increased or kept unchanged. In the final process of model fusion, the base classifiers are also weighted and fused according to the error rate. Classifiers with low error rates have greater “right to speak”

GBDT gradient boosting decision tree

main idea

Train a new weak classifier based on the negative gradient of the model loss function, and then combine the trained weak classifiers into the existing model in a cumulative form (i.e., use residuals for training)

Example

Video websites need to predict the age of each user. Characteristics include the length of the person's visit, time period, types of videos watched, etc. For example, the real age of user A is 25 years old, but the predicted age of the first decision tree is 22 years old, which is a difference of 3 years, that is, the residual is 3 years. Then in the second tree, we set A's age to 3 years old to learn. If the second tree can divide A into a 3-year-old leaf node, then the results of the two trees can be added to get A's true age. ; If the conclusion of the second tree is 5 years old, then A still has a residual of −2 years old, and the age of A in the third tree becomes −2 years old, and continue learning. Finally, add up the results. The use of residuals to continue learning here is what Gradient Boosted in GBDT means.

XGBoost

The original GBDT constructs a new decision tree based on the negative gradient of the empirical loss function, and only prunes after the decision tree is constructed. XGBoost adds regular terms in the decision tree construction phase. Compared with GBDT, XGBoost has also made a lot of optimizations in engineering implementation.

Commonly used base classifiers

decision tree

There are mainly three reasons. (1) Decision trees can more easily integrate the weight of samples into the training process. (2) The expression ability and generalization ability of the decision tree can be compromised by adjusting the number of layers of the tree. (3) The perturbation of data samples has a greater impact on the decision tree, so the decision tree base classifier generated by different subsample sets is more random. Such an "unstable learner" is more suitable as a base classifier. In addition, when the decision tree node is split, a feature subset is randomly selected to find the optimal split attribute, which introduces randomness well.

neural network model

Since the neural network model is also relatively "unstable", Moreover, randomness can also be introduced by adjusting the number of neurons, connection methods, number of network layers, initial weights, etc.

common problem

Is it possible to replace the base classifier in random forest from decision tree to linear classifier or K-nearest neighbor?

Can't. Random forest belongs to the bagging class of ensemble learning. The main benefit of bagging is that the variance of the integrated classifier is smaller than the variance of the base classifier. The base classifier used in bagging should preferably be one that is sensitive to sample distribution (the so-called unstable classifier), so that bagging can be useful. Linear classifiers or K-nearest neighbors are relatively stable classifiers, and their variances are not large.

What are the advantages and limitations of GBDT?

advantage (1) The calculation speed in the prediction stage is fast. (2) On densely distributed data sets, the generalization ability and expression ability are very good, which makes GBDT often top the list in many Kaggle competitions. (3) Using decision trees as weak classifiers makes the GBDT model have better interpretability and robustness, can automatically discover high-order relationships between features, and does not require special preprocessing of data such as normalization, etc. .

limitations (1) GBDT performs worse than support vector machines or neural networks on high-dimensional sparse data sets. (2) GBDT has no obvious advantages in dealing with text classification feature problems. (3) The training process requires serial training, and some local parallel methods can only be used within the decision tree to improve the training speed.

The difference between gradient boosting and gradient descent

In gradient descent, the model is represented in a parameterized form, so that the update of the model is equivalent to the update of the parameters.

In gradient boosting, the model does not need to be parameterized, but is directly defined in the function space, which greatly expands the types of models that can be used, so that different models can be combined together, such as GBDT

Why ensemble learning models can improve accuracy

Voting calculation principle

Integrated learning in practice

Generate data

make_moons (y has two labels 0,1)

data splitting

split

Logistic regression, SVC, and decision tree classify and predict respectively, and then vote

Integrated Learning VotingClassifier

hardvoting and softvoting

Use bagging, oob design (test using unobtained data), njobs sets the core (n_jobs=-1 multi-core training improves efficiency)

bootstrap_features selects some features, Compared with random forest

Extra-Trees extreme random trees

Decision trees use random features and random thresholds to divide nodes. Provides additional randomness, inhibits overfitting, but increases bias ---- curbs variance and increases bias Have faster training speed

BoostingSerial

AdaBoost GBDT