MindMap Gallery Summary of sklearn artificial intelligence machine learning knowledge points (with practical code illustrations)
A summary of practical machine learning knowledge points based on sklearn, including practical code and result diagrams written by the author, which can be used for learning, interview review, and advanced use.
Edited at 2022-03-20 14:40:39This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about bacteria, and its main contents include: overview, morphology, types, structure, reproduction, distribution, application, and expansion. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about plant asexual reproduction, and its main contents include: concept, spore reproduction, vegetative reproduction, tissue culture, and buds. The summary is comprehensive and meticulous, suitable as review materials.
This is a mind map about the reproductive development of animals, and its main contents include: insects, frogs, birds, sexual reproduction, and asexual reproduction. The summary is comprehensive and meticulous, suitable as review materials.
machine learning
Guide package 1
DictVectorizer
CountVectorizer
Chinese
jieba.cut
stuttering participle
Call the function on the left and countvectorizer
TF-IDF
The main idea of TF-IDF
If a word or phrase appears with a high probability in an article, And if it rarely appears in other articles, it is considered that this word or phrase has good category distinction. Ability is suitable for classification.
Tf: term frequency: the number of times the word frequency appears
idf: inverse document frequency inverse document frequency =log(total number of documents in the corpus/number of documents in which the word appears 1)
TF-IDF=tf*idf represents the degree of importance
Normalized
Features: Map the data to (default is [0,1]) by transforming the original data
+
Function: Gradient descent is faster, the optimal solution is found faster, and the model is trained faster
Disadvantages: easily affected by extreme values
standardization
Features: Transform the original data to a range with a mean of 0 and a standard deviation of 1 (standard normal distribution)
𝑋′= (𝑥−mean)/𝜎
Acts on each column, mean is the mean, and 𝜎 is the standard deviation.
std becomes the variance, 𝜎= √std
If outliers occur, due to a certain amount of data, a small number of outliers will not have a big impact on the average value, so the variance will change little.
Missing value handling
Missing values can be filled in by the mean or median of each row or column.
Feature selection
subtopic
Filter:VarianceThreshold
Remove low variance features
var = VarianceThreshold(threshold=0.2) # Delete those with variance less than 0.2 data = var.fit_transform([[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]])
Embedded: regularization, decision tree, neural network
Wrapper(wrapped)
The wrapping method continuously selects feature subsets from the initial feature set, trains the learner, and evaluates the subsets based on the performance of the learner until the best subset is selected.
PCA principal component analysis
Purpose: Data dimensionality compression, reducing the dimensionality (complexity) of the original data as much as possible and losing a small amount of information
Function: It can reduce the number of features in regression analysis or cluster analysis.
Guide package 2
fit_transform
Test is fit_transform, which means that only the mean and variance of test are used in the standardization process.
test is transform, indicating that the mean and variance of train are used in the standardization process.
KNN-facebook project
Knowledge points
If the k value of KNN is too small, it is easy to overfit.
k value selection problem
If it is too large, the model is too simple and prone to underfitting.
If it is too small, the model is too complex and easy to overfit.
Approximation error and estimation error:
The approximation error is the training error on the training set
The estimation error is the test error on the test set
Actual combat
Data format: row_id x y accuracy time place_id
x y is the coordinate, time is 23234s, the number of seconds from January 1, 1970, place_id is the location
The goal is to predict which store place_id to go to based on x y
Data preprocessing
Build KNN model
Advantages and Disadvantages
advantage
Simple and effective
Retraining is cheap
Suitable for class-domain cross-samples
Suitable for automatic classification of large samples
shortcoming
lazy learning
The output is not very interpretable
Not good at unbalanced samples
Too many of one category and too few of others
Model selection and tuning
Cross-validation
Cross-validate the training set
grid search
Hyperparameter search
Actual combat
Evaluation metrics for classification models
confusion matrix
Precision rate: The prediction result is the proportion of positive examples among the positive examples (accurate check) TP/(TP FP)
Recall rate: the proportion of predicted positive examples among the samples that are actually positive examples (complete search, ability to distinguish positive samples) TP/(TP FN)
F1-score: reflects the robustness of the model
TPR, FPR, TNR, FNR, ROC curve, AUC value
Accuracy can be understood as the number of correct classifications in all experiments
Naive Bayes classification algorithm
formula
Example
Laplacian smoothing
We get an entertainment probability of 0, which is unreasonable
formula
α is the specified coefficient, which is generally 1, and m is the number of feature words counted in the training document.
Advantages and Disadvantages
advantage
The Naive Bayes model originated from classical mathematical theory and has stable classification efficiency.
It is not very sensitive to missing data and the algorithm is relatively simple. It is often used for text classification.
High classification accuracy and fast speed
shortcoming
It is necessary to know the prior probability P(F1,F2,…|C), so at some point the prediction effect will be poor due to the assumed prior model: if the articles are not collected well, for example, there are cheating articles, which are full of certain words will interfere with the results
Actual combat
Data preprocessing
Model prediction and evaluation
Decision tree classification algorithm
information entropy
The decision tree is divided based on information gain.
ID3 case
Common algorithms
ID3
The criterion for maximizing information gain
Understand from the extreme point of view: if there is only one category added to a certain feature, then the information gain is zero, and we will delete this feature
Disadvantages: When the entropy values are similar, two features are multiplied by 1/2, and three features are multiplied by 1/3, so the gain is large. Therefore, the ID3 algorithm prefers features with a relatively large number of features.
C4.5
Information gain ratio maximum criterion
CART
Classification tree: Gini coefficient minimum criterion
Whether the loan is in arrears
data
Divided according to housing availability
According to marriage
According to annual income
The remaining attributes continue to be divided
final decision tree
Summary of common decision tree types
Advantages and Disadvantages:
advantage
1. Simple understanding and explanation, tree visualization. 2. Requires little data preparation, other techniques usually require data normalization, standardization
shortcoming
1. Fully grown, the tree is too complex and easy to overfit. 2. Decision trees may be unstable because small changes in data may cause completely different trees to be generated.
ways to improve
cart pruning
pre-pruning
(1) The minimum number of samples contained in each node, such as 10. If the total number of samples at the node is less than 10, no classification will be performed.
(2) Specify the height or depth of the tree, for example, the maximum depth of the tree is 4;
(3) If the entropy of the specified node is less than a certain value, it will no longer be divided.
post-pruning
Perform pruning on the generated overfitting decision tree to obtain a simplified version of the pruned decision tree.
Titanic Survival Prediction Project
Handle missing values and split data
Convert text into vectors and then model prediction
Ensemble learning method-random forest
A random forest is a classifier that contains multiple decision trees, and its output category is determined by the mode of the category output by the individual trees.
Key steps in the random forest construction process (use N to represent the number of training cases (samples), M to represent the number of features): 1) Randomly select one sample at a time, sampling with replacement, repeat N times (duplicate samples may occur) 2) Randomly select m features, m <<M, and build a decision tree
set up process
1. Why randomly sample the training set? If random sampling is not performed and the training set of each tree is the same, then the final trained tree classification results will be exactly the same. 2. Why is sampling with replacement necessary? If there is no sampling with replacement, then the training samples of each tree are different and have no intersection. In this way, each tree is "biased" and absolutely "one-sided" (of course it is possible to say this Wrong), that is to say, each tree is very different after training; and the final classification of random forest depends on the voting of multiple trees (weak classifiers).
advantage
1 has excellent accuracy among all current algorithms 2. Able to run effectively on large data sets 3. Able to process input samples with high-dimensional features without requiring dimensionality reduction. 4. Ability to evaluate the importance of each feature in classification problems 5. Good results can also be obtained for default value problems.
Actual code
Guide package 3
Regression Algorithm-Linear Regression Analysis
Definition: Linear regression is a regression analysis that models the relationship between one or more independent variables and a dependent variable. which is characterized by a linear combination of one or more model parameters called regression coefficients
formula
loss function
Visual diagram
Solution method: How to find W in the model to minimize the loss? (The purpose is to find the W value corresponding to the minimum loss, this is the key point)
normal equation
Derivation process (𝑋 is the eigenvalue matrix, 𝑦 is the target value matrix )
Disadvantages: 1. When the features are too complex, the solution speed is too slow. 2. Sometimes the transpose of X multiplied by the inverse matrix of X is not available and cannot be solved.
Boston house price return forecast in practice
Data preprocessing
Lr=LinearRegression() prediction
gradient descent
Gradient descent formula (w=w1-learning rate * derivative of loss with respect to w)
learning rate
is a hyperparameter, adjust it to get the minimum loss
descent process
Boston house price return forecast in practice
sgd = SGDRegressor(eta0=0.008) prediction
There are many parameters for stochastic gradient descent. Only common parameters are listed here. Penalty is the penalty, divided into L1 and L2, learning rate learning_rate, and regularization strength alpha.
Some knowledge points
SGD stands for Stochastic Gradient Descent: and updates the model as the intensity schedule (i.e. the learning rate) decreases.
The regularizer is a penalty to the loss function
L1 regularization produces sparse weights, L1 regularization tends to be sparse. It will automatically perform feature selection and remove some useless features, that is, reset the weights corresponding to these features to 0. Prevent overfitting.
The main function of L2 is to prevent overfitting. When the required parameters are smaller (the parameters are higher-order term coefficients), the higher-order term coefficients are smaller, and the higher-order terms are closer to 0, indicating that the model is simpler, and the simpler the model, The more it tends to be smoother, thus preventing overfitting.
Regularization strength: large: parameters approach 0, high-order terms approach 0 Small: The parameter changes are small (the weight of higher-order terms does not change)
gradient descent method
Full Gradient Descent Algorithm (FG)
Calculate the errors of all samples in the training set, sum them up and take the average as the objective function. Batch gradient descent is slow because we need to compute all gradients on the entire dataset when performing each update. At the same time, batch gradient descent cannot handle datasets that exceed the memory capacity limit.
Stochastic Gradient Descent Algorithm (SG)
The objective function of each round of calculation is no longer the error of all samples, but only the error of a single sample. That is, only the gradient of the objective function of one sample is calculated each time to update the weight, and then the next sample is taken and the process is repeated until the loss function value Stop the decline or the loss function value is smaller than some tolerable threshold. This process is simple and efficient, and can usually better prevent update iterations from converging to the local optimal solution.
Ridge Regression
Ridge regression is a regularized version of linear regression, that is, adding regular terms to the cost function of the original linear regression (ie, linear regression with l2 regularization)
formula
Actual code
Lasso Regression(Lasso Regression)
Lasso regression is linear regression with L1 regularization
formula
How to choose the right machine learning algorithm
The reason for underfitting: fewer features of the data are learned. Solution: Increase the number of features of the data.
Causes and solutions to overfitting
reason: There are too many original features and there are some noisy features. The model is too complex because the model tries to take into account individual test data points
Solution: Perform feature selection and eliminate highly relevant features (difficult to do) Cross-validation (let all data be trained) Regularization (understanding)
Classification Algorithm - Logistic Regression
It can only solve 2-classification problems. To solve multi-classification problems, it requires continuous 2-point classification.
activation function
sigmoid function
Function formula (z is the result of regression)
Output: probability value in the interval [0,1], default 0.5 as the threshold
cost loss function
Calculation process 1
Derivative of cost loss function with respect to w
The derivation process
Gradient descent to find the optimal w
Gradient descent formula
Gradient descent process gradually obtains the optimal dividing line
Practical combat - logistic regression for binary classification for cancer prediction
Data preprocessing
Model prediction
result
discriminative and generative models
Unsupervised learning - cluster analysis
k-means
Basic principles of algorithm
Kmeans performance evaluation metrics
Contour coefficient
Silhouette coefficient explanation
1. If 〖𝑠𝑐〗_ responsible is less than 0, it means that the average distance of 𝑎_ responsible is greater than the other nearest clusters. Clustering effect is not good 2. If 〖𝑠𝑐〗_ responsible is larger, it means that the average distance of 𝑎_ responsible is smaller than the nearest other clusters. Good clustering effect 3. The value of the silhouette coefficient is between [-1,1]. The closer it is to 1, the better the cohesion and separation are.
Practical combat-Taobao user cluster analysis
Read tables, merge tables
Make a crosstab of user ID and product ID
PCA principal component analysis dimensionality reduction
clustering model
Clustering results
Silhouette coefficient calculation
Outlier detection method
Draw box plot
principle
Z-score
principle
DBSCAN
All data points are defined as core points (Core Points), border points (Border Points) or noise points, and then clustered
Isolation Forest|Isolation Forest
It requires fewer splits to isolate outliers than to isolate non-outliers, i.e. outliers have lower isolation numbers compared to non-outlier points. Therefore, a data point is defined as an outlier if its number of orphans is below the threshold.
Ensemble learning
Definition: Unifying the results of base classifiers into a final decision
Classification
Boosting (serial)
The prediction of the next base classifier depends on the output of the previous base classifier
The Boosting method uses a serial method to train base classifiers, and there are dependencies between each base classifier. Its basic idea is to stack base classifiers layer by layer. During training, each layer gives higher weight to the samples that were misclassified by the previous layer's base classifier. During testing, the final result is obtained based on the weighting of the results of each layer of classifiers.
Bagging (parallel)
There is no strong dependence between the base classifiers and they can be trained in parallel. For example, a random forest based on a decision tree based classifier. In order to make the base classifiers independent of each other, the training set is divided into several subsets (when the number of training samples is small, there may be overlap between subsets). It is more like a collective decision-making process. Each individual learns individually. The learning content can be the same, different, or partially overlapping. However, due to differences between individuals, the final judgments will not be completely consistent. When making the final decision, each individual makes a judgment individually, and then the final collective decision is made through voting.
Understand the differences between Boosting and Bagging methods from the perspective of eliminating the bias and variance of the base classifier
The error of the base classifier is the sum of the bias and variance errors. The bias is mainly due to systematic errors caused by the limited expressive ability of the classifier, which is manifested in the non-convergence of the training error. The variance is due to the classifier being too sensitive to the sample distribution, resulting in overfitting when the number of training samples is small.
deviation
Bias refers to the deviation between the average output of the trained model and the output of the real model. The error caused by bias is usually reflected in the training error.
variance
The variance refers to the variance of the output of all models trained from all sampled training data sets of size m. The variance is usually caused by the complexity of the model being too high relative to the number of training samples m. The error caused by variance is usually reflected in the increment of test error relative to training error. Low variance predictions have good clustering of values
Shooting model example
Suppose a shot is the model making a prediction on a sample. Hitting the bull's-eye position means the prediction is accurate, and the further it deviates from the bull's-eye, the greater the prediction error.
In the upper left corner, the shooting results are accurate and concentrated, indicating that the bias and variance of the model are very small; Although the center of the shooting results in the upper right picture is around the bullseye, the distribution is relatively scattered, indicating that the model has a small deviation but a large variance; The lower left figure shows that the model variance is small and the deviation is large; The picture on the lower right shows that the model has a large variance and a large deviation.
The relationship between generalization error, bias, variance, and model complexity
The Boosting method reduces the bias of the integrated classifier by gradually focusing on the samples that were misclassified by the base classifier.
The Bagging method adopts a divide-and-conquer strategy to reduce the variance of the integrated classifier by sampling training samples multiple times, training multiple different models separately, and then synthesizing them.
Bagging diagram
Model 1, Model 2, and Model 3 are all trained using a subset of the training set. Viewed individually, their decision boundaries are very tortuous and tend to overfit. The decision boundary of the integrated model (shown by the red line) is smoother than that of each independent model. This is due to the integrated weighted voting method, which reduces the variance.
Basic steps of ensemble learning
(1) Find a base classifier whose errors are independent of each other. (2) Train the base classifier. (3) Merge the results of the base classifiers. There are two methods of merging base classifiers: voting and stacking.
Example
Adaboost
Select ID3 decision tree as base classifier The reason is: the tree model has a simple structure and is prone to randomness, so More commonly used
For correctly classified samples, the weight is reduced, and for incorrectly classified samples, the weight is increased or kept unchanged. In the final process of model fusion, the base classifiers are also weighted and fused according to the error rate. Classifiers with low error rates have greater “right to speak”
GBDT gradient boosting decision tree
main idea
Train a new weak classifier based on the negative gradient of the model loss function, and then combine the trained weak classifiers into the existing model in a cumulative form (i.e., use residuals for training)
Example
Video websites need to predict the age of each user. Characteristics include the length of the person's visit, time period, types of videos watched, etc. For example, the real age of user A is 25 years old, but the predicted age of the first decision tree is 22 years old, which is a difference of 3 years, that is, the residual is 3 years. Then in the second tree, we set A's age to 3 years old to learn. If the second tree can divide A into a 3-year-old leaf node, then the results of the two trees can be added to get A's true age. ; If the conclusion of the second tree is 5 years old, then A still has a residual of −2 years old, and the age of A in the third tree becomes −2 years old, and continue learning. Finally, add up the results. The use of residuals to continue learning here is what Gradient Boosted in GBDT means.
XGBoost
The original GBDT constructs a new decision tree based on the negative gradient of the empirical loss function, and only prunes after the decision tree is constructed. XGBoost adds regular terms in the decision tree construction phase. Compared with GBDT, XGBoost has also made a lot of optimizations in engineering implementation.
Commonly used base classifiers
decision tree
There are mainly three reasons. (1) Decision trees can more easily integrate the weight of samples into the training process. (2) The expression ability and generalization ability of the decision tree can be compromised by adjusting the number of layers of the tree. (3) The perturbation of data samples has a greater impact on the decision tree, so the decision tree base classifier generated by different subsample sets is more random. Such an "unstable learner" is more suitable as a base classifier. In addition, when the decision tree node is split, a feature subset is randomly selected to find the optimal split attribute, which introduces randomness well.
neural network model
Since the neural network model is also relatively "unstable", Moreover, randomness can also be introduced by adjusting the number of neurons, connection methods, number of network layers, initial weights, etc.
common problem
Is it possible to replace the base classifier in random forest from decision tree to linear classifier or K-nearest neighbor?
Can't. Random forest belongs to the bagging class of ensemble learning. The main benefit of bagging is that the variance of the integrated classifier is smaller than the variance of the base classifier. The base classifier used in bagging should preferably be one that is sensitive to sample distribution (the so-called unstable classifier), so that bagging can be useful. Linear classifiers or K-nearest neighbors are relatively stable classifiers, and their variances are not large.
What are the advantages and limitations of GBDT?
advantage (1) The calculation speed in the prediction stage is fast. (2) On densely distributed data sets, the generalization ability and expression ability are very good, which makes GBDT often top the list in many Kaggle competitions. (3) Using decision trees as weak classifiers makes the GBDT model have better interpretability and robustness, can automatically discover high-order relationships between features, and does not require special preprocessing of data such as normalization, etc. .
limitations (1) GBDT performs worse than support vector machines or neural networks on high-dimensional sparse data sets. (2) GBDT has no obvious advantages in dealing with text classification feature problems. (3) The training process requires serial training, and some local parallel methods can only be used within the decision tree to improve the training speed.
The difference between gradient boosting and gradient descent
In gradient descent, the model is represented in a parameterized form, so that the update of the model is equivalent to the update of the parameters.
In gradient boosting, the model does not need to be parameterized, but is directly defined in the function space, which greatly expands the types of models that can be used, so that different models can be combined together, such as GBDT
Why ensemble learning models can improve accuracy
Voting calculation principle
Integrated learning in practice
Generate data
make_moons (y has two labels 0,1)
data splitting
split
Logistic regression, SVC, and decision tree classify and predict respectively, and then vote
Integrated Learning VotingClassifier
hardvoting and softvoting
Use bagging, oob design (test using unobtained data), njobs sets the core (n_jobs=-1 multi-core training improves efficiency)
bootstrap_features selects some features, Compared with random forest
Extra-Trees extreme random trees
Decision trees use random features and random thresholds to divide nodes. Provides additional randomness, inhibits overfitting, but increases bias ---- curbs variance and increases bias Have faster training speed
BoostingSerial
AdaBoost GBDT