MindMap Gallery Machine Learning (Xigua Book) (Chapter 2 Model Evaluation and Selection)
The map is concise and helpful for understanding and memorizing knowledge points. After a few days of hard work to sort out the results, it is also a brain-wrecking few days... 55, welcome to like and collect!
Edited at 2024-03-22 23:38:51This Valentine's Day brand marketing handbook provides businesses with five practical models, covering everything from creating offline experiences to driving online engagement. Whether you're a shopping mall, restaurant, or online brand, you'll find a suitable strategy: each model includes clear objectives and industry-specific guidelines, helping brands transform traffic into real sales and lasting emotional connections during this romantic season.
This Valentine's Day map illustrates love through 30 romantic possibilities, from the vintage charm of "handwritten love letters" to the urban landscape of "rooftop sunsets," from the tactile experience of a "pottery workshop" to the leisurely moments of "wine tasting at a vineyard"—offering a unique sense of occasion for every couple. Whether it's cozy, experiential, or luxurious, love always finds the most fitting expression. May you all find the perfect atmosphere for your love story.
The ice hockey schedule for the Milano Cortina 2026 Winter Olympics, featuring preliminary rounds, quarterfinals, and medal matches for both men's and women's tournaments from February 5–22. All game times are listed in Eastern Standard Time (EST).
This Valentine's Day brand marketing handbook provides businesses with five practical models, covering everything from creating offline experiences to driving online engagement. Whether you're a shopping mall, restaurant, or online brand, you'll find a suitable strategy: each model includes clear objectives and industry-specific guidelines, helping brands transform traffic into real sales and lasting emotional connections during this romantic season.
This Valentine's Day map illustrates love through 30 romantic possibilities, from the vintage charm of "handwritten love letters" to the urban landscape of "rooftop sunsets," from the tactile experience of a "pottery workshop" to the leisurely moments of "wine tasting at a vineyard"—offering a unique sense of occasion for every couple. Whether it's cozy, experiential, or luxurious, love always finds the most fitting expression. May you all find the perfect atmosphere for your love story.
The ice hockey schedule for the Milano Cortina 2026 Winter Olympics, featuring preliminary rounds, quarterfinals, and medal matches for both men's and women's tournaments from February 5–22. All game times are listed in Eastern Standard Time (EST).
Model evaluation and selection
(2.1) Empirical error and overfitting
overfitting
The learner treats the characteristics of the training samples themselves as general properties that all potential samples will have.
Disadvantages: The learner learns the training samples "too well" and regards the characteristics of the training samples themselves as the general properties of all samples.
Overcoming method: Optimize the goal and add regular terms; early stop
Advantages: The general properties of training samples have not yet been learned well
Methods to overcome: Decision tree: expand branches; neural network: increase the number of training rounds
Underfitting
The general properties of the training samples have not yet been learned by the learner
Model selection problem
Ideal solution: Evaluate the generalization error of candidate models and choose generalization The model with the smallest error.
Practical problem: The generalization error cannot be obtained directly, and the training error due to The phenomenon of overfitting is not suitable as a criterion.
(2.2) Evaluation method
Three key questions
How do I get test results?
assessment method
How to evaluate performance?
Performance metrics
How to judge the substantial difference?
Compare experiences
Replace generalization error with test error
Get test set
The test set should be "mutually exclusive" with the training set
hold-out
Directly divide the data set into two mutually exclusive sets The training/test set division should maintain the consistency of data distribution as much as possible (for example: stratified sampling) Generally, the test set should not be too large or too small (for example: 1/5~1/3) after several random divisions and repeated experiments to obtain the average value (for example: 100 random divisions)
cross validation
Divide the stratified sampling of the data set into k mutually exclusive subsets of similar size, each time using The union of k-1 subsets is used as the training set, and the remaining subsets are used as the test set. Finally returns the mean of k test results. The value of k affects the stability and Fidelity, the most commonly used value is 10.
The bootstrap method is also called replacement sampling and can be sampled repeatedly.
The training set is the same size as the original set, but the data distribution has changed.
The bootstrapping method is very effective when the data set is small and it is difficult to effectively divide the training/test set. Useful; due to changing the distribution of the data set may introduce estimation bias, in the data When the amount of data is sufficient, the hold-out and cross-validation methods are more common
"Parameter adjustment" and final model
Algorithm parameters: generally set manually, also known as "hyperparameters"
Parameters of the model: generally determined by learning
After the algorithm parameters are selected, the final model must be retrained using the "training set and validation set". Note: Pay attention to the difference between the validation set and the test set
(2.3) Performance measurement
To evaluate the generalization performance of a learner, not only effective and feasible experimental estimation methods are needed, but also evaluation criteria to measure the generalization ability of the model, which is the performance measurement.
Performance measures reflect task requirements. The quality of the model depends not only on the data of the algorithm, but also on the task requirements. The most commonly used performance metric for regression tasks is mean square error
"Error rate" and "Precision"
For classification tasks, "error rate" and "precision" are the two most commonly used performance measures
Error rate: the proportion of incorrectly classified samples to the total number of samples
Precision: the proportion of paired samples to the total number of samples
"Precision rate", "recall rate" and F1
Although "error rate" and "accuracy" are commonly used, they do not meet the needs of all tasks. Similar requirements often appear in information retrieval, web search and other applications. "Precision rate" and "recall rate" are performance measures that are more suitable for such needs.
"Precision rate" and "recall rate" are a pair of contradictory measures. Generally speaking, when the "precision rate" is high, the "recall rate" is often low, and when the query rate is high, the precision rate is often low.
By counting the combination of real markers and predicted results, a "confusion matrix" can be obtained
"P-R" curve
According to the prediction results of the learner, the samples are sorted according to the likelihood of positive examples, and the samples are predicted one by one as positive examples. The "precision-recall" curve, referred to as the "P-R" curve, can be accurately obtained
"balance point"
The value when "precision rate" is equal to "recall rate" can be used to measure the classification performance of the crossover implemented by P-R.
ROC and AUC
ROC
Similar to the P-R curve, the samples are sorted according to the prediction results of the learner and are predicted one by one as positive examples. The ROC curve obtained with "false positive interest rate" as the horizontal axis and "true interest rate" as the vertical axis is called "trial" Recipient operating characteristics".
AUC
If the ROC curve of a certain learning system is "wrapped" by the curve of another learner, the performance of the latter is better than the former; otherwise, if the curves cross, the comparison can be based on the area under the ROC curve, that is, the AUC value.
(2.6) Reading materials
These people are so awesome
(2.5) Deviation and variance
"Bias-variance decomposition" is an important tool for explaining the generalization ability of learning algorithms. Bias-variance decomposition attempts to unpack the expected generalization error rate of a learning algorithm.
Will
Deviation: Measures the degree of deviation between the expected prediction of the learning algorithm and the true result; that is, Describes the fitting ability of the learning algorithm itself; Variance: measures the change in learning performance caused by changes in the training set of the same size; that is, it depicts the impact of data disturbance; Noise: expresses the lower bound of the expected generalization error that any learning algorithm can achieve on the current task; that is, it depicts the difficulty of the learning problem itself. Generalization performance is determined by the ability of the learning algorithm, the adequacy of the data, and the difficulty of the learning task itself. In order to achieve good generalization performance for a given learning task, it is necessary to make the bias small (fully fit the data) and the variance small (reduce the impact of data disturbance).
1. When training is insufficient, the learner’s fitting ability is not strong, and the perturbation of the training data is not enough to make the The fitting ability of the learner changes significantly, and the deviation dominates the generalization error rate at this time; 2. As the training degree deepens, the fitting ability of the learner gradually increases, and the variance gradually dominates the generalization error rate; 3. After sufficient training, learning The fitting ability of the learner is very strong. Slight perturbations in the training data will lead to significant changes in the learner. If the non-global characteristics of the training data themselves are learned, overfitting will occur.
(2.4) Comparative test
Commonly used methods: statistical hypothesis testing (hypothesis test)
Hypothesis testing of single learner generalization performance
binomial test
1. Assume that the test samples are independently sampled from the sample population distribution.
2. Record the generalization error rate as ε. Among the m samples, m' are misclassified. The probability that the remaining samples are correctly classified is
3.
4.
t-test
Face multiple repetitions of hold-out method or cross-validation method for multiple training You can use the "t-test" when practicing/testing.
Assume that k test errors are obtained, 1, ê2... êk, assuming e=e0 for significance α, if T1 is within the critical range [t-α/2, ta/2], the hypothesis cannot be rejected , that is, the generalization error rate e=eo, and its confidence level is 1-α.
Comparison of two learners
Cross-validation t-test (based on paired t-test)
k-fold cross-validation; 5*2 cross-validation (statistical significance)
McNemar test (based on contingency table, chi-square test)
Comparison of multiple learners
Friedman Nemenyi
Friedman test (based on ordinal values, F test; determine "whether they are all the same")
Nemenyi follow-up test (based on ordinal values to further determine the difference between pairs)
Significance test
When analyzing and comparing data, you cannot draw conclusions based on the difference between the two results. Instead, you must conduct statistical analysis and test the significance of the difference in the data.
In statistics, a significance test is a type of "statistical hypothesis testing". "Statistical hypothesis testing" actually points out that the prerequisite for "significance testing" is "statistical hypothesis". In other words, "no hypothesis, no test".
Premise: Before use, you must understand what the assumptions are. In layman’s terms, (First make a hypothesis about the data, and then use a test to check whether the hypothesis is correct.)
The hypothesis to be tested is called the null hypothesis and is recorded as H0; The corresponding (opposite) hypothesis is called the alternative hypothesis, denoted as H1.
Type 1 error
H0 is true, but the conclusion of the test advises you to give up H0. The probability of the first type of error is usually recorded as α (the probability α is called the significance level: there are generally two situations: α =0.05 and 0.01, which means that the conclusion error rate of the significance test must be less than 5% or 1%. In statistics, events that have a probability of less than 5% occurring in the real world are usually called "impossible events").
type 2 error
H0 is false, but the conclusion of the test advises you to accept H0. Pass The probability of a Type II error is often denoted as β.
When there is a significant difference between the data, it means that the data participating in the comparison are not are from the same population (Population), but from differences of two different populations.
The parameter adjustment process is similar: first generate several models, and then select based on a certain evaluation method. Whether the parameters are adjusted well or not often has a key impact on the final performance (range and step size)
10 times 10-fold cross-validation; leave-one-out method: Assume that the data set D contains m samples, if k=m, the leave-one-out method is obtained