MindMap Gallery traditional neural network
Review some knowledge points of traditional neural networks for machine learning, including nonlinear activation functions, the concept of gradient, the concept of linear regression, linear regression application scenarios and limitations, the structure of neural networks, etc.
Edited at 2022-11-23 09:35:21Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
traditional neural network
nonlinear activation function
sigmoid
advantage
Compress input feature values in a wide range to between 0 and 1, so that the data amplitude can be maintained without major changes in deep networks
Closest to biological neurons in a physical sense
Depending on its output range, this function is suitable for models that have predicted probabilities as output
shortcoming
When the input is very large or very small, the output is basically constant, that is, the change is very small, which causes the gradient to be close to 0.
Gradients may disappear prematurely, resulting in slower convergence
Exponential operations are relatively time-consuming
The output is not 0-mean, which causes the neurons in the next layer to get the non-0-mean signal output by the previous layer as input. As the network deepens, the distribution trend of the original data will change.
tanh
advantage
Solve the problem that the output of the above Sigmoid function is not 0 mean
The derivative of the Tanh function ranges from 0 to 1, which is better than the 0 to 0.25 of the sigmoid function, which alleviates the problem of vanishing gradients to a certain extent.
The Tanh function is similar to the y=x function near the origin. When the input activation value is low, matrix operations can be performed directly, and training is relatively easy.
shortcoming
Similar to the Sigmoid function, the vanishing gradient problem still exists
Observe its two forms of expressions, namely 2*sigmoid(2x)-1 and (exp(x)-exp(-x))/(exp(x) exp(-x)). It can be seen that the problem of power operation still exists
ReLU
advantage
Compared with the sigmoid function and the Tanh function, when the input is positive, the Relu function does not have a saturation problem, which solves the gradient vanishing problem and makes the deep network trainable.
The calculation speed is very fast, you only need to determine whether the input is greater than 0 value
The convergence speed is much faster than sigmoid and Tanh functions
Relu output will cause some neurons to have a value of 0, which not only brings network sparsity, but also reduces the correlation between parameters, which alleviates the problem of overfitting to a certain extent;
shortcoming
The output of the Relu function is not a function with 0 as the mean.
There is a Dead Relu Problem, that is, some neurons may never be activated, causing the corresponding parameters to never be updated. The main reasons for this problem include parameter initialization problems and learning rate settings that are too large;
When the input is a positive value and the derivative is 1, in the "chain reaction", the gradient will not disappear, but the strength of the gradient descent depends entirely on the product of the weights, which may lead to the gradient explosion problem
Leaky ReLU
advantage
In response to the Dead Relu Problem that exists in the Relu function, the Leaky Relu function gives the input value a very small slope when the input is a negative value. On the basis of solving the 0 gradient problem in the case of negative input, it is also well alleviated. Dead Relu issue
The output of this function is from negative infinity to positive infinity, that is, leaky expands the range of the Relu function, where the value of α is generally set to a smaller value, such as 0.01
shortcoming
Theoretically, this function has better effects than the Relu function, but a large amount of practice has proved that its effect is unstable, so there are not many applications of this function in practice.
Inconsistent results due to different functions applied in different intervals will result in the inability to provide consistent relationship predictions for positive and negative input values.
The concept of gradient
The original meaning of gradient is a vector (vector), which means that the directional derivative of a certain function at this point reaches the maximum value along this direction, that is, the function changes fastest along this direction (the direction of this gradient) at this point, and the change is The rate is the largest (the module of the gradient).
The concept of linear regression
Linear relationship to describe the mapping relationship from input to output
Linear regression application scenarios
Network analysis, risk analysis, stock price prediction, weather forecast
Limitations of linear regression
Linear regression can clearly describe the segmentation of linearly distributed data, but is weak in describing nonlinearly distributed data.
The structure of neural network
input layer
activation value
middle layer
output layer
Weight: refers to the close relationship with a neuron in the input layer. The closer the connection, the greater the value.
Activation value: The activation value of the output layer is calculated. The simple calculation is to multiply the activation value of the input layer by the weight.
Offset: Don’t worry about this parameter for now
“Parallel” and “Series” Connection of Neurons
Here, m represents the width of the nth layer of neural network, and n is the depth of the current neural network.
From the first layer of neural network to the final output, the value of each neuron is determined by the neuron value of the previous layer, the neuron parameters W, b and the excitation function. The equation of the k-th neuron in the n-th layer can be expressed by the formula:
Loss function-Loss
One of the most important factors affecting deep learning performance. It is the external world that affects the nerves Direct guidance for network model training
An appropriate loss function can ensure the convergence of the deep learning model
Designing an appropriate loss function is one of the main contents of research work
Softmax function definition and its benefits
normalized exponential function
Convert prediction results to non-negative numbers
The first step of softmax is to transform the prediction results of the model into an exponential function, thus ensuring the non-negative nature of the probability.
The sum of the probabilities of various predicted outcomes is equal to 1
The method is to divide the converted results by the sum of all converted results, which can be understood as the percentage of the converted results in the total. This gives approximate probabilities.
Definition of Cross entropy function and its benefits
Why it can be used as a loss function
Cross entropy can be used as a loss function in neural networks (machine learning). p represents the distribution of real labels, and q is the predicted label distribution of the trained model. The cross entropy loss function can measure the similarity between p and q.
Another benefit of cross entropy as a loss function is that using the sigmoid function during gradient descent can avoid the problem of reduced learning rate of the mean square error loss function, because the learning rate can be controlled by the output error.
Consider p(i) as the real probability distribution and q(i) as the predicted probability distribution. If we use cross entropy as the loss function, when we minimize it, we can make q(i) gradually approach p( i), the purpose of fitting is achieved.
,
Regression problem with target [0, 1] interval, and generation
customize
Take a fancy to a certain attribute
Take out certain predicted values individually or assign parameters of different sizes
Merge multiple losses
Multi-objective training tasks, setting reasonable loss combination methods (various operations)
neural network fusion
Different neural network losses are combined, and the common loss is used to train and guide the network.
learning rate
The larger the value, the faster the convergence speed.
Small numerical value, high convergence accuracy
How to choose an appropriate learning rate
Fixed
Fixed, that is, fixed learning rate, is the simplest configuration and requires only one parameter.
The learning rate remains unchanged during the entire optimization process. This is a very rarely used strategy, because as it approaches the global optimal point, the learning rate should become smaller and smaller to avoid skipping the optimal point.
step
Use a uniform reduction method, for example, each reduction is 0.1 times the original value.
This is a very commonly used learning rate iteration strategy. Each time the learning rate is reduced to a certain multiple of the original, it is a discontinuous transformation. It is simple to use and usually has good results.
Adagrad
adaptive learning rate
It can be seen from the AdaGrad algorithm that as the algorithm continues to iterate, r will become larger and larger, and the overall learning rate will become smaller and smaller. Therefore, generally speaking, the AdaGrad algorithm starts with incentive convergence, and then slowly turns into penalty convergence, and the speed becomes slower and slower.
RMSprop
The RMSProp algorithm does not violently and directly accumulate square gradients like the AdaGrad algorithm, but adds an attenuation coefficient to control how much historical information is obtained.
To put it simply, after setting the global learning rate, for each pass, the global learning rate is divided parameter by parameter by the square root of the square sum of the historical gradients controlled by the attenuation coefficient, so that the learning rate of each parameter is different.
The effect is that greater progress will be made in the flatter direction of the parameter space (because it is flatter, the sum of the squares of the historical gradients is smaller, corresponding to a smaller learning decline), and it can make the steep direction smoother. , thereby speeding up training
momentum
Go along the optimization direction that has been obtained. There is no need to re-find the direction, just fine-tuning.
What is the difference between using momentum and directly increasing the learning rate?
The direction is different and the search is more accurate.
overfitting
Over-fitting is also called over-learning. Its intuitive manifestation is that the algorithm performs well on the training set, but does not perform well on the test set, resulting in poor generalization performance.
Overfitting is caused by the fact that the training data contains sampling errors during the model parameter fitting process, and the complex model also fits the sampling errors during training. The so-called sampling error refers to the deviation between the sample set obtained by sampling and the overall data set.
The model itself is so complex that it fits the noise in the training sample set. At this time, you need to choose a simpler model or crop the model
The training samples are too few or lack representativeness. At this time, it is necessary to increase the number of samples or increase the diversity of samples
The interference of training sample noise causes the model to fit these noises. In this case, it is necessary to eliminate the noisy data or switch to a model that is not sensitive to noise.
solution
Dropout
The difference between Dropout and Pooling
subtopic
During forward propagation, we let the activation value of a certain neuron stop working with a certain probability p, which can make the model more generalizable because it will not rely too much on certain local features.
Regularization
What effect does Regularization have on the parameter w?
What is weight decay, and how is it related to Regularization?
The purpose of L2 regularization is to attenuate the weight to a smaller value and reduce the problem of model overfitting to a certain extent, so weight attenuation is also called L2 regularization.
Fine-tuning
Most parameters do not need to be updated, and the actual parameters are greatly reduced.
Freeze part of the convolutional layers of the pre-trained model (usually the majority of the convolutional layers close to the input, since these layers retain a lot of underlying information) or even freeze any network layers, and train the remaining convolutional layers (usually the parts close to the output convolutional layer) and fully connected layer.
The principle of fine-tuning is to use the known network structure and known network parameters, modify the output layer to our own layer, and fine-tune the parameters of several layers before the last layer, thus effectively utilizing the powerful generalization capabilities of deep neural networks. fine tuning capabilities, and eliminates the need to design complex models and time-consuming training, so fine tuning is a more suitable choice when the amount of data is insufficient.
significance
Stand on the shoulders of giants: There is a high probability that the model trained by predecessors will be stronger than the model you build from scratch. There is no need to reinvent the wheel.
The training cost can be very low: If you use the method of deriving feature vectors for transfer learning, the later training cost is very low, there is no pressure on the CPU, and it can be done without a deep learning machine.
Suitable for small data sets: For situations where the data set itself is small (thousands of images), it is unrealistic to train a large neural network with tens of millions of parameters from scratch, because the larger the model, the greater the data volume requirements. , overfitting cannot be avoided. At this time, if you still want to use the super feature extraction capabilities of large neural networks, you can only rely on transfer learning.
migration model
Transfer learning (Transfer learning), as the name suggests, is to transfer the parameters of a trained model (pre-trained model) to a new model to help the new model train. Considering that most data or tasks are related, through transfer learning we can share the learned model parameters (which can also be understood as the knowledge learned by the model) to the new model in some way to speed up the process. Optimizing the learning efficiency of the model does not require learning from scratch like most networks.