Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

MindMap Gallery traditional neural network

traditional neural network

Review some knowledge points of traditional neural networks for machine learning, including nonlinear activation functions, the concept of gradient, the concept of linear regression, linear regression application scenarios and limitations, the structure of neural networks, etc.

Edited at 2022-11-23 09:35:21

PlotWizard

Recent works View more works>>

traditional neural network

PlotWizard

Recent works View more works>>

Recommended to you
Outline

Detailed explanation of the operation process of neural network
- 35
PlotWizard
Convolutional Neural Network (CNN)
- 104
PlotWizard
Common types and applications of neural network models
- 33
- 1
PlotWizard
Neural Networks and Deep Learning Recurrent Neural Networks
- 61
PlotWizard
Neural Networks and Deep Learning Convolutional Neural Networks
- 46
PlotWizard
The foundation of neural networks and deep learning
- 112
PlotWizard
Handwritten digit recognition based on improved Sigmoid convolutional neural network
- 42
PlotWizard
Convolutional neural networks with dynamic regularization
- 64
PlotWizard
Neural Networks
- 1.0k
- 1
- 1
PlotWizard
Generate model
- 20
PlotWizard

traditional neural network

nonlinear activation function

sigmoid

advantage

Compress input feature values in a wide range to between 0 and 1, so that the data amplitude can be maintained without major changes in deep networks

Closest to biological neurons in a physical sense

Depending on its output range, this function is suitable for models that have predicted probabilities as output

shortcoming

When the input is very large or very small, the output is basically constant, that is, the change is very small, which causes the gradient to be close to 0.

Gradients may disappear prematurely, resulting in slower convergence

Exponential operations are relatively time-consuming

The output is not 0-mean, which causes the neurons in the next layer to get the non-0-mean signal output by the previous layer as input. As the network deepens, the distribution trend of the original data will change.

tanh

advantage

Solve the problem that the output of the above Sigmoid function is not 0 mean

The derivative of the Tanh function ranges from 0 to 1, which is better than the 0 to 0.25 of the sigmoid function, which alleviates the problem of vanishing gradients to a certain extent.

The Tanh function is similar to the y=x function near the origin. When the input activation value is low, matrix operations can be performed directly, and training is relatively easy.

shortcoming

Similar to the Sigmoid function, the vanishing gradient problem still exists

Observe its two forms of expressions, namely 2*sigmoid(2x)-1 and (exp(x)-exp(-x))/(exp(x) exp(-x)). It can be seen that the problem of power operation still exists

ReLU

advantage

Compared with the sigmoid function and the Tanh function, when the input is positive, the Relu function does not have a saturation problem, which solves the gradient vanishing problem and makes the deep network trainable.

The calculation speed is very fast, you only need to determine whether the input is greater than 0 value

The convergence speed is much faster than sigmoid and Tanh functions

Relu output will cause some neurons to have a value of 0, which not only brings network sparsity, but also reduces the correlation between parameters, which alleviates the problem of overfitting to a certain extent;

shortcoming

The output of the Relu function is not a function with 0 as the mean.

There is a Dead Relu Problem, that is, some neurons may never be activated, causing the corresponding parameters to never be updated. The main reasons for this problem include parameter initialization problems and learning rate settings that are too large;

When the input is a positive value and the derivative is 1, in the "chain reaction", the gradient will not disappear, but the strength of the gradient descent depends entirely on the product of the weights, which may lead to the gradient explosion problem

Leaky ReLU

advantage

In response to the Dead Relu Problem that exists in the Relu function, the Leaky Relu function gives the input value a very small slope when the input is a negative value. On the basis of solving the 0 gradient problem in the case of negative input, it is also well alleviated. Dead Relu issue

The output of this function is from negative infinity to positive infinity, that is, leaky expands the range of the Relu function, where the value of α is generally set to a smaller value, such as 0.01

shortcoming

Theoretically, this function has better effects than the Relu function, but a large amount of practice has proved that its effect is unstable, so there are not many applications of this function in practice.

Inconsistent results due to different functions applied in different intervals will result in the inability to provide consistent relationship predictions for positive and negative input values.

The concept of gradient

The original meaning of gradient is a vector (vector), which means that the directional derivative of a certain function at this point reaches the maximum value along this direction, that is, the function changes fastest along this direction (the direction of this gradient) at this point, and the change is The rate is the largest (the module of the gradient).

The concept of linear regression

Linear relationship to describe the mapping relationship from input to output

Linear regression application scenarios

Network analysis, risk analysis, stock price prediction, weather forecast

Limitations of linear regression

Linear regression can clearly describe the segmentation of linearly distributed data, but is weak in describing nonlinearly distributed data.

The structure of neural network

input layer

activation value

middle layer

output layer

Weight: refers to the close relationship with a neuron in the input layer. The closer the connection, the greater the value.

Activation value: The activation value of the output layer is calculated. The simple calculation is to multiply the activation value of the input layer by the weight.

Offset: Don’t worry about this parameter for now

“Parallel” and “Series” Connection of Neurons

Here, m represents the width of the nth layer of neural network, and n is the depth of the current neural network.

From the first layer of neural network to the final output, the value of each neuron is determined by the neuron value of the previous layer, the neuron parameters W, b and the excitation function. The equation of the k-th neuron in the n-th layer can be expressed by the formula:

Loss function-Loss

One of the most important factors affecting deep learning performance. It is the external world that affects the nerves Direct guidance for network model training

An appropriate loss function can ensure the convergence of the deep learning model

Designing an appropriate loss function is one of the main contents of research work

Softmax function definition and its benefits

normalized exponential function

Convert prediction results to non-negative numbers

The first step of softmax is to transform the prediction results of the model into an exponential function, thus ensuring the non-negative nature of the probability.

The sum of the probabilities of various predicted outcomes is equal to 1

The method is to divide the converted results by the sum of all converted results, which can be understood as the percentage of the converted results in the total. This gives approximate probabilities.

Definition of Cross entropy function and its benefits

Why it can be used as a loss function

Cross entropy can be used as a loss function in neural networks (machine learning). p represents the distribution of real labels, and q is the predicted label distribution of the trained model. The cross entropy loss function can measure the similarity between p and q.

Another benefit of cross entropy as a loss function is that using the sigmoid function during gradient descent can avoid the problem of reduced learning rate of the mean square error loss function, because the learning rate can be controlled by the output error.

Consider p(i) as the real probability distribution and q(i) as the predicted probability distribution. If we use cross entropy as the loss function, when we minimize it, we can make q(i) gradually approach p( i), the purpose of fitting is achieved.

Regression problem with target [0, 1] interval, and generation

customize

Take a fancy to a certain attribute

Take out certain predicted values individually or assign parameters of different sizes

Merge multiple losses

Multi-objective training tasks, setting reasonable loss combination methods (various operations)

neural network fusion

Different neural network losses are combined, and the common loss is used to train and guide the network.

learning rate

The larger the value, the faster the convergence speed.

Small numerical value, high convergence accuracy

How to choose an appropriate learning rate

Fixed

Fixed, that is, fixed learning rate, is the simplest configuration and requires only one parameter.

The learning rate remains unchanged during the entire optimization process. This is a very rarely used strategy, because as it approaches the global optimal point, the learning rate should become smaller and smaller to avoid skipping the optimal point.

step

Use a uniform reduction method, for example, each reduction is 0.1 times the original value.

This is a very commonly used learning rate iteration strategy. Each time the learning rate is reduced to a certain multiple of the original, it is a discontinuous transformation. It is simple to use and usually has good results.

Adagrad

adaptive learning rate

It can be seen from the AdaGrad algorithm that as the algorithm continues to iterate, r will become larger and larger, and the overall learning rate will become smaller and smaller. Therefore, generally speaking, the AdaGrad algorithm starts with incentive convergence, and then slowly turns into penalty convergence, and the speed becomes slower and slower.

RMSprop

The RMSProp algorithm does not violently and directly accumulate square gradients like the AdaGrad algorithm, but adds an attenuation coefficient to control how much historical information is obtained.

To put it simply, after setting the global learning rate, for each pass, the global learning rate is divided parameter by parameter by the square root of the square sum of the historical gradients controlled by the attenuation coefficient, so that the learning rate of each parameter is different.

The effect is that greater progress will be made in the flatter direction of the parameter space (because it is flatter, the sum of the squares of the historical gradients is smaller, corresponding to a smaller learning decline), and it can make the steep direction smoother. , thereby speeding up training

momentum

Go along the optimization direction that has been obtained. There is no need to re-find the direction, just fine-tuning.

What is the difference between using momentum and directly increasing the learning rate?

The direction is different and the search is more accurate.

overfitting

Over-fitting is also called over-learning. Its intuitive manifestation is that the algorithm performs well on the training set, but does not perform well on the test set, resulting in poor generalization performance.

Overfitting is caused by the fact that the training data contains sampling errors during the model parameter fitting process, and the complex model also fits the sampling errors during training. The so-called sampling error refers to the deviation between the sample set obtained by sampling and the overall data set.

The model itself is so complex that it fits the noise in the training sample set. At this time, you need to choose a simpler model or crop the model

The training samples are too few or lack representativeness. At this time, it is necessary to increase the number of samples or increase the diversity of samples

The interference of training sample noise causes the model to fit these noises. In this case, it is necessary to eliminate the noisy data or switch to a model that is not sensitive to noise.

solution

Dropout

The difference between Dropout and Pooling

subtopic

During forward propagation, we let the activation value of a certain neuron stop working with a certain probability p, which can make the model more generalizable because it will not rely too much on certain local features.

Regularization

What effect does Regularization have on the parameter w?

What is weight decay, and how is it related to Regularization?

The purpose of L2 regularization is to attenuate the weight to a smaller value and reduce the problem of model overfitting to a certain extent, so weight attenuation is also called L2 regularization.

Fine-tuning

Most parameters do not need to be updated, and the actual parameters are greatly reduced.

Freeze part of the convolutional layers of the pre-trained model (usually the majority of the convolutional layers close to the input, since these layers retain a lot of underlying information) or even freeze any network layers, and train the remaining convolutional layers (usually the parts close to the output convolutional layer) and fully connected layer.

The principle of fine-tuning is to use the known network structure and known network parameters, modify the output layer to our own layer, and fine-tune the parameters of several layers before the last layer, thus effectively utilizing the powerful generalization capabilities of deep neural networks. fine tuning capabilities, and eliminates the need to design complex models and time-consuming training, so fine tuning is a more suitable choice when the amount of data is insufficient.

significance

Stand on the shoulders of giants: There is a high probability that the model trained by predecessors will be stronger than the model you build from scratch. There is no need to reinvent the wheel.

The training cost can be very low: If you use the method of deriving feature vectors for transfer learning, the later training cost is very low, there is no pressure on the CPU, and it can be done without a deep learning machine.

Suitable for small data sets: For situations where the data set itself is small (thousands of images), it is unrealistic to train a large neural network with tens of millions of parameters from scratch, because the larger the model, the greater the data volume requirements. , overfitting cannot be avoided. At this time, if you still want to use the super feature extraction capabilities of large neural networks, you can only rely on transfer learning.

migration model

Transfer learning (Transfer learning), as the name suggests, is to transfer the parameters of a trained model (pre-trained model) to a new model to help the new model train. Considering that most data or tasks are related, through transfer learning we can share the learned model parameters (which can also be understood as the knowledge learned by the model) to the new model in some way to speed up the process. Optimizing the learning efficiency of the model does not require learning from scratch like most networks.