MindMap Gallery fully connected neural network
Detailed classification introduction of fully connected neural networks. Fully connected neural networks cascade multiple transformations to achieve input-to-output mapping. They are composed of an input layer, an output layer and multiple hidden layers.
Edited at 2023-07-27 22:52:26Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
Avatar 3 centers on the Sully family, showcasing the internal rift caused by the sacrifice of their eldest son, and their alliance with other tribes on Pandora against the external conflict of the Ashbringers, who adhere to the philosophy of fire and are allied with humans. It explores the grand themes of family, faith, and survival.
This article discusses the Easter eggs and homages in Zootopia 2 that you may have discovered. The main content includes: character and archetype Easter eggs, cinematic universe crossover Easter eggs, animal ecology and behavior references, symbol and metaphor Easter eggs, social satire and brand allusions, and emotional storylines and sequel foreshadowing.
[Zootopia Character Relationship Chart] The idealistic rabbit police officer Judy and the cynical fox conman Nick form a charmingly contrasting duo, rising from street hustlers to become Zootopia police officers!
fully connected neural network
definition
Fully connected neural networks cascade multiple transformations to achieve input-to-output mapping.
Two-layer fully connected network
Compared
linear classifier
W can be regarded as a template, and the number of templates is determined by the number of categories.
Fully connected
W1 can also be regarded as a template
W2 combines the matching results of multiple templates to achieve the final category scoring
nonlinear
composition
An input layer, an output layer and multiple hidden layers
activation function
Commonly used activation functions
Sigmoid
ReLU
Tanh
Leaky ReLU
Network structure design
The greater the number of neurons, the more complex the interface, and the stronger the classification ability on this set.
The complexity of the neural network model is adjusted according to the difficulty of the classification task. The more difficult the task, the deeper and wider the neural network structure designed should be, but attention should be paid to the phenomenon of overfitting.
SOFTMAX and cross-entropy loss
softmax
Normalize the output results
Convert output results into probabilities
entropy loss
Used to measure the difference from the real value (one-hot code) - KL divergence
optimization
Computational graph
step
Any complex function can be expressed in the form of a computational graph
Throughout the computational graph, each gate unit gets some input and then performs calculations
The output value of this gate
The local gradient of its output value with respect to the input value
Using the chain rule, the gate unit should multiply the returned gradient by the local gradient of its input to get the gradient of the entire network's output for each input value of the gate unit.
Common door units
Addition gate
multiplication gate
copy gate
max door
question
gradient disappears
Due to the multiplicative properties of the chain rule
gradient explosion
Due to the multiplicative properties of the chain rule
Solution
Use appropriate activation function
momentum method
Reduce the step size of the oscillation direction
advantage
Break out of the high-dimensional saddle point
Break out of local optima and saddle points
adaptive gradient method
Reduce the step size in the oscillation direction and increase the step size in the flat direction.
The square of the gradient amplitude is the direction of the oscillation
The square of the gradient amplitude is the flat direction.
RMSProp method
ADAM
A combination of momentum method and adaptive gradient method, but it needs to be corrected to avoid being too slow during cold start.
Summarize
Momentum method SGD is the best, but requires manual adjustment
ADAM is easy to use, but difficult to optimize
Weight initialization
all-zero initialization
not too good
random initialization
Use Gaussian distribution
There is a high probability that the gradient will disappear and the information flow will disappear.
Xavier initialization
The variance of the activation values of neurons in each layer is basically the same.
summary
A good initialization method can prevent information from disappearing during forward propagation and can also solve the problem of gradient disappearance during reverse propagation.
When selecting hyperbolic tangent or Sigmoid as the activation function, it is recommended to use the Xaizer initialization method.
When selecting ReLU or Leakly ReLU as the activation function, it is recommended to use the He initialization method.
batch normalization
called BN layer
method
Adjust the weight distribution so that the input and output have the same distribution
Adjust the y output after batch training - subtract the mean to remove the variance
Among them, the mean and variance of the data distribution need to be determined independently according to the contribution to classification.
benefit
Solve the problem of signal disappearance and gradient disappearance during forward pass
Overfitting and underfitting
overfitting
When the model ability decreases on the training set and begins to increase on the validation set, it begins to overfit.
When learning, the selected model contains too many parameters, resulting in good predictions for known data but poor predictions for unknown data.
Usually the training data is memorized rather than the data features learned.
solution
Get more training data
Regulate the model to allow information or to constrain it - regularization
Adjust model size
Constrain model weights, weight regularization
Random deactivation (dropout)
Let the hidden layer neurons not be activated with a certain probability
accomplish
During the training process, using dropout on a certain layer means randomly discarding some outputs of the layer. These discarded neurons seem to be deleted by the network.
random loss ratio
is the proportion of features set to 0, usually in the range of 0.2-0.5
Can be regarded as a model integration of multiple small networks
Underfitting
The model description ability is too weak to learn the patterns in the data well.
Usually the model is too simple
Hyperparameter tuning
learning rate
is too big
Unable to converge
Too big
Oscillates near the minimum value and cannot reach the optimal value.
too small
Long convergence time
Moderate
Fast convergence and good results
optimization
grid search method
Each hyperparameter takes several values, and these hyperparameters are combined to form multiple sets of hyperparameters.
Evaluate model performance for each set of hyperparameters on the validator
Select the set of values used by the best-performing model as the final hyperparameter values.
Random search method
Randomly select points in the parameter space, each point corresponds to a set of hyperparameters
Evaluate model performance for each set of hyperparameters on the validation set
Select the set of values used by the model with the best performance as the final hyperparameter values.
Generally, random sampling is done in log space.