Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

MindMap Gallery fully connected neural network

fully connected neural network

Detailed classification introduction of fully connected neural networks. Fully connected neural networks cascade multiple transformations to achieve input-to-output mapping. They are composed of an input layer, an output layer and multiple hidden layers.

Edited at 2023-07-27 22:52:26

PlotWizard

Recent works View more works>>

fully connected neural network

PlotWizard

Recent works View more works>>

Recommended to you
Outline

convolutional neural network
- 38
PlotWizard
attention mechanism
- 28
PlotWizard
Generate model
- 20
PlotWizard
Convolution, denoising, edge extraction
- 18
PlotWizard
Classic network analysis
- 16
PlotWizard
Computer Vision and Deep Learning
- 185
PlotWizard
Key deep learning models
- 24
Mason·Carter
Learning sequence of BEV sensing directions
- 13
emmarosy

fully connected neural network

definition

Fully connected neural networks cascade multiple transformations to achieve input-to-output mapping.

Two-layer fully connected network

Compared

linear classifier

W can be regarded as a template, and the number of templates is determined by the number of categories.

Fully connected

W1 can also be regarded as a template

W2 combines the matching results of multiple templates to achieve the final category scoring

nonlinear

composition

An input layer, an output layer and multiple hidden layers

activation function

Commonly used activation functions

Sigmoid

ReLU

Tanh

Leaky ReLU

Network structure design

The greater the number of neurons, the more complex the interface, and the stronger the classification ability on this set.

The complexity of the neural network model is adjusted according to the difficulty of the classification task. The more difficult the task, the deeper and wider the neural network structure designed should be, but attention should be paid to the phenomenon of overfitting.

SOFTMAX and cross-entropy loss

softmax

Normalize the output results

Convert output results into probabilities

entropy loss

Used to measure the difference from the real value (one-hot code) - KL divergence

optimization

Computational graph

step

Any complex function can be expressed in the form of a computational graph

Throughout the computational graph, each gate unit gets some input and then performs calculations

The output value of this gate

The local gradient of its output value with respect to the input value

Using the chain rule, the gate unit should multiply the returned gradient by the local gradient of its input to get the gradient of the entire network's output for each input value of the gate unit.

Common door units

Addition gate

multiplication gate

copy gate

max door

question

gradient disappears

Due to the multiplicative properties of the chain rule

gradient explosion

Due to the multiplicative properties of the chain rule

Solution

Use appropriate activation function

momentum method

Reduce the step size of the oscillation direction

advantage

Break out of the high-dimensional saddle point

Break out of local optima and saddle points

adaptive gradient method

Reduce the step size in the oscillation direction and increase the step size in the flat direction.

The square of the gradient amplitude is the direction of the oscillation

The square of the gradient amplitude is the flat direction.

RMSProp method

ADAM

A combination of momentum method and adaptive gradient method, but it needs to be corrected to avoid being too slow during cold start.

Summarize

Momentum method SGD is the best, but requires manual adjustment

ADAM is easy to use, but difficult to optimize

Weight initialization

all-zero initialization

not too good

random initialization

Use Gaussian distribution

There is a high probability that the gradient will disappear and the information flow will disappear.

Xavier initialization

The variance of the activation values of neurons in each layer is basically the same.

summary

A good initialization method can prevent information from disappearing during forward propagation and can also solve the problem of gradient disappearance during reverse propagation.

When selecting hyperbolic tangent or Sigmoid as the activation function, it is recommended to use the Xaizer initialization method.

When selecting ReLU or Leakly ReLU as the activation function, it is recommended to use the He initialization method.

batch normalization

called BN layer

method

Adjust the weight distribution so that the input and output have the same distribution

Adjust the y output after batch training - subtract the mean to remove the variance

Among them, the mean and variance of the data distribution need to be determined independently according to the contribution to classification.

benefit

Solve the problem of signal disappearance and gradient disappearance during forward pass

Overfitting and underfitting

overfitting

When the model ability decreases on the training set and begins to increase on the validation set, it begins to overfit.

When learning, the selected model contains too many parameters, resulting in good predictions for known data but poor predictions for unknown data.

Usually the training data is memorized rather than the data features learned.

solution

Get more training data

Regulate the model to allow information or to constrain it - regularization

Adjust model size

Constrain model weights, weight regularization

Random deactivation (dropout)

Let the hidden layer neurons not be activated with a certain probability

accomplish

During the training process, using dropout on a certain layer means randomly discarding some outputs of the layer. These discarded neurons seem to be deleted by the network.

random loss ratio

is the proportion of features set to 0, usually in the range of 0.2-0.5

Can be regarded as a model integration of multiple small networks

Underfitting

The model description ability is too weak to learn the patterns in the data well.

Usually the model is too simple

Hyperparameter tuning

learning rate

is too big

Unable to converge

Too big

Oscillates near the minimum value and cannot reach the optimal value.

too small

Long convergence time

Moderate

Fast convergence and good results

optimization

grid search method

Each hyperparameter takes several values, and these hyperparameters are combined to form multiple sets of hyperparameters.

Evaluate model performance for each set of hyperparameters on the validator

Select the set of values used by the best-performing model as the final hyperparameter values.

Random search method

Randomly select points in the parameter space, each point corresponds to a set of hyperparameters

Evaluate model performance for each set of hyperparameters on the validation set

Select the set of values used by the model with the best performance as the final hyperparameter values.

Generally, random sampling is done in log space.