Transformer and sequence-to-sequence (Seq2Seq)

A more detailed explanation of Transformer and Seq2Seq. Note that it only includes explanations of NLP, not ViT's transformer.

Edited at 2023-07-30 10:32:52

PlotWizard

Recent works View more works>>

Transformer and sequence-to-sequence (Seq2Seq)

PlotWizard

Recent works View more works>>

Recommended to you
Outline

Mind map of Electrical energy transmission
- 60
Anh Ngô Quốc
Transformer 和序列到序列（Seq2Seq)
- 32
Deu-Martina

Transformer and Sequence to sequence (Seq2Seq)

sequence to sequence

Sequence-to-sequence (or Seq2Seq) is a neural network that converts a given sequence of elements (such as a sequence of words in a sentence) into another sequence

Seq2Seq models are particularly good at translation, i.e. converting a sequence of words in one language into a different sequence of words in another language

A very basic option for the encoder and decoder of a Seq2Seq model is to use a long short-term memory (LSTM) model each

LSTM module

A recurrent network module (note this information)

Can use sequence correlation data to give meaning to the sequence while remembering (or forgetting) the parts it deems important (or not)

For example, sentences are order dependent because the order of words is crucial to understanding the sentence. LSTM is a natural choice for this type of data

The encoder takes the input sequence and maps it into a higher dimensional space (n-dimensional vectors). This abstract vector is fed into the decoder, which converts it into an output sequence. The output sequence can be another language, notation, a copy of the input, etc.

Note that "I am very happy" and "I am very happy" are both sequences. Now seq2seq needs to be converted into Chinese and English.

To put it more vividly, after the encoder encodes the sentence "I am very happy" to 1000110110, the decoder can convert this string of numbers into "I am very happy"

Attention (attention mechanism)

The attention mechanism looks at the input sequence and decides at each step which other parts of the sequence are important (similar to humans)

Similar to when reading this article, you always focus on the words you read, but at the same time your brain still retains the important keywords in the text to provide context

For example, "The price of a shirt is 9 yuan and 25". You will think that "shirt" and "9 yuan and 25" are more important. This is the result of paying attention to several keywords in this sentence.

For a given sequence, the attention mechanism knows which words are the key information in the sentence, then the translation becomes very simple. Even if a small part of the information is ignored, the general meaning will be the same.

Summarize

Every time the LSTM (encoder) reads an input, the attention mechanism will consider several other inputs at the same time and assign different weights to them according to the situation to decide which inputs are important.

The decoder then takes as input the encoded sentence and the weights provided by the attention mechanism

Transformer

The paper "Attention Is All You Need" describes Transformer

Compared with previous work, it does not have any recurrent network, and only uses the attention mechanism to improve the results of any translation task.

explain

Encoder is on the left and decoder is on the right

are composed of stackable modules

Inputs first enter the pink box "input embedding". This is to split the sequence into each phrase and map the phrases.

For example, "I am a handsome guy" ---> I am a handsome guy, three phrases, each phrase maps to 0, 1, 2, then the input into the network is "012" like this

The white Tai Chi diagram is position embedding, which means that the phrases you input have a first-come, first-served basis, and additional position information must be assigned to this phrase.

For example, "I am a handsome guy", then three phrases are input into the network "012", and there is a position code, also in order, 001002003, which corresponds to the position information of the sequence.

Multi-Head Attention module (the highlight)

Expand as shown above

The most important principle is here. Simply put, it is to input a sequence, and then average the attention through three parallel attentions. It is similar to three people voting on one thing. After all, three stooges are the same. Zhuge Liang. Usually there are eight stooges here

The specific mathematical principles need to be learned by yourself and will not be expanded upon here.

Feed Forward Feed Forward Network

This thing is an ordinary network structure, such as full connection, convolution and other operations. I have mentioned it before. I won’t go into details. You can see it for yourself.