attention相关知识总结(seq2seq、Encoder-Decoder等)

学学 温故而知新 于2023.3.25 修改

要详细了解的model

model achieved
TCN
attention Mechanism
dynamic graph attention networks

Attention

起源:获取更多所需要关注目标的细节信息,而抑制其他无用信息

image-20230327160415206

q query

v value

K key

看了一个视频,他是这么解释的 key 可以理解为查询后计算得到的相似性 value 是价值

所以 用相似性 x 价值 来衡量注意力

softmax 作用 转化为权重

除以 dk 论文里有解释 是因为处理梯度上的问题

image-20230327160949445

image-20230409114508970

Attention Mechanism

RNN的缺陷

在没有Transformer以前,大家做神经机器翻译用的最多的是基于RNN的Encoder-Decoder模型:

img

Encoder-Decoder模型当然很成功,在2018年以前用它是用的很多的。而且也有很强的能力。但是RNN天生有缺陷,只要是RNN,就会有梯度消失问题,核心原因是有递归的方式,作用在同一个权值矩阵上,使得如果这个矩阵满足条件的话,其最大的特征值要是小于1的话,那就一定会出现梯度消失问题。后来的LSTM和GRU也仅仅能缓解这个问题

img

Transformer为何优于RNN

Transformer中抛弃了传统的CNN和RNN,整个网络结构完全是由Attention机制组成。 作者采用Attention机制的原因是考虑到RNN(或者LSTM,GRU等)的计算限制为是顺序的,也就是说RNN相关算法只能从左向右依次计算或者从右向左依次计算,这种机制带来了两个问题:

  1. 时间片 t的计算依赖 t−1 时刻的计算结果,这样限制了模型的并行能力
  2. 顺序计算的过程中信息会丢失,尽管LSTM等门机制的结构一定程度上缓解了长期依赖的问题,但是对于特别长期的依赖现象,LSTM依旧无能为力

Transformer的提出解决了上面两个问题:

  1. 首先它使用了Attention机制,将序列中的任意两个位置之间的距离缩小为一个常量;
  2. 其次它不是类似RNN的顺序结构,因此具有更好的并行性,符合现有的GPU框架

Transformer模型架构

Transformer模型在论文《Attention Is All You Need》中被提出,利用Attention,去掉RNN。这篇论文的题目想说的是:你只需要用到Attention机制,完全不需要再用到RNN,就能解决很多问题。Transformer没有用到RNN,而且它的效果很好。

Transformer模型总体的样子如下图所示:总体来说,还是和Encoder-Decoder模型有些相似,左边是Encoder部分,右边是Decoder部分。

img

Encoder:输入是单词的Embedding,再加上位置编码,然后进入一个统一的结构,这个结构可以循环很多次(N次),也就是说有很多层(N层)。每一层又可以分成Attention层和全连接层,再额外加了一些处理,比如Skip Connection,做跳跃连接,然后还加了Normalization层。其实它本身的模型还是很简单的。

Decoder:第一次输入是前缀信息,之后的就是上一次产出的Embedding,加入位置编码,然后进入一个可以重复很多次的模块。该模块可以分成三块来看,第一块也是Attention层,第二块是cross Attention,不是Self-Attention,第三块是全连接层。也用了跳跃连接和Normalization。

输出:最后的输出要通过Linear层(全连接层),再通过softmax做预测。

再换用另一种简单的方式来解释Transformer的网络结构。

img

需要注意的上图的Decoder的第一个输入,就是output的前缀信息。

self-attention

Attention和self-attention的区别

以Encoder-Decoder框架为例,输入Source和输出Target内容是不一样的,比如对于英-中机器翻译来说,Source是英文句子,Target是对应的翻译出的中文句子,Attention发生在Target的元素Query和Source中的所有元素之间。

Self Attention,指的不是Target和Source之间的Attention机制,而是Source内部元素之间或者Target内部元素之间发生的Attention机制,也可以理解为Target=Source这种特殊情况下的Attention。

两者具体计算过程是一样的,只是计算对象发生了变化而已。

下图是李宏毅老师的视频截图

image-20220523201500251

image-20220523202311592

image-20220523202419140

multi-headed Attention

image-20220523202602648

一些感悟

  • 永远要善于分享自己的想法 并乐于让别人来给出评价 不能故步自封

pytorch实现

https://wmathor.com/index.php/archives/1451/

代码部分解释

通过代码简介什么是attention, self-attention, multi-head attention以及transformer_哔哩哔哩_bilibili

attention

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import torch
import torch.nn as nn

class Attention(nn.Module):
def __init__(self, attention_type='dot', hidden_size=256):
super(Attention, self).__init__()
self.attention_type = attention_type
self.hidden_size = hidden_size

# Linear layer to transform the query (decoder hidden state)
self.query = nn.Linear(hidden_size, hidden_size, bias=False)
# Linear layer to transform the key (encoder hidden state)
self.key = nn.Linear(hidden_size, hidden_size, bias=False)
# Linear layer to transform the value (encoder hidden state)
self.value = nn.Linear(hidden_size, hidden_size, bias=False)

# Softmax layer to compute the attention weights
self.softmax = nn.Softmax(dim=-1)

def forward(self, query, keys, values):
# Transform the query
query = self.query(query).unsqueeze(1)
# Transform the keys
keys = self.key(keys)
# Transform the values
values = self.value(values)

# Compute the attention weights
if self.attention_type == 'dot':
# dot product attention
attention_weights = torch.bmm(query, keys.transpose(1, 2))
elif self.attention_type == 'cosine':
# cosine similarity attention
query = query / query.norm(dim=-1, keepdim=True)
keys = keys / keys.norm(dim=-1, keepdim=True)
attention_weights = torch.bmm(query, keys.transpose(1, 2))
else:
raise ValueError(f"Invalid attention type: {self.attention_type}")

# Normalize the attention weights
attention_weights = self.softmax(attention_weights)

# Apply the attention weights to the values to obtain the attended output
attended_output = torch.bmm(attention_weights, values)

return attended_output, attention_weights

#To use this attention module, you can pass it the query (decoder hidden state), keys (encoder hidden states), and values (encoder hidden states)
#as input, and it will return the attended output and the attention weights.
#For example:

# Define the attention module
attention = Attention(attention_type='dot', hidden_size=256)

# Inputs to the attention module
batch_size = 10
hidden_size = 256
sequence_length = 12
query = torch.randn(batch_size, hidden_size)
keys = torch.randn(batch_size, sequence_length, hidden_size)
values = torch.randn(batch_size, sequence_length, hidden_size)

# Compute the attended output and attention weights
attended_output, attention_weights = attention(query, keys, values)
print(attended_output.shape)
print(attention_weights.shape)

multi-Head attention

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
## How to build multi-head attention using Pytorch?
# Multi-head attention is an extension of the attention mechanism that
# allows the model to attend to multiple different parts of the input simultaneously.
# It does this by using multiple attention heads, each of which attends to a different part of the input and produces its own attended output.
# These attended outputs are then concatenated and transformed to obtain the final attended output.
# Here is an example of how you can implement multi-head attention using PyTorch:


import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, input_dim, output_dim):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.input_dim = input_dim
self.output_dim = output_dim

self.query_projections = nn.ModuleList([nn.Linear(input_dim, output_dim) for _ in range(num_heads)])
self.key_projections = nn.ModuleList([nn.Linear(input_dim, output_dim) for _ in range(num_heads)])
self.value_projections = nn.ModuleList([nn.Linear(input_dim, output_dim) for _ in range(num_heads)])
self.output_projection = nn.Linear(num_heads * output_dim, output_dim)

def forward(self, query, key, value, mask=None):
outputs = []
for i in range(self.num_heads):
query_projection = self.query_projections[i](query)
key_projection = self.key_projections[i](key)
value_projection = self.value_projections[i](value)

dot_product = torch.matmul(query_projection, key_projection.transpose(1, 2))
if mask is not None:
dot_product = dot_product.masked_fill(mask == 0, -1e9)
attention_weights = torch.softmax(dot_product, dim=-1)
output = torch.matmul(attention_weights, value_projection)
outputs.append(output)

concatenated_outputs = torch.cat(outputs, dim=-1)
final_output = self.output_projection(concatenated_outputs)
return final_output

# Define the multi-head attention module
attention = MultiHeadAttention(num_heads=8, input_dim=512, output_dim=64)

# Define the input tensors
query = torch.randn(32, 16, 512)
key = torch.randn(32, 16, 512)
value = torch.randn(32, 16, 512)
mask = torch.zeros(32, 16, 16)

# Apply the attention module to the input tensors
output = attention(query, key, value, mask=mask)

transformer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
## How to build multi-head attention using Pytorch?
# A transformer is a deep learning model that is designed to process sequential input data using self-attention mechanisms.
# It consists of an encoder and a decoder, both of which are composed of multiple layers of self-attention and feedforward neural networks.
# Here is an example of how you can implement a transformer using PyTorch:

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, input_dim, output_dim):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.input_dim = input_dim
self.output_dim = output_dim

self.query_projections = nn.ModuleList([nn.Linear(input_dim, output_dim) for _ in range(num_heads)])
self.key_projections = nn.ModuleList([nn.Linear(input_dim, output_dim) for _ in range(num_heads)])
self.value_projections = nn.ModuleList([nn.Linear(input_dim, output_dim) for _ in range(num_heads)])
self.output_projection = nn.Linear(num_heads * output_dim, output_dim)

def forward(self, query, key, value, mask=None):
outputs = []
for i in range(self.num_heads):
query_projection = self.query_projections[i](query)
key_projection = self.key_projections[i](key)
value_projection = self.value_projections[i](value)

dot_product = torch.matmul(query_projection, key_projection.transpose(1, 2))
if mask is not None:
dot_product = dot_product.masked_fill(mask == 0, -1e9)
attention_weights = torch.softmax(dot_product, dim=-1)
output = torch.matmul(attention_weights, value_projection)
outputs.append(output)

concatenated_outputs = torch.cat(outputs, dim=-1)
final_output = self.output_projection(concatenated_outputs)
return final_output

class TransformerLayer(nn.Module):
def __init__(self, hidden_size, num_heads, dropout):
super(TransformerLayer, self).__init__()
self.hidden_size = hidden_size
self.num_heads = num_heads
self.dropout = dropout

# Multi-head attention module
self.attention = MultiHeadAttention(num_heads=num_heads, input_dim=hidden_size, output_dim=hidden_size)
# Dropout and residual connection after the attention module
self.attention_dropout = nn.Dropout(dropout)
self.attention_norm = nn.LayerNorm(hidden_size)

# Feedforward neural network
self.feedforward = nn.Sequential(
nn.Linear(hidden_size, 4 * hidden_size),
nn.ReLU(),
nn.Linear(4 * hidden_size, hidden_size)
)
# Dropout and residual connection after the feedforward network
self.feedforward_dropout = nn.Dropout(dropout)
self.feedforward_norm = nn.LayerNorm(hidden_size)

def forward(self, input, mask):
# Multi-head attention
attention_output = self.attention(input, input, input, mask)
attention_output = self.attention_dropout(attention_output)
# Add the residual connection and apply layer normalization
attention_output = self.attention_norm(input + attention_output)

# Feedforward neural network
feedforward_output = self.feedforward(attention_output)
feedforward_output = self.feedforward_dropout(feedforward_output)
# Add the residual connection and apply layer normalization
feedforward_output = self.feedforward_norm(attention_output + feedforward_output)

return feedforward_output

class Transformer(nn.Module):
def __init__(self, num_layers=6, hidden_size=512, num_heads=8, dropout=0.1):
super(Transformer, self).__init__()
self.num_layers = num_layers
self.hidden_size = hidden_size
self.num_heads = num_heads
self.dropout = dropout

# Encoder layers
self.encoder_layers = nn.ModuleList([TransformerLayer(hidden_size, num_heads, dropout) for _ in range(num_layers)])
self.encoder_norm = nn.LayerNorm(hidden_size)

# Decoder layers
self.decoder_layers = nn.ModuleList([TransformerLayer(hidden_size, num_heads, dropout) for _ in range(num_layers)])
self.decoder_norm = nn.LayerNorm(hidden_size)

# Output projection layer
self.output_projection = nn.Linear(hidden_size, hidden_size)

def forward(self, input, mask):
# Pass the input through the encoder layers
for layer in self.encoder_layers:
input = layer(input, mask)

# Apply the encoder layer normalization
input = self.encoder_norm(input)

# Pass the encoded input through the decoder layers
for layer in self.decoder_layers:
input = layer(input, mask)

# Apply the decoder layer normalization
input = self.decoder_norm(input)

# Apply the output projection layer
output = self.output_projection(input)

return output

# To use this transformer model, you can pass it the input data and a mask indicating
# which positions in the input should be ignored as input, and it will return the output of the transformer.
# For example:

# Define the transformer model
transformer = Transformer(num_layers=6, hidden_size=512, num_heads=8, dropout=0.1)

# Input data and mask
batch_size = 10
sequence_length = 16
hidden_size = 512
input = torch.randn(batch_size, sequence_length, hidden_size)
mask = torch.rand(batch_size, sequence_length, sequence_length) > 0.5

# Compute the transformer output
output = transformer(input, mask)

reference

https://medium.com/%E6%99%82%E7%A9%BA%E5%9C%96%E6%8A%80%E8%A1%93%E8%88%87%E6%87%89%E7%94%A8/%E9%96%B1%E8%AE%80%E7%AD%86%E8%A8%98-graph-wavenet-for-deep-spatial-temporal-graph-modeling-14ec94033af3

https://www.youtube.com/watch?v=9_1cEtimRNE (有空看看)

https://zhuanlan.zhihu.com/p/186987950 动态图注意力机制

https://blog.floydhub.com/attention-mechanism/

http://xtf615.com/2019/01/06/attention/

https://mp.weixin.qq.com/s/lUqpCae3TPkZlgT7gUatpg

https://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder