浅析pytorch中的LSTM(GRU)

Introduction

本文介绍pytorch RNN网络搭建,主要包括LSTM和GRU的使用。

最近从tensorflow入坑pytorch,发现两者的RNN模块前向传播有些不同。特别是RNN中的pack_padded_sequence和pad_packed_sequence。

Background

tensorflow的BILSTM

以BILSTM为例,tensorflow中我们一般这样操作:

  • 先定义个前向层

    1
    self.fw_cell = tf.nn.rnn_cell.LSTMCell(num_units=cell_size)
  • 在定义一个后向层

    1
    self.bw_cell = tf.nn.rnn_cell.LSTMCell(num_units=cell_size)

然后前向传播是这个样子的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def call(self, inputs, seq_length, training):
# 嵌入层
embedded_words = self.embeddings(inputs)
# RNN层
outputs, final_state = tf.nn.bidirectional_dynamic_rnn(
self.fw_cell,
self.bw_cell,
inputs=embedded_words,
# 句子原始长度(batch_size,1)
sequence_length=seq_length,
dtype=tf.float32,
time_major=False)
# 合并前后向的结果
# outputs是一个列表:[(batch_size, cell_size),(...)]
# len(output) = time_steps
outputs = tf.concat(outputs, axis=2)
# 由于采用的是dynamic_rnn,padding部分自动被截断了
# 这里取最后一维就好了
final_output = outputs[-1]
logits = self.Dense(final_output)
return logits

可以看到,tensorflow的dynamic LSTM有两个特点:

  • 前向传播时,不看词向量维度的话,输入是个二维(batch_size, time_steps)
  • padding部分不参与计算,会被自动截断

with this in our mind, let‘s see what’s the bilstm in pytorch。

Pytorch BILSTM

LSTM定义

一般这样操作:

  • 先定义个LSTM

  • 1
    2
    3
    4
    5
    self.rnn_cell = nn.LSTM(input_size=word_embedding_dimension,
    hidden_size=hidden_size,
    num_layers=num_layer,
    batch_first=True,
    bidirectional=bi_flag)

    双向的怎么办呢, 改个参数就行了

    1
    bidirectional=True
  • 多层呢

    1
    num_layers=True

可以看到,pytorch的LSTM把所有的功能整合到一起了,不需要定义前向后向(当然我们也可以搞两个单向的)。此外,我们除了要传递隐藏层的个数cell_size, 还需要将词向量的维度传递进来,这是为什么呢?

这还不是关键。关键在于pytorch的动态padding机制。因为我们一般是将输入作为batch传进来的,对于变长的文本来说,padding是不可避免的。但是在我们使用LSTM进行计算时,是不希望padding部分参与计算的(他们不是真实的文本,假如纳入计算,会引入不必要的噪声和不必要的计算量)。tensorflow采用dynamic LSTM很好地解决了这一问题。pytorch当然也有他自己的一套,下面来看看。

pack_padded_sequence

我们从embedding层拿到的输入维度为(batch_size, time_steps, word_embedding_dimension), 并不将他直接喂给LSTM,而是要预加工一下,这时候第一个重要的函数pack_padded_sequence就登场了,可以将其看成一个截断函数,作用就是将padding部分截断。不仅如此,阶段后会将输入拉平,变成(batch_size*time_steps, word_embedding_dimension), 仔细看一下,降了一个维度。来看实例:

1
2
3
4
5
6
7
8
9
import torch
from torch.nn.utils.rnn import pack_padded_sequence
# 手动造个张量, 包含3个序列,长度分别为10,5,3,使用0进行padding
x = torch.FloatTensor([[[1],[2],[3],[4],[5],[6],[7],[8],[8],[9]],[[1], [2],[3],[4],[5],[0],[0],[0],[0],[0]],[[5],[4],[6],[0],[0],[0],[0],[0],[0],[0]]])
print("x shape is:", x.shape)
# x 的真实长度
length = torch.LongTensor([10, 5,3])
x_packed = pack_padded_sequence(x, length,batch_first=True)
print(x_packed)

结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
x shape is:(3101
PackedSequence(data=tensor([[1.],
[1.],
[5.],
[2.],
[2.],
[4.],
[3.],
[3.],
[6.],
[4.],
[4.],
[5.],
[5.],
[6.],
[7.],
[8.],
[8.],
[9.]]), batch_sizes=tensor([3, 3, 3, 2, 2, 1, 1, 1, 1, 1]))

这里有几点要说一下:

  • 输入的张量是已经按照长度排序的!这是必须的,否则pack_padded_sequence函数会报错!一般而言排序有两种做法,一是使用torch.sort函数,第二是在构造batch时使用torchtext进行batch内排序,见我上篇博文
  • 经过pack,输入发生了三个变化。一是由三维变成了2维(18,1),二是padding部分的0没有了,三是对序列进行了拼接,这其实是维度降低的结果
  • pack完之后的结果是个tuple,tuple[0]是数据,是按列进行拼接的,tuple[1]是batch_size, 跟我们之前的batch_size是不一样的,这是为了我们后面可以还原

好了,现在pddding问题解决了,我们将它输入进LSTM,前向传播是这样的:

1
2
3
4
5
6
7
8
9
10
11
12
def forward(self, inputs, length):
"""前向传播"""
embeddings = self.embedding(inputs, length) # (batch_size, time_steps, embedding_dim)
# 去除padding元素
# embeddings_packed: (batch_size*time_steps, embedding_dim)
embeddings_packed = pack_padded_sequence(embeddings, length, batch_first=True)
output, (h_n, c_n) = self.rnn_cell(embeddings_packed, (h_0, c_0))
# padded_output: (batch_size, time_steps, hidden_size * bi_num)
# h_n|c_n: (num_layer*bi_num, batch_size, hidden_size)
padded_output, length = pad_packed_sequence(output, batch_first=True)
# 取最后一个有效输出作为最终输出(0为无效输出)
last_output = padded_output[0]

这里做下参数说明:

  • batch_first 类型:bool, True,输入的维度为(batch_size,time_steps, word_embedding_dimension);False,输入维度为(time_steps, batch_size, word_embedding_dimension)
  • h_0: 初始化隐藏态
  • c_0: 初始化细胞态
  • output:输出,为一个tuple,后面会用例子说明
  • h_n: 最后个隐藏态
  • c_n: 最后个细胞态

还是以上面我们造的那个张量为例,看下输出:

1
2
3
4
5
6
7
import torch.nn as nn

birnn = nn.LSTM(input_size=1, hidden_size=8, bidirectional=True)
output, (h_n, c_n) = birnn(x_packed)
print('output:\n', output)
print('h_n shape:\n',h_n.shape)
print('c_n shape:\n',c_n.shape)

结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
output:
PackedSequence(data=tensor([[-0.0087, 0.0716, 0.0069, -0.0040, -0.1375, 0.0404, 0.0757, 0.0291,
0.0619, 0.0213, 0.1517, 0.0241, 0.2986, -0.2594, -0.1432, -0.1742],
[-0.0087, 0.0716, 0.0069, -0.0040, -0.1375, 0.0404, 0.0757, 0.0291,
0.0486, 0.0195, 0.1454, 0.0138, 0.2596, -0.2467, -0.1491, -0.1596],
[-0.1019, 0.1606, -0.0482, -0.0519, -0.1932, 0.2632, -0.0423, 0.0774,
0.0434, 0.0142, 0.0915, -0.0455, 0.2970, -0.6665, -0.0578, -0.0521],
[-0.0895, 0.1504, -0.0069, -0.0137, -0.2248, 0.1365, 0.1002, 0.0918,
0.0730, 0.0247, 0.1526, 0.0160, 0.3559, -0.4301, -0.1375, -0.1429],
[-0.0895, 0.1504, -0.0069, -0.0137, -0.2248, 0.1365, 0.1002, 0.0918,
0.0517, 0.0197, 0.1431, -0.0005, 0.2929, -0.4016, -0.1401, -0.1218],
[-0.1855, 0.2300, -0.0625, -0.0508, -0.2488, 0.2741, -0.0638, 0.1441,
0.0239, 0.0137, 0.0786, -0.0524, 0.2311, -0.5685, -0.0673, -0.0485],
[-0.1621, 0.2254, -0.0318, -0.0240, -0.2693, 0.2207, 0.0843, 0.1487,
0.0789, 0.0259, 0.1354, 0.0050, 0.3851, -0.5676, -0.1110, -0.1117],
[-0.1621, 0.2254, -0.0318, -0.0240, -0.2693, 0.2207, 0.0843, 0.1487,
0.0439, 0.0172, 0.1198, -0.0221, 0.2856, -0.5105, -0.1089, -0.0852],
[-0.1346, 0.3093, -0.0775, -0.0678, -0.2616, 0.2946, -0.1167, 0.1699,
-0.0280, 0.0067, 0.0333, -0.0924, 0.1308, -0.5332, -0.0237, -0.0201],
[-0.1886, 0.2883, -0.0563, -0.0352, -0.2872, 0.2700, 0.0431, 0.1887,
0.0803, 0.0254, 0.1112, -0.0077, 0.3929, -0.6681, -0.0804, -0.0828],
[-0.1886, 0.2883, -0.0563, -0.0352, -0.2872, 0.2700, 0.0431, 0.1887,
0.0194, 0.0129, 0.0852, -0.0521, 0.2364, -0.5524, -0.0720, -0.0538],
[-0.1767, 0.3363, -0.0719, -0.0475, -0.2858, 0.2911, -0.0109, 0.2141,
0.0779, 0.0234, 0.0866, -0.0215, 0.3831, -0.7393, -0.0539, -0.0580],
[-0.1767, 0.3363, -0.0719, -0.0475, -0.2858, 0.2911, -0.0109, 0.2141,
-0.0274, 0.0076, 0.0448, -0.0817, 0.1408, -0.4683, -0.0382, -0.0285],
[-0.1470, 0.3710, -0.0771, -0.0605, -0.2693, 0.2962, -0.0684, 0.2291,
0.0724, 0.0204, 0.0651, -0.0356, 0.3568, -0.7885, -0.0341, -0.0385],
[-0.1142, 0.3956, -0.0744, -0.0727, -0.2434, 0.2935, -0.1226, 0.2366,
0.0636, 0.0164, 0.0477, -0.0497, 0.3136, -0.8199, -0.0208, -0.0245],
[-0.0854, 0.4126, -0.0669, -0.0829, -0.2140, 0.2871, -0.1699, 0.2389,
0.0496, 0.0118, 0.0343, -0.0641, 0.2542, -0.8307, -0.0124, -0.0151],
[-0.0847, 0.4176, -0.0620, -0.0877, -0.2053, 0.2863, -0.2063, 0.2468,
0.0281, 0.0090, 0.0254, -0.0784, 0.1827, -0.7871, -0.0100, -0.0116],
[-0.0620, 0.4284, -0.0550, -0.0931, -0.1827, 0.2785, -0.2409, 0.2438,
-0.0278, 0.0044, 0.0120, -0.1073, 0.0924, -0.6543, -0.0047, -0.0063]],
grad_fn=<CatBackward>), batch_sizes=tensor([3, 3, 3, 2, 2, 1, 1, 1, 1, 1]))

h_n shape:
torch.Size([2, 3, 8])
c_n shape:
torch.Size([2, 3, 8])

output由一个tuple组成,第一个元素就是输出,这里的维度为torch.Size([18, 16]),第二个元素和我们之前pack时的batch_size参数一样。

h_n和c_n分别为两个三维张量,维度如上,因为是双向,第一维为2.

到这里是不是就结束了呢,当然不是,pack完了之后经过LSTM输出了结果,看上去很费劲,因为序列连一起了,batch都没了。所以我们还原,这时候就用到pack_padded_sequence的好基友pad_packed_sequence了。

pad_packed_sequence

pad_packed_sequence可以看成是解压缩操作。从上面可以看到,h_n, c_n已经是正常维度的张量了,没有pack,当然也用不着pad。我们只需对output做pad。

1
2
3
4
5
padded_output, length = pad_packed_sequence(output, batch_first=True)

print('padded_output\n', padded_output)
print('padded_output shape\n', padded_output.shape)
print('length\n',length)

结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
padded_output
tensor([[[-0.0087, 0.0716, 0.0069, -0.0040, -0.1375, 0.0404, 0.0757,
0.0291, 0.0619, 0.0213, 0.1517, 0.0241, 0.2986, -0.2594,
-0.1432, -0.1742],
[-0.0895, 0.1504, -0.0069, -0.0137, -0.2248, 0.1365, 0.1002,
0.0918, 0.0730, 0.0247, 0.1526, 0.0160, 0.3559, -0.4301,
-0.1375, -0.1429],
[-0.1621, 0.2254, -0.0318, -0.0240, -0.2693, 0.2207, 0.0843,
0.1487, 0.0789, 0.0259, 0.1354, 0.0050, 0.3851, -0.5676,
-0.1110, -0.1117],
[-0.1886, 0.2883, -0.0563, -0.0352, -0.2872, 0.2700, 0.0431,
0.1887, 0.0803, 0.0254, 0.1112, -0.0077, 0.3929, -0.6681,
-0.0804, -0.0828],
[-0.1767, 0.3363, -0.0719, -0.0475, -0.2858, 0.2911, -0.0109,
0.2141, 0.0779, 0.0234, 0.0866, -0.0215, 0.3831, -0.7393,
-0.0539, -0.0580],
[-0.1470, 0.3710, -0.0771, -0.0605, -0.2693, 0.2962, -0.0684,
0.2291, 0.0724, 0.0204, 0.0651, -0.0356, 0.3568, -0.7885,
-0.0341, -0.0385],
[-0.1142, 0.3956, -0.0744, -0.0727, -0.2434, 0.2935, -0.1226,
0.2366, 0.0636, 0.0164, 0.0477, -0.0497, 0.3136, -0.8199,
-0.0208, -0.0245],
[-0.0854, 0.4126, -0.0669, -0.0829, -0.2140, 0.2871, -0.1699,
0.2389, 0.0496, 0.0118, 0.0343, -0.0641, 0.2542, -0.8307,
-0.0124, -0.0151],
[-0.0847, 0.4176, -0.0620, -0.0877, -0.2053, 0.2863, -0.2063,
0.2468, 0.0281, 0.0090, 0.0254, -0.0784, 0.1827, -0.7871,
-0.0100, -0.0116],
[-0.0620, 0.4284, -0.0550, -0.0931, -0.1827, 0.2785, -0.2409,
0.2438, -0.0278, 0.0044, 0.0120, -0.1073, 0.0924, -0.6543,
-0.0047, -0.0063]],

[[-0.0087, 0.0716, 0.0069, -0.0040, -0.1375, 0.0404, 0.0757,
0.0291, 0.0486, 0.0195, 0.1454, 0.0138, 0.2596, -0.2467,
-0.1491, -0.1596],
[-0.0895, 0.1504, -0.0069, -0.0137, -0.2248, 0.1365, 0.1002,
0.0918, 0.0517, 0.0197, 0.1431, -0.0005, 0.2929, -0.4016,
-0.1401, -0.1218],
[-0.1621, 0.2254, -0.0318, -0.0240, -0.2693, 0.2207, 0.0843,
0.1487, 0.0439, 0.0172, 0.1198, -0.0221, 0.2856, -0.5105,
-0.1089, -0.0852],
[-0.1886, 0.2883, -0.0563, -0.0352, -0.2872, 0.2700, 0.0431,
0.1887, 0.0194, 0.0129, 0.0852, -0.0521, 0.2364, -0.5524,
-0.0720, -0.0538],
[-0.1767, 0.3363, -0.0719, -0.0475, -0.2858, 0.2911, -0.0109,
0.2141, -0.0274, 0.0076, 0.0448, -0.0817, 0.1408, -0.4683,
-0.0382, -0.0285],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000]],

[[-0.1019, 0.1606, -0.0482, -0.0519, -0.1932, 0.2632, -0.0423,
0.0774, 0.0434, 0.0142, 0.0915, -0.0455, 0.2970, -0.6665,
-0.0578, -0.0521],
[-0.1855, 0.2300, -0.0625, -0.0508, -0.2488, 0.2741, -0.0638,
0.1441, 0.0239, 0.0137, 0.0786, -0.0524, 0.2311, -0.5685,
-0.0673, -0.0485],
[-0.1346, 0.3093, -0.0775, -0.0678, -0.2616, 0.2946, -0.1167,
0.1699, -0.0280, 0.0067, 0.0333, -0.0924, 0.1308, -0.5332,
-0.0237, -0.0201],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000]]], grad_fn=<TransposeBackward0>)
padded_output shape
torch.Size([3, 10, 16])
length
tensor([10, 5, 3])

是不是又回来了?16是cell_size*2, 因为是双向。

可以看到,没参与计算后来被补上来的部分都成为了0!

完整代码

tensorflow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
mport os
import time
import numpy as np
import tensorflow as tf
from util.embedding_util import get_embedding
from util.plot_util import loss_acc_plot
from util.lr_util import lr_update
import config.lstm_config as config


class BILSTM(tf.keras.Model):
def __init__(self, cell_size,
checkpoint_dir,
num_classes,
model_type,
vocab_size,
word2id,
embedding_dim,
keep_prob):
super().__init__()

self.checkpoint_dir = checkpoint_dir
self.history = {}
self.keep_prob = keep_prob

# embedding layer
weights = get_embedding(model_type=model_type,
word2id=word2id,
embedding_dim=embedding_dim)
if model_type == 'static':
self.embeddings = tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=weights, trainable=False)
elif model_type == 'non-static':
self.embeddings = tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=weights, trainable=True)
elif model_type == 'rand':
self.embeddings = tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=weights, trainable=True)
elif model_type == 'multichannel':
pass
else:
raise ValueError('unknown model type')

# BILSTM layer
self.fw_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.LSTMCell(num_units=cell_size), output_keep_prob=0.7)
self.bw_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.LSTMCell(num_units=cell_size), output_keep_prob=0.7)

self.Dense = tf.layers.Dense(units=num_classes, activation=None)

def call(self, inputs, seq_length, training):
embedded_words = self.embeddings(inputs)
outputs, final_state = tf.nn.bidirectional_dynamic_rnn(
self.fw_cell,
self.bw_cell,
inputs=embedded_words,
sequence_length=seq_length,
dtype=tf.float32,
time_major=False)
outputs = tf.concat(outputs, axis=2)
final_output = outputs[-1]
logits = self.Dense(final_output)
return logits

def loss_fn(self, inputs, target, seq_length, training):
preds = self.call(inputs, seq_length, training)
# L2正则化
loss_L2 = tf.add_n([tf.nn.l2_loss(v)
for v in self.trainable_variables
if 'bias' not in v.name]) * 0.001
loss = tf.losses.sparse_softmax_cross_entropy(labels=target, logits=preds)
loss = loss + loss_L2
return loss

def grads_fn(self, inputs, target, seq_length, training):
with tf.GradientTape() as tape:
loss = self.loss_fn(inputs, target, seq_length, training)
return tape.gradient(loss, self.variables)

def save_model(self, model):
""" Function to save trained model.
"""
checkpoint = tf.train.Checkpoint(model=model)
checkpoint_prefix = os.path.join(self.checkpoint_dir, 'ckpt')
checkpoint.save(file_prefix=checkpoint_prefix)

def restore_model(self):
# Run the model once to initialize variables
dummy_input = tf.constant(tf.zeros((1, 1)))
dummy_length = tf.constant(1, shape=(1,))
self(dummy_input, dummy_length, False)
# Restore the variables of the model
saver = tf.contrib.Saver(self.variables)
saver.restore(tf.train.latest_checkpoint
(self.checkpoint_directory))

def get_accuracy(self, inputs, target, seq_length, training):
y = self.call(inputs, seq_length, training)
y_pred = tf.argmax(y, axis=1)
correct = tf.where(tf.equal(y_pred, target)).numpy().shape[0]
total = target.numpy().shape[0]
return correct/total

def fit(self, training_data, eval_data, pbar, num_epochs=100,
early_stopping_rounds=5, verbose=1, train_from_scratch=True):
"""train the model"""
if train_from_scratch is False:
self.restore_model()

# Initialize best loss. This variable will store the lowest loss on the
# eval dataset.
best_loss = 2018

# Initialize classes to update the mean loss of train and eval
train_loss = []
eval_loss = []
train_accuracy = []
eval_accuracy = []

# Initialize dictionary to store the loss history
self.history['train_loss'] = []
self.history['eval_loss'] = []
self.history['train_accuracy'] = []
self.history['eval_accuracy'] = []

count = early_stopping_rounds

# Begin training
for i in range(num_epochs):
# 在每个epoch训练之初初始化optimizer,决定是否使用学习率衰减
learning_rate = lr_update(i+1, mode=config.lr_mode)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

# Training with gradient descent
start = time.time()
for index, (sequence, label, seq_length) in enumerate(training_data):
# cpu需要类型转换,不然会报错:Could not find valid device
sequence = tf.cast(sequence, dtype=tf.float32)
label = tf.cast(label, dtype=tf.int64)
grads = self.grads_fn(sequence, label, seq_length, training=True)
optimizer.apply_gradients(zip(grads, self.variables))
pbar.show(index, use_time=time.time()-start)

# Compute the loss on the training data after one epoch
for sequence, label, seq_length in training_data:
sequence = tf.cast(sequence, dtype=tf.float32)
label = tf.cast(label, dtype=tf.int64)
train_los = self.loss_fn(sequence, label, seq_length, training=False)
train_acc = self.get_accuracy(sequence, label, seq_length, training=False)
train_loss.append(train_los)
train_accuracy.append(train_acc)
self.history['train_loss'].append(np.mean(train_loss))
self.history['train_accuracy'].append(np.mean(train_accuracy))

# Compute the loss on the eval data after one epoch
for sequence, label, seq_length in eval_data:
sequence = tf.cast(sequence, dtype=tf.float32)
label = tf.cast(label, dtype=tf.int64)
eval_los = self.loss_fn(sequence, label, seq_length, training=False)
eval_acc = self.get_accuracy(sequence, label, seq_length, training=False)
eval_loss.append(eval_los)
eval_accuracy.append(eval_acc)
self.history['eval_loss'].append(np.mean(eval_loss))
self.history['eval_accuracy'].append(np.mean(eval_accuracy))

# Print train and eval losses
if (i == 0) | ((i + 1) % verbose == 0):
print('Epoch %d - train_loss: %4f - eval_loss: %4f - train_acc:%4f - eval_acc:%4f'
% (i + 1,
self.history['train_loss'][-1],
self.history['eval_loss'][-1],
self.history['train_accuracy'][-1],
self.history['eval_accuracy'][-1]))

# Check for early stopping
if self.history['eval_loss'][-1] < best_loss:
best_loss = self.history['eval_loss'][-1]
count = early_stopping_rounds
else:
count -= 1
if count == 0:
break
# 画出loss_acc曲线
loss_acc_plot(history=self.history)

pytorch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
from sklearn.metrics import f1_score
import numpy as np

import config.config as config
from util.embedding_util import get_embedding

torch.manual_seed(2018)
torch.cuda.manual_seed(2018)
torch.cuda.manual_seed_all(2018)
np.random.seed(2018)

os.environ["CUDA_VISIBLE_DEVICE"] = "1"

class RNN(nn.Module):
def __init__(self, vocab_size,
word_embedding_dimension,
hidden_size, bi_flag,
num_layer,
labels,
cell_type,
dropout,
checkpoint_dir):
super(RNN, self).__init__()
self.labels = labels
self.num_label = len(labels)
self.num_layer = num_layer
self.hidden_size = hidden_size
self.dropout = dropout
self.checkpoint_dir = checkpoint_dir

if torch.cuda.is_available():
self.device = torch.device("cuda")

self.embedding = nn.Embedding(vocab_size, word_embedding_dimension)
for p in self.embedding.parameters():
p.requires_grad = False
self.embedding.weight.data.copy_(torch.from_numpy(get_embedding(vocab_size, word_embedding_dimension)))

if cell_type == "LSTM":
self.rnn_cell = nn.LSTM(input_size=word_embedding_dimension,
hidden_size=hidden_size,
num_layers=num_layer,
batch_first=True,
dropout=dropout,
bidirectional=bi_flag)
elif cell_type == "GRU":
self.rnn_cell = nn.GRU(input_size=word_embedding_dimension,
hidden_size=hidden_size,
num_layers=num_layer,
batch_first=True,
dropout=dropout,
bidirectional=bi_flag)
else:
raise TypeError("RNN: Unknown rnn cell type")

# 是否双向
self.bi_num = 2 if bi_flag else 1

self.linear = nn.Linear(hidden_size*self.bi_num, self.num_label)

def forward(self, inputs, length):
batch_size = inputs.shape[0]
# 初始化态h和C,默认为zeros
h_0 = torch.zeros(self.num_layer*self.bi_num, batch_size, self.hidden_size).float()
c_0 = torch.zeros(self.num_layer*self.bi_num, batch_size, self.hidden_size).float()

embeddings = self.embedding(inputs, length) # (batch_size, time_steps, embedding_dim)
# 去除padding元素
# embeddings_packed: (batch_size*time_steps, embedding_dim)
embeddings_packed = pack_padded_sequence(embeddings, length, batch_first=True)
output, (h_n, c_n) = self.rnn_cell(embeddings_packed, (h_0, c_0))
# padded_output: (batch_size, time_steps, hidden_size * bi_num)
# h_n|c_n: (num_layer*bi_num, batch_size, hidden_size)
padded_output, _ = pad_packed_sequence(output, batch_first=True)
# 取最后一个有效输出作为最终输出(0为无效输出)
last_output = padded_output[torch.LongTensor(range(batch_size)), length]
last_output = F.dropout(last_output, p=self.dropout, training=self.training)
output = self.linear(last_output)
return output

def load(self):
self.load_state_dict(torch.load(self.checkpoint_dir))

def save(self):
torch.save(self.state_dict(), self.checkpoint_dir)

def evaluate(self, y_pred, y_true):
_, y_pred = torch.max(y_pred.data, 1)
if config.use_cuda:
y_true = y_true.cpu().numpy()
y_pred = y_pred.cpu().numpy()
else:
y_true = y_true.numpy()
y_pred = y_pred.numpy()
f1 = f1_score(y_true, y_pred, labels=self.labels, average="macro")
correct = np.sum((y_true==y_pred).astype(int))
acc = correct/y_pred.shape[0]
return (acc, f1)

各位晚安~