根据SMP2018中文人机对话模型创建的应用 简单的说就是给用户的问题分类,效果如下:
query
label
0
今天东莞天气如何
weather
1
从观音桥到重庆市图书馆怎么走
map
2
鸭蛋怎么腌?
cookbook
3
怎么治疗牛皮癣
health
4
唠什么
chat
用户意图领域分类 在人机对话系统的应用过程中,用户可能会有多种意图,相应地会触发人机对话系统中的多个领域(domain) ,其中包括任务型垂直领域(如查询机票、酒店、公交车等)、知识型问答以及闲聊等。因而,人机对话系统的一个关键任务就是正确地将用户的输入分类到相应的领域(domain)中,从而才能返回正确的回复结果。
分类的类别说明
包含闲聊和垂类两大类,其中垂类又细分为30个垂直领域。
本次评测任务1中,仅考虑针对单轮对话用户意图的领域分类,多轮对话整体意图的领域分类不在此次评测范围之内。 类别 = [‘website’, ‘tvchannel’, ‘lottery’, ‘chat’, ‘match’, 'datetime', 'weather', 'bus', 'novel', 'video', 'riddle',
'calc', 'telephone', 'health', 'contacts', 'epg', 'app', 'music',
'cookbook', 'stock', 'map', 'message', 'poetry', 'cinemas', 'news',
'flight', 'translation', 'train', 'schedule', 'radio', 'email']
开始使用
1 from app import query_2_label
Using TensorFlow backend.
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.945 seconds.
Prefix dict has been built succesfully.
'chat'
运行下面代码进行查询,输入 0 结束查询
1 2 3 4 5 6 7 8 while True : your_query_sentence = input() print('-' *10 ) label = query_2_label(your_query_sentence) print('predict label:\t' , label) print('-' *10 ) if your_query_sentence=='0' : break
今天东莞天气如何
----------
predict label: datetime
----------
怎么治疗感冒?
----------
predict label: health
----------
你好?
----------
predict label: chat
----------
与实验部分的分割线
下面是一个完整的针对 SMP2018中文人机对话技术评测(ECDT) 的实验,由该实验训练的基线模型能达到评测排行榜的前三的水平。
通过本实验,可以掌握处理自然语言文本数据的一般方法。
推荐自己修改此文件,达到更好的实验效果,比如改变以下几个超参数
1 2 3 4 5 6 embedding_word_dims = 32 batch_size = 30 epochs = 20
本实验还可以改进的地方举例
预处理阶段使用其它的分词工具
采用字符向量和词向量结合的方式
使用预先训练好的词向量
改变模型结构
改变模型超参数
导入依赖库 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import numpy as npimport pandas as pdimport collectionsimport jiebafrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesfrom keras.models import Sequentialfrom keras.layers import Embedding, LSTM, Densefrom keras.utils import to_categorical,plot_modelfrom keras.callbacks import TensorBoard, Callbackfrom sklearn.metrics import classification_reportimport requestsimport timeimport os
Using TensorFlow backend.
辅助函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 from keras import backend as Kdef f1 (y_true, y_pred) : def recall (y_true, y_pred) : """Recall metric. Only computes a batch-wise average of recall. Computes the recall, a metric for multi-label classification of how many relevant items are selected. """ true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0 , 1 ))) possible_positives = K.sum(K.round(K.clip(y_true, 0 , 1 ))) recall = true_positives / (possible_positives + K.epsilon()) return recall def precision (y_true, y_pred) : """Precision metric. Only computes a batch-wise average of precision. Computes the precision, a metric for multi-label classification of how many selected items are relevant. """ true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0 , 1 ))) predicted_positives = K.sum(K.round(K.clip(y_pred, 0 , 1 ))) precision = true_positives / (predicted_positives + K.epsilon()) return precision precision = precision(y_true, y_pred) recall = recall(y_true, y_pred) return 2 *((precision*recall)/(precision+recall+K.epsilon()))
1 2 3 4 5 6 def get_customization_time () : time_tuple = time.localtime(time.time()) customization_time = "{}_{}_{}_{}_{}_{}" .format(time_tuple[0 ], time_tuple[1 ], time_tuple[2 ], time_tuple[3 ], time_tuple[4 ], time_tuple[5 ]) return customization_time
准备数据 1 2 3 4 5 6 7 8 9 10 11 12 13 raw_train_data_url = "https://worksheets.codalab.org/rest/bundles/0x0161fd2fb40d4dd48541c2643d04b0b8/contents/blob/" raw_test_data_url = "https://worksheets.codalab.org/rest/bundles/0x1f96bc12222641209ad057e762910252/contents/blob/" if (not os.path.exists('./data/train.json' )) or (not os.path.exists('./data/dev.json' )): raw_train = requests.get(raw_train_data_url) raw_test = requests.get(raw_test_data_url) if not os.path.exists('./data' ): os.makedirs('./data' ) with open("./data/train.json" , "wb" ) as code: code.write(raw_train.content) with open("./data/dev.json" , "wb" ) as code: code.write(raw_test.content)
1 2 3 4 5 6 7 8 def get_json_data (path) : data_df = pd.read_json(path) data_df = data_df.transpose() data_df = data_df[['query' , 'label' ]] return data_df
1 2 3 train_data_df = get_json_data(path="data/train.json" ) test_data_df = get_json_data(path="data/dev.json" )
query
label
0
今天东莞天气如何
weather
1
从观音桥到重庆市图书馆怎么走
map
2
鸭蛋怎么腌?
cookbook
3
怎么治疗牛皮癣
health
4
唠什么
chat
结巴分词 示例,下面将使用结巴分词对原数据进行处理1 2 seg_list = jieba.cut("他来到了网易杭研大厦" ) print(list(seg_list))
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.022 seconds.
Prefix dict has been built succesfully.
['他', '来到', '了', '网易', '杭研', '大厦']
序列化 1 2 3 4 5 def use_jieba_cut (a_sentence) : return list(jieba.cut(a_sentence)) train_data_df['cut_query' ] = train_data_df['query' ].apply(use_jieba_cut) test_data_df['cut_query' ] = test_data_df['query' ].apply(use_jieba_cut)
query
label
cut_query
0
今天东莞天气如何
weather
[今天, 东莞, 天气, 如何]
1
从观音桥到重庆市图书馆怎么走
map
[从, 观音桥, 到, 重庆市, 图书馆, 怎么, 走]
2
鸭蛋怎么腌?
cookbook
[鸭蛋, 怎么, 腌, ?]
3
怎么治疗牛皮癣
health
[怎么, 治疗, 牛皮癣]
4
唠什么
chat
[唠, 什么]
5
阳澄湖大闸蟹的做法。
cookbook
[阳澄湖, 大闸蟹, 的, 做法, 。]
6
昆山大润发在哪里
map
[昆山, 大润发, 在, 哪里]
7
红烧肉怎么做?嗯?
cookbook
[红烧肉, 怎么, 做, ?, 嗯, ?]
8
南京到厦门的火车票
train
[南京, 到, 厦门, 的, 火车票]
9
6的平方
calc
[6, 的, 平方]
处理特征
1 tokenizer.fit_on_texts(train_data_df['cut_query' ])
1 2 3 max_features = len(tokenizer.index_word) len(tokenizer.index_word)
2883
1 2 3 x_train = tokenizer.texts_to_sequences(train_data_df['cut_query' ]) x_test = tokenizer.texts_to_sequences(test_data_df['cut_query' ])
1 max_cut_query_lenth = 26
1 2 3 x_train = pad_sequences(x_train, max_cut_query_lenth) x_test = pad_sequences(x_test, max_cut_query_lenth)
(2299, 26)
(770, 26)
处理标签 1 label_tokenizer = Tokenizer()
1 label_tokenizer.fit_on_texts(train_data_df['label' ])
1 label_numbers = len(label_tokenizer.word_counts)
1 NUM_CLASSES = len(label_tokenizer.word_counts)
1 label_tokenizer.word_counts
OrderedDict([('weather', 66),
('map', 68),
('cookbook', 269),
('health', 55),
('chat', 455),
('train', 70),
('calc', 24),
('translation', 61),
('music', 66),
('tvchannel', 71),
('poetry', 102),
('telephone', 63),
('stock', 71),
('radio', 24),
('contacts', 30),
('lottery', 24),
('website', 54),
('video', 182),
('news', 58),
('bus', 24),
('app', 53),
('flight', 62),
('epg', 107),
('message', 63),
('match', 24),
('schedule', 29),
('novel', 24),
('riddle', 34),
('email', 24),
('datetime', 18),
('cinemas', 24)])
1 y_train = label_tokenizer.texts_to_sequences(train_data_df['label' ])
[[10], [9], [2], [17], [1], [2], [9], [2], [8], [23]]
1 y_train = [[y[0 ]-1 ] for y in y_train]
[[9], [8], [1], [16], [0], [1], [8], [1], [7], [22]]
1 2 y_train = to_categorical(y_train, label_numbers) y_train.shape
(2299, 31)
1 2 3 4 y_test = label_tokenizer.texts_to_sequences(test_data_df['label' ]) y_test = [y[0 ]-1 for y in y_test] y_test = to_categorical(y_test, label_numbers) y_test.shape
(770, 31)
array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
dtype=float32)
设计模型 1 2 3 4 5 6 7 8 9 10 11 12 13 def create_SMP2018_lstm_model (max_features, max_cut_query_lenth, label_numbers) : model = Sequential() model.add(Embedding(input_dim=max_features+1 , output_dim=32 , input_length=max_cut_query_lenth)) model.add(LSTM(units=64 , dropout=0.2 , recurrent_dropout=0.2 )) model.add(Dense(label_numbers, activation='softmax' )) model.compile(loss='categorical_crossentropy' , optimizer='adam' , metrics=[f1]) plot_model(model, to_file='SMP2018_lstm_model.png' , show_shapes=True ) return model
训练模型 1 2 3 4 5 6 7 8 9 if 'max_features' not in dir(): max_features = 2888 print('not find max_features variable, use default max_features values:\t{}' .format(max_features)) if 'max_cut_query_lenth' not in dir(): max_cut_query_lenth = 26 print('not find max_cut_query_lenth, use default max_features values:\t{}' .format(max_cut_query_lenth)) if 'label_numbers' not in dir(): label_numbers = 31 print('not find label_numbers, use default max_features values:\t{}' .format(label_numbers))
1 model = create_SMP2018_lstm_model(max_features, max_cut_query_lenth, label_numbers)
1 2 batch_size = 20 epochs = 30
1 print(x_train.shape, y_train.shape)
(2299, 26) (2299, 31)
1 print(x_test.shape, y_test.shape)
(770, 26) (770, 31)
1 2 3 4 print('Train...' ) model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)
Train...
Epoch 1/30
2299/2299 [==============================] - 16s 7ms/step - loss: 3.0916 - f1: 0.0000e+00
Epoch 2/30
2299/2299 [==============================] - 14s 6ms/step - loss: 2.6594 - f1: 0.1409
Epoch 3/30
2299/2299 [==============================] - 13s 6ms/step - loss: 2.0817 - f1: 0.4055
Epoch 4/30
2299/2299 [==============================] - 14s 6ms/step - loss: 1.6032 - f1: 0.4689
Epoch 5/30
2299/2299 [==============================] - 14s 6ms/step - loss: 1.1318 - f1: 0.6176
Epoch 6/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.8090 - f1: 0.7399
Epoch 7/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.5704 - f1: 0.8298
Epoch 8/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.4051 - f1: 0.8879
Epoch 9/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.3002 - f1: 0.9280
Epoch 10/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.2317 - f1: 0.9467
Epoch 11/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.1755 - f1: 0.9678
Epoch 12/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.1391 - f1: 0.9758
Epoch 13/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.1131 - f1: 0.9800
Epoch 14/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0883 - f1: 0.9861
Epoch 15/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0725 - f1: 0.9894
Epoch 16/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0615 - f1: 0.9929
Epoch 17/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0507 - f1: 0.9945
Epoch 18/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0455 - f1: 0.9963
Epoch 19/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0398 - f1: 0.9960
Epoch 20/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0313 - f1: 0.9978
Epoch 21/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0266 - f1: 0.9984
Epoch 22/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0279 - f1: 0.9965
Epoch 23/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0250 - f1: 0.9976
Epoch 24/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0219 - f1: 0.9982
Epoch 25/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0195 - f1: 0.9982
Epoch 26/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0179 - f1: 0.9989
Epoch 27/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0177 - f1: 0.9974
Epoch 28/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0139 - f1: 0.9987
Epoch 29/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0139 - f1: 0.9989
Epoch 30/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0129 - f1: 0.9987
<keras.callbacks.History at 0x7f84e87c5f28>
评估模型 1 2 3 4 5 score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1 ) print('Test score:' , score[0 ]) print('Test f1:' , score[1 ])
770/770 [==============================] - 1s 1ms/step
Test score: 0.6803552009068526
Test f1: 0.8464262740952628
1 y_hat_test = model.predict(x_test)
(770, 31)
将 one-hot 张量转换成对应的整数 1 y_pred = np.argmax(y_hat_test, axis=1 ).tolist()
1 y_true = np.argmax(y_test, axis=1 ).tolist()
查看多分类的 准确率、召回率、F1 值 1 print(classification_report(y_true, y_pred))
precision recall f1-score support
0 0.78 0.93 0.85 154
1 0.92 0.97 0.95 89
2 0.67 0.62 0.64 60
3 0.83 0.83 0.83 36
4 0.79 1.00 0.88 34
5 0.83 0.65 0.73 23
6 1.00 0.83 0.91 24
7 1.00 1.00 1.00 24
8 0.68 0.65 0.67 23
9 0.90 0.86 0.88 22
10 0.85 0.50 0.63 22
11 0.88 1.00 0.93 21
12 1.00 0.90 0.95 21
13 0.91 0.95 0.93 21
14 1.00 0.95 0.98 21
15 0.79 0.95 0.86 20
16 0.90 0.47 0.62 19
17 0.79 0.61 0.69 18
18 0.63 0.67 0.65 18
19 0.90 0.82 0.86 11
20 1.00 0.70 0.82 10
21 1.00 0.67 0.80 9
22 1.00 0.88 0.93 8
23 1.00 0.62 0.77 8
24 1.00 1.00 1.00 8
25 1.00 0.88 0.93 8
26 0.88 0.88 0.88 8
27 0.86 0.75 0.80 8
28 1.00 1.00 1.00 8
29 0.75 0.75 0.75 8
30 0.75 1.00 0.86 6
micro avg 0.84 0.84 0.84 770
macro avg 0.88 0.82 0.84 770
weighted avg 0.85 0.84 0.84 770