0%

SMP2018中文人机对话技术评测(ECDT)

点击查看GitHub SMP2018 完整代码和解析

根据SMP2018中文人机对话模型创建的应用

简单的说就是给用户的问题分类,效果如下:

query label
0 今天东莞天气如何 weather
1 从观音桥到重庆市图书馆怎么走 map
2 鸭蛋怎么腌? cookbook
3 怎么治疗牛皮癣 health
4 唠什么 chat

用户意图领域分类

  在人机对话系统的应用过程中,用户可能会有多种意图,相应地会触发人机对话系统中的多个领域(domain) ,其中包括任务型垂直领域(如查询机票、酒店、公交车等)、知识型问答以及闲聊等。因而,人机对话系统的一个关键任务就是正确地将用户的输入分类到相应的领域(domain)中,从而才能返回正确的回复结果。

分类的类别说明

  • 包含闲聊和垂类两大类,其中垂类又细分为30个垂直领域。
  • 本次评测任务1中,仅考虑针对单轮对话用户意图的领域分类,多轮对话整体意图的领域分类不在此次评测范围之内。
    类别 = [‘website’, ‘tvchannel’, ‘lottery’, ‘chat’, ‘match’,
        'datetime', 'weather', 'bus', 'novel', 'video', 'riddle',
        'calc', 'telephone', 'health', 'contacts', 'epg', 'app', 'music',
        'cookbook', 'stock', 'map', 'message', 'poetry', 'cinemas', 'news',
        'flight', 'translation', 'train', 'schedule', 'radio', 'email']
    

开始使用

1
from app import query_2_label
Using TensorFlow backend.
1
query_2_label('我喜欢你')
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.945 seconds.
Prefix dict has been built succesfully.





'chat'

运行下面代码进行查询,输入 0 结束查询

1
2
3
4
5
6
7
8
while True:
your_query_sentence = input()
print('-'*10)
label = query_2_label(your_query_sentence)
print('predict label:\t', label)
print('-'*10)
if your_query_sentence=='0':
break
 今天东莞天气如何


----------
predict label:     datetime
----------


 怎么治疗感冒?


----------
predict label:     health
----------


 你好?


----------
predict label:     chat
----------

与实验部分的分割线


  1. 下面是一个完整的针对 SMP2018中文人机对话技术评测(ECDT) 的实验,由该实验训练的基线模型能达到评测排行榜的前三的水平。
  2. 通过本实验,可以掌握处理自然语言文本数据的一般方法。
  3. 推荐自己修改此文件,达到更好的实验效果,比如改变以下几个超参数
1
2
3
4
5
6
# 词嵌入的维度
embedding_word_dims = 32
# 批次大小
batch_size = 30
# 周期
epochs = 20

本实验还可以改进的地方举例

  1. 预处理阶段使用其它的分词工具
  2. 采用字符向量和词向量结合的方式
  3. 使用预先训练好的词向量
  4. 改变模型结构
  5. 改变模型超参数

导入依赖库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np
import pandas as pd
import collections
import jieba
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.utils import to_categorical,plot_model
from keras.callbacks import TensorBoard, Callback

from sklearn.metrics import classification_report

import requests

import time

import os
Using TensorFlow backend.

辅助函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from keras import backend as K

# 计算 F1 值的函数
def f1(y_true, y_pred):
def recall(y_true, y_pred):
"""Recall metric.

Only computes a batch-wise average of recall.

Computes the recall, a metric for multi-label classification of
how many relevant items are selected.
"""
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (possible_positives + K.epsilon())
return recall

def precision(y_true, y_pred):
"""Precision metric.

Only computes a batch-wise average of precision.

Computes the precision, a metric for multi-label classification of
how many selected items are relevant.
"""
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
precision = precision(y_true, y_pred)
recall = recall(y_true, y_pred)
return 2*((precision*recall)/(precision+recall+K.epsilon()))
1
2
3
4
5
6
# 获取自定义时间格式的字符串
def get_customization_time():
# return '2018_10_10_18_11_45' 年月日时分秒
time_tuple = time.localtime(time.time())
customization_time = "{}_{}_{}_{}_{}_{}".format(time_tuple[0], time_tuple[1], time_tuple[2], time_tuple[3], time_tuple[4], time_tuple[5])
return customization_time

准备数据

下载SMP2018官方数据

1
2
3
4
5
6
7
8
9
10
11
12
13
raw_train_data_url = "https://worksheets.codalab.org/rest/bundles/0x0161fd2fb40d4dd48541c2643d04b0b8/contents/blob/"
raw_test_data_url = "https://worksheets.codalab.org/rest/bundles/0x1f96bc12222641209ad057e762910252/contents/blob/"

# 如果不存在 SMP2018 数据,则下载
if (not os.path.exists('./data/train.json')) or (not os.path.exists('./data/dev.json')):
raw_train = requests.get(raw_train_data_url)
raw_test = requests.get(raw_test_data_url)
if not os.path.exists('./data'):
os.makedirs('./data')
with open("./data/train.json", "wb") as code:
code.write(raw_train.content)
with open("./data/dev.json", "wb") as code:
code.write(raw_test.content)
1
2
3
4
5
6
7
8
def get_json_data(path):
# read data
data_df = pd.read_json(path)
# change row and colunm
data_df = data_df.transpose()
# change colunm order
data_df = data_df[['query', 'label']]
return data_df
1
2
3
train_data_df = get_json_data(path="data/train.json")

test_data_df = get_json_data(path="data/dev.json")
1
train_data_df.head()
query label
0 今天东莞天气如何 weather
1 从观音桥到重庆市图书馆怎么走 map
2 鸭蛋怎么腌? cookbook
3 怎么治疗牛皮癣 health
4 唠什么 chat

结巴分词示例,下面将使用结巴分词对原数据进行处理

1
2
seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print(list(seg_list))
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.022 seconds.
Prefix dict has been built succesfully.


['他', '来到', '了', '网易', '杭研', '大厦']

序列化

1
2
3
4
5
def use_jieba_cut(a_sentence):
return list(jieba.cut(a_sentence))

train_data_df['cut_query'] = train_data_df['query'].apply(use_jieba_cut)
test_data_df['cut_query'] = test_data_df['query'].apply(use_jieba_cut)
1
train_data_df.head(10)
query label cut_query
0 今天东莞天气如何 weather [今天, 东莞, 天气, 如何]
1 从观音桥到重庆市图书馆怎么走 map [从, 观音桥, 到, 重庆市, 图书馆, 怎么, 走]
2 鸭蛋怎么腌? cookbook [鸭蛋, 怎么, 腌, ?]
3 怎么治疗牛皮癣 health [怎么, 治疗, 牛皮癣]
4 唠什么 chat [唠, 什么]
5 阳澄湖大闸蟹的做法。 cookbook [阳澄湖, 大闸蟹, 的, 做法, 。]
6 昆山大润发在哪里 map [昆山, 大润发, 在, 哪里]
7 红烧肉怎么做?嗯? cookbook [红烧肉, 怎么, 做, ?, 嗯, ?]
8 南京到厦门的火车票 train [南京, 到, 厦门, 的, 火车票]
9 6的平方 calc [6, 的, 平方]

处理特征

1
tokenizer = Tokenizer()
1
tokenizer.fit_on_texts(train_data_df['cut_query'])
1
2
3
max_features = len(tokenizer.index_word)

len(tokenizer.index_word)
2883
1
2
3
x_train = tokenizer.texts_to_sequences(train_data_df['cut_query'])

x_test = tokenizer.texts_to_sequences(test_data_df['cut_query'])
1
max_cut_query_lenth = 26
1
2
3
x_train = pad_sequences(x_train, max_cut_query_lenth)

x_test = pad_sequences(x_test, max_cut_query_lenth)
1
x_train.shape
(2299, 26)
1
x_test.shape
(770, 26)

处理标签

1
label_tokenizer = Tokenizer()
1
label_tokenizer.fit_on_texts(train_data_df['label'])
1
label_numbers = len(label_tokenizer.word_counts)
1
NUM_CLASSES = len(label_tokenizer.word_counts)
1
label_tokenizer.word_counts
OrderedDict([('weather', 66),
             ('map', 68),
             ('cookbook', 269),
             ('health', 55),
             ('chat', 455),
             ('train', 70),
             ('calc', 24),
             ('translation', 61),
             ('music', 66),
             ('tvchannel', 71),
             ('poetry', 102),
             ('telephone', 63),
             ('stock', 71),
             ('radio', 24),
             ('contacts', 30),
             ('lottery', 24),
             ('website', 54),
             ('video', 182),
             ('news', 58),
             ('bus', 24),
             ('app', 53),
             ('flight', 62),
             ('epg', 107),
             ('message', 63),
             ('match', 24),
             ('schedule', 29),
             ('novel', 24),
             ('riddle', 34),
             ('email', 24),
             ('datetime', 18),
             ('cinemas', 24)])
1
y_train = label_tokenizer.texts_to_sequences(train_data_df['label'])
1
y_train[:10]
[[10], [9], [2], [17], [1], [2], [9], [2], [8], [23]]
1
y_train = [[y[0]-1] for y in y_train]
1
y_train[:10]
[[9], [8], [1], [16], [0], [1], [8], [1], [7], [22]]
1
2
y_train = to_categorical(y_train, label_numbers)
y_train.shape
(2299, 31)
1
2
3
4
y_test = label_tokenizer.texts_to_sequences(test_data_df['label'])
y_test = [y[0]-1 for y in y_test]
y_test = to_categorical(y_test, label_numbers)
y_test.shape
(770, 31)
1
y_test[0]
array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)

设计模型

1
2
3
4
5
6
7
8
9
10
11
12
13
def create_SMP2018_lstm_model(max_features, max_cut_query_lenth, label_numbers):
model = Sequential()
model.add(Embedding(input_dim=max_features+1, output_dim=32, input_length=max_cut_query_lenth))
model.add(LSTM(units=64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(label_numbers, activation='softmax'))
# try using different optimizers and different optimizer configs
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=[f1])

plot_model(model, to_file='SMP2018_lstm_model.png', show_shapes=True)

return model

训练模型

1
2
3
4
5
6
7
8
9
if 'max_features'  not in  dir():
max_features = 2888
print('not find max_features variable, use default max_features values:\t{}'.format(max_features))
if 'max_cut_query_lenth' not in dir():
max_cut_query_lenth = 26
print('not find max_cut_query_lenth, use default max_features values:\t{}'.format(max_cut_query_lenth))
if 'label_numbers' not in dir():
label_numbers = 31
print('not find label_numbers, use default max_features values:\t{}'.format(label_numbers))
1
model = create_SMP2018_lstm_model(max_features, max_cut_query_lenth, label_numbers)
1
2
batch_size = 20
epochs = 30
1
print(x_train.shape, y_train.shape)
(2299, 26) (2299, 31)
1
print(x_test.shape, y_test.shape)
(770, 26) (770, 31)
1
2
3
4
print('Train...')
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs)
Train...
Epoch 1/30
2299/2299 [==============================] - 16s 7ms/step - loss: 3.0916 - f1: 0.0000e+00
Epoch 2/30
2299/2299 [==============================] - 14s 6ms/step - loss: 2.6594 - f1: 0.1409
Epoch 3/30
2299/2299 [==============================] - 13s 6ms/step - loss: 2.0817 - f1: 0.4055
Epoch 4/30
2299/2299 [==============================] - 14s 6ms/step - loss: 1.6032 - f1: 0.4689
Epoch 5/30
2299/2299 [==============================] - 14s 6ms/step - loss: 1.1318 - f1: 0.6176
Epoch 6/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.8090 - f1: 0.7399
Epoch 7/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.5704 - f1: 0.8298
Epoch 8/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.4051 - f1: 0.8879
Epoch 9/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.3002 - f1: 0.9280
Epoch 10/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.2317 - f1: 0.9467
Epoch 11/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.1755 - f1: 0.9678
Epoch 12/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.1391 - f1: 0.9758
Epoch 13/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.1131 - f1: 0.9800
Epoch 14/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0883 - f1: 0.9861
Epoch 15/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0725 - f1: 0.9894
Epoch 16/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0615 - f1: 0.9929
Epoch 17/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0507 - f1: 0.9945
Epoch 18/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0455 - f1: 0.9963
Epoch 19/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0398 - f1: 0.9960
Epoch 20/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0313 - f1: 0.9978
Epoch 21/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0266 - f1: 0.9984
Epoch 22/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0279 - f1: 0.9965
Epoch 23/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0250 - f1: 0.9976
Epoch 24/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0219 - f1: 0.9982
Epoch 25/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0195 - f1: 0.9982
Epoch 26/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0179 - f1: 0.9989
Epoch 27/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0177 - f1: 0.9974
Epoch 28/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0139 - f1: 0.9987
Epoch 29/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0139 - f1: 0.9989
Epoch 30/30
2299/2299 [==============================] - 14s 6ms/step - loss: 0.0129 - f1: 0.9987





<keras.callbacks.History at 0x7f84e87c5f28>

评估模型

1
2
3
4
5
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)

print('Test score:', score[0])
print('Test f1:', score[1])
770/770 [==============================] - 1s 1ms/step
Test score: 0.6803552009068526
Test f1: 0.8464262740952628
1
y_hat_test = model.predict(x_test)
1
print(y_hat_test.shape)
(770, 31)

将 one-hot 张量转换成对应的整数

1
y_pred = np.argmax(y_hat_test, axis=1).tolist()
1
y_true = np.argmax(y_test, axis=1).tolist()

查看多分类的 准确率、召回率、F1 值

1
print(classification_report(y_true, y_pred))
              precision    recall  f1-score   support

           0       0.78      0.93      0.85       154
           1       0.92      0.97      0.95        89
           2       0.67      0.62      0.64        60
           3       0.83      0.83      0.83        36
           4       0.79      1.00      0.88        34
           5       0.83      0.65      0.73        23
           6       1.00      0.83      0.91        24
           7       1.00      1.00      1.00        24
           8       0.68      0.65      0.67        23
           9       0.90      0.86      0.88        22
          10       0.85      0.50      0.63        22
          11       0.88      1.00      0.93        21
          12       1.00      0.90      0.95        21
          13       0.91      0.95      0.93        21
          14       1.00      0.95      0.98        21
          15       0.79      0.95      0.86        20
          16       0.90      0.47      0.62        19
          17       0.79      0.61      0.69        18
          18       0.63      0.67      0.65        18
          19       0.90      0.82      0.86        11
          20       1.00      0.70      0.82        10
          21       1.00      0.67      0.80         9
          22       1.00      0.88      0.93         8
          23       1.00      0.62      0.77         8
          24       1.00      1.00      1.00         8
          25       1.00      0.88      0.93         8
          26       0.88      0.88      0.88         8
          27       0.86      0.75      0.80         8
          28       1.00      1.00      1.00         8
          29       0.75      0.75      0.75         8
          30       0.75      1.00      0.86         6

   micro avg       0.84      0.84      0.84       770
   macro avg       0.88      0.82      0.84       770
weighted avg       0.85      0.84      0.84       770
本站所有文章和源码均免费开放,如您喜欢,可以请我喝杯咖啡