0%

HuggingFace-Transformers手册

HuggingFace-Transformers手册 = 官方链接 + 设计结构 + 使用教程 + 代码解析

Transformers(以前称为pytorch Transformers和pytorch pretrained bert)为自然语言理解(NLU)和自然语言生成(NLG)提供了最先进的通用架构(bert、GPT-2、RoBERTa、XLM、DistilBert、XLNet、CTRL…),其中有超过32个100多种语言的预训练模型并同时支持TensorFlow 2.0和Pythorch两大深度学习框架。

HuggingFace-Transformers官方链接

设计结构

预训练模型(TFPreTrainedModel)、模型配置(PretrainedConfig)、分词器(PreTrainedTokenizer)就是整个HuggingFace-Transformers的核心,它们分别负责模型结构和权重、模型的超参数、模型的输入预处理。

使用教程

使用BERT解决GLUE任务MRPC数据集的分类任务

微调模型

1
2
3
4
import os
import tensorflow as tf
import tensorflow_datasets
from transformers import BertTokenizer, TFBertForSequenceClassification, BertConfig, glue_convert_examples_to_features, glue_processors
1
2
3
4
5
6
7
8
9
10
11
12
13
# script parameters
BATCH_SIZE = 32
EVAL_BATCH_SIZE = BATCH_SIZE * 2
EPOCHS = 3

TASK = "mrpc"

if TASK == "sst-2":
TFDS_TASK = "sst2"
elif TASK == "sts-b":
TFDS_TASK = "stsb"
else:
TFDS_TASK = TASK
1
2
num_labels = len(glue_processors[TASK]().get_labels())
print(num_labels)
2
1
2
3
4
# Load tokenizer and model from pretrained model/vocabulary. Specify the number of labels to classify (2+: classification, 1: regression)
config = BertConfig.from_pretrained("bert-base-cased", num_labels=num_labels)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', config=config)
1
2
3
# Load dataset via TensorFlow Datasets
data, info = tensorflow_datasets.load(f'glue/{TFDS_TASK}', with_info=True)
train_examples = info.splits['train'].num_examples
1
2
# MNLI expects either validation_matched or validation_mismatched
valid_examples = info.splits['validation'].num_examples
1
2
# Prepare dataset for GLUE as a tf.data.Dataset instance
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, 128, TASK)
1
2
3
4
# MNLI expects either validation_matched or validation_mismatched
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, 128, TASK)
train_dataset = train_dataset.shuffle(128).batch(BATCH_SIZE).repeat(-1)
valid_dataset = valid_dataset.batch(EVAL_BATCH_SIZE)
1
2
3
4
5
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
opt = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
if USE_AMP:
# loss scaling is currently required when using mixed precision
opt = tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt, 'dynamic')
1
2
3
4
if num_labels == 1:
loss = tf.keras.losses.MeanSquaredError()
else:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
1
2
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=opt, loss=loss, metrics=[metric])
1
2
3
# Train and evaluate using tf.keras.Model.fit()
train_steps = train_examples//BATCH_SIZE
valid_steps = valid_examples//EVAL_BATCH_SIZE
1
2
history = model.fit(train_dataset, epochs=EPOCHS, steps_per_epoch=train_steps,
validation_data=valid_dataset, validation_steps=valid_steps)
Train for 114 steps, validate for 6 steps
Epoch 1/3
114/114 [==============================] - 840s 7s/step - loss: 0.5896 - accuracy: 0.6883 - val_loss: 0.4735 - val_accuracy: 0.7578
Epoch 2/3
114/114 [==============================] - 763s 7s/step - loss: 0.3810 - accuracy: 0.8303 - val_loss: 0.3895 - val_accuracy: 0.8359
Epoch 3/3
114/114 [==============================] - 756s 7s/step - loss: 0.1856 - accuracy: 0.9334 - val_loss: 0.5333 - val_accuracy: 0.8125
1
2
3
# Save TF2 model
os.makedirs('./glue_mrpc_save/', exist_ok=True)
model.save_pretrained('./glue_mrpc_save/')
1
tokenizer.save_pretrained('./glue_mrpc_save/')
('./glue_mrpc_save/vocab.txt',
 './glue_mrpc_save/special_tokens_map.json',
 './glue_mrpc_save/added_tokens.json')

加载模型和评估模型

1
2
3
import tensorflow as tf
import tensorflow_datasets
from transformers import BertTokenizer, TFBertForSequenceClassification, BertConfig, glue_convert_examples_to_features
1
2
3
# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('glue_mrpc_save')
model = TFBertForSequenceClassification.from_pretrained('glue_mrpc_save')
1
2
3
TFDS_TASK = "mrpc"
# Load dataset via TensorFlow Datasets
data, info = tensorflow_datasets.load(f'glue/{TFDS_TASK}', with_info=True)
INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset glue (/home/b418a/tensorflow_datasets/glue/mrpc/0.0.2)
INFO:absl:Constructing tf.data.Dataset for split None, from /home/b418a/tensorflow_datasets/glue/mrpc/0.0.2
1
2
# MNLI expects either validation_matched or validation_mismatched
valid_examples = info.splits['validation'].num_examples
1
2
3
# MNLI expects either validation_matched or validation_mismatched
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, 128, TFDS_TASK)
valid_dataset = valid_dataset.batch(64)
1
2
3
4
5
6
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
opt = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=opt, loss=loss, metrics=[metric])
1
model.evaluate(valid_dataset)
[0.5068637558392116, 0.8137255]

代码解析

待完成…

有关类图的说明请参考:UML 类图详解UML类图与类的关系详解

本站所有文章和源码均免费开放,如您喜欢,可以请我喝杯咖啡