预训练语言模型（ULMFiT、EMLo和BERT）的核心理论、工具使用、相关论文汇总

Significance: 预训练语言模型的意义

Instead of training the model from scratch, you can use another pre-trained model as the basis and only fine-tune it to solve the specific NLP task.
Using pre-trained models allows you to achieve the same or even better performance much faster and with much less labeled data.

Model: 2018 年三个最主要的语言模型 ULMFiT、EMLo和BERT

ULMFiT, or Universal Language Model Fine-Tuning method, is likely the first effective approach to fine-tuning the language model. The authors demonstrate the importance of several novel techniques, including discriminative fine-tuning, slanted triangular learning rate, and gradual unfreezing, for retaining previous knowledge and avoiding catastrophic forgetting during fine-tuning.
ELMo word representations, or Embeddings from Language Models, are generated in a way to take the entire context into consideration. In particular, they are created as a weighted sum of the internal states of a deep bi-directional language model (biLM), pre-trained on a large text corpus. Furthermore, ELMo representations are based on characters so that the network can understand even out-of-vocabulary tokens unseen in training.
BERT, or Bidirectional Encoder Representations from Transformers, is a new cutting-edge model that considers the context from both the left and the right sides of each word. The two key success factors are (1) masking part of input tokens to avoid cycles where words indirectly “see themselves”, and (2) pre-training a sentence relationship model. Finally, BERT is also a very big model trained on a huge word corpus.

Tool: 使用三大语言模型的工具

模型	工具	说明
ULMFiT	fastai	fastai NLP; fastai imdb 例子; Tutorial on Text Classification (NLP) using ULMFiT and fastai Library in Python
ELMo	allen-ELMo;Google-ELMo	ELMo-use-example
BERT	Google-BERT	BERT—use-example

Summarie: ULMFiT、EMLo和BERT

UNIVERSAL LANGUAGE MODEL FINE-TUNING FOR TEXT CLASSIFICATION

ORIGINAL ABSTRACT
Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open source our pretrained models and code.

OUR SUMMARY
Howard and Ruder suggest using pre-trained models for solving a wide range of NLP problems. With this approach, you don’t need to train your model from scratch, but only fine-tune the original model. Their method, called Universal Language Model Fine-Tuning (ULMFiT) outperforms state-of-the-art results, reducing the error by 18-24%. Even more, with only 100 labeled examples, ULMFiT matches the performance of models trained from scratch on 10K labeled examples.

WHAT’S THE CORE IDEA OF THIS PAPER?

To address the lack of labeled data and to make NLP classification easier and less time-consuming, the researchers suggest applying transfer learning to NLP problems. Thus, instead of training the model from scratch, you can use another model that has been trained to solve a similar problem as the basis, and then fine-tune the original model to solve your specific problem.
However, to be successful, this fine-tuning should take into account several important considerations:
- Different layers should be fine-tuned to different extents as they capture different kinds of information.
- Adapting model’s parameters to task-specific features will be more efficient if the learning rate is firstly linearly increased and then linearly decayed.
- Fine-tuning all layers at once is likely to result in catastrophic forgetting; thus, it would be better to gradually unfreeze the model starting from the last layer.

WHAT’S THE KEY ACHIEVEMENT?

Significantly outperforming state-of-the-art: reducing the error by 18-24%.
Much less labeled data needed: with only 100 labeled examples and 50K unlabeled, matching the performance of learning from scratch on 100x more data.

DEEP CONTEXTUALIZED WORD REPRESENTATIONS

ORIGINAL ABSTRACT
We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

OUR SUMMARY
The team from the Allen Institute for Artificial Intelligence introduces a new type of deep contextualized word representation – Embeddings from Language Models (ELMo). In ELMO-enhanced models, each word is vectorized on the basis of the entire context in which it is used. Adding ELMo to the existing NLP systems results in 1) relative error reduction ranging from 6-20%, 2) a significantly lower number of epochs required to train the models and 3) a significantly reduced amount of training data needed to reach baseline performance.

WHAT’S THE CORE IDEA OF THIS PAPER?

To generate word embeddings as a weighted sum of the internal states of a deep bi-directional language model (biLM), pre-trained on a large text corpus.
To include representations from all layers of a biLM as different layers represent different types of information.
To base ELMo representations on characters so that the network can use morphological clues to “understand” out-of-vocabulary tokens unseen in training.

WHAT’S THE KEY ACHIEVEMENT?

Adding ELMo to the model leads to the new state-of-the-art results, with relative error reductions ranging from 6 – 20% across such NLP tasks as question answering, textual entailment, semantic role labeling, coreference resolution, named entity extraction, and sentiment analysis.
Enhancing the model with ELMo results in a significantly lower number of updates required to reach state-of-the-art performance. Thus, the Semantic Role Labeling (SRL) model with ELMo needs only 10 epochs to exceed the baseline maximum reached after 486 epochs of training.
Introducing ELMo to the model also significantly reduces the amount of training data needed to achieve the same level of performance. For example, for the SRL task, the ELMo-enhanced model needs only 1% of the training set to achieve the same performance as the baseline model with 10% of the training data.

BERT: PRE-TRAINING OF DEEP BIDIRECTIONAL TRANSFORMERS FOR LANGUAGE UNDERSTANDING

ORIGINAL ABSTRACT
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

OUR SUMMARY
A Google AI team presents a new cutting-edge model for Natural Language Processing (NLP) – BERT, or Bidirectional Encoder Representations from Transformers. Its design allows the model to consider the context from both the left and the right sides of each word. While being conceptually simple, BERT obtains new state-of-the-art results on eleven NLP tasks, including question answering, named entity recognition and other tasks related to general language understanding.

WHAT’S THE CORE IDEA OF THIS PAPER?

Training a deep bidirectional model by randomly masking a percentage of input tokens – thus, avoiding cycles where words can indirectly “see themselves”.
Also pre-training a sentence relationship model by building a simple binary classification task to predict whether sentence B immediately follows sentence A, thus allowing BERT to better understand relationships between sentences.
Training a very big model (24 Transformer blocks, 1024-hidden, 340M parameters) with lots of data (3.3 billion word corpus).

WHAT’S THE KEY ACHIEVEMENT?

Advancing the state-of-the-art for 11 NLP tasks, including:
- getting a GLUE score of 80.4%, which is 7.6% of absolute improvement from the previous best result;
- achieving 93.2% accuracy on SQuAD 1.1 and outperforming human performance by 2%.
Suggesting a pre-trained model, which doesn’t require any substantial architecture modifications to be applied to specific NLP tasks.

Paper: 相关论文汇总

ULMFiT work was further extended and applied in the following papers

Improving Language Understanding by Generative Pre-Training

Abstract

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classiﬁcation. Although large unlabeled text corpora are abundant, labeled data for learning these speciﬁc tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative ﬁne-tuning on each speciﬁc task. In contrast to previous approaches, we make use of task-aware input transformations during ﬁne-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures speciﬁcally crafted for each task, signiﬁcantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

摘要

自然语言理解包括各种各样的任务，如文本蕴涵，问答，语义相似性评估和文档分类。虽然大量未标记的文本语料库很丰富，但用于学习这些特定任务的标记数据很少，这使得经过充分理解的模式训练模型具有挑战性。我们证明，通过对多种未标记文本语料库中的语言模型进行生成预训练，然后对每个特定任务进行辨别性调整，可以实现这些任务的巨大收益。与以前的方法相比，我们在微调期间利用任务感知输入转换来实现有效传输，同时对模型架构进行最少的更改。我们证明了我们的方法在广泛的自然语言理解基准上的有效性。我们的一般任务不可知模型优于使用针对每项任务专门设计的架构的经过训练的训练模型，在所研究的12项任务中的9项中显着改进了现有技术水平。例如，我们在常识推理（Stories Cloze Test）上获得8.9％的绝对改善，在问答（RACE）上达到5.7％，在文本蕴涵（MultiNLI）上达到1.5％。

Universal Language Model Fine-Tuning with Subword Tokenization for Polish

Abstract

Universal Language Model for Fine-tuning [arXiv:1801.06146] (ULMFiT) is one of the first NLP methods for efficient inductive transfer learning. Unsupervised pretraining results in improvements on many NLP tasks for English. In this paper, we describe a new method that uses subword tokenization to adapt ULMFiT to languages with high inflection. Our approach results in a new state-of-the-art for the Polish language, taking first place in Task 3 of PolEval’18. After further training, our final model outperformed the second best model by 35%. We have open-sourced our pretrained models and code.

摘要

用于微调的通用语言模型[arXiv：1801.06146]（ULMFiT）是有效的归纳转移学习的第一种NLP方法之一。无人监督的预训练可以改善英语的许多NLP任务。在本文中，我们描述了一种新方法，该方法使用子词标记化来使ULMFiT适应具有高度变形的语言。我们的方法为波兰语创造了一种新的先进技术，在PolEval’18的任务3中占据首位。经过进一步培训，我们的最终模型比第二好的模型高出35％。我们开源了我们的预训练模型和代码。

Universal Language Model Fine-tuning for Patent Classification

Abstract

This paper describes the methods used for the 2018 ALTA Shared Task. The task this year was to automatically classify Australian patents into their main International Patent Classiﬁcation section. Our ﬁnal submission used a Support Vector Machine (SVM) and Universal Language Model with Fine-tuning (ULMFiT). Our system achieved the best results in the student category.

摘要

本文介绍了2018年ALTA共享任务使用的方法。今年的任务是自动将澳大利亚专利分类为其主要的国际专利分类部分。我们的最终提交使用支持向量机（SVM）和具有微调的通用语言模型（ULMFiT）。我们的系统在学生类别中取得了最佳成绩。

ELMo embeddings have been already used in a number of important research papers including

Linguistically-Informed Self-Attention for Semantic Role Labeling

Abstract

Current state-of-the-art semantic role labeling (SRL) uses a deep neural network with no explicit linguistic features. However, prior work has shown that gold syntax trees can dramatically improve SRL decoding, suggesting the possibility of increased accuracy from explicit modeling of syntax. In this work, we present linguistically-informed self-attention (LISA): a neural network model that combines multi-head self-attention with multi-task learning across dependency parsing, part-of-speech tagging, predicate detection and SRL. Unlike previous models which require significant pre-processing to prepare linguistic features, LISA can incorporate syntax using merely raw tokens as input, encoding the sequence only once to simultaneously perform parsing, predicate detection and role labeling for all predicates. Syntax is incorporated by training one attention head to attend to syntactic parents for each token. Moreover, if a high-quality syntactic parse is already available, it can be beneficially injected at test time without re-training our SRL model. In experiments on CoNLL-2005 SRL, LISA achieves new state-of-the-art performance for a model using predicted predicates and standard word embeddings, attaining 2.5 F1 absolute higher than the previous state-of-the-art on newswire and more than 3.5 F1 on out-of-domain data, nearly 10% reduction in error. On ConLL-2012 English SRL we also show an improvement of more than 2.5 F1. LISA also out-performs the state-of-the-art with contextually-encoded (ELMo) word representations, by nearly 1.0 F1 on news and more than 2.0 F1 on out-of-domain text.

摘要
当前最先进的语义角色标记（SRL）使用深度神经网络，没有明确的语言特征。但是，之前的工作表明，黄金语法树可以显着改善SRL解码，这表明通过显式语法建模可以提高准确性。在这项工作中，我们提出了语言知情的自我关注（LISA）：一种神经网络模型，它将多头自我关注与多任务学习相结合，包括依赖解析，词性标注，谓词检测和SRL。与先前需要大量预处理来准备语言特征的模型不同，LISA可以仅使用原始令牌作为输入来合并语法，仅对序列编码一次以同时对所有谓词执行解析，谓词检测和角色标记。语法通过训练一个注意力头来参与每个令牌的句法父母。此外，如果已经有高质量的语法分析，则可以在测试时进行有益的注入，而无需重新训练我们的SRL模型。在CoNLL-2005 SRL的实验中，LISA使用预测谓词和标准字嵌入为模型实现了最新的最先进性能，比新闻专线上的先前技术水平高出2.5 F1绝对值3.5关于域外数据的F1，误差减少近10％。在ConLL-2012英语SRL上，我们也表现出超过2.5 F1的改进。 LISA还通过上下文编码（ELMo）单词表示超出了最新技术水平，在新闻上接近1.0 F1，在域外文本上超过2.0 F1。

Language Model Pre-training for Hierarchical Document Representations

Abstract

Hierarchical neural architectures are often used to capture long-distance dependencies and have been applied to many document-level tasks such as summarization, document segmentation, and sentiment analysis. However, effective usage of such a large context can be difficult to learn, especially in the case where there is limited labeled data available. Building on the recent success of language model pretraining methods for learning flat representations of text, we propose algorithms for pre-training hierarchical document representations from unlabeled data. Unlike prior work, which has focused on pre-training contextual token representations or context-independent {sentence/paragraph} representations, our hierarchical document representations include fixed-length sentence/paragraph representations which integrate contextual information from the entire documents. Experiments on document segmentation, document-level question answering, and extractive document summarization demonstrate the effectiveness of the proposed pre-training algorithms.

摘要

分层神经架构通常用于捕获长距离依赖关系，并已应用于许多文档级任务，如摘要，文档分段和情感分析。然而，这种大的上下文的有效使用可能难以学习，尤其是在可用的标记数据有限的情况下。在最近成功学习用于学习文本平面表示的语言模型预训练方法的基础上，我们提出了用于从未标记数据预训练分层文档表示的算法。与先前的工作不同，后者专注于预训练上下文令牌表示或与上下文无关的{句子/段落}表示，我们的分层文档表示包括固定长度的句子/段落表示，其整合来自整个文档的上下文信息。文档分割，文档级问答和提取文档摘要的实验证明了所提出的预训练算法的有效性。

Deep Enhanced Representation for Implicit Discourse Relation Recognition

Abstract

Implicit discourse relation recognition is a challenging task as the relation prediction without explicit connectives in discourse parsing needs understanding of text spans and cannot be easily derived from surface features from the input sentence pairs. Thus, properly representing the text is very crucial to this task. In this paper, we propose a model augmented with different grained text representations, including character, subword, word, sentence, and sentence pair levels. The proposed deeper model is evaluated on the benchmark treebank and achieves state-of-the-art accuracy with greater than 48% in 11-way and F1 score greater than 50% in 4-way classifications for the first time according to our best knowledge.

摘要

隐含话语关系识别是一项具有挑战性的任务，因为在话语分析中没有明确连接词的关系预测需要理解文本跨度，并且不能容易地从输入句子对的表面特征中导出。因此，正确表示文本对于这项任务至关重要。在本文中，我们提出了一个增加了不同粒度文本表示的模型，包括字符，子字，单词，句子和句子对等级。根据我们的最佳知识，所提出的更深层次模型在基准树库上进行评估并达到最先进的准确度，11路方式超过48％，4路分类中F1分数首次超过50％。

BERT was introduced only in late 2018 but has been already a basis for some further research advancements

A BERT Baseline for the Natural Questions

Abstract

This technical note describes a new baseline for the Natural Questions. Our model is based on BERT and reduces the gap between the model F1 scores reported in the original dataset paper and the human upper bound by 30% and 50% relative for the long and short answer tasks respectively. This baseline has been submitted to the official NQ leaderboard at this http URL. Code, preprocessed data and pretrained model are available at this https URL.

摘要

本技术说明描述了自然问题的新基线。我们的模型基于BERT，并且将原始数据集论文中报告的模型F1分数与人类上限之间的差距分别缩小了30％和50％相对于长期和短期答案任务。此基线已通过此http URL提交给官方NQ排行榜。此https URL提供代码，预处理数据和预训练模型。

SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering

Abstract

Conversational question answering (CQA) is a novel QA task that requires understanding of dialogue context. Different from traditional single-turn machine reading comprehension (MRC) tasks, CQA includes passage comprehension, coreference resolution, and contextual understanding. In this paper, we propose an innovated contextualized attention-based deep neural network, SDNet, to fuse context into traditional MRC models. Our model leverages both inter-attention and self-attention to comprehend conversation context and extract relevant information from passage. Furthermore, we demonstrated a novel method to integrate the latest BERT contextual model. Empirical results show the effectiveness of our model, which sets the new state of the art result in CoQA leaderboard, outperforming the previous best model by 1.6% F1. Our ensemble model further improves the result by 2.7% F1.

摘要

会话问答（CQA）是一项新颖的QA任务，需要了解对话背景。与传统的单圈机器阅读理解（MRC）任务不同，CQA包括段落理解，共指解析和情境理解。在本文中，我们提出了一种创新的基于情境的注重深度神经网络SDNet，将背景融合到传统的MRC模型中。我们的模型利用相互关注和自我关注来理解对话背景并从段落中提取相关信息。此外，我们演示了一种集成最新BERT上下文模型的新方法。实证结果显示我们的模型的有效性，它在CoQA排行榜中设定了新的最新状态，比之前的最佳模型优于1.6％的F1。我们的整体模型进一步提高了2.7％的F1结果。

Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers

Abstract

Most approaches to extraction multiple relations from a paragraph require multiple passes over the paragraph. In practice, multiple passes are computationally expensive and this makes difficult to scale to longer paragraphs and larger text corpora. In this work, we focus on the task of multiple relation extraction by encoding the paragraph only once (one-pass). We build our solution on the pre-trained self-attentive (Transformer) models, where we first add a structured prediction layer to handle extraction between multiple entity pairs, then enhance the paragraph embedding to capture multiple relational information associated with each entity with an entity-aware attention technique. We show that our approach is not only scalable but can also perform state-of-the-art on the standard benchmark ACE 2005.

摘要

从段落中提取多个关系的大多数方法都要求在段落上多次传递。在实践中，多次通过在计算上是昂贵的，并且这使得难以扩展到更长的段落和更大的文本语料库。在这项工作中，我们通过仅对段落进行一次编码（一次通过）来专注于多关系提取的任务。我们在预训练的自我关注（Transformer）模型上构建我们的解决方案，我们首先添加结构化预测层来处理多个实体对之间的提取，然后增强段落嵌入以捕获与每个实体关联的多个关系信息与实体 - 注意技术。我们表明，我们的方法不仅可扩展，而且还可以在标准基准ACE 2005上执行最先进的技术。

标题	说明	时间
WHAT EVERY NLP ENGINEER NEEDS TO KNOW ABOUT PRE-TRAINED LANGUAGE MODELS	本文主要内容基于该文	Feb 19, 2019