SQuAD 论文

摘要：我们展示了斯坦福问答数据集(SQuAD)，这是一个新的阅读理解数据集，由一组维基百科文章上的众筹工作者提出的10万个问题组成，每个问题的答案都是对应阅读文章的一段文字。我们分析数据集，以理解回答问题所需的推理类型，主要依赖依赖关系和选区树。我们建立了一个强逻辑回归模型，该模型的F1得分为51.0

QANet(To Do)

A Tensorflow implementation of QANet for machine reading comprehension
https://github.com/NLPLearn/QANet

SQuAD 相关文章

PaperWeekly 第38期 SQuAD综述

SQuAD，斯坦福在自然语言处理的野心

SQUAD_traing

训练结果：
{“exact_match”: 81.20151371807, “f1”: 88.56178500169332}

SQUAD_DATA_处理解析

原始语料

‘’’
a paragraphs in original SQUAD DATA
{‘paragraphs’:
[{‘context’: ‘Architecturally, the school has a Catholic ‘
“character. Atop the Main Building’s gold dome is “
‘a golden statue of the Virgin Mary. Immediately ‘
‘in front of the Main Building and facing it, is a ‘
‘copper statue of Christ with arms upraised with ‘
‘the legend “Venite Ad Me Omnes”. Next to the Main ‘
‘Building is the Basilica of the Sacred Heart. ‘
‘Immediately behind the basilica is the Grotto, a ‘
‘Marian place of prayer and reflection. It is a ‘
‘replica of the grotto at Lourdes, France where ‘
‘the Virgin Mary reputedly appeared to Saint ‘
‘Bernadette Soubirous in 1858. At the end of the ‘
‘main drive (and in a direct line that connects ‘
‘through 3 statues and the Gold Dome), is a ‘
‘simple, modern stone statue of Mary.’,
‘qas’: [{‘answers’: [{‘answer_start’: 515,
‘text’: ‘Saint Bernadette Soubirous’}],
‘id’: ‘5733be284776f41900661182’,
‘question’: ‘To whom did the Virgin Mary allegedly ‘
‘appear in 1858 in Lourdes France?’},
{‘answers’: [{‘answer_start’: 188,
‘text’: ‘a copper statue of Christ’}],
‘id’: ‘5733be284776f4190066117f’,
‘question’: ‘What is in front of the Notre Dame Main ‘
‘Building?’},
{‘answers’: [{‘answer_start’: 279,
‘text’: ‘the Main Building’}],
‘id’: ‘5733be284776f41900661180’,
‘question’: ‘The Basilica of the Sacred heart at ‘
‘Notre Dame is beside to which ‘
‘structure?’},
{‘answers’: [{‘answer_start’: 381,
‘text’: ‘a Marian place of prayer and ‘
‘reflection’}],
‘id’: ‘5733be284776f41900661181’,
‘question’: ‘What is the Grotto at Notre Dame?’},
{‘answers’: [{‘answer_start’: 92,
‘text’: ‘a golden statue of the Virgin ‘
‘Mary’}],
‘id’: ‘5733be284776f4190066117e’,
‘question’: ‘What sits on top of the Main Building ‘
‘at Notre Dame?’}]},
‘’’

程序处理后的语料

#INFO:tensorflow:*** Example ***
#INFO:tensorflow:unique_id: 1000000005
#INFO:tensorflow:example_index: 5
#INFO:tensorflow:doc_span_index: 0
#INFO:tensorflow:tokens: [CLS] what kind of topics began appearing more commonly in poetry and literature during the enlightenment ? [SEP] the influence of science also began appearing more commonly in poetry and literature during the enlightenment . some poetry became infused with scientific metaphor and imagery , while other poems were written directly about scientific topics . sir richard black ##more committed the newton ##ian system to verse in creation , a philosophical poem in seven books ( 1712 ) . after newton ' s death in 1727 , poems were composed in his honour for decades . james thomson ( 1700 – 1748 ) penned his " poem to the memory of newton , " which mo ##urne ##d the loss of newton , but also praised his science and legacy . [SEP]
#INFO:tensorflow:token_to_orig_map: 18:0 19:1 20:2 21:3 22:4 23:5 24:6 25:7 26:8 27:9 28:10 29:11 30:12 31:13 32:14 33:15 34:15 35:16 36:17 37:18 38:19 39:20 40:21 41:22 42:23 43:24 44:24 45:25 46:26 47:27 48:28 49:29 50:30 51:31 52:32 53:33 54:33 55:34 56:35 57:36 58:36 59:37 60:38 61:39 62:39 63:40 64:41 65:42 66:43 67:44 68:44 69:45 70:46 71:47 72:48 73:49 74:50 75:51 76:51 77:51 78:51 79:52 80:53 81:53 82:53 83:54 84:55 85:56 86:56 87:57 88:58 89:59 90:60 91:61 92:62 93:63 94:64 95:64 96:65 97:66 98:67 99:67 100:67 101:67 102:67 103:68 104:69 105:70 106:70 107:71 108:72 109:73 110:74 111:75 112:75 113:75 114:76 115:77 116:77 117:77 118:78 119:79 120:80 121:81 122:81 123:82 124:83 125:84 126:85 127:86 128:87 129:88 130:88
#INFO:tensorflow:token_is_max_context: 18:True 19:True 20:True 21:True 22:True 23:True 24:True 25:True 26:True 27:True 28:True 29:True 30:True 31:True 32:True 33:True 34:True 35:True 36:True 37:True 38:True 39:True 40:True 41:True 42:True 43:True 44:True 45:True 46:True 47:True 48:True 49:True 50:True 51:True 52:True 53:True 54:True 55:True 56:True 57:True 58:True 59:True 60:True 61:True 62:True 63:True 64:True 65:True 66:True 67:True 68:True 69:True 70:True 71:True 72:True 73:True 74:True 75:True 76:True 77:True 78:True 79:True 80:True 81:True 82:True 83:True 84:True 85:True 86:True 87:True 88:True 89:True 90:True 91:True 92:True 93:True 94:True 95:True 96:True 97:True 98:True 99:True 100:True 101:True 102:True 103:True 104:True 105:True 106:True 107:True 108:True 109:True 110:True 111:True 112:True 113:True 114:True 115:True 116:True 117:True 118:True 119:True 120:True 121:True 122:True 123:True 124:True 125:True 126:True 127:True 128:True 129:True 130:True
#INFO:tensorflow:input_ids: 101 2054 2785 1997 7832 2211 6037 2062 4141 1999 4623 1998 3906 2076 1996 16724 1029 102 1996 3747 1997 2671 2036 2211 6037 2062 4141 1999 4623 1998 3906 2076 1996 16724 1012 2070 4623 2150 29592 2007 4045 19240 1998 13425 1010 2096 2060 5878 2020 2517 3495 2055 4045 7832 1012 2909 2957 2304 5974 5462 1996 8446 2937 2291 2000 7893 1999 4325 1010 1037 9569 5961 1999 2698 2808 1006 28460 1007 1012 2044 8446 1005 1055 2331 1999 25350 1010 5878 2020 3605 1999 2010 6225 2005 5109 1012 2508 11161 1006 16601 1516 24445 1007 17430 2010 1000 5961 2000 1996 3638 1997 8446 1010 1000 2029 9587 21737 2094 1996 3279 1997 8446 1010 2021 2036 5868 2010 2671 1998 8027 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#INFO:tensorflow:start_position: 52
#INFO:tensorflow:end_position: 53
#INFO:tensorflow:answer: scientific topics

import tensorflow as tf
import numpy as np
import tokenization
import json
import pprint
import collections

def is_whitespace(c):
    if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
        return True
    return False

class SquadExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self,
                 qas_id,
                 question_text,
                 doc_tokens,
                 orig_answer_text=None,
                 start_position=None,
                 end_position=None):
        self.qas_id = qas_id
        self.question_text = question_text
        self.doc_tokens = doc_tokens
        self.orig_answer_text = orig_answer_text
        self.start_position = start_position
        self.end_position = end_position

    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        s = ""
        s += "qas_id: %s" % (tokenization.printable_text(self.qas_id))
        s += ", question_text: %s" % (
            tokenization.printable_text(self.question_text))
        s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens))
        if self.start_position:
            s += ", start_position: %d" % (self.start_position)
        if self.start_position:
            s += ", end_position: %d" % (self.end_position)
        return s

def read_squad_examples(input_file, is_training):
    """Read a SQuAD json file into a list of SquadExample."""
    with tf.gfile.Open(input_file, "r") as reader:
        input_data = json.load(reader)["data"]

    def is_whitespace(c):
        if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
            return True
        return False

    examples = []
    for entry in input_data:
        for paragraph in entry["paragraphs"]:
            paragraph_text = paragraph["context"]
            doc_tokens = []
            char_to_word_offset = []
            prev_is_whitespace = True
            for c in paragraph_text:
                if is_whitespace(c):
                    prev_is_whitespace = True
                else:
                    if prev_is_whitespace:
                        doc_tokens.append(c)
                    else:
                        doc_tokens[-1] += c
                    prev_is_whitespace = False
                char_to_word_offset.append(len(doc_tokens) - 1)

            for qa in paragraph["qas"]:
                qas_id = qa["id"]
                question_text = qa["question"]
                start_position = None
                end_position = None
                orig_answer_text = None
                if is_training:
                    if len(qa["answers"]) != 1:
                        raise ValueError(
                            "For training, each question should have exactly 1 answer.")
                    answer = qa["answers"][0]
                    orig_answer_text = answer["text"]
                    answer_offset = answer["answer_start"]
                    answer_length = len(orig_answer_text)
                    start_position = char_to_word_offset[answer_offset]
                    end_position = char_to_word_offset[answer_offset + answer_length - 1]
                    # Only add answers where the text can be exactly recovered from the
                    # document. If this CAN'T happen it's likely due to weird Unicode
                    # stuff so we will just skip the example.
                    #
                    # Note that this means for training mode, every example is NOT
                    # guaranteed to be preserved.
                    actual_text = " ".join(doc_tokens[start_position:(end_position + 1)])
                    cleaned_answer_text = " ".join(
                        tokenization.whitespace_tokenize(orig_answer_text))
                    if actual_text.find(cleaned_answer_text) == -1:
                        tf.logging.warning("Could not find answer: '%s' vs. '%s'",
                                           actual_text, cleaned_answer_text)
                        continue

                example = SquadExample(
                    qas_id=qas_id,
                    question_text=question_text,
                    doc_tokens=doc_tokens,
                    orig_answer_text=orig_answer_text,
                    start_position=start_position,
                    end_position=end_position)
                examples.append(example)
    return examples

def _check_is_max_context(doc_spans, cur_span_index, position):
    """Check if this is the 'max context' doc span for the token."""

    # Because of the sliding window approach taken to scoring documents, a single
    # token can appear in multiple documents. E.g.
    #  Doc: the man went to the store and bought a gallon of milk
    #  Span A: the man went to the
    #  Span B: to the store and bought
    #  Span C: and bought a gallon of
    #  ...
    #
    # Now the word 'bought' will have two scores from spans B and C. We only
    # want to consider the score with "maximum context", which we define as
    # the *minimum* of its left and right context (the *sum* of left and
    # right context will always be the same, of course).
    #
    # In the example the maximum context for 'bought' would be span C since
    # it has 1 left context and 3 right context, while span B has 4 left context
    # and 0 right context.
    best_score = None
    best_span_index = None
    for (span_index, doc_span) in enumerate(doc_spans):
        end = doc_span.start + doc_span.length - 1
        if position < doc_span.start:
            continue
        if position > end:
            continue
        num_left_context = position - doc_span.start
        num_right_context = end - position
        score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
        if best_score is None or score > best_score:
            best_score = score
            best_span_index = span_index

    return cur_span_index == best_span_index

1	vocab_file_path = "/home/b418/jupyter_workspace/B418_common/袁宵/model/BERT/uncased_L-12_H-768_A-12/vocab.txt"

1	tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file_path, do_lower_case=True)

1	print(tokenizer.tokenize("To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"))

['to', 'whom', 'did', 'the', 'virgin', 'mary', 'allegedly', 'appear', 'in', '1858', 'in', 'lou', '##rdes', 'france', '?']

1	tokenizer.tokenize("Lourdes")

['lou', '##rdes']

SQuAD_data/train-v1.1.json

1	input_file_path = "/home/b418/jupyter_workspace/B418_common/袁宵/data/SQuAD_data/train-v1.1.json"

1 2	with tf.gfile.Open(input_file_path, "r") as reader: input_data = json.load(reader)["data"]

1	a_input_data = input_data[0]

paragraph

1
2
3

for paragraph in a_input_data["paragraphs"][:2]:
    print('-'*100)
    pprint.pprint(paragraph)

----------------------------------------------------------------------------------------------------
{'context': 'Architecturally, the school has a Catholic character. Atop the '
            "Main Building's gold dome is a golden statue of the Virgin Mary. "
            'Immediately in front of the Main Building and facing it, is a '
            'copper statue of Christ with arms upraised with the legend '
            '"Venite Ad Me Omnes". Next to the Main Building is the Basilica '
            'of the Sacred Heart. Immediately behind the basilica is the '
            'Grotto, a Marian place of prayer and reflection. It is a replica '
            'of the grotto at Lourdes, France where the Virgin Mary reputedly '
            'appeared to Saint Bernadette Soubirous in 1858. At the end of the '
            'main drive (and in a direct line that connects through 3 statues '
            'and the Gold Dome), is a simple, modern stone statue of Mary.',
 'qas': [{'answers': [{'answer_start': 515,
                       'text': 'Saint Bernadette Soubirous'}],
          'id': '5733be284776f41900661182',
          'question': 'To whom did the Virgin Mary allegedly appear in 1858 in '
                      'Lourdes France?'},
         {'answers': [{'answer_start': 188,
                       'text': 'a copper statue of Christ'}],
          'id': '5733be284776f4190066117f',
          'question': 'What is in front of the Notre Dame Main Building?'},
         {'answers': [{'answer_start': 279, 'text': 'the Main Building'}],
          'id': '5733be284776f41900661180',
          'question': 'The Basilica of the Sacred heart at Notre Dame is '
                      'beside to which structure?'},
         {'answers': [{'answer_start': 381,
                       'text': 'a Marian place of prayer and reflection'}],
          'id': '5733be284776f41900661181',
          'question': 'What is the Grotto at Notre Dame?'},
         {'answers': [{'answer_start': 92,
                       'text': 'a golden statue of the Virgin Mary'}],
          'id': '5733be284776f4190066117e',
          'question': 'What sits on top of the Main Building at Notre Dame?'}]}
----------------------------------------------------------------------------------------------------
{'context': "As at most other universities, Notre Dame's students run a number "
            'of news media outlets. The nine student-run outlets include three '
            'newspapers, both a radio and television station, and several '
            'magazines and journals. Begun as a one-page journal in September '
            '1876, the Scholastic magazine is issued twice monthly and claims '
            'to be the oldest continuous collegiate publication in the United '
            'States. The other magazine, The Juggler, is released twice a year '
            'and focuses on student literature and artwork. The Dome yearbook '
            'is published annually. The newspapers have varying publication '
            'interests, with The Observer published daily and mainly reporting '
            'university and other news, and staffed by students from both '
            "Notre Dame and Saint Mary's College. Unlike Scholastic and The "
            'Dome, The Observer is an independent publication and does not '
            'have a faculty advisor or any editorial oversight from the '
            'University. In 1987, when some students believed that The '
            'Observer began to show a conservative bias, a liberal newspaper, '
            'Common Sense was published. Likewise, in 2003, when other '
            'students believed that the paper showed a liberal bias, the '
            'conservative paper Irish Rover went into production. Neither '
            'paper is published as often as The Observer; however, all three '
            'are distributed to all students. Finally, in Spring 2008 an '
            'undergraduate journal for political science research, Beyond '
            'Politics, made its debut.',
 'qas': [{'answers': [{'answer_start': 248, 'text': 'September 1876'}],
          'id': '5733bf84d058e614000b61be',
          'question': 'When did the Scholastic Magazine of Notre dame begin '
                      'publishing?'},
         {'answers': [{'answer_start': 441, 'text': 'twice'}],
          'id': '5733bf84d058e614000b61bf',
          'question': "How often is Notre Dame's the Juggler published?"},
         {'answers': [{'answer_start': 598, 'text': 'The Observer'}],
          'id': '5733bf84d058e614000b61c0',
          'question': 'What is the daily student paper at Notre Dame called?'},
         {'answers': [{'answer_start': 126, 'text': 'three'}],
          'id': '5733bf84d058e614000b61bd',
          'question': 'How many student news papers are found at Notre Dame?'},
         {'answers': [{'answer_start': 908, 'text': '1987'}],
          'id': '5733bf84d058e614000b61c1',
          'question': 'In what year did the student paper Common Sense begin '
                      'publication at Notre Dame?'}]}

1	examples = read_squad_examples(input_file_path, True)

1	len(examples)

read_squad_examples

examples - doc_tokens

1	a_paragraph_text = a_input_data["paragraphs"][0]['context']

a_paragraph_text = '''Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart.'''

1 2	print(len(a_paragraph_text)) print(a_paragraph_text)

333
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart.

doc_tokens = []
char_to_word_offset = []
prev_is_whitespace = True
# for c in paragraph_text:
for c in a_paragraph_text:
    if is_whitespace(c):
        prev_is_whitespace = True
    else:
        if prev_is_whitespace:
            doc_tokens.append(c)
        else:
            doc_tokens[-1] += c
        prev_is_whitespace = False
    char_to_word_offset.append(len(doc_tokens) - 1)

1 2	print(len(doc_tokens)) print(doc_tokens)

59
['Architecturally,', 'the', 'school', 'has', 'a', 'Catholic', 'character.', 'Atop', 'the', 'Main', "Building's", 'gold', 'dome', 'is', 'a', 'golden', 'statue', 'of', 'the', 'Virgin', 'Mary.', 'Immediately', 'in', 'front', 'of', 'the', 'Main', 'Building', 'and', 'facing', 'it,', 'is', 'a', 'copper', 'statue', 'of', 'Christ', 'with', 'arms', 'upraised', 'with', 'the', 'legend', '"Venite', 'Ad', 'Me', 'Omnes".', 'Next', 'to', 'the', 'Main', 'Building', 'is', 'the', 'Basilica', 'of', 'the', 'Sacred', 'Heart.']

1 2	print(len(char_to_word_offset)) print(char_to_word_offset)

333
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 14, 14, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 23, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 25, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 32, 32, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 34, 34, 34, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 40, 40, 40, 40, 40, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 43, 43, 44, 44, 44, 45, 45, 45, 46, 46, 46, 46, 46, 46, 46, 46, 47, 47, 47, 47, 47, 48, 48, 48, 49, 49, 49, 49, 50, 50, 50, 50, 50, 51, 51, 51, 51, 51, 51, 51, 51, 51, 52, 52, 52, 53, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54, 54, 54, 55, 55, 55, 56, 56, 56, 56, 57, 57, 57, 57, 57, 57, 57, 58, 58, 58, 58, 58, 58]

‘Architecturally, the school has a Catholic character. Atop the ‘
“Main Building’s gold dome is a golden statue of the Virgin Mary. “
‘Immediately in front of the Main Building and facing it, is a ‘
‘copper statue of Christ with arms upraised with the legend ‘
‘“Venite Ad Me Omnes”. Next to the Main Building is the Basilica ‘
‘of the Sacred Heart. Immediately behind the basilica is the ‘
‘Grotto, a Marian place of prayer and reflection. It is a replica ‘
‘of the grotto at Lourdes, France where the Virgin Mary reputedly ‘
‘appeared to Saint Bernadette Soubirous in 1858. At the end of the ‘
‘main drive (and in a direct line that connects through 3 statues ‘
‘and the Gold Dome), is a simple, modern stone statue of Mary.’,

[‘Architecturally,’, ‘the’, ‘school’, ‘has’, ‘a’, ‘Catholic’, ‘character.’, ‘Atop’, ‘the’,
‘Main’, “Building’s”, ‘gold’, ‘dome’, ‘is’, ‘a’, ‘golden’, ‘statue’, ‘of’, ‘the’, ‘Virgin’, ‘Mary.’,
‘Immediately’, ‘in’, ‘front’, ‘of’, ‘the’, ‘Main’, ‘Building’, ‘and’, ‘facing’, ‘it,’, ‘is’, ‘a’,
‘copper’, ‘statue’, ‘of’, ‘Christ’, ‘with’, ‘arms’, ‘upraised’, ‘with’, ‘the’, ‘legend’,
‘“Venite’, ‘Ad’, ‘Me’, ‘Omnes”.’, ‘Next’, ‘to’, ‘the’, ‘Main’, ‘Building’, ‘is’, ‘the’, ‘Basilica’,
‘of’, ‘the’, ‘Sacred’, ‘Heart.’, ‘Immediately’, ‘behind’, ‘the’, ‘basilica’, ‘is’, ‘the’,
‘Grotto,’, ‘a’, ‘Marian’, ‘place’, ‘of’, ‘prayer’, ‘and’, ‘reflection.’, ‘It’, ‘is’, ‘a’, ‘replica’,
‘of’, ‘the’, ‘grotto’, ‘at’, ‘Lourdes,’, ‘France’, ‘where’, ‘the’, ‘Virgin’, ‘Mary’, ‘reputedly’,
‘appeared’, ‘to’, ‘Saint’, ‘Bernadette’, ‘Soubirous’, ‘in’, ‘1858.’, ‘At’, ‘the’, ‘end’, ‘of’, ‘the’,
‘main’, ‘drive’, ‘(and’, ‘in’, ‘a’, ‘direct’, ‘line’, ‘that’, ‘connects’, ‘through’, ‘3’, ‘statues’,
‘and’, ‘the’, ‘Gold’, ‘Dome),’, ‘is’, ‘a’, ‘simple,’, ‘modern’, ‘stone’, ‘statue’, ‘of’, ‘Mary.’]

查看原始语料转换成的 examples 内容

a_example_doc_tokens = None
for i in range(2):
    example = examples[i]
    print(example.qas_id)
    n = 100 - len(example.question_text) - len(example.orig_answer_text,)
    print(example.question_text, '-'*n, example.orig_answer_text)
    print(example.start_position, '-'*n, example.end_position)
    print(example.doc_tokens)
    print(example.doc_tokens[90], example.doc_tokens[91], example.doc_tokens[92])
    a_example_doc_tokens = example.doc_tokens
    print('\n')
    print('\n')

5733be284776f41900661182
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? --- Saint Bernadette Soubirous
90 --- 92
['Architecturally,', 'the', 'school', 'has', 'a', 'Catholic', 'character.', 'Atop', 'the', 'Main', "Building's", 'gold', 'dome', 'is', 'a', 'golden', 'statue', 'of', 'the', 'Virgin', 'Mary.', 'Immediately', 'in', 'front', 'of', 'the', 'Main', 'Building', 'and', 'facing', 'it,', 'is', 'a', 'copper', 'statue', 'of', 'Christ', 'with', 'arms', 'upraised', 'with', 'the', 'legend', '"Venite', 'Ad', 'Me', 'Omnes".', 'Next', 'to', 'the', 'Main', 'Building', 'is', 'the', 'Basilica', 'of', 'the', 'Sacred', 'Heart.', 'Immediately', 'behind', 'the', 'basilica', 'is', 'the', 'Grotto,', 'a', 'Marian', 'place', 'of', 'prayer', 'and', 'reflection.', 'It', 'is', 'a', 'replica', 'of', 'the', 'grotto', 'at', 'Lourdes,', 'France', 'where', 'the', 'Virgin', 'Mary', 'reputedly', 'appeared', 'to', 'Saint', 'Bernadette', 'Soubirous', 'in', '1858.', 'At', 'the', 'end', 'of', 'the', 'main', 'drive', '(and', 'in', 'a', 'direct', 'line', 'that', 'connects', 'through', '3', 'statues', 'and', 'the', 'Gold', 'Dome),', 'is', 'a', 'simple,', 'modern', 'stone', 'statue', 'of', 'Mary.']
Saint Bernadette Soubirous




5733be284776f4190066117f
What is in front of the Notre Dame Main Building? -------------------------- a copper statue of Christ
32 -------------------------- 36
['Architecturally,', 'the', 'school', 'has', 'a', 'Catholic', 'character.', 'Atop', 'the', 'Main', "Building's", 'gold', 'dome', 'is', 'a', 'golden', 'statue', 'of', 'the', 'Virgin', 'Mary.', 'Immediately', 'in', 'front', 'of', 'the', 'Main', 'Building', 'and', 'facing', 'it,', 'is', 'a', 'copper', 'statue', 'of', 'Christ', 'with', 'arms', 'upraised', 'with', 'the', 'legend', '"Venite', 'Ad', 'Me', 'Omnes".', 'Next', 'to', 'the', 'Main', 'Building', 'is', 'the', 'Basilica', 'of', 'the', 'Sacred', 'Heart.', 'Immediately', 'behind', 'the', 'basilica', 'is', 'the', 'Grotto,', 'a', 'Marian', 'place', 'of', 'prayer', 'and', 'reflection.', 'It', 'is', 'a', 'replica', 'of', 'the', 'grotto', 'at', 'Lourdes,', 'France', 'where', 'the', 'Virgin', 'Mary', 'reputedly', 'appeared', 'to', 'Saint', 'Bernadette', 'Soubirous', 'in', '1858.', 'At', 'the', 'end', 'of', 'the', 'main', 'drive', '(and', 'in', 'a', 'direct', 'line', 'that', 'connects', 'through', '3', 'statues', 'and', 'the', 'Gold', 'Dome),', 'is', 'a', 'simple,', 'modern', 'stone', 'statue', 'of', 'Mary.']
Saint Bernadette Soubirous

convert_examples_to_features

doc_tokens 2 all_doc_tokens

1 2	print(len(a_example_doc_tokens)) print(a_example_doc_tokens)

124
['Architecturally,', 'the', 'school', 'has', 'a', 'Catholic', 'character.', 'Atop', 'the', 'Main', "Building's", 'gold', 'dome', 'is', 'a', 'golden', 'statue', 'of', 'the', 'Virgin', 'Mary.', 'Immediately', 'in', 'front', 'of', 'the', 'Main', 'Building', 'and', 'facing', 'it,', 'is', 'a', 'copper', 'statue', 'of', 'Christ', 'with', 'arms', 'upraised', 'with', 'the', 'legend', '"Venite', 'Ad', 'Me', 'Omnes".', 'Next', 'to', 'the', 'Main', 'Building', 'is', 'the', 'Basilica', 'of', 'the', 'Sacred', 'Heart.', 'Immediately', 'behind', 'the', 'basilica', 'is', 'the', 'Grotto,', 'a', 'Marian', 'place', 'of', 'prayer', 'and', 'reflection.', 'It', 'is', 'a', 'replica', 'of', 'the', 'grotto', 'at', 'Lourdes,', 'France', 'where', 'the', 'Virgin', 'Mary', 'reputedly', 'appeared', 'to', 'Saint', 'Bernadette', 'Soubirous', 'in', '1858.', 'At', 'the', 'end', 'of', 'the', 'main', 'drive', '(and', 'in', 'a', 'direct', 'line', 'that', 'connects', 'through', '3', 'statues', 'and', 'the', 'Gold', 'Dome),', 'is', 'a', 'simple,', 'modern', 'stone', 'statue', 'of', 'Mary.']

tok_to_orig_index = []
orig_to_tok_index = []
all_doc_tokens = []
#for (i, token) in enumerate(example.doc_tokens):
for (i, token) in enumerate(a_example_doc_tokens):
    orig_to_tok_index.append(len(all_doc_tokens))
    sub_tokens = tokenizer.tokenize(token)
    for sub_token in sub_tokens:
        tok_to_orig_index.append(i)
        all_doc_tokens.append(sub_token)

1 2	print(len(orig_to_tok_index)) print(orig_to_tok_index)

124
[0, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 49, 50, 51, 52, 56, 57, 58, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 75, 76, 77, 78, 79, 80, 81, 84, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95, 96, 97, 98, 100, 101, 104, 105, 106, 107, 108, 109, 111, 112, 113, 114, 117, 121, 122, 124, 125, 126, 127, 128, 129, 130, 131, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 148, 149, 150, 152, 153, 154, 155, 156]

1 2	print(len(tok_to_orig_index)) print(tok_to_orig_index)

158
[0, 0, 0, 1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 10, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 39, 39, 40, 41, 42, 43, 43, 43, 43, 44, 45, 46, 46, 46, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 58, 59, 60, 61, 62, 63, 64, 65, 65, 65, 66, 67, 68, 69, 70, 71, 72, 72, 73, 74, 75, 76, 77, 78, 79, 79, 80, 81, 81, 81, 82, 83, 84, 85, 86, 87, 87, 88, 89, 90, 91, 91, 91, 92, 92, 92, 92, 93, 94, 94, 95, 96, 97, 98, 99, 100, 101, 102, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 115, 115, 116, 117, 118, 118, 119, 120, 121, 122, 123, 123]

1 2	print(len(all_doc_tokens)) print(all_doc_tokens)

158
['architectural', '##ly', ',', 'the', 'school', 'has', 'a', 'catholic', 'character', '.', 'atop', 'the', 'main', 'building', "'", 's', 'gold', 'dome', 'is', 'a', 'golden', 'statue', 'of', 'the', 'virgin', 'mary', '.', 'immediately', 'in', 'front', 'of', 'the', 'main', 'building', 'and', 'facing', 'it', ',', 'is', 'a', 'copper', 'statue', 'of', 'christ', 'with', 'arms', 'up', '##rai', '##sed', 'with', 'the', 'legend', '"', 've', '##ni', '##te', 'ad', 'me', 'om', '##nes', '"', '.', 'next', 'to', 'the', 'main', 'building', 'is', 'the', 'basilica', 'of', 'the', 'sacred', 'heart', '.', 'immediately', 'behind', 'the', 'basilica', 'is', 'the', 'gr', '##otto', ',', 'a', 'marian', 'place', 'of', 'prayer', 'and', 'reflection', '.', 'it', 'is', 'a', 'replica', 'of', 'the', 'gr', '##otto', 'at', 'lou', '##rdes', ',', 'france', 'where', 'the', 'virgin', 'mary', 'reputed', '##ly', 'appeared', 'to', 'saint', 'bern', '##ade', '##tte', 'so', '##ub', '##iro', '##us', 'in', '1858', '.', 'at', 'the', 'end', 'of', 'the', 'main', 'drive', '(', 'and', 'in', 'a', 'direct', 'line', 'that', 'connects', 'through', '3', 'statues', 'and', 'the', 'gold', 'dome', ')', ',', 'is', 'a', 'simple', ',', 'modern', 'stone', 'statue', 'of', 'mary', '.']

使用滑动窗口解决文档太长问题

# The -3 accounts for [CLS], [SEP] and [SEP]
#max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
max_tokens_for_doc = 100 - 20 -3
doc_stride = 128

# We can have documents that are longer than the maximum sequence length.
# To deal with this we do a sliding window approach, where we take chunks
# of the up to our max length with a stride of `doc_stride`.
_DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
    "DocSpan", ["start", "length"])
doc_spans = []
start_offset = 0
while start_offset < len(all_doc_tokens):
    length = len(all_doc_tokens) - start_offset
    if length > max_tokens_for_doc:
        length = max_tokens_for_doc
    doc_spans.append(_DocSpan(start=start_offset, length=length))
    if start_offset + length == len(all_doc_tokens):
        break
    start_offset += min(length, doc_stride)
print("max_tokens_for_doc:\t",max_tokens_for_doc)
print("doc_spans:\t",doc_spans)

max_tokens_for_doc:     77
doc_spans:     [DocSpan(start=0, length=77), DocSpan(start=77, length=77), DocSpan(start=154, length=4)]

1 2	doc_span = doc_spans[0] doc_span

DocSpan(start=0, length=77)

1 2	print(doc_span.start) print(doc_span.length)

0
77

tokens = [], token_to_orig_map = {}, token_is_max_context = {}, segment_ids = []

1
2
3

example = examples[0]
query_tokens = tokenizer.tokenize(example.question_text)
print(query_tokens)

['to', 'whom', 'did', 'the', 'virgin', 'mary', 'allegedly', 'appear', 'in', '1858', 'in', 'lou', '##rdes', 'france', '?']

for (doc_span_index, doc_span) in enumerate(doc_spans):
    tokens = []
    token_to_orig_map = {}
    token_is_max_context = {}
    segment_ids = []
    tokens.append("[CLS]")
    segment_ids.append(0)
    for token in query_tokens:
        tokens.append(token)
        segment_ids.append(0)
    tokens.append("[SEP]")
    segment_ids.append(0)

    for i in range(doc_span.length):
        split_token_index = doc_span.start + i
        token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index]

        is_max_context = _check_is_max_context(doc_spans, doc_span_index,
                                               split_token_index)
        token_is_max_context[len(tokens)] = is_max_context
        tokens.append(all_doc_tokens[split_token_index])
        segment_ids.append(1)
    tokens.append("[SEP]")
    segment_ids.append(1)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    print("tokens:",len(tokens),'\n',tokens)
    print('\n')
    print("input_ids:",len(input_ids),'\n',input_ids)
    print('\n')
    print("segment_ids:",len(segment_ids),'\n',segment_ids)
    print('\n')
    print("token_is_max_context:",len(token_is_max_context),'\n',token_is_max_context)
    print('\n')
    print("token_to_orig_maplen:",len(token_to_orig_map),'\n',token_to_orig_map)
    print('\n')
    print('-'*100)

tokens: 95
 ['[CLS]', 'to', 'whom', 'did', 'the', 'virgin', 'mary', 'allegedly', 'appear', 'in', '1858', 'in', 'lou', '##rdes', 'france', '?', '[SEP]', 'architectural', '##ly', ',', 'the', 'school', 'has', 'a', 'catholic', 'character', '.', 'atop', 'the', 'main', 'building', "'", 's', 'gold', 'dome', 'is', 'a', 'golden', 'statue', 'of', 'the', 'virgin', 'mary', '.', 'immediately', 'in', 'front', 'of', 'the', 'main', 'building', 'and', 'facing', 'it', ',', 'is', 'a', 'copper', 'statue', 'of', 'christ', 'with', 'arms', 'up', '##rai', '##sed', 'with', 'the', 'legend', '"', 've', '##ni', '##te', 'ad', 'me', 'om', '##nes', '"', '.', 'next', 'to', 'the', 'main', 'building', 'is', 'the', 'basilica', 'of', 'the', 'sacred', 'heart', '.', 'immediately', 'behind', '[SEP]']


input_ids: 95
 [101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 102]


segment_ids: 95
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


token_is_max_context: 77
 {17: True, 18: True, 19: True, 20: True, 21: True, 22: True, 23: True, 24: True, 25: True, 26: True, 27: True, 28: True, 29: True, 30: True, 31: True, 32: True, 33: True, 34: True, 35: True, 36: True, 37: True, 38: True, 39: True, 40: True, 41: True, 42: True, 43: True, 44: True, 45: True, 46: True, 47: True, 48: True, 49: True, 50: True, 51: True, 52: True, 53: True, 54: True, 55: True, 56: True, 57: True, 58: True, 59: True, 60: True, 61: True, 62: True, 63: True, 64: True, 65: True, 66: True, 67: True, 68: True, 69: True, 70: True, 71: True, 72: True, 73: True, 74: True, 75: True, 76: True, 77: True, 78: True, 79: True, 80: True, 81: True, 82: True, 83: True, 84: True, 85: True, 86: True, 87: True, 88: True, 89: True, 90: True, 91: True, 92: True, 93: True}


token_to_orig_maplen: 77
 {17: 0, 18: 0, 19: 0, 20: 1, 21: 2, 22: 3, 23: 4, 24: 5, 25: 6, 26: 6, 27: 7, 28: 8, 29: 9, 30: 10, 31: 10, 32: 10, 33: 11, 34: 12, 35: 13, 36: 14, 37: 15, 38: 16, 39: 17, 40: 18, 41: 19, 42: 20, 43: 20, 44: 21, 45: 22, 46: 23, 47: 24, 48: 25, 49: 26, 50: 27, 51: 28, 52: 29, 53: 30, 54: 30, 55: 31, 56: 32, 57: 33, 58: 34, 59: 35, 60: 36, 61: 37, 62: 38, 63: 39, 64: 39, 65: 39, 66: 40, 67: 41, 68: 42, 69: 43, 70: 43, 71: 43, 72: 43, 73: 44, 74: 45, 75: 46, 76: 46, 77: 46, 78: 46, 79: 47, 80: 48, 81: 49, 82: 50, 83: 51, 84: 52, 85: 53, 86: 54, 87: 55, 88: 56, 89: 57, 90: 58, 91: 58, 92: 59, 93: 60}


----------------------------------------------------------------------------------------------------
tokens: 95
 ['[CLS]', 'to', 'whom', 'did', 'the', 'virgin', 'mary', 'allegedly', 'appear', 'in', '1858', 'in', 'lou', '##rdes', 'france', '?', '[SEP]', 'the', 'basilica', 'is', 'the', 'gr', '##otto', ',', 'a', 'marian', 'place', 'of', 'prayer', 'and', 'reflection', '.', 'it', 'is', 'a', 'replica', 'of', 'the', 'gr', '##otto', 'at', 'lou', '##rdes', ',', 'france', 'where', 'the', 'virgin', 'mary', 'reputed', '##ly', 'appeared', 'to', 'saint', 'bern', '##ade', '##tte', 'so', '##ub', '##iro', '##us', 'in', '1858', '.', 'at', 'the', 'end', 'of', 'the', 'main', 'drive', '(', 'and', 'in', 'a', 'direct', 'line', 'that', 'connects', 'through', '3', 'statues', 'and', 'the', 'gold', 'dome', ')', ',', 'is', 'a', 'simple', ',', 'modern', 'stone', '[SEP]']


input_ids: 95
 [101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 1996, 13546, 2003, 1996, 24665, 23052, 1010, 1037, 14042, 2173, 1997, 7083, 1998, 9185, 1012, 2009, 2003, 1037, 15059, 1997, 1996, 24665, 23052, 2012, 10223, 26371, 1010, 2605, 2073, 1996, 6261, 2984, 22353, 2135, 2596, 2000, 3002, 16595, 9648, 4674, 2061, 12083, 9711, 2271, 1999, 8517, 1012, 2012, 1996, 2203, 1997, 1996, 2364, 3298, 1006, 1998, 1999, 1037, 3622, 2240, 2008, 8539, 2083, 1017, 11342, 1998, 1996, 2751, 8514, 1007, 1010, 2003, 1037, 3722, 1010, 2715, 2962, 102]


segment_ids: 95
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


token_is_max_context: 77
 {17: True, 18: True, 19: True, 20: True, 21: True, 22: True, 23: True, 24: True, 25: True, 26: True, 27: True, 28: True, 29: True, 30: True, 31: True, 32: True, 33: True, 34: True, 35: True, 36: True, 37: True, 38: True, 39: True, 40: True, 41: True, 42: True, 43: True, 44: True, 45: True, 46: True, 47: True, 48: True, 49: True, 50: True, 51: True, 52: True, 53: True, 54: True, 55: True, 56: True, 57: True, 58: True, 59: True, 60: True, 61: True, 62: True, 63: True, 64: True, 65: True, 66: True, 67: True, 68: True, 69: True, 70: True, 71: True, 72: True, 73: True, 74: True, 75: True, 76: True, 77: True, 78: True, 79: True, 80: True, 81: True, 82: True, 83: True, 84: True, 85: True, 86: True, 87: True, 88: True, 89: True, 90: True, 91: True, 92: True, 93: True}


token_to_orig_maplen: 77
 {17: 61, 18: 62, 19: 63, 20: 64, 21: 65, 22: 65, 23: 65, 24: 66, 25: 67, 26: 68, 27: 69, 28: 70, 29: 71, 30: 72, 31: 72, 32: 73, 33: 74, 34: 75, 35: 76, 36: 77, 37: 78, 38: 79, 39: 79, 40: 80, 41: 81, 42: 81, 43: 81, 44: 82, 45: 83, 46: 84, 47: 85, 48: 86, 49: 87, 50: 87, 51: 88, 52: 89, 53: 90, 54: 91, 55: 91, 56: 91, 57: 92, 58: 92, 59: 92, 60: 92, 61: 93, 62: 94, 63: 94, 64: 95, 65: 96, 66: 97, 67: 98, 68: 99, 69: 100, 70: 101, 71: 102, 72: 102, 73: 103, 74: 104, 75: 105, 76: 106, 77: 107, 78: 108, 79: 109, 80: 110, 81: 111, 82: 112, 83: 113, 84: 114, 85: 115, 86: 115, 87: 115, 88: 116, 89: 117, 90: 118, 91: 118, 92: 119, 93: 120}


----------------------------------------------------------------------------------------------------
tokens: 22
 ['[CLS]', 'to', 'whom', 'did', 'the', 'virgin', 'mary', 'allegedly', 'appear', 'in', '1858', 'in', 'lou', '##rdes', 'france', '?', '[SEP]', 'statue', 'of', 'mary', '.', '[SEP]']


input_ids: 22
 [101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6231, 1997, 2984, 1012, 102]


segment_ids: 22
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]


token_is_max_context: 4
 {17: True, 18: True, 19: True, 20: True}


token_to_orig_maplen: 4
 {17: 121, 18: 122, 19: 123, 20: 123}


------------------------------------------------------------------

人工智能

SQuAD（Stanford Question Answering Dataset）