Open In Colab

HuggingFace nlp library - Quick overview

Models come and go (linear models, LSTM, Transformers, ...) but two core elements have consistently been the beating heart of Natural Language Processing: Datasets & Metrics

nlp is a lightweight and extensible library to easily share and load dataset and evaluation metrics, already providing access to ~100 datasets and ~10 evaluation metrics.

The library has several interesting features (beside easy access to datasets/metrics):

  • Build-in interoperability with PyTorch, Tensorflow 2, Pandas and Numpy
  • Small and fast library with a transparent and pythonic API
  • Strive on large datasets: nlp naturally frees you from RAM memory limits, all datasets are memory-mapped on drive by default.
  • Smart caching with an intelligent tf.data-like cache: never wait for your data to process several times

nlp originated from a fork of the awesome Tensorflow-Datasets and the HuggingFace team want to deeply thank the team behind this amazing library and user API. We have tried to keep a layer of compatibility with tfds and a conversion can provide conversion from one format to the other.

Main datasets API

This notebook is a quick dive in the main user API for loading datasets in nlp

In [11]:
# install nlp
!pip install nlp

# Make sure that we have a recent version of pyarrow in the session before we continue - otherwise reboot Colab to activate it
import pyarrow
if int(pyarrow.__version__.split('.')[1]) < 16 and int(pyarrow.__version__.split('.')[0]) == 0:
    import os
    os.kill(os.getpid(), 9)
Requirement already satisfied: nlp in /usr/local/lib/python3.6/dist-packages (0.2.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from nlp) (1.18.4)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.6/dist-packages (from nlp) (2.23.0)
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from nlp) (0.7)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from nlp) (3.0.12)
Requirement already satisfied: dill in /usr/local/lib/python3.6/dist-packages (from nlp) (0.3.1.1)
Requirement already satisfied: pyarrow>=0.16.0 in /usr/local/lib/python3.6/dist-packages (from nlp) (0.17.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from nlp) (4.41.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->nlp) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->nlp) (2020.4.5.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->nlp) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->nlp) (2.9)
In [0]:
import logging
logging.basicConfig(level=logging.INFO)
17
In [0]:
# Let's import the library
import nlp
INFO:nlp.utils.file_utils:PyTorch version 1.5.0+cu101 available.
INFO:nlp.utils.file_utils:TensorFlow version 2.2.0 available.

Listing the currently available datasets and metrics

In [0]:
# Currently available datasets and metrics
datasets = nlp.list_datasets()
metrics = nlp.list_metrics()

print(f"🀩 Currently {len(datasets)} datasets are available on HuggingFace AWS bucket: \n" 
      + '\n'.join(dataset.id for dataset in datasets) + '\n')
print(f"🀩 Currently {len(metrics)} metrics are available on HuggingFace AWS bucket: \n" 
      + '\n'.join(metric.id for metric in metrics))
🀩 Currently 114 datasets are available on HuggingFace AWS bucket: 
aeslc
ai2_arc
anli
arcd
art
billsum
blimp
blog_authorship_corpus
boolq
break_data
c4
cfq
civil_comments
cmrc2018
cnn_dailymail
coarse_discourse
com_qa
commonsense_qa
coqa
cornell_movie_dialog
cos_e
cosmos_qa
crime_and_punish
csv
definite_pronoun_resolution
discofuse
drop
empathetic_dialogues
eraser_multi_rc
esnli
event2Mind
flores
fquad
gap
germeval_14
gigaword
glue
hansards
hellaswag
imdb
jeopardy
json
kor_nli
lc_quad
librispeech_lm
lm1b
math_dataset
math_qa
mlqa
movie_rationales
multi_news
multi_nli
multi_nli_mismatch
natural_questions
newsroom
openbookqa
opinosis
para_crawl
qa4mre
qangaroo
qanta
qasc
quarel
quartz
quoref
race
reclor
reddit
reddit_tifu
scan
scicite
scientific_papers
scifact
sciq
scitail
sentiment140
snli
social_i_qa
squad
squad_es
squad_it
squad_v1_pt
squad_v2
super_glue
ted_hrlr
ted_multi
tiny_shakespeare
trivia_qa
tydiqa
ubuntu_dialogs_corpus
webis/tl_dr
wiki40b
wiki_qa
wiki_split
wikihow
wikipedia
wikitext
winogrande
wiqa
wmt14
wmt15
wmt16
wmt17
wmt18
wmt19
wmt_t2t
wnut_17
x_stance
xcopa
xnli
xquad
xsum
xtreme
yelp_polarity

🀩 Currently 11 metrics are available on HuggingFace AWS bucket: 
bertscore
bleu
coval
gleu
glue
rouge
sacrebleu
seqeval
squad
squad_v2
xnli
In [0]:
# You can read a few attributes of the datasets before loading them (they are python dataclasses)
from dataclasses import asdict

for key, value in asdict(datasets[6]).items():
    print('πŸ‘‰ ' + key + ': ' + str(value))
πŸ‘‰ id: blimp
πŸ‘‰ key: nlp/datasets/blimp/blimp.py
πŸ‘‰ lastModified: 2020-05-14T14:57:19.000Z
πŸ‘‰ description: BLiMP is a challenge set for evaluating what language models (LMs) know about
major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each
containing 1000 minimal pairs isolating specific contrasts in syntax,
morphology, or semantics. The data is automatically generated according to
expert-crafted grammars.
πŸ‘‰ citation: @article{warstadt2019blimp,
  title={BLiMP: A Benchmark of Linguistic Minimal Pairs for English},
  author={Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei, and Wang, Sheng-Fu and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1912.00582},
  year={2019}
}
πŸ‘‰ size: 7307
πŸ‘‰ etag: "3659a5abbb1ca837439d94aa2217c5f2"
πŸ‘‰ siblings: [{'key': 'nlp/datasets/blimp/blimp.py', 'etag': '"3659a5abbb1ca837439d94aa2217c5f2"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 7307, 'rfilename': 'blimp.py'}, {'key': 'nlp/datasets/blimp/dataset_infos.json', 'etag': '"c6427bb29472ce40e7317ffb2da3eb8c"', 'lastModified': '2020-05-14T15:43:08.000Z', 'size': 140760, 'rfilename': 'dataset_infos.json'}, {'key': 'nlp/datasets/blimp/dummy/adjunct_island/0.1.0/dummy_data-zip-extracted/dummy_data/adjunct_island.jsonl', 'etag': '"c4963f9c5fc4e06b345e8fa7dd5c0f75"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 1674, 'rfilename': 'dummy/adjunct_island/0.1.0/dummy_data-zip-extracted/dummy_data/adjunct_island.jsonl'}, {'key': 'nlp/datasets/blimp/dummy/adjunct_island/0.1.0/dummy_data.zip', 'etag': '"4d9b4aebaabf5e8e879bd70c346ce444"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 693, 'rfilename': 'dummy/adjunct_island/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/anaphor_gender_agreement/0.1.0/dummy_data.zip', 'etag': '"ae6d58b49a06df42d5e1a195d2554090"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 713, 'rfilename': 'dummy/anaphor_gender_agreement/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/anaphor_number_agreement/0.1.0/dummy_data.zip', 'etag': '"407313731ea04097977cd570fb15fad6"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 713, 'rfilename': 'dummy/anaphor_number_agreement/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/animate_subject_passive/0.1.0/dummy_data.zip', 'etag': '"72591df75a765e08304c38ef534707fe"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 711, 'rfilename': 'dummy/animate_subject_passive/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/animate_subject_trans/0.1.0/dummy_data.zip', 'etag': '"3d28fc9d1d86d694eb93f594e4b6402b"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 707, 'rfilename': 'dummy/animate_subject_trans/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/causative/0.1.0/dummy_data.zip', 'etag': '"f93eac2f0293e71d4ada42b11e45b76a"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 683, 'rfilename': 'dummy/causative/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/complex_NP_island/0.1.0/dummy_data.zip', 'etag': '"a67e08505c35bdc5a7d309a9196984e5"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 699, 'rfilename': 'dummy/complex_NP_island/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/coordinate_structure_constraint_complex_left_branch/0.1.0/dummy_data.zip', 'etag': '"ec1b42518b4dfb6665236bfd114d8296"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 767, 'rfilename': 'dummy/coordinate_structure_constraint_complex_left_branch/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/coordinate_structure_constraint_object_extraction/0.1.0/dummy_data.zip', 'etag': '"158a029bb0fa569215686021e6c7ca6c"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 763, 'rfilename': 'dummy/coordinate_structure_constraint_object_extraction/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/determiner_noun_agreement_1/0.1.0/dummy_data.zip', 'etag': '"077540f947928995d7d7dac70599fc91"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 719, 'rfilename': 'dummy/determiner_noun_agreement_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/determiner_noun_agreement_2/0.1.0/dummy_data.zip', 'etag': '"4fea5ce64be1c0dd799e48cd4d3f11f0"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 719, 'rfilename': 'dummy/determiner_noun_agreement_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/determiner_noun_agreement_irregular_1/0.1.0/dummy_data.zip', 'etag': '"25c16f66bbe341f5d13915470d5a5826"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 739, 'rfilename': 'dummy/determiner_noun_agreement_irregular_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/determiner_noun_agreement_irregular_2/0.1.0/dummy_data.zip', 'etag': '"29708c976d3c7489a6723f0e4278a9ac"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 739, 'rfilename': 'dummy/determiner_noun_agreement_irregular_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/determiner_noun_agreement_with_adj_2/0.1.0/dummy_data.zip', 'etag': '"a78f9d469dbdf0a57f36b81330e71bdf"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 737, 'rfilename': 'dummy/determiner_noun_agreement_with_adj_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/determiner_noun_agreement_with_adj_irregular_1/0.1.0/dummy_data.zip', 'etag': '"c162a33eded087712aeb9b4b1ea2dbd7"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 757, 'rfilename': 'dummy/determiner_noun_agreement_with_adj_irregular_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/determiner_noun_agreement_with_adj_irregular_2/0.1.0/dummy_data.zip', 'etag': '"3a70bc956a304ba81d853dafe1de5d9e"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 757, 'rfilename': 'dummy/determiner_noun_agreement_with_adj_irregular_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/determiner_noun_agreement_with_adjective_1/0.1.0/dummy_data.zip', 'etag': '"a88a63f40066d2ae8f06528cc845f5d2"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 749, 'rfilename': 'dummy/determiner_noun_agreement_with_adjective_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/distractor_agreement_relational_noun/0.1.0/dummy_data.zip', 'etag': '"9860891071b0473b79194f6e6565d4d5"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 737, 'rfilename': 'dummy/distractor_agreement_relational_noun/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/distractor_agreement_relative_clause/0.1.0/dummy_data.zip', 'etag': '"bd95732a4096a1d44be5510e96e289e6"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 737, 'rfilename': 'dummy/distractor_agreement_relative_clause/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/drop_argument/0.1.0/dummy_data.zip', 'etag': '"4246cbc981d1cbf5d0328416a52b8147"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 691, 'rfilename': 'dummy/drop_argument/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/ellipsis_n_bar_1/0.1.0/dummy_data.zip', 'etag': '"6cbd75e3bb2c2d49a4fbe51471706a8f"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 697, 'rfilename': 'dummy/ellipsis_n_bar_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/ellipsis_n_bar_2/0.1.0/dummy_data.zip', 'etag': '"10c1c36cda5f08686a2348c7d82f6306"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 697, 'rfilename': 'dummy/ellipsis_n_bar_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/existential_there_object_raising/0.1.0/dummy_data.zip', 'etag': '"24793f8775ae82d66df3608f5b72d72f"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 729, 'rfilename': 'dummy/existential_there_object_raising/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/existential_there_quantifiers_1/0.1.0/dummy_data.zip', 'etag': '"aa24879f651e050eacda5c25ffafa387"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 727, 'rfilename': 'dummy/existential_there_quantifiers_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/existential_there_quantifiers_2/0.1.0/dummy_data.zip', 'etag': '"0705585e50d8f263f25867504c09c813"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 727, 'rfilename': 'dummy/existential_there_quantifiers_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/existential_there_subject_raising/0.1.0/dummy_data.zip', 'etag': '"fda3b97d380f5176f731ffda924ba9de"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 731, 'rfilename': 'dummy/existential_there_subject_raising/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/expletive_it_object_raising/0.1.0/dummy_data.zip', 'etag': '"2106674a6f17e4908c152fee32016a05"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 719, 'rfilename': 'dummy/expletive_it_object_raising/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/inchoative/0.1.0/dummy_data.zip', 'etag': '"24aba4b2aab4b1934d018d65e66fb47b"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 685, 'rfilename': 'dummy/inchoative/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/intransitive/0.1.0/dummy_data.zip', 'etag': '"21f3f19e72493b429e30c159f5e30a51"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 689, 'rfilename': 'dummy/intransitive/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/irregular_past_participle_adjectives/0.1.0/dummy_data.zip', 'etag': '"c586a9a3b493cc3f0e68a24bb666857f"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 737, 'rfilename': 'dummy/irregular_past_participle_adjectives/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/irregular_past_participle_verbs/0.1.0/dummy_data.zip', 'etag': '"f81d92fc913257f1957b97b51b80ad59"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 727, 'rfilename': 'dummy/irregular_past_participle_verbs/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/irregular_plural_subject_verb_agreement_1/0.1.0/dummy_data.zip', 'etag': '"960ff5ce201f4f3f40b90a939913b540"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 747, 'rfilename': 'dummy/irregular_plural_subject_verb_agreement_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/irregular_plural_subject_verb_agreement_2/0.1.0/dummy_data.zip', 'etag': '"9ba898073d2a5505c3b8556f4b55f8fb"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 747, 'rfilename': 'dummy/irregular_plural_subject_verb_agreement_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/left_branch_island_echo_question/0.1.0/dummy_data.zip', 'etag': '"a5bb57f0ea9714fbabba9224d33c4e56"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 729, 'rfilename': 'dummy/left_branch_island_echo_question/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/left_branch_island_simple_question/0.1.0/dummy_data.zip', 'etag': '"be10d837c20742074d65597ccd374340"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 733, 'rfilename': 'dummy/left_branch_island_simple_question/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/matrix_question_npi_licensor_present/0.1.0/dummy_data.zip', 'etag': '"300e1321aa464419c048c360591e0c1e"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 737, 'rfilename': 'dummy/matrix_question_npi_licensor_present/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/npi_present_1/0.1.0/dummy_data.zip', 'etag': '"7447d7643c939ac3c4bdf8c2b1e0784a"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 691, 'rfilename': 'dummy/npi_present_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/npi_present_2/0.1.0/dummy_data.zip', 'etag': '"a991d676c4d73aefa976390d5d7a6941"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 691, 'rfilename': 'dummy/npi_present_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/only_npi_licensor_present/0.1.0/dummy_data.zip', 'etag': '"7c53926215e0bc66c7094d1f06c315e2"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 715, 'rfilename': 'dummy/only_npi_licensor_present/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/only_npi_scope/0.1.0/dummy_data.zip', 'etag': '"613051e1d03e43f4b9ec943050ae5cc4"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 693, 'rfilename': 'dummy/only_npi_scope/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/passive_1/0.1.0/dummy_data.zip', 'etag': '"caba86741ee749d10da8c6a71343cd0e"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 683, 'rfilename': 'dummy/passive_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/passive_2/0.1.0/dummy_data.zip', 'etag': '"ce805ec1ecd2e1aaa02dfe7a35c08b13"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 683, 'rfilename': 'dummy/passive_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/principle_A_c_command/0.1.0/dummy_data.zip', 'etag': '"a221171ea87e33f5b6c0f846a79e2dd0"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 707, 'rfilename': 'dummy/principle_A_c_command/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/principle_A_case_1/0.1.0/dummy_data.zip', 'etag': '"3e83c3688b974786fd9d8d5209a0476b"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 701, 'rfilename': 'dummy/principle_A_case_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/principle_A_case_2/0.1.0/dummy_data.zip', 'etag': '"4dea33e17e8ef8f8c0d872757c13b3d6"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 701, 'rfilename': 'dummy/principle_A_case_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/principle_A_domain_1/0.1.0/dummy_data.zip', 'etag': '"7900b1564e4c529efb41d82be6809a57"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 705, 'rfilename': 'dummy/principle_A_domain_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/principle_A_domain_2/0.1.0/dummy_data.zip', 'etag': '"23ae73bd517a5bf0f6f2f0f59be52789"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 705, 'rfilename': 'dummy/principle_A_domain_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/principle_A_domain_3/0.1.0/dummy_data.zip', 'etag': '"ee728b9ae19c9948e2a71583a6b13999"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 705, 'rfilename': 'dummy/principle_A_domain_3/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/principle_A_reconstruction/0.1.0/dummy_data.zip', 'etag': '"87e77eb32d448ecf248d790233972129"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 717, 'rfilename': 'dummy/principle_A_reconstruction/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/regular_plural_subject_verb_agreement_1/0.1.0/dummy_data.zip', 'etag': '"89e82be1dcccf4356842072a4f32bf5c"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 743, 'rfilename': 'dummy/regular_plural_subject_verb_agreement_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/regular_plural_subject_verb_agreement_2/0.1.0/dummy_data.zip', 'etag': '"2c9e5b90e6f63af758fa496463cd9fb4"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 743, 'rfilename': 'dummy/regular_plural_subject_verb_agreement_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/sentential_negation_npi_licensor_present/0.1.0/dummy_data.zip', 'etag': '"1a1cf51238c21622ffc5fc49aa4b1a92"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 745, 'rfilename': 'dummy/sentential_negation_npi_licensor_present/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/sentential_negation_npi_scope/0.1.0/dummy_data.zip', 'etag': '"657e58f6efdc4fe8044b3a98d101f513"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 723, 'rfilename': 'dummy/sentential_negation_npi_scope/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/sentential_subject_island/0.1.0/dummy_data.zip', 'etag': '"2025fd4b889dff3071d901d769ae2dfc"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 715, 'rfilename': 'dummy/sentential_subject_island/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/superlative_quantifiers_1/0.1.0/dummy_data.zip', 'etag': '"3e88f881e2c5b140a637cf77c5aac68e"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 715, 'rfilename': 'dummy/superlative_quantifiers_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/superlative_quantifiers_2/0.1.0/dummy_data.zip', 'etag': '"9d94cb3368370bddeff5173d4631d72f"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 715, 'rfilename': 'dummy/superlative_quantifiers_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/tough_vs_raising_1/0.1.0/dummy_data.zip', 'etag': '"e69cdfb4f2e71152048dca408916f89e"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 701, 'rfilename': 'dummy/tough_vs_raising_1/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/tough_vs_raising_2/0.1.0/dummy_data.zip', 'etag': '"581dc9ecd1c28099badda01a8a31d976"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 701, 'rfilename': 'dummy/tough_vs_raising_2/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/transitive/0.1.0/dummy_data.zip', 'etag': '"93e8e5ab817c9169f13bada583fba7d5"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 685, 'rfilename': 'dummy/transitive/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/wh_island/0.1.0/dummy_data.zip', 'etag': '"985dbc9e4984c1d17e6a274eea718cb6"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 683, 'rfilename': 'dummy/wh_island/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/wh_questions_object_gap/0.1.0/dummy_data.zip', 'etag': '"7604d63509f82b6447f060210d5a33d3"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 711, 'rfilename': 'dummy/wh_questions_object_gap/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/wh_questions_subject_gap/0.1.0/dummy_data.zip', 'etag': '"188e6f8d00454d16258e82664cfa1d19"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 713, 'rfilename': 'dummy/wh_questions_subject_gap/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/wh_questions_subject_gap_long_distance/0.1.0/dummy_data.zip', 'etag': '"9882a65b7e9b43c8ee27d95f6cc7e7d8"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 741, 'rfilename': 'dummy/wh_questions_subject_gap_long_distance/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/wh_vs_that_no_gap/0.1.0/dummy_data.zip', 'etag': '"b18933a136b6f1505a0cf2c1bf015d1a"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 699, 'rfilename': 'dummy/wh_vs_that_no_gap/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/wh_vs_that_no_gap_long_distance/0.1.0/dummy_data.zip', 'etag': '"56f297a9bef3726085ad06039f9b676e"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 727, 'rfilename': 'dummy/wh_vs_that_no_gap_long_distance/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/wh_vs_that_with_gap/0.1.0/dummy_data.zip', 'etag': '"ecaf2faccae7968d073c717f32904436"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 703, 'rfilename': 'dummy/wh_vs_that_with_gap/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/dummy/wh_vs_that_with_gap_long_distance/0.1.0/dummy_data.zip', 'etag': '"553219792801242448c22bfe4a5c5566"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 731, 'rfilename': 'dummy/wh_vs_that_with_gap_long_distance/0.1.0/dummy_data.zip'}, {'key': 'nlp/datasets/blimp/urls_checksums/checksums.txt', 'etag': '"eed3a912a5a68248de27d7fa1c540c8e"', 'lastModified': '2020-05-14T14:57:19.000Z', 'size': 11410, 'rfilename': 'urls_checksums/checksums.txt'}]
πŸ‘‰ author: None

An example with SQuAD

In [0]:
# Downloading and loading a dataset

dataset = nlp.load_dataset('squad', split='validation[:10%]')
INFO:filelock:Lock 139884110310704 acquired on /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py.lock
INFO:nlp.utils.file_utils:https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/squad.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/tmpd52q9bes
INFO:nlp.utils.file_utils:storing https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/squad.py in cache at /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py
INFO:nlp.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py
INFO:filelock:Lock 139884110310704 released on /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py.lock
INFO:filelock:Lock 139886448054000 acquired on /root/.cache/huggingface/datasets/9ba53336b6bc977097b39b8527b06ec6ba3f60a44230f2a0a918735fcd8ad902.893fb39fe374e4c574667dd71a3017b7e2e1d196f3a34fb00b56bac805447f7c.lock
INFO:nlp.utils.file_utils:https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/dataset_infos.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/tmp9kaastvw

INFO:nlp.utils.file_utils:storing https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/dataset_infos.json in cache at /root/.cache/huggingface/datasets/9ba53336b6bc977097b39b8527b06ec6ba3f60a44230f2a0a918735fcd8ad902.893fb39fe374e4c574667dd71a3017b7e2e1d196f3a34fb00b56bac805447f7c
INFO:nlp.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/9ba53336b6bc977097b39b8527b06ec6ba3f60a44230f2a0a918735fcd8ad902.893fb39fe374e4c574667dd71a3017b7e2e1d196f3a34fb00b56bac805447f7c
INFO:filelock:Lock 139886448054000 released on /root/.cache/huggingface/datasets/9ba53336b6bc977097b39b8527b06ec6ba3f60a44230f2a0a918735fcd8ad902.893fb39fe374e4c574667dd71a3017b7e2e1d196f3a34fb00b56bac805447f7c.lock
INFO:nlp.load:Checking /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py for additional imports.
INFO:filelock:Lock 139886448054000 acquired on /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py.lock
INFO:nlp.load:Found main folder for dataset https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/squad.py at /usr/local/lib/python3.6/dist-packages/nlp/datasets/squad
INFO:nlp.load:Found specific version folder for dataset https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/squad.py at /usr/local/lib/python3.6/dist-packages/nlp/datasets/squad/c0327553d80335e3a3283527f64d9778df7ad04ab28f38148d072782712bb670
INFO:nlp.load:Found script file from https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/squad.py to /usr/local/lib/python3.6/dist-packages/nlp/datasets/squad/c0327553d80335e3a3283527f64d9778df7ad04ab28f38148d072782712bb670/squad.py
INFO:nlp.load:Copying dataset infos file from https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/dataset_infos.json to /usr/local/lib/python3.6/dist-packages/nlp/datasets/squad/c0327553d80335e3a3283527f64d9778df7ad04ab28f38148d072782712bb670/dataset_infos.json
INFO:nlp.load:Creating metadata file for dataset https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/squad/squad.py at /usr/local/lib/python3.6/dist-packages/nlp/datasets/squad/c0327553d80335e3a3283527f64d9778df7ad04ab28f38148d072782712bb670/squad.json
INFO:filelock:Lock 139886448054000 released on /root/.cache/huggingface/datasets/09ec6948d9db29db9a2dcd08df97ac45bccfa6aa104ea62d73c97fa4aaa5cd6c.f373b0de1570ca81b50bb03bd371604f7979e35de2cfcf2a3b4521d0b3104d9b.py.lock
INFO:nlp.builder:No config specified, defaulting to first: squad/plain_text
INFO:nlp.info:Loading Dataset Infos from /usr/local/lib/python3.6/dist-packages/nlp/datasets/squad/c0327553d80335e3a3283527f64d9778df7ad04ab28f38148d072782712bb670
INFO:nlp.builder:Generating dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0)

INFO:nlp.builder:Dataset not on Hf google storage. Downloading and preparing it from source
Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.75 MiB, total: 119.27 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0...
INFO:filelock:Lock 139884104848496 acquired on /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739.lock
INFO:nlp.utils.file_utils:https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmp74l2ywcp
INFO:nlp.utils.file_utils:storing https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json in cache at /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739
INFO:nlp.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739
INFO:filelock:Lock 139884104848496 released on /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739.lock

INFO:filelock:Lock 139884104848328 acquired on /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747.lock
INFO:nlp.utils.file_utils:https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmpfhou_5to
INFO:nlp.utils.file_utils:storing https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json in cache at /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747
INFO:nlp.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747
INFO:filelock:Lock 139884104848328 released on /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747.lock
INFO:nlp.utils.info_utils:All the checksums matched successfully.
INFO:nlp.builder:Generating split train

INFO:root:generating examples from = /root/.cache/huggingface/datasets/downloads/b8bb19735e1bb591510a01cc032f4c9f969bc0eeb081ae1b328cd306f3b24008.961f90ccac96b3e5df3c9ebb533f58da8f3ae596f5418c74cc814af15b348739
INFO:nlp.arrow_writer:Done writing 87599 examples in 79317110 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0.incomplete/squad-train.arrow.
INFO:nlp.builder:Generating split validation

INFO:root:generating examples from = /root/.cache/huggingface/datasets/downloads/9d5462987ef5f814fe15a369c1724f6ec39a2018b3b6271a9d7d2598686ca2ff.01470d0bbaa4753fc1435055451f474b824c23e0dc139470f39b1f233bde8747
INFO:nlp.arrow_writer:Done writing 10570 examples in 10472653 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0.incomplete/squad-validation.arrow.
INFO:nlp.utils.info_utils:All the splits matched successfully.
INFO:nlp.builder:Constructing Dataset for split validation[:10%], from /root/.cache/huggingface/datasets/squad/plain_text/1.0.0
Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0. Subsequent calls will reuse this data.

This call to nlp.load_dataset() does the following steps under the hood:

  1. Download and import in the library the SQuAD python processing script from HuggingFace AWS bucket if it's not already stored in the library. You can find the SQuAD processing script here for instance.

    Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files.

  1. Run the SQuAD python processing script which will:

    • Download the SQuAD dataset from the original URL (see the script) if it's not already downloaded and cached.
    • Process and cache all SQuAD in a structured Arrow table for each standard splits stored on the drive.

      Arrow table are arbitrarly long tables, typed with types that can be mapped to numpy/pandas/python standard types and can store nested objects. They can be directly access from drive, loaded in RAM or even streamed over the web.

  1. Return a dataset build from the splits asked by the user (default: all), in the above example we create a dataset with the first 10% of the validation split.
In [0]:
# Informations on the dataset (description, citation, size, splits, format...)
# are provided in `dataset.info` (as a python dataclass)
for key, value in asdict(dataset.info).items():
    print('πŸ‘‰ ' + key + ': ' + str(value))
πŸ‘‰ description: Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

πŸ‘‰ citation: @article{2016arXiv160605250R,
       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
                 Konstantin and {Liang}, Percy},
        title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
      journal = {arXiv e-prints},
         year = 2016,
          eid = {arXiv:1606.05250},
        pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
       eprint = {1606.05250},
}

πŸ‘‰ homepage: https://rajpurkar.github.io/SQuAD-explorer/
πŸ‘‰ license: 
πŸ‘‰ features: {'id': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'title': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'context': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'question': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'answers': {'feature': {'text': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'answer_start': {'dtype': 'int32', 'id': None, '_type': 'Value'}}, 'length': -1, 'id': None, '_type': 'Sequence'}}
πŸ‘‰ supervised_keys: None
πŸ‘‰ builder_name: squad
πŸ‘‰ config_name: plain_text
πŸ‘‰ version: {'version_str': '1.0.0', 'description': 'New split API (https://tensorflow.org/datasets/splits)', 'nlp_version_to_prepare': None, 'major': 1, 'minor': 0, 'patch': 0}
πŸ‘‰ splits: {'train': {'name': 'train', 'num_bytes': 79317110, 'num_examples': 87599, 'dataset_name': 'squad'}, 'validation': {'name': 'validation', 'num_bytes': 10472653, 'num_examples': 10570, 'dataset_name': 'squad'}}
πŸ‘‰ download_checksums: {'https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json': {'num_bytes': 30288272, 'checksum': '3527663986b8295af4f7fcdff1ba1ff3f72d07d61a20f487cb238a6ef92fd955'}, 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json': {'num_bytes': 4854279, 'checksum': '95aa6a52d5d6a735563366753ca50492a658031da74f301ac5238b03966972c9'}}
πŸ‘‰ download_size: 35142551
πŸ‘‰ dataset_size: 89789763
πŸ‘‰ size_in_bytes: 124932314

Inspecting and using the dataset: elements, slices and columns

The returned Dataset object is a memory mapped dataset that behave similarly to a normal map-style dataset. It is backed by an Apache Arrow table which allows many interesting features.

In [0]:
print(dataset)
Dataset(schema: {'id': 'string', 'title': 'string', 'context': 'string', 'question': 'string', 'answers': 'struct<text: list<item: string>, answer_start: list<item: int32>>'}, num_rows: 1057)

You can query it's length and get items or slices like you would do normally with a python mapping.

In [0]:
from pprint import pprint

print(f"πŸ‘‰Dataset len(dataset): {len(dataset)}")
print("\nπŸ‘‰First item 'dataset[0]':")
pprint(dataset[0])
πŸ‘‰Dataset len(dataset): 1057

πŸ‘‰First item 'dataset[0]':
{'answers': {'answer_start': [177, 177, 177],
             'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the '
            'champion of the National Football League (NFL) for the 2015 '
            'season. The American Football Conference (AFC) champion Denver '
            'Broncos defeated the National Football Conference (NFC) champion '
            'Carolina Panthers 24–10 to earn their third Super Bowl title. The '
            "game was played on February 7, 2016, at Levi's Stadium in the San "
            'Francisco Bay Area at Santa Clara, California. As this was the '
            '50th Super Bowl, the league emphasized the "golden anniversary" '
            'with various gold-themed initiatives, as well as temporarily '
            'suspending the tradition of naming each Super Bowl game with '
            'Roman numerals (under which the game would have been known as '
            '"Super Bowl L"), so that the logo could prominently feature the '
            'Arabic numerals 50.',
 'id': '56be4db0acb8001400a502ec',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'title': 'Super_Bowl_50'}
In [0]:
# Or get slices with several examples:
print("\nπŸ‘‰Slice of the two items 'dataset[10:12]':")
pprint(dataset[10:12])
πŸ‘‰Slice of the two items 'dataset[10:12]':
OrderedDict([('id', ['56bea9923aeaaa14008c91bb', '56beace93aeaaa14008c91df']),
             ('title', ['Super_Bowl_50', 'Super_Bowl_50']),
             ('context',
              ['Super Bowl 50 was an American football game to determine the '
               'champion of the National Football League (NFL) for the 2015 '
               'season. The American Football Conference (AFC) champion Denver '
               'Broncos defeated the National Football Conference (NFC) '
               'champion Carolina Panthers 24–10 to earn their third Super '
               "Bowl title. The game was played on February 7, 2016, at Levi's "
               'Stadium in the San Francisco Bay Area at Santa Clara, '
               'California. As this was the 50th Super Bowl, the league '
               'emphasized the "golden anniversary" with various gold-themed '
               'initiatives, as well as temporarily suspending the tradition '
               'of naming each Super Bowl game with Roman numerals (under '
               'which the game would have been known as "Super Bowl L"), so '
               'that the logo could prominently feature the Arabic numerals '
               '50.',
               'Super Bowl 50 was an American football game to determine the '
               'champion of the National Football League (NFL) for the 2015 '
               'season. The American Football Conference (AFC) champion Denver '
               'Broncos defeated the National Football Conference (NFC) '
               'champion Carolina Panthers 24–10 to earn their third Super '
               "Bowl title. The game was played on February 7, 2016, at Levi's "
               'Stadium in the San Francisco Bay Area at Santa Clara, '
               'California. As this was the 50th Super Bowl, the league '
               'emphasized the "golden anniversary" with various gold-themed '
               'initiatives, as well as temporarily suspending the tradition '
               'of naming each Super Bowl game with Roman numerals (under '
               'which the game would have been known as "Super Bowl L"), so '
               'that the logo could prominently feature the Arabic numerals '
               '50.']),
             ('question',
              ['What day was the Super Bowl played on?',
               'Who won Super Bowl 50?']),
             ('answers',
              [{'answer_start': [334, 334, 334],
                'text': ['February 7, 2016', 'February 7', 'February 7, 2016']},
               {'answer_start': [177, 177, 177],
                'text': ['Denver Broncos',
                         'Denver Broncos',
                         'Denver Broncos']}])])
In [0]:
# You can get a full column of the dataset by indexing with its name as a string:
print(dataset['question'][:10])
['Which NFL team represented the AFC at Super Bowl 50?', 'Which NFL team represented the NFC at Super Bowl 50?', 'Where did Super Bowl 50 take place?', 'Which NFL team won Super Bowl 50?', 'What color was used to emphasize the 50th anniversary of the Super Bowl?', 'What was the theme of Super Bowl 50?', 'What day was the game played on?', 'What is the AFC short for?', 'What was the theme of Super Bowl 50?', 'What does AFC stand for?']

The __getitem__ method will return different format depending on the type of query:

  • Items like dataset[0] are returned as dict of elements.
  • Slices like dataset[10:20] are returned as dict of lists of elements.
  • Columns like dataset['question'] are returned as a list of elements.

This may seems surprising at first but in our experiments it's actually a lot easier to use for data processing than returning the same format for each of these views on the dataset.

In particular, you can easily iterate along columns in slices, and also naturally permute consecutive indexings with identical results as showed here by permuting column indexing with elements and slices:

In [0]:
print(dataset[0]['question'] == dataset['question'][0])
print(dataset[10:20]['context'] == dataset['context'][10:20])
True
True

Dataset are internally typed and structured

The dataset is backed by one (or several) Apache Arrow tables which are typed and allows for fast retrieval and access as well as arbitrary-size memory mapping.

This means respectively that the format for the dataset is clearly defined and that you can load datasets of arbitrary size without worrying about RAM memory limitation (basically the dataset take no space in RAM, it's directly read from drive when needed with fast IO access).

In [0]:
# You can inspect the dataset column names and type 
print(dataset.column_names)
print(dataset.schema)
['id', 'title', 'context', 'question', 'answers']
id: string not null
title: string not null
context: string not null
question: string not null
answers: struct<text: list<item: string>, answer_start: list<item: int32>> not null
  child 0, text: list<item: string>
      child 0, item: string
  child 1, answer_start: list<item: int32>
      child 0, item: int32

Additional misc properties

In [0]:
# Datasets also have a bunch of properties you can access
print("The number of bytes allocated on the drive is ", dataset.nbytes)
print("For comparison, here is the number of bytes allocated in memory which can be")
print("accessed with `nlp.total_allocated_bytes()`: ", nlp.total_allocated_bytes())
print("The number of rows", dataset.num_rows)
print("The number of columns", dataset.num_columns)
print("The shape (rows, columns)", dataset.shape)
The number of bytes allocated on the drive is  9855914
For comparison, here is the number of bytes allocated in memory which can be
accessed with `nlp.total_allocated_bytes()`:  0
The number of rows 1057
The number of columns 5
The shape (rows, columns) (1057, 5)

Additional misc methods

In [0]:
# We can list the unique elements in a column. This is done by the backend (so fast!)
print(f"dataset.unique('title'): {dataset.unique('title')}")

# This will drop the column 'id'
dataset.drop('id')  # Remove column 'id'
print(f"After dataset.drop('id'), remaining columns are {dataset.column_names}")

# This will flatten nested columns (in 'answers' in our case)
dataset.flatten()
print(f"After dataset.flatten(), column names are {dataset.column_names}")

# We can also "dictionnary encode" a column if many of it's elements are similar
# This will reduce it's size by only storing the distinct elements (e.g. string)
# It only has effect on the internal storage (no difference from a user point of view)
dataset.dictionary_encode_column('title')
dataset.unique('title'): ['Super_Bowl_50', 'Warsaw']
After dataset.drop('id'), remaining columns are ['title', 'context', 'question', 'answers']
After dataset.flatten(), column names are ['title', 'context', 'question', 'answers.text', 'answers.answer_start']

Cache

nlp datasets are backed by Apache Arrow cache files which allows:

  • to load arbitrary large datasets by using memory mapping (as long as the datasets can fit on the drive)
  • to use a fast backend to process the dataset efficiently
  • to do smart caching by storing and reusing the results of operations performed on the drive

Let's dive a bit in these parts now

You can check the current cache files backing the dataset with the .cache_file property

In [0]:
dataset.cache_files
Out[0]:
({'filename': '/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/squad-validation.arrow',
  'skip': 0,
  'take': 1057},)

You can clean up the cache files in the current dataset directory (only keeping the currently used one) with .cleanup_cache_files().

Be careful that no other process is using some other cache files when running this command.

In [0]:
dataset.cleanup_cache_files()  # Returns the number of removed cache files
INFO:nlp.arrow_dataset:Listing files in /root/.cache/huggingface/datasets/squad/plain_text/1.0.0
Out[0]:
0

Modifying the dataset with dataset.map

There is a powerful method .map() which is inspired by tf.data map method and that you can use to apply a function to each examples, independently or in batch.

In [0]:
# `.map()` takes a callable accepting a dict as argument
# (same dict as returned by dataset[i])
# and iterate over the dataset by calling the function with each example.

# Let's print the length of each `context` string in our subset of the dataset
# (10% of the validation i.e. 1057 examples)

dataset.map(lambda example: print(len(example['context']), end=','))
775,775,
0it [00:00, ?it/s]
775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,775,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,637,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,347,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,394,179,179,179,179,179,179,179,179,179,179,179,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,168,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,638,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,326,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,704,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,917,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1271,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,1166,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,2060,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,929,704,704,704,704,704,704,704,704,704,704,704,704,704,704,353,353,353,353,353,353,353,353,353,353,353,353,353,353,353,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,464,306,306,306,306,306,306,306,306,306,306,306,306,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,372,496,496,496,496,496,496,496,496,496,496,496,496,496,496,496,260,260,260,260,260,260,260,260,260,874,874,874,874,874,874,874,874,874,874,874,874,874,874,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,1025,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,176,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,782,536,536,536,536,536,536,536,536,536,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,666,495,495,495,495,495,495,495,495,495,495,495,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,385,441,441,441,441,441,441,441,441,441,441,441,357,357,357,357,357,357,357,357,357,296,296,296,296,296,296,296,296,296,296,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,644,804,
637it [00:00, 6365.64it/s]
804,804,804,804,804,804,804,804,804,804,397,397,397,397,397,397,397,397,397,397,397,397,397,397,360,360,360,360,360,360,360,973,973,973,973,973,973,973,973,973,973,973,973,973,973,263,263,263,263,263,263,263,263,263,263,263,568,568,568,568,568,568,568,568,568,568,568,264,264,264,264,264,264,264,264,264,264,264,264,264,264,264,892,892,892,892,892,892,892,892,892,892,892,206,206,206,206,206,489,489,489,489,489,489,489,489,489,489,489,489,489,181,181,181,181,181,181,181,181,181,181,181,181,531,531,531,531,531,531,531,531,531,531,531,531,664,664,664,664,664,664,664,664,664,664,664,664,664,664,672,672,672,672,672,672,672,672,672,672,672,672,672,672,858,858,858,858,858,858,858,858,858,858,858,858,634,634,634,634,634,634,634,634,634,634,634,634,634,634,891,891,891,891,891,891,891,891,891,891,891,891,891,488,488,488,488,488,488,488,488,488,488,488,488,942,942,942,942,942,942,942,942,942,942,942,942,942,942,942,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1162,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,1353,522,522,522,522,522,1643,1643,1643,1643,1643,628,628,628,628,628,758,758,758,758,758,883,883,883,883,883,559,559,559,559,559,603,603,603,603,631,631,631,631,631,626,626,626,626,626,541,541,541,541,541,795,795,795,795,795,591,591,591,591,591,568,568,568,568,568,536,536,536,536,536,575,575,575,575,575,571,571,571,571,571,641,641,641,641,641,665,665,665,665,665
899it [00:00, 4413.66it/s]
,1088,1088,1088,1088,1088,1619,1619,1619,1619,1619,939,939,939,939,939,865,865,865,865,865,711,711,711,711,711,831,831,831,831,831,501,501,501,501,501,676,676,676,676,676,854,854,854,854,854,784,784,784,784,784,641,641,641,641,641,544,544,544,544,544,918,918,918,918,918,763,763,763,763,763,906,906,906,906,906,632,632,632,632,632,869,869,869,869,869,1044,1044,1044,1044,1044,760,760,760,760,760,715,715,715,715,715,838,838,838,838,838,881,881,881,881,881,940,940,940,940,940,618,618,618,618,618,1205,1205,1205,534,534,534,534,534,757,757,757,757,757,1239,1239,1239,1239,1239,609,609,609,609,609,798,798,798,798,798,613,613,613,613,613,613,
1057it [00:00, 4215.63it/s]
613,613,613,613,

Out[0]:
Dataset(schema: {'title': 'string', 'context': 'string', 'question': 'string', 'answers.text': 'list<item: string>', 'answers.answer_start': 'list<item: int32>'}, num_rows: 1057)

This is basically the same as doing

for example in dataset:
    function(example)

The above example had no effect on the dataset because the method we supplied to .map() didn't return a dict or a abc.Mapping that could be used to update the examples in the dataset.

In such a case, .map() will return the same dataset (self).

Now let's see how we can use a method that actually modify the dataset.

Modifying the dataset example by example

The main interest of .map() is to update and modify the content of the table and leverage smart caching and fast backend.

To use .map() to update elements in the table you need to provide a function with the following signature: function(example: dict) -> dict.

In [0]:
# Let's add a prefix 'My cute title: ' to each of our titles

def add_prefix_to_title(example):
    example['title'] = 'My cute title: ' + example['title']
    return example

dataset = dataset.map(add_prefix_to_title)

print(dataset.unique('title'))
INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-7fc546b401ec7a73d642e3460f4bcaa3.arrow
1057it [00:00, 13900.01it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 905032 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-7fc546b401ec7a73d642e3460f4bcaa3.arrow.
['My cute title: Super_Bowl_50', 'My cute title: Warsaw']

This call to .map() compute and return the updated table. It will also store the updated table in a cache file indexed by the current state and the mapped function.

A subsequent call to .map() (even in another python session) will reuse the cached file instead of recomputing the operation.

You can test this by running again the previous cell, you will see that the result are directly loaded from the cache and not re-computed again.

The updated dataset returned by .map() is (again) directly memory mapped from drive and not allocated in RAM.

The function you provide to .map() should accept an input with the format of an item of the dataset: function(dataset[0]) and return a python dict.

The columns and type of the outputs can be different than the input dict. In this case the new keys will be added as additional columns in the dataset.

Bascially each dataset example dict is updated with the dictionary returned by the function like this: example.update(function(example)).

In [0]:
# Since the input example dict is updated with our function output dict,
# we can actually just return the updated 'title' field
dataset = dataset.map(lambda example: {'title': 'My cutest title: ' + example['title']})

print(dataset.unique('title'))
INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-e254729a165001477fc910898551132f.arrow
1057it [00:00, 12758.48it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 923001 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-e254729a165001477fc910898551132f.arrow.
['My cutest title: My cute title: Super_Bowl_50', 'My cutest title: My cute title: Warsaw']

Removing columns

You can also remove columns when running map with the remove_columns=List[str] argument.

In [0]:
# This will remove the 'title' column while doing the update (after having send it the the mapped function so you can use it in your function!)
dataset = dataset.map(lambda example: {'new_title': 'Wouhahh: ' + example['title']},
                     remove_columns=['title'])

print(dataset.column_names)
print(dataset.unique('new_title'))
INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-319ffdab1a236b2101739c4b33dc26d8.arrow
1057it [00:00, 12976.87it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 932514 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-319ffdab1a236b2101739c4b33dc26d8.arrow.
['context', 'question', 'answers.text', 'answers.answer_start', 'new_title']
['Wouhahh: My cutest title: My cute title: Super_Bowl_50', 'Wouhahh: My cutest title: My cute title: Warsaw']

Using examples indices

With with_indices=True, dataset indices (from 0 to len(dataset)) will be supplied to the function which must thus have the following signature: function(example: dict, indice: int) -> dict

In [0]:
# This will add the index in the dataset to the 'question' field
dataset = dataset.map(lambda example, idx: {'question': f'{idx}: ' + example['question']},
                      with_indices=True)

print('\n'.join(dataset['question'][:5]))
INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-0d7046ac832c326979b2f70469eac9fa.arrow
1057it [00:00, 13039.70it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 937746 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-0d7046ac832c326979b2f70469eac9fa.arrow.
0: Which NFL team represented the AFC at Super Bowl 50?
1: Which NFL team represented the NFC at Super Bowl 50?
2: Where did Super Bowl 50 take place?
3: Which NFL team won Super Bowl 50?
4: What color was used to emphasize the 50th anniversary of the Super Bowl?

Modifying the dataset with batched updates

.map() can also work with batch of examples (slices of the dataset).

This is particularly interesting if you have a function that can handle batch of inputs like the tokenizers of HuggingFace tokenizers.

To work on batched inputs set batched=True when calling .map() and supply a function with the following signature: function(examples: Dict[List]) -> Dict[List] or, if you use indices, function(examples: Dict[List], indices: List[int]) -> Dict[List]).

Bascially, your function should accept an input with the format of a slice of the dataset: function(dataset[:10]).

In [0]:
!pip install transformers
Collecting transformers
  Downloading https://files.pythonhosted.org/packages/12/b5/ac41e3e95205ebf53439e4dd087c58e9fd371fd8e3724f2b9b4cdb8282e5/transformers-2.10.0-py3-none-any.whl (660kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 665kB 3.5MB/s 
Collecting sentencepiece
  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.1MB 17.6MB/s 
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers) (3.0.12)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers) (1.18.4)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers) (4.41.1)
Collecting sacremoses
  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 890kB 25.9MB/s 
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers) (0.7)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers) (2.23.0)
Collecting tokenizers==0.7.0
  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3.8MB 34.4MB/s 
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (1.12.0)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (7.1.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (0.15.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2020.4.5.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2.9)
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... done
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893260 sha256=7a20f1b539ae5c37ce1c58b61c7f0f1d942d292d2fa7f27f45cb064a66621ea5
  Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45
Successfully built sacremoses
Installing collected packages: sentencepiece, sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.43 sentencepiece-0.1.91 tokenizers-0.7.0 transformers-2.10.0
In [0]:
# Let's import a fast tokenizer that can work on batched inputs
# (the 'Fast' tokenizers in HuggingFace)
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
INFO:transformers.file_utils:PyTorch version 1.5.0+cu101 available.
INFO:transformers.file_utils:TensorFlow version 2.2.0 available.
INFO:filelock:Lock 139884348804680 acquired on /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpbrrc_uwe
INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt in cache at /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:filelock:Lock 139884348804680 released on /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1.lock
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /root/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1

In [0]:
# Now let's batch tokenize our dataset 'context'
dataset = dataset.map(lambda example: tokenizer.batch_encode_plus(example['context']),
                      batched=True)

print("dataset[0]", dataset[0])
INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-4c8436e14fee9674f678b8735b43c65e.arrow
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00,  3.54it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 4749270 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-4c8436e14fee9674f678b8735b43c65e.arrow.
dataset[0] {'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': '0: Which NFL team represented the AFC at Super Bowl 50?', 'answers.text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answers.answer_start': [177, 177, 177], 'new_title': 'Wouhahh: My cutest title: My cute title: Super_Bowl_50', 'input_ids': [101, 3198, 5308, 1851, 1108, 1126, 1237, 1709, 1342, 1106, 4959, 1103, 3628, 1104, 1103, 1305, 2289, 1453, 113, 4279, 114, 1111, 1103, 1410, 1265, 119, 1109, 1237, 2289, 3047, 113, 10402, 114, 3628, 7068, 14722, 2378, 1103, 1305, 2289, 3047, 113, 24743, 114, 3628, 2938, 13598, 1572, 782, 1275, 1106, 7379, 1147, 1503, 3198, 5308, 1641, 119, 1109, 1342, 1108, 1307, 1113, 1428, 128, 117, 1446, 117, 1120, 12388, 112, 188, 3339, 1107, 1103, 1727, 2948, 2410, 3894, 1120, 3364, 10200, 117, 1756, 119, 1249, 1142, 1108, 1103, 13163, 3198, 5308, 117, 1103, 2074, 13463, 1103, 107, 5404, 5453, 107, 1114, 1672, 2284, 118, 12005, 11751, 117, 1112, 1218, 1112, 7818, 28117, 20080, 16264, 1103, 3904, 1104, 10505, 1296, 3198, 5308, 1342, 1114, 2264, 183, 15447, 16179, 113, 1223, 1134, 1103, 1342, 1156, 1138, 1151, 1227, 1112, 107, 3198, 5308, 149, 107, 114, 117, 1177, 1115, 1103, 7998, 1180, 15199, 2672, 1103, 4944, 183, 15447, 16179, 1851, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
In [0]:
# we have added additional columns
print(dataset.column_names)
['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask']
In [0]:
# Let show a more complex processing with the full preparation of the SQuAD dataset
# for training a model from Transformers
def convert_to_features(batch):
    # Tokenize contexts and questions (as pairs of inputs)
    # keep offset mappings for evaluation
    input_pairs = list(zip(batch['context'], batch['question']))
    encodings = tokenizer.batch_encode_plus(input_pairs,
                                            pad_to_max_length=True,
                                            return_offsets_mapping=True)

    # Compute start and end tokens for labels
    start_positions, end_positions = [], []
    for i, (text, start) in enumerate(zip(batch['answers.text'], batch['answers.answer_start'])):
        first_char = start[0]
        last_char = first_char + len(text[0]) - 1
        start_positions.append(encodings.char_to_token(i, first_char))
        end_positions.append(encodings.char_to_token(i, last_char))

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
    return encodings

dataset = dataset.map(convert_to_features, batched=True)
INFO:nlp.arrow_dataset:Caching processed dataset at /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-3cceeef76f89add124dd3c1c12d2f776.arrow
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00,  2.50it/s]
INFO:nlp.arrow_writer:Done writing 1057 examples in 21643250 bytes /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/cache-3cceeef76f89add124dd3c1c12d2f776.arrow.
In [0]:
# Now our dataset comprise the labels for the start and end position
# as well as the offsets for converting back tokens
# in span of the original string for evaluation
print("column_names", dataset.column_names)
print("start_positions", dataset[:5]['start_positions'])
column_names ['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']
start_positions [34, 45, 80, 34, 98]

Formating outputs for numpy/torch/tensorflow

Now that we have tokenized our inputs, we probably want to use this dataset in a torch.Dataloader or a tf.data.Dataset.

To be able to do this we need to tweak two things:

  • format the indexing (__getitem__) to return numpy/pytorch/tensorflow tensors, instead of python objects, and probably
  • format the indexing (__getitem__) to return only the subset of the columns that we need for our model inputs.

    We don't want the columns id or title as inputs to train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model.

This is handled by the .set_format(type: Union[None, str], columns: Union[None, str, List[str]]) where:

  • type define the return type for our dataset __getitem__ method and is one of [None, 'numpy', 'pandas', 'torch', 'tensorflow'] (None means return python objects), and
  • columns define the columns returned by __getitem__ and takes the name of a column in the dataset or a list of columns to return (None means return all columns).
In [0]:
columns_to_return = ['input_ids', 'token_type_ids', 'attention_mask',
                     'start_positions', 'end_positions']

dataset.set_format(type='torch',
                   columns=columns_to_return)

# Our dataset indexing output is now ready for being used in a pytorch dataloader
print('\n'.join([' '.join((n, str(type(t)), str(t.shape))) for n, t in dataset[:10].items()]))
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to torch for ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'] columns  (when key is int or slice) and don't output other (un-formated) columns.
input_ids <class 'torch.Tensor'> torch.Size([10, 451])
token_type_ids <class 'torch.Tensor'> torch.Size([10, 451])
attention_mask <class 'torch.Tensor'> torch.Size([10, 451])
start_positions <class 'torch.Tensor'> torch.Size([10])
end_positions <class 'torch.Tensor'> torch.Size([10])
In [0]:
# Note that the columns are not removed from the dataset, just not returned when calling __getitem__
# Similarly the inner type of the dataset is not changed to torch.Tensor, the conversion and filtering is done on-the-fly when querying the dataset
print(dataset.column_names)
['context', 'question', 'answers.text', 'answers.answer_start', 'new_title', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions']
In [0]:
# We can remove the formating with `.reset_format()`
# or, identically, a call to `.set_format()` with no arguments
dataset.reset_format()

print('\n'.join([' '.join((n, str(type(t)))) for n, t in dataset[:10].items()]))
INFO:nlp.arrow_dataset:Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formated) columns.
context <class 'list'>
question <class 'list'>
answers.text <class 'list'>
answers.answer_start <class 'list'>
new_title <class 'list'>
input_ids <class 'list'>
token_type_ids <class 'list'>
attention_mask <class 'list'>
offset_mapping <class 'list'>
start_positions <class 'list'>
end_positions <class 'list'>
In [0]:
# The current format can be checked with `.format`,
# which is a dict of the type and formating
dataset.format
Out[0]:
{'columns': ['context',
  'question',
  'answers.text',
  'answers.answer_start',
  'new_title',
  'input_ids',
  'token_type_ids',
  'attention_mask',
  'offset_mapping',
  'start_positions',
  'end_positions'],
 'output_all_columns': False,
 'type': 'python'}

Wrapping this all up (PyTorch)

Let's wrap this all up with the full code to load and prepare SQuAD for training a PyTorch model from HuggingFace transformers library.

In [13]:
!pip install transformers
Requirement already satisfied: transformers in /usr/local/lib/python3.6/dist-packages (2.10.0)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers) (0.1.91)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers) (3.0.12)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers) (4.41.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers) (1.18.4)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers) (0.0.43)
Requirement already satisfied: tokenizers==0.7.0 in /usr/local/lib/python3.6/dist-packages (from transformers) (0.7.0)
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers) (0.7)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (7.1.2)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (1.12.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (0.15.1)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2.9)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2020.4.5.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (3.0.4)
In [0]:
import nlp
import torch 
from transformers import BertTokenizerFast

# Load our training dataset and tokenizer
dataset = nlp.load_dataset('squad')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

def get_correct_alignement(context, answer):
    """ Some original examples in SQuAD have indices wrong by 1 or 2 character. We test and fix this here. """
    gold_text = answer['text'][0]
    start_idx = answer['answer_start'][0]
    end_idx = start_idx + len(gold_text)
    if context[start_idx:end_idx] == gold_text:
        return start_idx, end_idx       # When the gold label position is good
    elif context[start_idx-1:end_idx-1] == gold_text:
        return start_idx-1, end_idx-1   # When the gold label is off by one character
    elif context[start_idx-2:end_idx-2] == gold_text:
        return start_idx-2, end_idx-2   # When the gold label is off by two character
    else:
        raise ValueError()

# Tokenize our training dataset
def convert_to_features(example_batch):
    # Tokenize contexts and questions (as pairs of inputs)
    input_pairs = list(zip(example_batch['context'], example_batch['question']))
    encodings = tokenizer.batch_encode_plus(input_pairs, pad_to_max_length=True)

    # Compute start and end tokens for labels using Transformers's fast tokenizers alignement methodes.
    start_positions, end_positions = [], []
    for i, (context, answer) in enumerate(zip(example_batch['context'], example_batch['answers'])):
        start_idx, end_idx = get_correct_alignement(context, answer)
        start_positions.append(encodings.char_to_token(i, start_idx))
        end_positions.append(encodings.char_to_token(i, end_idx-1))
    encodings.update({'start_positions': start_positions,
                      'end_positions': end_positions})
    return encodings

dataset['train'] = dataset['train'].map(convert_to_features, batched=True)

# Format our dataset to outputs torch.Tensor to train a pytorch model
columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
dataset['train'].set_format(type='torch', columns=columns)

# Instantiate a PyTorch Dataloader around our dataset
dataloader = torch.utils.data.DataLoader(dataset['train'], batch_size=8)
In [0]:
# Let's load a pretrained Bert model and a simple optimizer
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('distilbert-base-cased')
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
INFO:filelock:Lock 139884094601256 acquired on /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab97e2ee0f8d9071a5cd694e19bf664237a92aea20ebe04ddb7097b494.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmp1uhk_b1k
INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json in cache at /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab97e2ee0f8d9071a5cd694e19bf664237a92aea20ebe04ddb7097b494
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab97e2ee0f8d9071a5cd694e19bf664237a92aea20ebe04ddb7097b494
INFO:filelock:Lock 139884094601256 released on /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab97e2ee0f8d9071a5cd694e19bf664237a92aea20ebe04ddb7097b494.lock
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json from cache at /root/.cache/torch/transformers/774d52b0be7c2f621ac9e64708a8b80f22059f6d0e264e1bdc4f4d71c386c4ea.f44aaaab97e2ee0f8d9071a5cd694e19bf664237a92aea20ebe04ddb7097b494
INFO:transformers.configuration_utils:Model config BertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "attention_probs_dropout_prob": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_act": "gelu",
  "hidden_dim": 3072,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "n_heads": 12,
  "n_layers": 6,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "type_vocab_size": 2,
  "vocab_size": 28996
}


INFO:filelock:Lock 139884094601256 acquired on /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b2b8c7d.lock
INFO:transformers.file_utils:https://cdn.huggingface.co/distilbert-base-cased-pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmp8t3mu3iu
INFO:transformers.file_utils:storing https://cdn.huggingface.co/distilbert-base-cased-pytorch_model.bin in cache at /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b2b8c7d
INFO:transformers.file_utils:creating metadata file for /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b2b8c7d
INFO:filelock:Lock 139884094601256 released on /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b2b8c7d.lock
INFO:transformers.modeling_utils:loading weights file https://cdn.huggingface.co/distilbert-base-cased-pytorch_model.bin from cache at /root/.cache/torch/transformers/185eb053d63bc5c2d6994e4b2a8e5eb59f31af90db9c5fae5e38c32a986462cb.857b7d17ad0bfaa2eec50caf481575bab1073303fef16bd5f29bc5248b2b8c7d

INFO:transformers.modeling_utils:Weights of BertForQuestionAnswering not initialized from pretrained model: ['embeddings.word_embeddings.weight', 'embeddings.position_embeddings.weight', 'embeddings.token_type_embeddings.weight', 'embeddings.LayerNorm.weight', 'embeddings.LayerNorm.bias', 'encoder.layer.0.attention.self.query.weight', 'encoder.layer.0.attention.self.query.bias', 'encoder.layer.0.attention.self.key.weight', 'encoder.layer.0.attention.self.key.bias', 'encoder.layer.0.attention.self.value.weight', 'encoder.layer.0.attention.self.value.bias', 'encoder.layer.0.attention.output.dense.weight', 'encoder.layer.0.attention.output.dense.bias', 'encoder.layer.0.attention.output.LayerNorm.weight', 'encoder.layer.0.attention.output.LayerNorm.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.layer.0.output.dense.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.1.attention.self.query.weight', 'encoder.layer.1.attention.self.query.bias', 'encoder.layer.1.attention.self.key.weight', 'encoder.layer.1.attention.self.key.bias', 'encoder.layer.1.attention.self.value.weight', 'encoder.layer.1.attention.self.value.bias', 'encoder.layer.1.attention.output.dense.weight', 'encoder.layer.1.attention.output.dense.bias', 'encoder.layer.1.attention.output.LayerNorm.weight', 'encoder.layer.1.attention.output.LayerNorm.bias', 'encoder.layer.1.intermediate.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.1.output.dense.bias', 'encoder.layer.1.output.LayerNorm.weight', 'encoder.layer.1.output.LayerNorm.bias', 'encoder.layer.2.attention.self.query.weight', 'encoder.layer.2.attention.self.query.bias', 'encoder.layer.2.attention.self.key.weight', 'encoder.layer.2.attention.self.key.bias', 'encoder.layer.2.attention.self.value.weight', 'encoder.layer.2.attention.self.value.bias', 'encoder.layer.2.attention.output.dense.weight', 'encoder.layer.2.attention.output.dense.bias', 'encoder.layer.2.attention.output.LayerNorm.weight', 'encoder.layer.2.attention.output.LayerNorm.bias', 'encoder.layer.2.intermediate.dense.weight', 'encoder.layer.2.intermediate.dense.bias', 'encoder.layer.2.output.dense.weight', 'encoder.layer.2.output.dense.bias', 'encoder.layer.2.output.LayerNorm.weight', 'encoder.layer.2.output.LayerNorm.bias', 'encoder.layer.3.attention.self.query.weight', 'encoder.layer.3.attention.self.query.bias', 'encoder.layer.3.attention.self.key.weight', 'encoder.layer.3.attention.self.key.bias', 'encoder.layer.3.attention.self.value.weight', 'encoder.layer.3.attention.self.value.bias', 'encoder.layer.3.attention.output.dense.weight', 'encoder.layer.3.attention.output.dense.bias', 'encoder.layer.3.attention.output.LayerNorm.weight', 'encoder.layer.3.attention.output.LayerNorm.bias', 'encoder.layer.3.intermediate.dense.weight', 'encoder.layer.3.intermediate.dense.bias', 'encoder.layer.3.output.dense.weight', 'encoder.layer.3.output.dense.bias', 'encoder.layer.3.output.LayerNorm.weight', 'encoder.layer.3.output.LayerNorm.bias', 'encoder.layer.4.attention.self.query.weight', 'encoder.layer.4.attention.self.query.bias', 'encoder.layer.4.attention.self.key.weight', 'encoder.layer.4.attention.self.key.bias', 'encoder.layer.4.attention.self.value.weight', 'encoder.layer.4.attention.self.value.bias', 'encoder.layer.4.attention.output.dense.weight', 'encoder.layer.4.attention.output.dense.bias', 'encoder.layer.4.attention.output.LayerNorm.weight', 'encoder.layer.4.attention.output.LayerNorm.bias', 'encoder.layer.4.intermediate.dense.weight', 'encoder.layer.4.intermediate.dense.bias', 'encoder.layer.4.output.dense.weight', 'encoder.layer.4.output.dense.bias', 'encoder.layer.4.output.LayerNorm.weight', 'encoder.layer.4.output.LayerNorm.bias', 'encoder.layer.5.attention.self.query.weight', 'encoder.layer.5.attention.self.query.bias', 'encoder.layer.5.attention.self.key.weight', 'encoder.layer.5.attention.self.key.bias', 'encoder.layer.5.attention.self.value.weight', 'encoder.layer.5.attention.self.value.bias', 'encoder.layer.5.attention.output.dense.weight', 'encoder.layer.5.attention.output.dense.bias', 'encoder.layer.5.attention.output.LayerNorm.weight', 'encoder.layer.5.attention.output.LayerNorm.bias', 'encoder.layer.5.intermediate.dense.weight', 'encoder.layer.5.intermediate.dense.bias', 'encoder.layer.5.output.dense.weight', 'encoder.layer.5.output.dense.bias', 'encoder.layer.5.output.LayerNorm.weight', 'encoder.layer.5.output.LayerNorm.bias', 'encoder.layer.6.attention.self.query.weight', 'encoder.layer.6.attention.self.query.bias', 'encoder.layer.6.attention.self.key.weight', 'encoder.layer.6.attention.self.key.bias', 'encoder.layer.6.attention.self.value.weight', 'encoder.layer.6.attention.self.value.bias', 'encoder.layer.6.attention.output.dense.weight', 'encoder.layer.6.attention.output.dense.bias', 'encoder.layer.6.attention.output.LayerNorm.weight', 'encoder.layer.6.attention.output.LayerNorm.bias', 'encoder.layer.6.intermediate.dense.weight', 'encoder.layer.6.intermediate.dense.bias', 'encoder.layer.6.output.dense.weight', 'encoder.layer.6.output.dense.bias', 'encoder.layer.6.output.LayerNorm.weight', 'encoder.layer.6.output.LayerNorm.bias', 'encoder.layer.7.attention.self.query.weight', 'encoder.layer.7.attention.self.query.bias', 'encoder.layer.7.attention.self.key.weight', 'encoder.layer.7.attention.self.key.bias', 'encoder.layer.7.attention.self.value.weight', 'encoder.layer.7.attention.self.value.bias', 'encoder.layer.7.attention.output.dense.weight', 'encoder.layer.7.attention.output.dense.bias', 'encoder.layer.7.attention.output.LayerNorm.weight', 'encoder.layer.7.attention.output.LayerNorm.bias', 'encoder.layer.7.intermediate.dense.weight', 'encoder.layer.7.intermediate.dense.bias', 'encoder.layer.7.output.dense.weight', 'encoder.layer.7.output.dense.bias', 'encoder.layer.7.output.LayerNorm.weight', 'encoder.layer.7.output.LayerNorm.bias', 'encoder.layer.8.attention.self.query.weight', 'encoder.layer.8.attention.self.query.bias', 'encoder.layer.8.attention.self.key.weight', 'encoder.layer.8.attention.self.key.bias', 'encoder.layer.8.attention.self.value.weight', 'encoder.layer.8.attention.self.value.bias', 'encoder.layer.8.attention.output.dense.weight', 'encoder.layer.8.attention.output.dense.bias', 'encoder.layer.8.attention.output.LayerNorm.weight', 'encoder.layer.8.attention.output.LayerNorm.bias', 'encoder.layer.8.intermediate.dense.weight', 'encoder.layer.8.intermediate.dense.bias', 'encoder.layer.8.output.dense.weight', 'encoder.layer.8.output.dense.bias', 'encoder.layer.8.output.LayerNorm.weight', 'encoder.layer.8.output.LayerNorm.bias', 'encoder.layer.9.attention.self.query.weight', 'encoder.layer.9.attention.self.query.bias', 'encoder.layer.9.attention.self.key.weight', 'encoder.layer.9.attention.self.key.bias', 'encoder.layer.9.attention.self.value.weight', 'encoder.layer.9.attention.self.value.bias', 'encoder.layer.9.attention.output.dense.weight', 'encoder.layer.9.attention.output.dense.bias', 'encoder.layer.9.attention.output.LayerNorm.weight', 'encoder.layer.9.attention.output.LayerNorm.bias', 'encoder.layer.9.intermediate.dense.weight', 'encoder.layer.9.intermediate.dense.bias', 'encoder.layer.9.output.dense.weight', 'encoder.layer.9.output.dense.bias', 'encoder.layer.9.output.LayerNorm.weight', 'encoder.layer.9.output.LayerNorm.bias', 'encoder.layer.10.attention.self.query.weight', 'encoder.layer.10.attention.self.query.bias', 'encoder.layer.10.attention.self.key.weight', 'encoder.layer.10.attention.self.key.bias', 'encoder.layer.10.attention.self.value.weight', 'encoder.layer.10.attention.self.value.bias', 'encoder.layer.10.attention.output.dense.weight', 'encoder.layer.10.attention.output.dense.bias', 'encoder.layer.10.attention.output.LayerNorm.weight', 'encoder.layer.10.attention.output.LayerNorm.bias', 'encoder.layer.10.intermediate.dense.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.10.output.dense.weight', 'encoder.layer.10.output.dense.bias', 'encoder.layer.10.output.LayerNorm.weight', 'encoder.layer.10.output.LayerNorm.bias', 'encoder.layer.11.attention.self.query.weight', 'encoder.layer.11.attention.self.query.bias', 'encoder.layer.11.attention.self.key.weight', 'encoder.layer.11.attention.self.key.bias', 'encoder.layer.11.attention.self.value.weight', 'encoder.layer.11.attention.self.value.bias', 'encoder.layer.11.attention.output.dense.weight', 'encoder.layer.11.attention.output.dense.bias', 'encoder.layer.11.attention.output.LayerNorm.weight', 'encoder.layer.11.attention.output.LayerNorm.bias', 'encoder.layer.11.intermediate.dense.weight', 'encoder.layer.11.intermediate.dense.bias', 'encoder.layer.11.output.dense.weight', 'encoder.layer.11.output.dense.bias', 'encoder.layer.11.output.LayerNorm.weight', 'encoder.layer.11.output.LayerNorm.bias', 'pooler.dense.weight', 'pooler.dense.bias', 'qa_outputs.bias', 'qa_outputs.weight']
INFO:transformers.modeling_utils:Weights from pretrained model not used in BertForQuestionAnswering: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.lin2.weight', 'distilbert.transformer.layer.0.ffn.lin2.bias', 'distilbert.transformer.layer.0.output_layer_norm.weight', 'distilbert.transformer.layer.0.output_layer_norm.bias', 'distilbert.transformer.layer.1.attention.q_lin.weight', 'distilbert.transformer.layer.1.attention.q_lin.bias', 'distilbert.transformer.layer.1.attention.k_lin.weight', 'distilbert.transformer.layer.1.attention.k_lin.bias', 'distilbert.transformer.layer.1.attention.v_lin.weight', 'distilbert.transformer.layer.1.attention.v_lin.bias', 'distilbert.transformer.layer.1.attention.out_lin.weight', 'distilbert.transformer.layer.1.attention.out_lin.bias', 'distilbert.transformer.layer.1.sa_layer_norm.weight', 'distilbert.transformer.layer.1.sa_layer_norm.bias', 'distilbert.transformer.layer.1.ffn.lin1.weight', 'distilbert.transformer.layer.1.ffn.lin1.bias', 'distilbert.transformer.layer.1.ffn.lin2.weight', 'distilbert.transformer.layer.1.ffn.lin2.bias', 'distilbert.transformer.layer.1.output_layer_norm.weight', 'distilbert.transformer.layer.1.output_layer_norm.bias', 'distilbert.transformer.layer.2.attention.q_lin.weight', 'distilbert.transformer.layer.2.attention.q_lin.bias', 'distilbert.transformer.layer.2.attention.k_lin.weight', 'distilbert.transformer.layer.2.attention.k_lin.bias', 'distilbert.transformer.layer.2.attention.v_lin.weight', 'distilbert.transformer.layer.2.attention.v_lin.bias', 'distilbert.transformer.layer.2.attention.out_lin.weight', 'distilbert.transformer.layer.2.attention.out_lin.bias', 'distilbert.transformer.layer.2.sa_layer_norm.weight', 'distilbert.transformer.layer.2.sa_layer_norm.bias', 'distilbert.transformer.layer.2.ffn.lin1.weight', 'distilbert.transformer.layer.2.ffn.lin1.bias', 'distilbert.transformer.layer.2.ffn.lin2.weight', 'distilbert.transformer.layer.2.ffn.lin2.bias', 'distilbert.transformer.layer.2.output_layer_norm.weight', 'distilbert.transformer.layer.2.output_layer_norm.bias', 'distilbert.transformer.layer.3.attention.q_lin.weight', 'distilbert.transformer.layer.3.attention.q_lin.bias', 'distilbert.transformer.layer.3.attention.k_lin.weight', 'distilbert.transformer.layer.3.attention.k_lin.bias', 'distilbert.transformer.layer.3.attention.v_lin.weight', 'distilbert.transformer.layer.3.attention.v_lin.bias', 'distilbert.transformer.layer.3.attention.out_lin.weight', 'distilbert.transformer.layer.3.attention.out_lin.bias', 'distilbert.transformer.layer.3.sa_layer_norm.weight', 'distilbert.transformer.layer.3.sa_layer_norm.bias', 'distilbert.transformer.layer.3.ffn.lin1.weight', 'distilbert.transformer.layer.3.ffn.lin1.bias', 'distilbert.transformer.layer.3.ffn.lin2.weight', 'distilbert.transformer.layer.3.ffn.lin2.bias', 'distilbert.transformer.layer.3.output_layer_norm.weight', 'distilbert.transformer.layer.3.output_layer_norm.bias', 'distilbert.transformer.layer.4.attention.q_lin.weight', 'distilbert.transformer.layer.4.attention.q_lin.bias', 'distilbert.transformer.layer.4.attention.k_lin.weight', 'distilbert.transformer.layer.4.attention.k_lin.bias', 'distilbert.transformer.layer.4.attention.v_lin.weight', 'distilbert.transformer.layer.4.attention.v_lin.bias', 'distilbert.transformer.layer.4.attention.out_lin.weight', 'distilbert.transformer.layer.4.attention.out_lin.bias', 'distilbert.transformer.layer.4.sa_layer_norm.weight', 'distilbert.transformer.layer.4.sa_layer_norm.bias', 'distilbert.transformer.layer.4.ffn.lin1.weight', 'distilbert.transformer.layer.4.ffn.lin1.bias', 'distilbert.transformer.layer.4.ffn.lin2.weight', 'distilbert.transformer.layer.4.ffn.lin2.bias', 'distilbert.transformer.layer.4.output_layer_norm.weight', 'distilbert.transformer.layer.4.output_layer_norm.bias', 'distilbert.transformer.layer.5.attention.q_lin.weight', 'distilbert.transformer.layer.5.attention.q_lin.bias', 'distilbert.transformer.layer.5.attention.k_lin.weight', 'distilbert.transformer.layer.5.attention.k_lin.bias', 'distilbert.transformer.layer.5.attention.v_lin.weight', 'distilbert.transformer.layer.5.attention.v_lin.bias', 'distilbert.transformer.layer.5.attention.out_lin.weight', 'distilbert.transformer.layer.5.attention.out_lin.bias', 'distilbert.transformer.layer.5.sa_layer_norm.weight', 'distilbert.transformer.layer.5.sa_layer_norm.bias', 'distilbert.transformer.layer.5.ffn.lin1.weight', 'distilbert.transformer.layer.5.ffn.lin1.bias', 'distilbert.transformer.layer.5.ffn.lin2.weight', 'distilbert.transformer.layer.5.ffn.lin2.bias', 'distilbert.transformer.layer.5.output_layer_norm.weight', 'distilbert.transformer.layer.5.output_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
In [0]:
# Now let's train our model

model.train()
for i, batch in enumerate(dataloader):
    outputs = model(**batch)
    loss = outputs[0]
    loss.backward()
    optimizer.step()
    model.zero_grad()
    print(f'Step {i} - loss: {loss:.3}')
    if i > 3:
        break
Step 0 - loss: 6.42
Step 1 - loss: 5.64
Step 2 - loss: 5.09
Step 3 - loss: 5.59
Step 4 - loss: 4.81

Wrapping this all up (Tensorflow)

Let's wrap this all up with the full code to load and prepare SQuAD for training a Tensorflow model (works only from the version 2.2.0)

In [15]:
import tensorflow as tf
import nlp
from transformers import BertTokenizerFast

# Load our training dataset and tokenizer
train_tf_dataset = nlp.load_dataset('squad', split="train")
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

# Tokenize our training dataset
# The only one diff here is that start_positions and end_positions
# must be single dim list => [[23], [45] ...]
# instead of => [23, 45 ...]
def convert_to_tf_features(example_batch):
    # Tokenize contexts and questions (as pairs of inputs)
    input_pairs = list(zip(example_batch['context'], example_batch['question']))
    encodings = tokenizer.batch_encode_plus(input_pairs, pad_to_max_length=True, max_length=tokenizer.max_len)

    # Compute start and end tokens for labels using Transformers's fast tokenizers alignement methods.
    start_positions, end_positions = [], []
    for i, (context, answer) in enumerate(zip(example_batch['context'], example_batch['answers'])):
        start_idx, end_idx = get_correct_alignement(context, answer)
        start_positions.append([encodings.char_to_token(i, start_idx)])
        end_positions.append([encodings.char_to_token(i, end_idx-1)])
    
    if start_positions and end_positions:
      encodings.update({'start_positions': start_positions,
                        'end_positions': end_positions})
    return encodings

train_tf_dataset = train_tf_dataset.map(convert_to_tf_features, batched=True)

def remove_none_values(example):
  return not None in example["start_positions"] or not None in example["end_positions"]

train_tf_dataset = train_tf_dataset.filter(remove_none_values, load_from_cache_file=False)
columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
train_tf_dataset.set_format(type='tensorflow', columns=columns)
features = {x: train_tf_dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.max_len]) for x in columns[:3]} 
labels = {"output_1": train_tf_dataset["start_positions"].to_tensor(default_value=0, shape=[None, 1])}
labels["output_2"] = train_tf_dataset["end_positions"].to_tensor(default_value=0, shape=[None, 1])
tfdataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(8)
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 88/88 [00:38<00:00,  2.30it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 88/88 [00:38<00:00,  2.26it/s]
In [0]:
# Let's load a pretrained TF2 Bert model and a simple optimizer
from transformers import TFBertForQuestionAnswering

model = TFBertForQuestionAnswering.from_pretrained("bert-base-cased")
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt,
              loss={'output_1': loss_fn, 'output_2': loss_fn},
              loss_weights={'output_1': 1., 'output_2': 1.},
              metrics=['accuracy'])
In [17]:
# Now let's train our model

model.fit(tfdataset, epochs=1, steps_per_epoch=3)
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_for_question_answering_1/bert/pooler/dense/kernel:0', 'tf_bert_for_question_answering_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_for_question_answering_1/bert/pooler/dense/kernel:0', 'tf_bert_for_question_answering_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_for_question_answering_1/bert/pooler/dense/kernel:0', 'tf_bert_for_question_answering_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_for_question_answering_1/bert/pooler/dense/kernel:0', 'tf_bert_for_question_answering_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_for_question_answering_1/bert/pooler/dense/kernel:0', 'tf_bert_for_question_answering_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_for_question_answering_1/bert/pooler/dense/kernel:0', 'tf_bert_for_question_answering_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_for_question_answering_1/bert/pooler/dense/kernel:0', 'tf_bert_for_question_answering_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_for_question_answering_1/bert/pooler/dense/kernel:0', 'tf_bert_for_question_answering_1/bert/pooler/dense/bias:0'] when minimizing the loss.
3/3 [==============================] - 97s 32s/step - loss: 12.2385 - output_1_loss: 6.0742 - output_2_loss: 6.1642 - output_1_accuracy: 0.0417 - output_2_accuracy: 0.0000e+00
Out[17]:
<tensorflow.python.keras.callbacks.History at 0x7f1a8b824908>

Metrics API

nlp also provides easy access and sharing of metrics.

This aspect of the library is still experimental and the API may still evolve more than the datasets API.

Like datasets, metrics are added as small scripts wrapping common metrics in a common API.

There are several reason you may want to use metrics with nlp and in particular:

  • metrics for specific datasets like GLUE or SQuAD are provided out-of-the-box in a simple, convenient and consistant way integrated with the dataset,
  • metrics in nlp leverage the powerful backend to provide smart features out-of-the-box like support for distributed evaluation in PyTorch

Using metrics

Using metrics is pretty simple, they have two main methods: .compute(predictions, references) to directly compute the metric and .add(prediction, reference) or .add_batch(predictions, references) to only store some results if you want to do the evaluation in one go at the end.

Here is a quick gist of a standard use of metrics (the simplest usage):

import nlp
bleu_metric = nlp.load_metric('bleu')

# If you only have a single iteration, you can easily compute the score like this
predictions = model(inputs)
score = bleu_metric.compute(predictions, references)

# If you have a loop, you can "add" your predictions and references at each iteration instead of having to save them yourself (the metric object store them efficiently for you)
for batch in dataloader:
    model_input, targets = batch
    predictions = model(model_inputs)
    bleu_metric.add_batch(predictions, targets)
score = bleu_metric.compute()  # Compute the score from all the stored predictions/references

Here is a quick gist of a use in a distributed torch setup (should work for any python multi-process setup actually). It's pretty much identical to the second example above:

import nlp
# You need to give the total number of parallel python processes (num_process) and the id of each process (process_id)
bleu_metric = nlp.load_metric('bleu', process_id=torch.distributed.get_rank(),b num_process=torch.distributed.get_world_size())

for batch in dataloader:
    model_input, targets = batch
    predictions = model(model_inputs)
    bleu_metric.add_batch(predictions, targets)
score = bleu_metric.compute()  # Compute the score on the first node by default (can be set to compute on each node as well)

Example with a NER metric: seqeval

In [0]:
ner_metric = nlp.load_metric('seqeval')
references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
predictions =  [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
ner_metric.compute(predictions, references)

Adding a new dataset or a new metric

They are two ways to add new datasets and metrics in nlp:

  • datasets can be added with a Pull-Request adding a script in the datasets folder of the nlp repository

=> once the PR is merged, the dataset can be instantiate by it's folder name e.g. nlp.load_dataset('squad'). If you want HuggingFace to host the data as well you will need to ask the HuggingFace team to upload the data.

  • datasets can also be added with a direct upload using nlp CLI as a user or organization (like for models in transformers). In this case the dataset will be accessible under the gien user/organization name, e.g. nlp.load_dataset('thomwolf/squad'). In this case you can upload the data yourself at the same time and in the same folder.

We will add a full tutorial on how to add and upload datasets soon.

In [0]: