Dataset Processing

Preparing the dataset includes the following steps:

Obtain textual data
Process dataset (wikipedia or bookcorpus) and combine into 1 text file using process_data.py
Divide the data into N shards using shard_data.py
Generate samples for training and testing the model using generate_samples.py

Obtaining textual data

Any textual dataset can be processed and used for training a BERT-like model.

In our experiments we trained models using the English section of Wikipedia and the Toronto Bookcorpus [REF].

Wikipedia dumps can be freely downloaded from https://dumps.wikimedia.org/ and can be processed (removing HTML tags, picture, and non textual data) using Wikiextractor.py.

We are unable to provide a source for Bookcorpus dataset.

Import and Shard Data

shard_data.py is used to shard multiple text files (processed with the script above) into pre-defined number of shards, and to divide the dataset into train and test sets.

Use shard_data.py to import and shard common corpus (e.g. Wikipedia and Bookcorpus) or customized corpus easily. Also shard_data.py has supported one-click corpus download and sharding of Wikipedia and Bookcorpus dataset in Huggingface without preparing the data in advance.

IMPORTANT NOTE: the number of shards is affected by the duplication factor used when generating the samples (with masked tokens). This means that if 10 training shards are generated with shard_data.py and samples are generated with duplication factor 5, the final number of training shards will be 50. This approach avoids intra-shard duplications that might overfit the model in each epoch.

IMPORTATN NOTE 2: the performance of the sharding script (we might fix in the future) might be slow if you choose to generate a small amount of shards (from our experiment under 100). If you encounter such situation we recommand to generate 256+ shards and then merging them to fewer using the merging script we provide (merge_shards.py). See more info the next section.

See python shard_data.py -h for the full list of options.

Option 1: Download and Shard Wikipedia and Bookcorpus from Huggingface

Example for downloading and sharding Wikipedia dataset with subset name --huggingface_wiki_config from Huggingface in one click:

python shard_data.py \
    -o <output_dir> \
    --num_train_shards 256 \
    --num_test_shards 128 \
    --frac_test 0.1 \
    --type huggingface_wikipedia \
    --huggingface_wiki_config 20220301.simple

Example for downloading and sharding Bookcorpus dataset from Huggingface in one click:

python shard_data.py \
    -o <output_dir> \
    --num_train_shards 256 \
    --num_test_shards 128 \
    --frac_test 0.1 \
    --type huggingface_bookcorpus

The output of this code would be --num_train_shards train files and --num_test_shards test files and an one article per line file. If users want to reuse the dataset without redownloading it from huggingface, they can use the bash commend in Option 2 with --dir as the path of directiory containing the one article per line file.

Option 2: Shard Local Customized Dataset

Example for sharding user's own corpus found in the input --dir into 256 train shards and 128 test shards, with 10% of the samples held-out for the test set:

python shard_data.py \
    --dir <path_to_text_files> \
    -o <output_dir> \
    --num_train_shards 256 \
    --num_test_shards 128 \
    --frac_test 0.1 \
    --type custom

The supported formats of corpus can be founded in the custom_data_example file. The code supports the txt file with one article per line, one sentence per line or multiple sentences per line. The most important thing is the blank lines that split different articles. Examples are as follows:

<article 1>
     
<article 2>
     
<article 3>
     
...

or

<article 1, sentence 1>
<article 1, sentence 2-3>
<article 1, sentence 4>
    
<article 2, sentence 1>
<article 2, sentence 2>
<article 2, sentence 3-115>

<article 3, sentence 1>
<article 3, sentence 2>
<article 3, sentence 3>
    
...

Option 3: Shard Local Wikipedia or Bookcorpus Dataset

Data Processing

Use process_data.py for pre-processing wikipedia/bookcorpus datasets into a single text file.

See python process_data.py -h for the full list of options.

An example for pre-processing the English Wikipedia xml dataset:

python process_data.py -f <path_to_xml> -o <output_dir> --type wiki

An example for pre-processing the Bookcorpus dataset:

python process_data.py -f <path_to_text_files> -o <output_dir> --type bookcorpus

Data Sharding

Example for sharding Wikipedia corpus found in the input --dir into 256 train shards and 128 test shard, with 10% of the samples held-out for the test set:

python shard_data.py \
    --dir <path_to_text_files> \
    -o <output_dir> \
    --num_train_shards 256 \
    --num_test_shards 128 \
    --frac_test 0.1
    --type wiki

For sharding Bookcorpus, use --type bookcorpus.

Merging Shards (optional)

Merging existing shards into fewer shards (while maintaining 2^N shards, for example 256->128 (2:1 ratio)) can be done with merge_shards.py script.

See python merge_shards.py -h for the full list of options.

Example for merging randomly 2 shards into 1 shard:

python merge_shards.py \
    --data <path_to_shards_dir> \
    --output_dir <output_dir> \
    --ratio 2

Samples Generation

Use generate_samples.py for generating samples compatible with dataloaders used in the training script.

IMPORTANT NOTE: the duplication factor chosen will multiply the number of final shards by its factor. For example, 10 shards with duplication factor 5 will generate 50 shards (each shard with different randomly generated (masked) samples).

See python generate_samples.py -h for the full list of options.

Example for generating shards with duplication factor 10, lowercasing the tokens, masked LM probability of 15%, max sequence length of 128, tokenizer by provided (Huggingface compatible) model named bert-large-uncased, max predictions per sample 20 and 16 parallel processes (for processing faster):

python generate_samples.py \
    --dir <path_to_shards> \
    -o <output_path> \
    --dup_factor 10 \
    --seed 42 \
    --vocab_file <path_to_vocabulary_file> \
    --do_lower_case 1 \
    --masked_lm_prob 0.15 \ 
    --max_seq_length 128 \
    --model_name bert-large-uncased \
    --max_predictions_per_seq 20 \
    --n_processes 16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset Processing

Obtaining textual data

Import and Shard Data

Option 1: Download and Shard Wikipedia and Bookcorpus from Huggingface

Option 2: Shard Local Customized Dataset

Option 3: Shard Local Wikipedia or Bookcorpus Dataset

Merging Shards (optional)

Samples Generation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset Processing

Obtaining textual data

Import and Shard Data

Option 1: Download and Shard Wikipedia and Bookcorpus from Huggingface

Option 2: Shard Local Customized Dataset

Option 3: Shard Local Wikipedia or Bookcorpus Dataset

Merging Shards (optional)

Samples Generation