-
Notifications
You must be signed in to change notification settings - Fork 298
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding Sentence Order Prediction (#1061)
* misc run scripts * sbatch * sweep scripts * update * qa * update * update * update * update * update * sb file * moving update_metrics to outside scope of dataparallel * fixing micro_avg calculation * undo debugging * Fixing tests, moving update_metrics out of other tasks * remove extraneous change * MLM task * Added MLM task * update * fix multiple choice dataparallel forward * update * add _mask_id to transformers * Update * MLM update * adding update_metrics abstraction * delete update_metrics_ notation * fixed wrong index problem * removed unrelated files * removed unrelated files * removed unrelated files * fix PEP8 * Fixed get_pretained_lm_head for BERT and ALBERT * spelling check * black formatting * fixing tests * bug fix * Adding batch_size constraints to multi-GPU setting * adding documentation * adding batch size test * black correct version * Fixing batch size assertion * generalize batch size assertion for more than 2 GPU setting * reducing label loops in code * fixing span forward * Fixing span prediction forward for multi-GPU * fix commonsenseQA forward * MLM * adding function documentation * resolving nits, fixing seq_gen forward * remove nit * fixing batch_size assert and SpanPrediction task * Remove debugging * Fix batch size mismatch multi-GPU test * Fix order of assert checking for batch size mismatch * mlm training * update * sbatch * update * data parallel * update data parallel stuffs * using sequencelabel, using 1 paragraph per example * update label mapping * adding exmaples-porportion-mixing * changing dataloader to work with wikitext103 * weight sampling * add early stopping only onb one task * commit * Cleaning up code * Removing unecessarily tracked git folders * Removing unnecesary changes * revert README * revert README.md again * Making more general for Transformer-based embedders * torch.uint8 -> torch.bool * Fixing indexing issues * get rid of unecessary changes * black cleanup * update * Prevent updating update_metrics twice in one step * update * update * update * Fixing SOP to work with jiant * delete debugging * tying pooler weights from ALBERT * fixed SOP tie weight, and MLM vocab error * dataset update for SOP * removed pdb * Fix ALBERT -> MLM problem, reduce amount of times get_data_iter is called * delete debugging * adding utf-8 encoding * Removing two-layer MLM class hierarchy * MLM indexing bug * fixing MLM error * removed rest of the shifting code * adding * fixing batch[inputs] error * change corpus to wikipedia raw * change corpus to wikipedia raw * Finish merge * style * Revert rest of mlm_weight * Revert LM change * Revert * Merging SOP * Improving documentation * Revert base_roberta * revert unecessary change * Correcting documentation * revert unnecessary changes * Refactoring SOP to make clearer * Adding SOPClassifier * Fixing SOP Task * Adding further documentation * Adding more description of dataset * fixing merge conflict * cleaning up unnecessary files * Making documentation clearer about our implementation of ALBERT SOP * Fix docstring * Refactoring SOP back as a PairClassificationTask, adding more documentation * Adding more documentation, adding process_split * Fix typo in comment * Adding modified SOP code * fixing based on comments * fixing len(current_chunk)==1 condition * fixing len(current_chunk)==1 condition * documentation fix * minor fix * minor fix: tokenizer * minor fix: current_length update * minor fix: current_length update * minor fix * bug fix * bug fix * Fixing document leakage bug * Fixing document delimiting bug * Cleaning up test * Black style * Accurately updating current_length based on when len for_next_chunk > 2 * SOP data generation insturctions * Fix documentation * Fixing docstrings and adding source of code * Fixing typos and data script documentation * Revert merge mistake Co-authored-by: phu-pmh <[email protected]> Co-authored-by: Haokun Liu <[email protected]> Co-authored-by: pruksmhc <[email protected]> Co-authored-by: DeepLearning VM <[email protected]>
- Loading branch information
1 parent
14fae87
commit ccad92a
Showing
7 changed files
with
358 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Downloading Wikipedia Corpus | ||
|
||
We use the preprocessing code from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#getting-the-data | ||
and the bash scripts provided here is used to help with streamlining the data generation in the NVIDIA repository. | ||
|
||
First, git clone https://github.com/NVIDIA/DeepLearningExamples.git. | ||
Then, move scripts/sop/create_wiki_sop_data.sh and scripts/sop/get_small_english_wiki.sh into DeepLearningExamples/PyTorch/LanguageModeling/BERT/data. | ||
|
||
Then, follow the instructions below: | ||
|
||
NVIDIA script download the latest Wikipedia dump. We use the Wikipedia dump 2020-03-01. | ||
To download the Wikipedia dump 2020-03-01, replace line 29 of `DeepLearningExamples/PyTorch/LanguageModeling/BERT/data/WikiDownloader.py`: | ||
`'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',` with `'en' : `https://dumps.wikimedia.org/enwiki/20200301/enwiki-20200301-pages-articles.xml.bz2`. | ||
|
||
The data creation for SOP is almost the same as MLM, except you need to edit the following. | ||
In `DeepLearningExamples/PyTorch/LanguageModeling/BERT/data/TextSharding.py`, replace line 55: | ||
`self.articles[global_article_count] = line.rstrip()` with `self.articles[global_article_count] = line.rstrip() + "\n ========THIS IS THE END OF ARTICLE.========"`. | ||
This is because SOP requires a signal for the end of each Wikipedia article. | ||
|
||
Additionally, replace '/workspace/wikiextractor/WikiExtractor.py' in line 80 of | ||
`DeepLearningExamples/PyTorch/LanguageModeling/BERT/data/bertPrep.py` with 'wikiextractor/WikiExtractor.py'. | ||
|
||
Run `bash create_wiki_sop_data.sh $lang $save_directory` | ||
The NVIDIA code supports English (en) and Chinese (zh) wikipedia. | ||
|
||
For example, to download and process English Wikipedia and save it in `~/Download` directory, run | ||
`bash create_wiki_sop_data.sh en ~/Download` | ||
|
||
The above command will download the entire English Wikipedia. | ||
|
||
In our experiments, we only use a small subset (around 5% of) the entire English Wikipedia, which has the same number of sentences as Wikitext103. | ||
To get this subset, run `bash get_small_english_wiki.sh $path_to_wikicorpus_en`. where $path_to_wikicorpus_en is the directory where you saved the full processed `wikicorpus_en` corpus. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
#!/bin/bash | ||
|
||
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
lang=$1 #the language, 'en' for English wikipedia | ||
export BERT_PREP_WORKING_DIR=$2 | ||
|
||
# clone wikiextractor if it doesn't exist | ||
if [ ! -d "wikiextractor" ]; then | ||
git clone https://github.com/attardi/wikiextractor.git | ||
fi | ||
|
||
echo "Downloading $lang wikpedia in directory $save_dir" | ||
# Download | ||
python3 bertPrep.py --action download --dataset wikicorpus_$lang | ||
|
||
|
||
# Properly format the text files | ||
python3 bertPrep.py --action text_formatting --dataset wikicorpus_$lang | ||
|
||
|
||
# Shard the text files (group wiki+books then shard) | ||
python3 bertPrep.py --action sharding --dataset wikicorpus_$lang | ||
|
||
|
||
# Combine sharded files into one | ||
save_dir=$BERT_PREP_WORKING_DIR/sharded_training_shards_256_test_shards_256_fraction_0.2/wikicorpus_$lang | ||
cat $save_dir/*training*.txt > $save_dir/train_$lang.txt | ||
cat $save_dir/*test*.txt > $save_dir/test_$lang.txt | ||
rm -rf $save_dir/wiki*training*.txt | ||
rm -rf $save_dir/wiki*test*.txt | ||
|
||
# remove some remaining xml tags | ||
sed -i 's/<[^>]*>//g' $save_dir/train_$lang.txt | ||
sed -i 's/<[^>]*>//g' $save_dir/test_$lang.txt | ||
|
||
echo "Your corpus is saved in $save_dir" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
wiki_path=$1 | ||
|
||
mkdir -p $wiki_path/wikipedia_sop_small | ||
head -3978309 $wiki_path/train_en.txt > $wiki_path/wikipedia_sop_small/train.txt | ||
head -10001 $wiki_path/test_en.txt > $wiki_path/wikipedia_sop_small/test.txt | ||
tail -8438 $wiki_path/train_en.txt > $wiki_path/wikipedia_sop_small/valid.txt |
Oops, something went wrong.