Polishing, benchmarking & other preparation for publishing our work
Key folders & their intended use:
.
├── data # everything that is not a code or notebook
│ ├── processed # outputs of pipeline steps - intermediate or final
│ └── raw # data downloaded from sources
├── models
│ ├── mixed # models performing more than 1 task
│ ├── ner # named entity recognizers
│ ├── pos # part of speech taggers
│ └── trees # dependency parsers
├── notebooks # Jupyter notebooks performing analysis
├── scripts # useful bash scripts
└── spacy_pl # all python source code (scripts and modules)
Let's keep the repo as clean as possible - in case you feel a new folder is necessary, discuss it with everyone first.
All of the .dvc
files should be stored in repository root folder.
Also, all scripts should be executed from the root folder - that way,
all commands line options related to file paths can be specified as
defaults and we'll keep commands short.
Pipelines should reflect the flow of data from raw downloaded files (data/raw
folder)
through preprocessing (one or more items created in processed
folder), finally creating
one or more models in models
folder.
Ideally, we shouldn't have to specify too much dependencies or outputs for any dvc
command -
to do this, let's keep grouping outputs of a given script within one folder. For example,
a script that has 3 output files should create a directory and specify it in dvc - not
the individual files. Names should reflect what the script does clearly - so that
people working on other pipelines can easily understand what it does.
Python files running executing long or complicated processing logic should have a docstring at the beginning to explain what they do.
.dvc
files should always be named by <verb>_<noun(s) separated by "_">
- an action
and its output(s). Verbs vocabulary:
add
: first step in the pipeline, adds new data from external sourcegenerate
: generates a piece of processed data from previously added data sourcescv
: creates a model using cross-validation on one datasettrain
: creates a model using separate train, dev and test datasets
If the step executes a single python file, the python file should be named the same as a script.
Whenever you struggle with doing something with dvc, describe a problem and its solution (step-by-step) here. Some example problems worth describing:
- how to change dependencies of an already computed step
- how to update .dvc file after renaming dependency or output
- how to commit and push your new steps
- how to re-run last step but pull all the previous ones from remote
This way we'll create a nice knowledge base and speed up our work in the future.
pip install -r requirements.txt
to install dvc (and other dependencies)- Find the google cloud key (let's name it
gc-key.json
) and place it inpath/to/a/folder/of/your/choice/
- In a shell from which you wish to use dvc, run
export GOOGLE_APPLICATION_CREDENTIALS=path/to/a/folder/of/your/choice/gc-key.json
(on Windows, in CMD, run:set GOOGLE_APPLICATION_CREDENTIALS=path/to/a/folder/of/your/choice/gc-key.json
)
Extra TIP: If you're using pycharm, make sure to mark .dvc folder as excluded - otherwise it will keep indexing your dvc files (including cache).
git fetch --all
git checkout master
- necessary for steps 3 & 4git pull --rebase
dvc pull
- get the latest updated data from master, this will take some timegit checkout -b your-branch-#10
where 10 is the number of github issue related to the branchdvc commit
- if this command changes anything in your repo, it means you messed something up
I'm just adding the data/raw/cc.pl.300.vec.gz
- what should I do to make it easy for others to work with?
- Place the file in
data/raw/cc.pl.300.vec.gz
- Run
dvc add data/raw/cc.pl.300.vec.gz
- Move the file created by dvc to the right location:
mv data/raw/cc.pl.300.vec.gz.dvc ./add_fasttext_vectors.dvc
- Open the moved dvc file, make sure the path to the added data is correct (should be
data/raw/cc.pl.300.vec.gz
) git add data/raw/cc.pl.300.vec.gz.dvc data/raw/.gitignore
- Make sure the data file itself (
cc.pl.300.vec.gz
) is ignored in git, ie. doesn't show up ingit status
output - Check if it works:
dvc repro add_fasttext_vectors.dvc
- should print something like "stage didn't change, using cache" git commit
dvc push -j 1
- push your changes as early as possible to prevent problems later,-j 1
option tells dvc to use 1 thread, which may be slower but provides a progressbar so at least you know what is going ongit push
- same as for dvc, if you know remote contains your work you don't have to worry about breaking something locally :)
I have just created a spacy_pl/tagset
module with 2 python files, that I want to use to generate tagset and conversion map
from nltk to spacy format (for the selected NKJP POS tags). To do this:
- I use click options in the python script to specify all paths to dependencies and outputs
- However, I also specify the default values for them (to write shorter commands in shell later)
- I run the script normally to make sure it works, list of key things to check:
- all output directories are created if they don't exist already
- all paths are relative to repository root folder (including python imports)
- this step can also be performed with dvc, but since we don't care about result at first, why bother?
- Now I can run it with dvc:
dvc run -d data/raw/NKJP_1.2_nltk -d spacy_pl/tagset -o data/processed/tagset -f generate_pos_NKJP_justpos.dvc python spacy_pl/tagset/generate_tagset_and_conversion_map.py
, note that:- I ran this from root folder of repository
- even though I was running just one script, I specified entire module
spacy_pl/tagset
as dependency, as it contains related code - I output more than one file, so instead of listing them all, I group them by putting inside one folder
- I specified the name of dvc file using conventions described above
- After the script ends successfully, we can add and commit our changes:
git add spacy_pl/tagset generate_pos_NKJP_justpos.dvc data/processed/.gitignore
- make sure the data file itself (
data/processed/tagset
) is ignored in git, ie. doesn't show up ingit status
output - check if it works:
dvc repro generate_pos_NKJP_justpos.dvc
- should print something like "stage didn't change, using cache" git commit
dvc push -j 1
push your changes as early as possible to prevent problems later,-j 1
option tells dvc to use 1 threadgit push
If you followed the guidelines for adding files to dvc and running experiments, everything will work.
Some useful commands in case you messed something up (documentation on dvc.org is ok for these):
dvc commit
dvc pull
dvc lock
Assuming your're on the right branch (ie. the models' dvc file exists on it). For example, for pulling cross-validation of pos-only tagger using fasttext vectors:
dvc pull
dvc repro cv_pos_nkjp_justpos_fasttext.dvc
Assuming your're on the right branch (ie. the models' dvc file exists on it). For example, ro re-run the cross-validation of pos-only tagger using fasttext vectors:
dvc pull
- View the dependency tree of pos tagger:
dvc pipeline show --ascii cv_pos_nkjp_justpos_fasttext.dvc
- For each immediate dependency, run
dvc repro dependency-name.dvc
- Make your changes to the code
- To track the training results, follow steps from How to run experiment that I just wrote? described above