- generate_results: This contains the directory for the sentence idx and the memorization scores of this sentence. The name of the file follow the format of memorization_evals_{model_size}_deduped-v0_{context size}_{continuation size}_143000.csv which has two columns idx and scores.
- dedup_data: This contains the original deduplicated data.
- dedup_merge: This contains the merged deduplicated data.
- undeduped_data: This contains the original undeduplicated data.
- undedup_merge: This contains the merged undeduplicated data.
- pythia: means the pythia package
- run_generate.sh: This initiatiaste the batch_generate.py script. The input parameters are model size, checkpoint (usually the last step), batch size (usually fixed), context size and continuation size.
- data_download.py: Used to download the pre-train data. possibly do not have to use it again.
- cluster.py: Sample different memorized/unmemorized data points and apply dimension reduction and show in a figure.
- clmtraing.py: Trains a model on causal language modelling task.
- embedding_obtain,py: A script shows how to obtain hiddent state embedding for Pythia or any other model.
- generate.py, csv_process.py, csv_reformat.py are just some helper scripts or format conversion scripts may not be used again.
- example_explore.py: A script to show to make a single example generation.