Memorization Study

Directory Explanation

generate_results: This contains the directory for the sentence idx and the memorization scores of this sentence. The name of the file follow the format of memorization_evals_{model_size}_deduped-v0_{context size}_{continuation size}_143000.csv which has two columns idx and scores.
dedup_data: This contains the original deduplicated data.
dedup_merge: This contains the merged deduplicated data.
undeduped_data: This contains the original undeduplicated data.
undedup_merge: This contains the merged undeduplicated data.
pythia: means the pythia package

run_generate.sh: This initiatiaste the batch_generate.py script. The input parameters are model size, checkpoint (usually the last step), batch size (usually fixed), context size and continuation size.
data_download.py: Used to download the pre-train data. possibly do not have to use it again.
cluster.py: Sample different memorized/unmemorized data points and apply dimension reduction and show in a figure.
clmtraing.py: Trains a model on causal language modelling task.
embedding_obtain,py: A script shows how to obtain hiddent state embedding for Pythia or any other model.
generate.py, csv_process.py, csv_reformat.py are just some helper scripts or format conversion scripts may not be used again.
example_explore.py: A script to show to make a single example generation.