Memorization Study

Directory Explanation

generate_results: This contains the directory for the sentence idx and the memorization scores of this sentence. The name of the file follow the format of memorization_evals_{model_size}_deduped-v0_{context size}_{continuation size}_143000.csv which has two columns idx and scores.
dedup_data: This contains the original deduplicated data.
dedup_merge: This contains the merged deduplicated data.
undeduped_data: This contains the original undeduplicated data.
undedup_merge: This contains the merged undeduplicated data.
pythia: means the pythia package

run_generate.sh: This initiatiaste the batch_generate.py script. The input parameters are model size, checkpoint (usually the last step), batch size (usually fixed), context size and continuation size.
data_download.py: Used to download the pre-train data. possibly do not have to use it again.
cluster.py: Sample different memorized/unmemorized data points and apply dimension reduction and show in a figure.
clmtraing.py: Trains a model on causal language modelling task.
embedding_obtain,py: A script shows how to obtain hiddent state embedding for Pythia or any other model.
generate.py, csv_process.py, csv_reformat.py are just some helper scripts or format conversion scripts may not be used again.
example_explore.py: A script to show to make a single example generation.

Name		Name	Last commit message	Last commit date
Latest commit History 1,115 Commits
.idea		.idea
embedding_anaysis		embedding_anaysis
figures		figures
memorization_prediction		memorization_prediction
ngram_dict		ngram_dict
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
corpus_sankey_flow.py		corpus_sankey_flow.py
csv_merge.py		csv_merge.py
distributed_generate.py		distributed_generate.py
distribution_analysis.py		distribution_analysis.py
figure_draw.py		figure_draw.py
models.py		models.py
multi_node_run.sh		multi_node_run.sh
paralle.sh		paralle.sh
utils.py		utils.py