This is the code repository for the implementation of ideas described in paper with the title "Generative Language Models on Nucleotide Sequences of Human Genes" by Musa Nuri İhtiyar and Arzucan Özgür from Computer Engineering Department of Boğaziçi University.
"actualDatasets" folder contains files to be used for training and evaluation of various models.
"nGram", "laplaceSmoothing", "rnn" and "transformer" folders refer to codes for corresonding methods. Each of these folders contain a README file describing how to run these codes. Also each of them already contains the ultimately obtained models and results during the execution of the programs.
Lastly, "syntheticMutationDataset" folder contains the code for generating a synthetic mutation dataset and evaluating different models on it. It has a README file as well.