Code for next word prediction training based on the BookMIA dataset. This is part of the code for tests done of the work "Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?"
Dataset : Here the language model are trained with next token prediction for older books so they appear as a copyright text in the model.