This is a fork of gpt2-ml, gpt2-ml is a wonderful project which is not maintained anymore. Hope @imcaspar is all good. This fork fixed some download link and made the pre-trained sustainable which means you don't need to download pre-trained file every time...
Try it now:
If it runs failed, check the dependence:
- Simplifed GPT2 train scripts(based on Grover, supporting TPUs)
- Ported bert tokenizer, multilingual corpus compatible
- 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
- Batteries-included Colab demo #
- 1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )
Size | Language | Corpus | Vocab | Link1 | Link2 | SHA256 |
---|---|---|---|---|---|---|
1.5B Params | Chinese | ~30G | CLUE ( 8021 tokens ) | Google Drive | Baidu Pan (ffz6) | e698cc97a7f5f706f84f58bb469d614e 51d3c0ce5f9ab9bf77e01e3fcb41d482 |
1.5B Params | Chinese | ~15G | Bert ( 21128 tokens ) | Google Drive | Baidu Pan (q9vr) | 4a6e5124df8db7ac2bdd902e6191b807 a6983a7f5d09fb10ce011f9a073b183e |
Corpus from THUCNews and nlp_chinese_corpus
Using Cloud TPU Pod v3-256 to train 22w steps
Due to the reason of colab (google reduced free gpu performance), the colab demo might be stale or no response unless you have a paid account (paid account is charged by google company, I have nothing to do with that. me and/or all the contributors won't get any money from it since this is a completely free and completely open sources project).
With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:
The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.
@misc{GPT2-ML,
author = {Zhibo Zhang},{zxkmm}
title = {GPT2-ML: GPT-2 for Multiple Languages},
year = {2019},{2022}
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/imcaspar/gpt2-ml}},
}
https://github.com/google-research/bert
https://github.com/rowanz/grover
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)