FORGE: Pre-training Open Foundation Models for Science

Contributions

Best practices for the end-to-end pre-training LLMs for science on HPC
Open releases of a set of foundation models (and domain datasets) on scientific corpus
Propose scientific related down-stream benchmarks for evaluating LLMs for science
Provide heuristics for large-batch training and communication requirment
Evaluate current practices and share our observations

FORGE models

Model	#Params	#Tokens	Link
Forge-bio	1.44B	38B	download
Forge-che	1.44B	41B	download
Forge-eng	1.44B	29B	download
Forge-mat	1.44B	15B	download
Forge-phy	1.44B	32B	download
Forge-soc	1.44B	90B	download
Forge-s1	1.44B	10B	download
Forge-s2	1.44B	20B	download
Forge-s3	1.44B	30B	download
Forge-s4	1.44B	257B	download
Forge-m1	13B	30B	download
Forge-m2	13B	257B	download
Forge-l	22.4B	257B	download

Data sources

CORE: https://core.ac.uk/documentation/dataset (core_2020-12-20)
MAG: https://www.microsoft.com/en-us/research/project/open-academic-graph/ (v2-1)
Aminer: https://www.microsoft.com/en-us/research/project/open-academic-graph/ (v2-1)
Arixv: https://huggingface.co/datasets/arxiv_dataset
Scopus: 6M abstracts for the DOIs extracted via Scopus API

Example usages

Forge models can be used using standard Hugging Face API

from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
model = GPTNeoXForCausalLM.from_pretrained("path_to_forge_model")
tokenizer = GPTNeoXTokenizerFast.from_pretrained("path_to_forge_model")
prompt = "high entropy alloy applications include"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(input_ids,
                            do_sample=True,
                            temperature=0.7,
                            max_length=100)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)

high entropy alloy applications include high strength steels, alloys, composites, as well some metal alloys. In recent years, there has been much interest the use of such materials for manufacturing parts, components, machinery. For example, automotive sector an increasing number applications. most widely used is steels.

Pre-processing

Steps on preprocessing CORE, MAG and Aminer
Steps on domain partitioning

Training

Software envrionment, configurations, and steps on pre-training

Scientific downstream tasks

Domain subject and material phase classifications
Energy regression

Raw performance data and plots

The raw performance data including computation performance, loss, downstream evaluations, etc are available
The jupyter notebook to plot is also provided

Reference

@INPROCEEDINGS{10.1145/3581784.3613215,
  author={Junqi Yin and Sajal Dash and Feiyi Wang and Mallikarjun Shankar},
  title={FORGE: Pre-training Open Foundation Models for Science}, 
  booktitle={SC23: International Conference for High Performance Computing, Networking, Storage and Analysis}, 
  year={2023},
  doi={10.1145/3581784.3613215}}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FORGE: Pre-training Open Foundation Models for Science

Contributions

FORGE models

Data sources

Example usages

Pre-processing

Training

Scientific downstream tasks

Raw performance data and plots

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

FORGE: Pre-training Open Foundation Models for Science

Contributions

FORGE models

Data sources

Example usages

Pre-processing

Training

Scientific downstream tasks

Raw performance data and plots

Reference