We propose domain subject classification and alloy phase classification tasks.
The labelled dataset is generated by randomly sampling domain journals in CORE data.
- Fine tune and validate on domains text, e.g.
python llm-classifier.py --model globuslabs/ScholarBERT --emb-size 1024
The labelled datatset is obtained from https://www.nature.com/articles/s41524-020-0308-7
The deepspeed parallelization for the fine-tuning codes are also provided for above 2 tasks, respectively. E.g., to run it on Summit
bsub phase/launch_classifier_phase.lsf