GitHub - fangxintao/SciMind: SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences

SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences

The repo contains: - The official implementation of [SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences](https://openreview.net/forum?id=xbyPquFUB4)

Content

Introduction
Model and Data
Environment
Repository Introduction
Quick start
Fine-tuning instructions
Citation

Introduction

SciMind is a multimodal mixture of experts large model developed based on the Llama2 model within the MindSpore framework

Improvements compared to Llama2:

Treat SMILES molecules, protein molecules, and nucleic acid molecules each as a separate modality, construct special tokens, and expand them into the vocabulary.
Perform forward expert partitioning on the feedforward layer of Llama, and during forward propagation, different numbers of experts can be selected for tokens of different modalities

Model and Data

model

Pre-trained model without fine-tuning is available at : Scimind.ckpt
LPM-24 text2smiles finetuned model is available at : Scimind-text2smiles.ckpt
LPM-24 smiles2text finetuned model is available at : Scimind-smiles2text.ckpt

data

LPM-24 dataset refer to Repository Introduction
other data:

Dataset
ChEBI
MoA
Protein-oriented_Instructions

Environment

Hardware: Ascend 910A/B
MindSpore：2.2
MindFormers version: 0.8.0
Refer to requirement.txt for other dependencies

Repository Introduction

Files related to the model are located in: ./mindformers/model/llama
llama
├── init.py
├── convert_weight.py # Weight conversion script
├── llama.py # Model
├── llama_config.py # Model configuration
├── llama_layer.py # Llama network layer definition, with the addition of a mixture of experts system
├── llama_processor.py # llama preprocess
├── llama_tokenizer.py # The tokenizer has an expanded vocabulary
└── llama_transformer.py # transformer layer
The model parameter configuration file is located in: ./configs/llama
llama
├── run_llama_7b_finetune.yaml # 7b model full fine-tuning startup configuration
LPM24 data: ./LPM-24-data
LPM-24-data
├── smiles_to_text
- ├── LPM-24_train_qa_conversation.json
- ├── LPM-24_test_qa_conversation.json
- ├── LPM-24_test.json
- ├── LPM-24_train.json
├── smiles_to_text_generate
- ├── LPM-24_smiles2text_generate.txt # Already processed and ready for direct inference\
- ├── eval-text.txt # LPM-24 original validation set\
- ├── eval-text-special_token.txt # LPM-24 dataset processed with special tokens
├── text_to_smiles
- ├── LPM-24_train_qa_conversation.json
- ├── LPM-24_test_qa_conversation.json
- └── LPM-24_test.json
- └── LPM-24_test.json
├── text_to_smiles_generate
- ├── LPM-24_text2smile_generate.txt # Already processed and ready for direct inference\
- ├── eval-molgen.txt # LPM-24 original validation set

Quick start

For quick use our model, first you need to configure the model path and parameters for accelerating inference in the parameter configuration file

Configuration(.mindformers/configs/llama2/run_llama2_7b_finetune.yaml)

-- load_checkpoint : /path/to/your/model file --use_past : True

then, you need to prepare the correct format data and run scripts

1. data preprocessing

For smiles2text task generation, construct data like: ./LPM-24-data/smiles2text_generate/LPM-24_smile2text_generate.txt

For text2smiles task generation, construct data like: ./LPM-24-data/text2smiles_generate/LPM-24_text2smiles_generate.txt

2. Generation

run script run_mindformer.py (run by one NPU)

> python run_mindformer.py
--config {config_file_path}
--run_mode predict
--predict_data {input_data_path}
--use_parallel False --device_id 0
--save_file {output_path}

example：

python run_mindformer.py\
--config configs/llama2/run_llama2_7b_finetune.yaml
--run_mode predict
--predict_data ./LPM-24-data/smiles2text_generate/LPM-24_smile2text_generate.txt
--use_parallel False
--device_id 0
--save_file ./LPM-24-data/smiles2text_generate/output_LPM_smiles2text.txt

Fine-tuning instructions

1. Data preprecessing (Taking the smile2text task of the LPM-24 dataset as an example)

Process the original LPM-24 data into a conversational text format and save it as a JSON file. JSON reference format:
- ./LPM-24-data/text2smiles/LPM-24_train_qa_conversation.json
Further process the JSON file to obtain a MindRecord format file for model fine-tuning. Use the following preprocessing script to generate MindRecord format training data

  python llama_preprocess_2.py
  --dataset_type qa
  --input_glob {json/data/path} # Example: ./LPM-24-data/text2smiles/LPM-24_train_qa_conversation.json
  --model_file ./mindformers/checkpoint_download/llama2/tokenizer.model
  --seq_length 2048
  --output_file {output_path}example.mindrecord

2. Fine-tuning (Fine-tuning defaults to requiring 8 NPUs)

Specify the model location and the MindRecord format data location in the configuration file

Configuration(.mindformers/configs/llama2/run_llama2_7b_finetune.yaml)
- load_checkpoint : /path/to/your/model file
- dataset_dir : /path/to/your/mindrecord file

Turn to the path：./mindformers/scripts , then run shell script:

bash run_distribute.sh /user/config/jobstart_hccl.json ../configs/llama2/run_llama2_7b_finetune.yaml [0,8] train

Citation

If you use Scimind in your work, please cite our paper:

@inproceedings{
xiong2024scimind,
title={SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences},
author={Zhaoping Xiong and Xintao Fang and Haotian CHU and Xiaozhe Wan and Liwei Liu and Yameng Li and Wenkai Xiang and Mingyue Zheng},
booktitle={ACL 2024 Workshop Language + Molecules},
year={2024},
url={https://openreview.net/forum?id=xbyPquFUB4}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LPM-24-data		LPM-24-data
mindformers		mindformers
param		param
picture		picture
README.md		README.md
README_CN.md		README_CN.md
requirements.txt		requirements.txt
smiles_alphabet.txt		smiles_alphabet.txt
smiles_alphabet_32098.txt		smiles_alphabet_32098.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences

Content

Introduction

Model and Data

model

data

Environment

Repository Introduction

Quick start

1. data preprocessing

2. Generation

Fine-tuning instructions

1. Data preprecessing (Taking the smile2text task of the LPM-24 dataset as an example)

2. Fine-tuning (Fine-tuning defaults to requiring 8 NPUs)

Citation

About

Releases

Packages

Languages

fangxintao/SciMind

Folders and files

Latest commit

History

Repository files navigation

SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences

Content

Introduction

Model and Data

model

data

Environment

Repository Introduction

Quick start

1. data preprocessing

2. Generation

Fine-tuning instructions

1. Data preprecessing (Taking the smile2text task of the LPM-24 dataset as an example)

2. Fine-tuning (Fine-tuning defaults to requiring 8 NPUs)

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages