Piccolo-GPT

Paper: Piccolo2: General Text Embeddings with Multi-Task Hybrid loss Training

🚀 New SOTA on CMTEB

🔥 Our general sentence embedding selmisskilig/piccolo-gpt-zh achieves SOTA on the CMTEB Leaderboard with an average score of 70.95. [2024/4/23] One of the Embedding models based on large model training has achieved an average score of 68.03. The repository is the training pipeline code for Piccolo-GPT, a branch of Piccolo Embedding.

💡Model Highlights

The Piccolo-GPT-trained piccolo-gpt-zh outperforms most of the models in the combined evaluation of the six tasks on the CMTEB list, and is currently in the top 10. piccolo-gpt-zh primarily utilises efficient multi-task hybrid loss training methods to effectively exploit text data and labels from different downstream tasks.

For huggingface model, please refer to our space: selmisskilig/piccolo-gpt-zh For training details, please refer to our tech report: https://arxiv.org/abs/2405.06932

📖 Repo Details

In this repo, we release our training code and implement some tricks that are helpful for embedding training, including:

Multi-task Hybrid Loss Training
Matryoshka Representation Learning
Embdding Dimension Scaling
Task-Homogenous Dataset
Position Embedding Hierarchical Decomposition

In order to save memory, we default use deepspeed-zero1, gradient checkpointing and mix-precision for training. We also provide slurm scripts to help conduct distributed training.

🔨 Best Practice

1. Environment

pip install -r requirements.txt

2. Prepare Dataset

We divide the dataset into three major categories: retrieval/reranking, clustering/classification, abd sts/pair classification, and we adopt different loss functions for different categories. In data_example dir, we provide example for these three types of data.

Retri: For Retrieval and Reranking dataset, we implement standard InfoNCE with in-batch-negative, this dataset contain 4 columns: text, text_pos, text_neg, type. Here 'type' is 'retri_contrast'
STS: For STS and Pair-Classification dataset, we implement the cosent loss, this dataset contain 4 columns: text, text_pair, label, type. Here 'type' is 'cosent'
Cls: For Classification and Clustering dataset, we implement the InfoNCE w/o in-batch-negative, this dataset contain 4 columns: text, text_pos, text_neg, type. Here 'type' is 'cls_contrast'

The 'type' column indicates the type of the current dataset. We obtain the type of the current batch to apply different loss functions.

3. Training

We offer the training script located at scripts/ft.sh. Below we've provided explanations for the variables used within the script.

Environment Variables

ROOT: The absolute path of the repo in your machine.
GPUS_PER_NODE: Number of GPUs on a single machine The default proviede scripts is for single-machine training. If you are in a multi-node distributed training scenario, you should additionally specify the env parameters below:
WORLD_SIZE: Number of nodes
RANK: rank of current node, it is often obtained thourgh the SLURM environment variable
MASTER_ADDR:MASTER_PORT: Communication port

Training Variables

MODEL_NAME_OR_PATH: the absolute path of the pretrain model
DS_PATH: the deepspeed config, we provide a default config in ./data_example/deepspeed_config.json
META_PATHS: the list of the dataset. We provide a sample in meta_lists/piccolo.txt. Each row consists of two columns: the relative path of the dataset and the number of repeats.
ROOT_DIRS: The absolute path of the dataset dir

Run

bash scripts/ft.sh

🤗 Model List

Model	Language		Description	prompt
selmisskilig/piccolo-gpt-zh	Chinese		finetuning with multi-task hybrid loss training	None

Citation

If you find our tech report, models or code helpful, please cite our report or give a star on github or huggingface!

@misc{2405.06932,
Author = {Junqin Huang and Zhongjie Hu and Zihao Jing and Mengya Gao and Yichao Wu},
Title = {Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training},
Year = {2024},
Eprint = {arXiv:2405.06932},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data_example		data_example
evaluation		evaluation
finetune		finetune
meta_lists		meta_lists
piccolo		piccolo
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Piccolo-GPT

💡Model Highlights

📖 Repo Details

🔨 Best Practice

1. Environment

2. Prepare Dataset

3. Training

🤗 Model List

Citation

About

Releases

Packages

Languages

License

selmiss/piccolo-gpt

Folders and files

Latest commit

History

Repository files navigation

Piccolo-GPT

💡Model Highlights

📖 Repo Details

🔨 Best Practice

1. Environment

2. Prepare Dataset

3. Training

🤗 Model List

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages