A collection of open-source Instruction-tuning dataset to train chat-based LLMs (ChatGPT,LLaMA,Alpaca).
This is an on-going project. We will soon add tags to classify the following datasets and continuously update our collection.
- The template
- The English Instruction Datasets
- tatsu-lab/Alpaca
- gururise/Cleaned Alpaca
- PhoebusSi/Alpaca-COT
- QingyiSi/Alpaca-CoT
- orhonovich/unnatural-instructions
- bigscience/PromptSource
- bigscience/P3
- allenai/natural-instructions
- allenai/super-natural-instructions
- google-research/FLAN 2021
- google-research/FLAN 2022 Collection
- tasksource-instruct
- LianjiaTech/BELLE 1.5M
- LianjiaTech/BELLE 10M
- XueFuzhao/InstructionWild
- ExMix
- UnifiedSKG
- MetaICL
- openai/InstructionGPT
- facebookresearch/metasqe/OPT-IML
- THUDM/GLM-130B
- laion/OIG
- baize/baize-chatbot
- lightaime/camel
- thunlp/UltraChat
- databrickslabs/doll
- Instruction-Tuning-with-GPT-4/GPT-4-LLM
- ShareGPT
- stanfordnlp/SHP
- Anthropic/hh-rlhf
- HuggingFaceH4/stack-exchange-preferences
- Hellp-SimpleAI/HC3
- f/awesome-chatgpt-prompts
- The Chinese Instruction Datasets
- The Miltilingual Instruction Datasets
- The Code Instruction Datasets
## [owner/project-name](https://github.com/link/to/project)
* Size:
* Language:
* Summary:
* Generation Method:
* Paper:
* HuggingFace: (if applicable)
* Demo: (if applicable)
* License:
- Size: 175 seed instructions, 52,002 instructions
- Language: EN
- Summary: Alpaca contains 52K instruction-following data, consisting of instruction, input and output.
- Generateion Method: Self-instruct with human written 175 seed tasks.
- Paper: Self-Instruct: Aligning Language Model with Self Generated Instructions
- HuggingFace: https://huggingface.co/datasets/tatsu-lab/alpaca
- License: CC BY NC 4.0
- Size: 51,713 instructions
- Language: EN
- Summary: Cleaned Alpaca Dataset helps solve the folowing issues: Hallucinations, Merged Instructions, Empty outputs, Empty code examples, Instructions to generate images, N/A outputs, Inconsistent input field, Wrong answers, Non-Sensical/Unclear instructions, and Extraneous escape and control characters.
- HuggingFace: https://huggingface.co/datasets/yahma/alpaca-cleaned
- License: CC BY NC 4.0
- Language: EN
- Summary: Alpaca-COT is a datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca.
- Generateion Method: Use the template provided by FLAN to change the original dataset into various Chain-of-Thoughts forms, and then convert them to the instruction-input-output triplets.
- HuggingFace: https://huggingface.co/datasets/QingyiSi/Alpaca-CoT
- License: Apache License
- Empty for now. Soon to update.
- Size: 240,000 instructions
- Language: EN
- Summary: Unnatural Instructions consist of a core dataset of 68,478 instruction-input-output triplets, and a full dataset.
- Generateion Method:
- Step 1 (Core Dataset Generation): Collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth, following a strict instruction-input-output format.
- Step 2 (Template Expansion): Prompt a language model to reformulate the tasks in the core dataset, and collect two alternative formulations for each generated task
- Paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
- License:
- Size: 180 tasks, 2,085 instructions
- Language: EN
- Summary: PromptSource aims at designing a prompt query such that the answer can be mapped onto the specific dataset
- Generateion Method:
- Five steps: Dataset Exploration, Prompt Writing, Prompt Documentation, Iteration and Variation, and Global Review.
- Paper: PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
- HuggingFace: https://huggingface.co/spaces/bigscience/promptsource/tree/main
- Demo: https://huggingface.co/spaces/bigscience/promptsource
- License:
- Size: 270 tasks, 2,085 instructions
- Language: EN
- Summary: P3 has a diverse set of NLP tasks, including multiple-choice QA, sentiment analysis or natural language inference.
- Generateion Method: A subset of the prompts available in Promptsource.
- Paper: Multitask Prompted Training Enables Zero-Shot Task Generalization
- HuggingFace: https://huggingface.co/datasets/bigscience/P3
- License:
- Size: 61 tasks, 61 instructions
- Language: EN
- Summary: Natural Instruct v1 is a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances.
- Generateion Method:
- Map exist datasets into Instruction Schema.
- Instruction Schema:
- Part I - Title + Definition + Things-to-Avoid + Emphasis-and-Caution
- Part II - Positive Example: Input + Output + Reason
- Part III - Negative Example: Input + Output + Reason + Suggestions to be modified to be positive
- Part IV - Prompt
- Paper: Cross-Task Generalization via Natural Language Crowdsourcing Instructions
- Demo: https://instructions.apps.allenai.org/
- License:
- Size: 1,616 tasks, 1,616 instructions
- Language: EN
- Summary: Super-Natural-Instruct v2 is built on Natural Instruct v1, has a simpler schema and contains over 1.5k tasks.
- Generateion Method:
- Map exist datasets into Instruction Schema.
- Instruction Schema:
- Part I - Definition
- Part II - Positive Example: Input + Output + Reason
- Part III - Negative Example: Input + Output + Reason + Suggestions to be modified to be positive
- Paper: Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
- Demo: https://instructions.apps.allenai.org/
- License:
- Size: 62 tasks
- Language: EN
- Summary: FLAN 2021 aggregates 62 text datasets on Tensorflow Datasets into a single mixture. It is currently not public.
- Generateion Method: Map exist datasets into Instruction Schema.
- Paper: Finetuned Language Models Are Zero-Shot Learners
- License:
- Size: 1,836 tasks, 18,360 instructions
- Language: EN
- Summary: Flan 2022 Collection combines Flan 2021, P3 Dataset Family, Super-Natural Instructions, with some additional reasoning, dialog, and program synthesis datasets.
- Paper: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
- License:
- Size: 475 tasks, 4.7M instructions
- Language: EN
- Summary: Tasksource datasets recasted as instructions with minimal but unambiguous templates. This is particularly useful for classification and reasoning.
- HuggingFace: https://huggingface.co/datasets/tasksource/tasksource-instruct-v0
- Paper: tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation
- License:
- Size: 175 seed instructions, 1.5M instructions
- Language: CH
- Summary: 1.5M Chinese instructions produced by BELLE, with various instruction types and domains.
- Generateion Method: Self-instruct with 175 Chinese seed tasks translated from the seed tasks in Alpaca, using text-davinci-003.
- Paper: Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases
- HuggingFace:
- Demo: https://github.com/LianjiaTech/BELLE/blob/main/chat/README.md
- License: https://github.com/LianjiaTech/BELLE/blob/main/DISCLAIMER
- Size: 10M instructions
- Language: CH
- Summary: 10M Chinese instructions produced by BELLE, with 4 subsets.
- Generateion Method:
- School Math: Chinese math questions and answers generated by ChatGPT.
- Multiturn Chat: Chinese multiturn chat generated by ChatGPT, with two characters Human and Assistant.
- Generated Chat: Chinese role-playing chat generated by ChatGPT.
- 2M Chinese instructions: Various Chinese instructions generated by ChatGPT.
- Paper: Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases
- HuggingFace:
- School Math https://huggingface.co/datasets/BelleGroup/school_math_0.25M
- Multiturn Chat https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M
- Generated Chat https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M
- 2M Chinese instructions https://huggingface.co/datasets/BelleGroup/train_2M_CN
- Demo: https://github.com/LianjiaTech/BELLE/blob/main/chat/README.md
- License: https://github.com/LianjiaTech/BELLE/blob/main/DISCLAIMER
- Size: 479 seed instructions, 52,191 Chinese instructions, 52,191 English instructions
- Language: CH, EN
- Summary: InstructionWild use the same format as Alpaca for fast and easy usage. Its instructions have no input field.
- Generateion Method:
- Pick 429 instructions over 700 noisy instructions from Twitter
- Use a similar method as Alpaca for generating the resulting instructions.
- License:
- Paper: ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
- Download: ExMix's official data is not open-sourced, but you can use the following URLs to download partial data in ExMiX.
- Paper: UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models
- Download: https://drive.google.com/drive/folders/1GXigUv3MU-Sh4XiY6Wz3xVeNT_s0SuON
- Size: 112,801 instructions
- Language: EN
- Generation Method: Human Annotated
- Paper: Training language models to follow instructions with human feedback
- Size: 1,667 tasks, 3,128 instructions
- Language: EN
- Summary: OPT-IML dataset expands the Super-Natural-Instructions benchmark with the task collections from multiple existing work on instruction-tuning, cross-task transfer studies, and area-specific task consolidation.
- Generation Method:
- Benchmarks included in OPT-IML are Super-Natural-Instructions, PromptSource, CrossFit, FLAN, ExMix, T5, UnifiedSKG, and Reasoning. Authors only kept partial tasks from CrossFit, ExMix and T5 due to the significant overlap.
- To organize the Instruction schema, authors broadly classify the instructions in these benchmarks into two categories, dataset-level and instance-level.
- Paper: OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
- License:
- Size: 74 tasks
- Language: Multilingual
- Paper: GLM-130B: An Open Bilingual Pre-trained Model
- Size: 30 tasks, 43M instructions
- Language: EN
- Summary: OIG contains instructions that are created using data augmentation from a diverse collection of data sources, and formatted in a dialogue style (… … pairs).
- Generation Method:
- OIG is created by various LAION community members, consisting of 30 datasets and 43M instructions, with the goal of reaching 1 trillion tokens.
- OIG dataset can be divided roughly into 75% academic datasets, such as P3, Natural instructions and FLAN, and 25% datasets composed of various tasks, such as high school math, python coding and peoty generation.
- HuggingFace: https://huggingface.co/datasets/laion/OIG
- Demo: https://github.com/LAION-AI/Open-Assistant
- License:
- Size: 3 tasks, 100K+ instructions
- Language: EN
- Summary: Baize dataset is a high-quality multi-turn chat corpus by leveraging ChatGPT to engage in a conversation with itself, named self-chatting.
- Generation Method:
- First apply a template to define the format and requirements of a conversation.
- Then use questions from Quora and Stack Overflow as seeds that set the topic for the chat.
- Paper: Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data
- HuggingFace: (if applicable)
- Demo: https://huggingface.co/spaces/project-baize/Baize-7B
- License:
- Size: 115K instructions
- Language: EN
- Summary: Camel dataset introduces a novel communicative agent framework named role-playing.
- Generation Method:
- The prompt engineering in Camel consists of three prompts, the task specifier prompt, the assistant system prompt, and the user system prompt. The scenarios in Camel include AI Society and Code.
- Authors also create Data Generation Prompts to generate meta data by LLMs. 50 assistant roles and 50 user roles are generated for AI Society. 20 programming languages and 50 domains are generated for Code.
- Paper: CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society
- HuggingFace: https://huggingface.co/camel-ai
- Demo: https://www.camel-ai.org/
- License:
- Size: 657K instructions
- Language: EN
- Summary: UltraChat is a multi-round dialogue dataset powered by Turbo APIs, composed of three sectors, namely Questions about the World, Writing and Creation, and Assistance on Existent Materials.
- Generation Method:
- Two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response.
- We instruct the user model with carefully designed prompts to mimic human user behavior and call the two APIs iteratively.
- HuggingFace: https://huggingface.co/datasets/stingning/ultrachat
- License:
- Size: 7 tasks, 15,000 instructions
- Language: EN
- Summary: Dolly is a human-generated corpus, whose categories are Creative Writing, Closed QA, Open QA, Summarization, Information Extraction, Classification and Brainstorming.
- Generation Method:
- Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories.
- For instruction categories that require an annotator to consult a reference text, contributors selected passages from Wikipedia for particular subsets of instruction categories.
- HuggingFace: https://huggingface.co/datasets/databricks/databricks-dolly-15k
- License:
- Summary: ShareGPT is an open-source Chrome Extension for you to share your wildest ChatGPT conversations with one click.
- Generation Method: Collect chats with ChatGPT from its users.
- Demo: https://sharegpt.com/
- Size: 18 tasks, 385K instructions
- Language: EN
- Summary: SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. It is used to train RLHF reward models and NLG evaluation models.
- Generation Method:
- The data is sourced from Reddit, which is a public forum organized into topic-specific fora called subreddits.
- Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post.
- Paper: Understanding Dataset Difficulty with V -Usable Information
- HuggingFace: https://huggingface.co/datasets/stanfordnlp/SHP
- License:
- Size: 169,550 instructions
- Language: EN
- Summary: HH-RLHF is a dataset of human preferences over models' responses to questions/instructions.
- Generation Method:
- Hire crowdworkers to interact with models through two interfaces, helpfulness interface and harmlessness (red-teaming) interface respectively.
- For the helpfulness dataset, ask crowdworkers to have open-ended conversations with our models, asking for help, advice, or for the model to accomplish a task, and to choose the model response that was more helpful.
- For the harmlessness (red-teaming) dataset, ask crowdworkers to attempt to elicit harmful responses from our models, and to choose the more harmful response offered by the models.
- Paper:
- HuggingFace: https://huggingface.co/datasets/Anthropic/hh-rlhf
- License:
- Size: 10M instructions
- Language: EN
- Summary: Stack-Exchange-Preferences dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.
- Generation Method:
- Paper: A General Language Assistant as a Laboratory for Alignment
- HuggingFace: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
- License:
- Size: 12 tasks, 37,175 instructions
- Language: EN, CH
- Summary: HC3 is a comparison corpus that consists of both human and ChatGPT answers to the same questions.
- Generation Method:
- Human Answers Collection: The first part is publicly available question-answering datasets, whose answers are given by experts or high-voted. The second part is built by constructing question-answer pairs from wiki sources.
- ChatGPT Answers Collection: use ChatGPT to generate answers to the questions in Human Answers Collection
- Paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
- HuggingFace: https://huggingface.co/datasets/Hello-SimpleAI/HC3
- License: CC-BY-SA
- Empty for now. Soon to update.
- Size: 2K tasks, 191,191 instructions in total
- Language: CH
- Summary: Chinese Open Instruction Generalist (COIG) is a Chinese instruction dataset consisting of 4 sub-tasks.
- Generateion Method:
- Task 1: Translated Instructions (67,798)
- Translate the following datasets into Chinese: 1,616 task descriptions in Super-Natural-Instruct v2 along with a single instance for each of them; 175 seed tasks in Self-instruct; 66,007 instructions from Unnatural Instructions.
- Task 2: Exam Instructions (63,532)
- Exams include The Chinese National College Entrance Examination (高考), Middle School Entrance Examinations (中考), and Civil Servant Examination (公务员考试).
- Turn them into Chain-of-Thought (CoT) corpus by extracting six informative elements from original exam questions, including instruction, question context, question, answer, answer analysis, and coarse-grained subject.
- Task 3: Human Value Alignment Instructions (34,471)
- Select a set of samples that present shared human values in the Chinese-speaking world, and get 50 seed instructions and 3k resulting instructions.
- Some additional sets of samples that present regional-culture or country-specific human values are also added.
- Task 4: Counterfactural Correction Multi-round Chat (13,653)
- The aim is to alleviate and resolve the pain points of hallucination and factual inconsistency in current LLMs.
- Based on CN-DBpedia knowledge graph dataset, CCMC has ~13,000 dialogues with an average of 5 rounds per dialogue, resulting in ~65,000 rounds of chat.
- Leetcode Instructions (11,737)
- 2,589 programming questions from Leetcode.
- Task 1: Translated Instructions (67,798)
- Paper: Chinese Open Instruction Generalist: A Preliminary Release
- HuggingFace: https://huggingface.co/datasets/BAAI/COIG
- License: MIT License
- Size: 9 tasks, 73 instructions
- Language: CH
- Summary: pCLUE is a large-scale prompt-based dataset for multi-task and zero-shot learning in Chinese.
- Generation Method: pCLUE is based on existing datasets.
- HuggingFace: https://huggingface.co/datasets/wbbbbb/pclue
- Demo: https://cluebenchmarks.com/pclue.html
- License:
- Size: 4 tasks, 396,209 instructions
- Language: CH
- Summary: CSL is a large-scale Chinese scientific literature dataset.
- Generation Method:
- Obtain the paper’s meta-information from the National Engineering Research Center for Science and Technology Resources Sharing Service (NSTR) dated from 2010 to 2020.
- Label papers with categories and disciplines, with the assistance of volunteers.
- The data format in CSL is <T,A,K,c,d>, where T is the title, A is the abstract, K is a list of keywords, c is the category label and d is the discipline label.
- Paper: CSL: A Large-scale Chinese Scientific Literature Dataset
- License:
- Size: 23 tasks, 1.1M instructions
- Language: CH
- Summary: Firefly dataset is a high-quality Chinese instruction-tuning dataset.
- Generation Method: For each task, human experts write many templates to ensure the quality and diversity of Firefly dataset.
- HuggingFace: https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M
- License:
- Size: 18 tasks
- Language: CH
- Summary: CUGE selects and organizes datasets in a language capability-task-dataset hierarchical framework, covering 7 language capabilities, 18 mainstream NLP tasks and 21 representative datasets.
- Paper: CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark
- Demo: http://cuge.baai.ac.cn/#/
- License:
- Language: Multilingual
- License:
- Paper: ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization
- License:
- Empty for now. Soon to update.
- Size: 20,456 instructions
- Language: CH
- Generateion Method: Translate Alpaca into Chinese by machine and then clean.
- Size: 19,442 instructions
- Language: CH
- Generateion Method: Translate Alpaca into Chinese by ChatGPT, and check them by humans
- Size: 51,458 instructions
- Language: CH
- Generateion Method: Translate Alpaca into Chinese by ChatGPT, and discard some of them.
- Size: 51,672 instructions
- Language: CH
- Generateion Method: Translate Stanford Alpaca dataset into Chinese by ChatGPT.
- Size: 20,465 instructions
- Language: TC
- Generateion Method: Translate Stanford Alpaca dataset into traditional Chinese using OpenCC.
- Size: 124,469 instructions
- Language: EN, TC
- Generateion Method: Combine the English instruction/input and traditional Chinese output by ChatGPT.
- Size: 52,002 instructions
- Language: EN, TC
- Generateion Method: A Traditional-Chinese version of the Alpaca dataset, whose instruction part is left as English.
- Size: 52,002 instructions
- Language: EN, TC
- Generateion Method: An Traditional-Chinese version of the Alpaca dataset, where there are English and traditional Chinese versions of one single instruction.
- Size: 83 tasks
- Language: Multilingual (46 languages)
- Summary:
- xP3 is a mixture of 13 training tasks in 46 languages with English prompts.
- Moreover, there is a xP3 Dataset Family, including the following two datasets:
- xP3mt is a mixture of 13 training tasks in 46 languages with prompts in 20 languages;
- xP3all consists of xP3 itself and evaluation datasets adding an additional 3 tasks.
- Generateion Method: Build on the P3 task taxonomy and add 28 new multilingual datasets.
- Paper: Crosslingual Generalization through Multitask Finetuning
- HuggingFace: https://huggingface.co/datasets/bigscience/xP3
- License:
- Size: 380,835 instructions in total
- Language: CH, DE, EN, JA, TC
- Summary: Guanaco dataset builds upon the 175 tasks from Alpaca, containing 3 versions with different sizes and methods.
- Generateion Method:
- Original Version (48967): Rewrite 175 Alpaca seed tasks in different languages, and add new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition.
- Mixed Version (279644): The original 175 tasks were translated into 4 versions and regenerated independently, excluding Deutsch.
- MIni Version (52224): 52K instrucrion dataset, which is included in the Mixed Version.
- HuggingFace: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset/tree/main
- License:
- Size: 205,999 instructions in total
- Language: CH, DE, EN, JA
- Summary: The Paper/General-QA dataset is a collection of questions and answers constructed for AI-generated papers or general texts in 4 languages. The purpose of this dataset is to generate paragraph-level answers to questions posed about lengthy documents such as PDFs.
- Generateion Method:
- The question dataset contains 106,707 questions, and the answer dataset contains 99,292 answers.
- Similar questions are combined to form a tree-like structure, and graph theory algorithms are used to process user questions, content summaries, and contextual logic.
- HuggingFace: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset/tree/main/additional
- License:
- Size: 20,023 instructions
- Language: EN
- Summary:
- Generateion Method: Self-instuct with prompts to focus on code generation/edting/optimization tasks, using text-davinci-003.
- HuggingFace:
- License: