Beemo

Dataset Description

Beemo (Benchmark of expert-edited machine-generated outputs) is a benchmark for fine-grained machine-generated text detection, which consists of 6.5k texts written by humans, generated by ten open-source instruction-finetuned LLMs and edited by expert annotators for various use cases. Furthermore, each machine-generated text is edited by two state-of-the-art LLMs using several diverse editing prompts, which results in 13.1k machine-generated & LLM-edited texts. We make one of the first attempts to address more practical machine-generated text detection scenarios, where the user refines the LLM output or utilizes another LLM to make it more human-like.

Beemo is available in the HuggingFace datasets library and in this GitHub repository. Please refer to our paper for details on our benchmark creation approach, general statistics, and empirical evaluation results.

Our benchmark is named after BMO (abbreviated from "Be MOre", phonetically spelled "Beemo"), one of the main characters of Adventure Time.

📊 Curated by: Toloka, Penn State University, and University of Oslo.
🌐 Language(s): English
🗞️ Paper: TBA
🪪 License: MIT

🔥Updates

07.11.2024: the release of Beemo, which includes adding machine-generated & LLM-edited texts and a preprint on arXiv.
17.09.2024: the initial release and evaluation of 11 detectors on Beemo.

Benchmark Design

The Beemo's creation approach involves:

(a) 🤖 Machine-generated Text Collection: prompting an instruction-finetuned LLM;
(b) 👩🏻‍🔬 Expert-based Editing: editing the LLM's output by an expert annotator;
(c) 🦾 LLM-based Editing: editing the LLM's output by two state-of-the-art LLMs.

🤖 Machine-generated Text Collection

The No Robots 🙅‍♂️🤖 dataset is used as the source of prompts and corresponding human-written texts across the following categories: Generation, Rewrite, Summarize, Open QA, and Closed QA. We randomly sample each prompt to generate an output with one of ten open-source instruction-finetuned LLMs using the default 🤗 HuggingFace chat templates and inference hyperparameters.

Name	Base	SFT corpus	License	Paper
HuggingFaceH4/zephyr-7b-beta	Mistral-7B-v0.1	UltraChat, UltradFeedback	MIT	Tunstall et al. (2023)
allenai/tulu-2-7b	Llama 2 7B	human-written and synthetic	AI2 ImpACT	Ivison et al (2023)
allenai/tulu-2-13b	Llama 2 13B	human-written and synthetic	AI2 ImpACT	Ivison et al. (2023)
google/gemma-2b-it	Gemma 2B	human-written and synthetic	Gemma license	Gemma Team et al. (2024)
google/gemma-7b-it	Gemma 7B	human-written and synthetic	Gemma license	Gemma Team et al. (2024)
meta-llama/Llama-2-7b-chat-hf	Llama 2 7B	Misc.	Llama license	Touvron et al. (2023)
meta-llama/Llama-2-13b-chat-hf	Llama 2 13B	Misc.	Llama license	Touvron et al. (2023)
meta-llama/Llama-2-70b-chat-hf	Llama 2 70B	Misc.	Llama license	Touvron et al. (2023)
mistralai/Mistral-7B-Instruct-v0.1	Mistral-7B-v0.1	Misc.	Apache-2.0	Jiang et. al (2023)
mistralai/Mixtral-8x7B-Instruct-v0.1	Mixtral 8x7B	Misc.	Apache-2.0	Jiang et al. (2024)
meta-llama/Llama-3.1-70B-Instruct	Llama-3.1	Misc.	Llama	Dubey et al. (2024)
GPT-4o	GPT-4	Misc.	OpenAI	OpenAI (2024)
Table 1: Overview of the instruction-finetuned LLMs used to create Beemo. `GPT-4o` and `meta-llama/Llama-3.1-70B-Instruct` are used only for LLM-based editing.

👩🏻‍🔬 Expert-based Editing

The machine-generated texts are edited by an in-house team of annotators, who are well experienced in refining content produced by LLMs.

🦾 LLM-based Editing

The machine-generated texts are "humanized" by GPT-4o and meta-llama/Llama-3.1-70B-Instruct using three editing prompts.

P1: You are given a prompt and a text generated by AI using this prompt. Your task is to edit the AI-generated text to make it sound human-like and error-free. Ensure your overall edits do not exceed 40% of the generated text and the edited text follows the user request. Output only the edited text and do not explain your edits.\n\nPrompt: {prompt}\n\nAI text: {model_output}
P2: You are given a pair containing two components: (1) a user prompt for an AI assistant and (2) the AI assistant’s response. Refine the AI-generated response to make it sound more natural. Vary your editing patterns and the portions of text you choose to modify, and ensure your overall edits are 20-40% of the words in the response.\n\nUser prompt: {prompt}\n\nAI-generated response: {model_output}
P3: Modify a machine-generated response to a given prompt to make it appear more like it was written by a native English speaker. Ensure the revised version follows the user's intent. You should just give me the revised version without any other words.\n\nPrompt: {prompt}\n\nMachine-generated response: {model_output}

License

The prompts and human-written texts from No Robots 🙅‍♂️🤖 are under the original dataset's license: CC-BY-NC-4.0.
The machine-generated texts and their LLM-edited versions are subject to the underlying instruction-finetuned LLMs' licensing terms mentioned in Table 1.
The expert-edited machine-generated texts are available under the MIT license, unless otherwise specified in the underlying instruction-finetuned LLMs' licensing terms.

Cite us

Contact us

Vladislav Mikhailov ([email protected])
Ekaterina Artemova ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
beemo.gif		beemo.gif
beemo.jpg		beemo.jpg
dataset.parquet		dataset.parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beemo

Dataset Description

🔥Updates

Benchmark Design

License

Cite us

Contact us

About

Releases

Packages

License

Toloka/beemo

Folders and files

Latest commit

History

Repository files navigation

Beemo

Dataset Description

🔥Updates

Benchmark Design

License

Cite us

Contact us

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Packages