✨ Release highlights

Offline Batch Generation and OpenAI Batch API

We’ve updated the LLM interface so now LLMs using an external platform that offers a batch service can be integrated in distilabel. In addition, OpenAILLM has been updated so it can use the OpenAI Batch API to get 50% cost reductions.

distilabel-offline-batch-generation.mp4

Improved cache for maximum outputs reusability

We all know that running LLM is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset generated by one that finished its execution and was re-executed.

In this release, we've greatly improved the cache so the outputs of all the Steps are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:

In addition, we've added a use_cache attribute in the Steps that allows toggling the use of the cache at step level.

Steps can generated artifacts

In some cases, Step produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact that can be called within the step to store artifacts generated by it. The artifacts generated by the Step will also get uploaded to the Hugging Face Hub.

from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt

if TYPE_CHECKING:
    from distilabel.steps import StepOutput


class CountTextCharacters(GlobalStep):
    @property
    def inputs(self) -> List[str]:
        return ["text"]

    @property
    def outputs(self) -> List[str]:
        return ["text_character_count"]

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        character_counts = []

        for input in inputs:
            text_character_count = len(input["text"])
            input["text_character_count"] = text_character_count
            character_counts.append(text_character_count)

        # Generate plot with the distribution of text character counts
        plt.figure(figsize=(10, 6))
        plt.hist(character_counts, bins=30, edgecolor="black")
        plt.title("Distribution of Text Character Counts")
        plt.xlabel("Character Count")
        plt.ylabel("Frequency")

        # Save the plot as an artifact of the step
        self.save_artifact(
            name="text_character_count_distribution",
            write_function=lambda path: plt.savefig(path / "figure.png"),
            metadata={"type": "image", "library": "matplotlib"},
        )

        plt.close()

        yield inputs

New `Tasks`: `CLAIR`, `APIGEN` and many more!

New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A preferred A’ is much more contrastive and precise.
New tasks to replicate APIGen framework: APIGenGenerator, APIGenSemanticChecker, APIGenExecutionChecker. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
New URIAL task that allows using non-instruct models to generate a response for an instruction.
New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.

New Steps to sample data in your pipelines and remove duplicates

New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
New CombineOutputs to combine the outputs of two or more steps into a single output.

Generate text embeddings using `vLLM`

Now you can generate embeddings using vLLMEmbeddings!

Extra things

Easily visualize the tasks’ prompts using Task.print method.
New use_default_structured_outputs flag in tasks to automatically use structured generation in some tasks that can benefit from it.

What's Changed

Make ClientvLLM.model_name a cached_property by @gabrielmbmb in #862
Pass dataset to dry_run method by @plaguss in #863
Add default structured output for GenerateSentencePair task by @plaguss in #868
Complexity scorer default structured output by @plaguss in #870
Quality scorer default structured output by @plaguss in #873
Ultrafeedback default structured output by @plaguss in #876
Remove use of default_chat_template by @gabrielmbmb in #888
Temporary fix for installing llama-cpp-python by @gabrielmbmb in #886
Fix unit tests after release of transformers==4.44.0 by @gabrielmbmb in #891
Fix default structured output by @plaguss in #892
Send as many batches as possible to input queues by @gabrielmbmb in #895
Exclude repo_id from LoadDataFromFileSystem by @plaguss in #898
Fix loader to read from a glob pattern by @plaguss in #877
Add save_artifact method to _Step by @gabrielmbmb in #871
Add new add_raw_input argument to _Task so we can automatically include the formatted input by @plaguss in #903
New TruncateTextColumn to truncate the length of texts using the number of tokens or characters by @plaguss in #902
Update inputs and outputs interface to allow returning dict indicating optionality by @gabrielmbmb in #883
Update mistrallm by @plaguss in #904
Deepseek prover by @plaguss in #907
Update RewardModelScore.inputs property by @gabrielmbmb in #908
Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in #893
Fix load data from disk by @plaguss in #910
docs: minor fixes by @davidberenstein1957 in #913
Add URIAL task by @gabrielmbmb in #921
Add vLLMEmbeddings by @plaguss in #920
docs: add tutorials preference and clean by @sdiazlor in #917
Fix StructuredGeneration examples and internal check by @plaguss in #912
Generate deterministic pipeline name when it's not given by @plaguss in #878
Add custom errors by @plaguss in #911
Docs/tutorials fix by @sdiazlor in #922
Add revision runtime parameter to LoadDataFromHub by @gabrielmbmb in #928
Add plausible as replacement for GA by @davidberenstein1957 in #929
Add minhash related steps to deduplicate texts by @plaguss in #931
docs: API reference review by @sdiazlor in #932
Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in #937
Update make_generator_step to set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in #936
Add CombineOutputs step by @gabrielmbmb in #939
fix: regex expression in POSITIVE_NEGATIVE by @sdiazlor in #940
Offline batch generation by @gabrielmbmb in #923
Fix applying input mapping when mapping overrides another column by @gabrielmbmb in #938
Fix all replicas had the same _llm_identifier for CudaDevicePlacementMixin by @gabrielmbmb in #941
Fix empty load stage when two GlobalSteps are chained by @gabrielmbmb in #945
Add system_prompt attribute to TextGeneration by @gabrielmbmb in #950
Add step to deduplicate records based on embeddings by @plaguss in #946
Updated setup_logging to use UTF-8 in FileHandler by @dameikle in #952
Add more generation parameters to vLLM by @gabrielmbmb in #955
Fix Magpie generating different columns names depending on LLM output by @gabrielmbmb in #965
Docs/962 docs create a smoother transition from index installation quickstart by @davidberenstein1957 in #968
Add logging_handlers argument by @gabrielmbmb in #969
[DOCS] Add tips in the docs to avoid overloading Free Serverless Endpoints by @plaguss in #973
Add TextClassification, UMAP, DBSCAN and TextClustering tasks by @plaguss in #948
[FEATURE] Simplify customizing the TextGeneration task with custom prompts by @plaguss in #974
Update system_prompt attribute for adding probabilities in MagpieBase by @gabrielmbmb in #981
Fix unloading steps with more than 1 replica by @gabrielmbmb in #982
docs: 960 docs add a glossary concept section by @davidberenstein1957 in #970
Fix missing system_prompt_key column in Magpie tasks by @gabrielmbmb in #983
docs: update component gallery by @davidberenstein1957 in #987
fix missing batch when last batch arrive early by @zye1996 in #989
Fine personas socialai tutorial by @plaguss in #992
feat: add basic draw implementation to pipline by @davidberenstein1957 in #966
Fix schema inference structured generation by @davidberenstein1957 in #994
[DOCS] Add developer documentation section in the docs by @plaguss in #999
Fix vllm installation in CI by @gabrielmbmb in #1009
fix metadata writeout when llm error by @zye1996 in #1003
Add example of custom text generation step in quickstart by @plaguss in #984
feat: 985 feature argillalabeller task by @davidberenstein1957 in #986
Fixllvmlite install with uv by @gabrielmbmb in #1018
fix: failing tests argilla labeller by @davidberenstein1957 in #1017
fix inpute when output_mapping is not empty by @zye1996 in #1015
Add Tasks to replicate APIGen by @plaguss in #925
Pretty print by @plaguss in #934
Add CLAIR task by @plaguss in #926
Add cache at Step level by @plaguss in #766
Fix IndexError when overriding inputs and group_generations=False by @plaguss in #1022
Update Pipeline cache docs by @gabrielmbmb in #1023
1.4.0 by @gabrielmbmb in #1024

New Contributors

@dameikle made their first contribution in #952
@zye1996 made their first contribution in #989

Full Changelog: 1.3.2...1.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.4.0

✨ Release highlights

Offline Batch Generation and OpenAI Batch API

Improved cache for maximum outputs reusability

Steps can generated artifacts

New `Tasks`: `CLAIR`, `APIGEN` and many more!

New Steps to sample data in your pipelines and remove duplicates

Generate text embeddings using `vLLM`

Extra things

What's Changed

New Contributors

Contributors

1.4.0

✨ Release highlights

Offline Batch Generation and OpenAI Batch API

Improved cache for maximum outputs reusability

Steps can generated artifacts

New Tasks: CLAIR, APIGEN and many more!

New Steps to sample data in your pipelines and remove duplicates

Generate text embeddings using vLLM

Extra things

What's Changed

New Contributors

Contributors

New `Tasks`: `CLAIR`, `APIGEN` and many more!

Generate text embeddings using `vLLM`