1.4.0
✨ Release highlights
Offline Batch Generation and OpenAI Batch API
We’ve updated the LLM
interface so now LLM
s using an external platform that offers a batch service can be integrated in distilabel
. In addition, OpenAILLM
has been updated so it can use the OpenAI Batch API to get 50% cost reductions.
distilabel-offline-batch-generation.mp4
Improved cache for maximum outputs reusability
We all know that running LLM
is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel
cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset
generated by one that finished its execution and was re-executed.
In this release, we've greatly improved the cache so the outputs of all the Step
s are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:
In addition, we've added a use_cache
attribute in the Step
s that allows toggling the use of the cache at step level.
Steps can generated artifacts
In some cases, Step
produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact
that can be called within the step to store artifacts generated by it. The artifacts generated by the Step
will also get uploaded to the Hugging Face Hub.
from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt
if TYPE_CHECKING:
from distilabel.steps import StepOutput
class CountTextCharacters(GlobalStep):
@property
def inputs(self) -> List[str]:
return ["text"]
@property
def outputs(self) -> List[str]:
return ["text_character_count"]
def process(self, inputs: StepInput) -> "StepOutput": # type: ignore
character_counts = []
for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)
# Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")
# Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)
plt.close()
yield inputs
New Tasks
: CLAIR
, APIGEN
and many more!
- New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A
preferred
A’ is much more contrastive and precise. - New tasks to replicate APIGen framework:
APIGenGenerator
,APIGenSemanticChecker
,APIGenExecutionChecker
. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets - New URIAL task that allows using non-instruct models to generate a response for an instruction.
- New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
- TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
- Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.
New Steps to sample data in your pipelines and remove duplicates
- New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
- New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
- New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
- New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
- New CombineOutputs to combine the outputs of two or more steps into a single output.
Generate text embeddings using vLLM
- Now you can generate embeddings using vLLMEmbeddings!
Extra things
- Easily visualize the tasks’ prompts using Task.print method.
- New use_default_structured_outputs flag in tasks to automatically use structured generation in some tasks that can benefit from it.
What's Changed
- Make
ClientvLLM.model_name
acached_property
by @gabrielmbmb in #862 - Pass dataset to dry_run method by @plaguss in #863
- Add default structured output for
GenerateSentencePair
task by @plaguss in #868 - Complexity scorer default structured output by @plaguss in #870
- Quality scorer default structured output by @plaguss in #873
- Ultrafeedback default structured output by @plaguss in #876
- Remove use of
default_chat_template
by @gabrielmbmb in #888 - Temporary fix for installing
llama-cpp-python
by @gabrielmbmb in #886 - Fix unit tests after release of
transformers==4.44.0
by @gabrielmbmb in #891 - Fix default structured output by @plaguss in #892
- Send as many batches as possible to input queues by @gabrielmbmb in #895
- Exclude
repo_id
fromLoadDataFromFileSystem
by @plaguss in #898 - Fix loader to read from a glob pattern by @plaguss in #877
- Add
save_artifact
method to_Step
by @gabrielmbmb in #871 - Add new
add_raw_input
argument to_Task
so we can automatically include the formatted input by @plaguss in #903 - New
TruncateTextColumn
to truncate the length of texts using the number of tokens or characters by @plaguss in #902 - Update
inputs
andoutputs
interface to allow returning dict indicating optionality by @gabrielmbmb in #883 - Update mistrallm by @plaguss in #904
- Deepseek prover by @plaguss in #907
- Update
RewardModelScore.inputs
property by @gabrielmbmb in #908 - Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in #893
- Fix load data from disk by @plaguss in #910
- docs: minor fixes by @davidberenstein1957 in #913
- Add
URIAL
task by @gabrielmbmb in #921 - Add
vLLMEmbeddings
by @plaguss in #920 - docs: add tutorials preference and clean by @sdiazlor in #917
- Fix
StructuredGeneration
examples and internal check by @plaguss in #912 - Generate deterministic pipeline name when it's not given by @plaguss in #878
- Add custom errors by @plaguss in #911
- Docs/tutorials fix by @sdiazlor in #922
- Add
revision
runtime parameter toLoadDataFromHub
by @gabrielmbmb in #928 - Add plausible as replacement for GA by @davidberenstein1957 in #929
- Add minhash related steps to deduplicate texts by @plaguss in #931
- docs: API reference review by @sdiazlor in #932
- Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in #937
- Update
make_generator_step
to set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in #936 - Add
CombineOutputs
step by @gabrielmbmb in #939 - fix: regex expression in POSITIVE_NEGATIVE by @sdiazlor in #940
- Offline batch generation by @gabrielmbmb in #923
- Fix applying input mapping when mapping overrides another column by @gabrielmbmb in #938
- Fix all replicas had the same
_llm_identifier
forCudaDevicePlacementMixin
by @gabrielmbmb in #941 - Fix empty load stage when two
GlobalStep
s are chained by @gabrielmbmb in #945 - Add
system_prompt
attribute toTextGeneration
by @gabrielmbmb in #950 - Add step to deduplicate records based on embeddings by @plaguss in #946
- Updated setup_logging to use UTF-8 in FileHandler by @dameikle in #952
- Add more generation parameters to
vLLM
by @gabrielmbmb in #955 - Fix
Magpie
generating different columns names depending onLLM
output by @gabrielmbmb in #965 - Docs/962 docs create a smoother transition from index installation quickstart by @davidberenstein1957 in #968
- Add
logging_handlers
argument by @gabrielmbmb in #969 - [DOCS] Add tips in the docs to avoid overloading Free Serverless Endpoints by @plaguss in #973
- Add
TextClassification
,UMAP
,DBSCAN
andTextClustering
tasks by @plaguss in #948 - [FEATURE] Simplify customizing the
TextGeneration
task with custom prompts by @plaguss in #974 - Update
system_prompt
attribute for adding probabilities inMagpieBase
by @gabrielmbmb in #981 - Fix unloading steps with more than 1 replica by @gabrielmbmb in #982
- docs: 960 docs add a glossary concept section by @davidberenstein1957 in #970
- Fix missing
system_prompt_key
column inMagpie
tasks by @gabrielmbmb in #983 - docs: update component gallery by @davidberenstein1957 in #987
- fix missing batch when last batch arrive early by @zye1996 in #989
- Fine personas socialai tutorial by @plaguss in #992
- feat: add basic draw implementation to pipline by @davidberenstein1957 in #966
- Fix schema inference structured generation by @davidberenstein1957 in #994
- [DOCS] Add developer documentation section in the docs by @plaguss in #999
- Fix
vllm
installation in CI by @gabrielmbmb in #1009 - fix metadata writeout when llm error by @zye1996 in #1003
- Add example of custom text generation step in quickstart by @plaguss in #984
- feat: 985 feature argillalabeller task by @davidberenstein1957 in #986
- Fix
llvmlite
install withuv
by @gabrielmbmb in #1018 - fix: failing tests argilla labeller by @davidberenstein1957 in #1017
- fix inpute when output_mapping is not empty by @zye1996 in #1015
- Add Tasks to replicate
APIGen
by @plaguss in #925 - Pretty print by @plaguss in #934
- Add
CLAIR
task by @plaguss in #926 - Add cache at
Step
level by @plaguss in #766 - Fix
IndexError
when overriding inputs andgroup_generations=False
by @plaguss in #1022 - Update
Pipeline cache
docs by @gabrielmbmb in #1023 1.4.0
by @gabrielmbmb in #1024
New Contributors
Full Changelog: 1.3.2...1.4.0