feat: implenent basic SFT pipeline based on synthetic data generator #1059

burtenshaw · 2024-11-13T10:49:03Z

This is just an idea for how to make distilabel crazy-user-friendly. In this example, an SFT pipeline is abstracted away to make it easy to use. Like this:

from distilabel.presets import InstructionResponsePipeline

pipeline = InstructionResponsePipeline()

distiset = pipeline.run()

Maybe we could do this for other core tasks like: DPO, classification, retrieval

for more information, see https://pre-commit.ci

codspeed-hq · 2024-11-13T10:57:44Z

CodSpeed Performance Report

Merging #1059 will not alter performance

_{Comparing feat/pipeline-usecase-abstraction (79a13d5) with main (844165f)}

🎉 Hooray! `pytest-codspeed` just leveled up to 3.0.0!

A heads-up, this is a breaking change and it might affect your current performance baseline a bit. But here's the exciting part - it's packed with new, cool features and promises improved result stability 🥳!
Curious about what's new? Visit our releases page to delve into all the awesome details about this new version.

Summary

✅ 1 untouched benchmarks

for more information, see https://pre-commit.ci

src/distilabel/presets/sft.py

davidberenstein1957 · 2024-11-18T08:56:31Z

@burtenshaw what pipelines do you intend to create?

SFT
DPO
Classification

Should we also have individual components like completion generation (if you only want an additional completion), instruction generation (which can be done with magpie and inheriting with only_instruction=True), preference (without generating instruction/completions from scratch)?

I can imagine people doing these things in steps and reviewing the quality of the generations in between before racking up costs with a complete pipeline.

for more information, see https://pre-commit.ci

burtenshaw · 2024-11-18T11:17:51Z

@burtenshaw what pipelines do you intend to create?

SFT

DPO

Classification

@burtenshaw In this PR, I plan to just implement SFT and share that. But I think we could move on to DPO and classification straight after.

Should we also have individual components like completion generation (if you only want an additional completion), instruction generation (which can be done with magpie and inheriting with only_instruction=True), preference (without generating instruction/completions from scratch)?

I can imagine people doing these things in steps and reviewing the quality of the generations in between before racking up costs with a complete pipeline.

I agree that this is useful to more basic users. But this is also potentially complex. I don't want to end up abstracting the Magpie API in multiple ways, because the user could just use a proper pipeline. I would prefer to get feedback first.

davidberenstein1957 · 2024-11-18T11:59:54Z

@burtenshaw what pipelines do you intend to create?

SFT

DPO

Classification

@burtenshaw In this PR, I plan to just implement SFT and share that. But I think we could move on to DPO and classification straight after.

Should we also have individual components like completion generation (if you only want an additional completion), instruction generation (which can be done with magpie and inheriting with only_instruction=True), preference (without generating instruction/completions from scratch)?
I can imagine people doing these things in steps and reviewing the quality of the generations in between before racking up costs with a complete pipeline.

I agree that this is useful to more basic users. But this is also potentially complex. I don't want to end up abstracting the Magpie API in multiple ways, because the user could just use a proper pipeline. I would prefer to get feedback first.

Sure, sounds fair, I just wanted to see what your thoughts were surrounding the features.

src/distilabel/pipeline/templates/instruction.py

davidberenstein1957 · 2024-11-18T12:09:09Z

src/distilabel/pipeline/templates/__init__.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .instruction import InstructionResponsePipeline  # noqa: F401


will you also add this to CLI commands?
something like the following would be cool
distilabel pipelines sft --num-rows 100 --n-turns 1

That's nice, but we should do it another PR.

plaguss

I like the idea! Just let some small notes. If you can test it and generate a sample dataset that we can link in the docs that would be perfect

docs/sections/getting_started/quickstart.md

src/distilabel/pipeline/templates/instruction.py

burtenshaw · 2024-11-19T10:19:45Z

@sdiazlor this is the abstraction for SFT in distilabel.

for more information, see https://pre-commit.ci

burtenshaw · 2024-11-19T13:35:52Z

I like the idea! Just let some small notes. If you can test it and generate a sample dataset that we can link in the docs that would be perfect

Here's a dataset: https://huggingface.co/datasets/argilla/distilabel-sft-easy

burtenshaw and others added 2 commits November 13, 2024 06:48

feat: implenent basic SFT pipeline based on synthetic data generator

6ee1c92

[pre-commit.ci] auto fixes from pre-commit.com hooks

b18a98f

for more information, see https://pre-commit.ci

burtenshaw requested review from dvsrepo and gabrielmbmb November 13, 2024 10:51

burtenshaw and others added 2 commits November 13, 2024 13:47

feat: expose more params in init for config

a0dfa76

[pre-commit.ci] auto fixes from pre-commit.com hooks

2cecb13

for more information, see https://pre-commit.ci

davidberenstein1957 reviewed Nov 18, 2024

View reviewed changes

src/distilabel/presets/sft.py Outdated Show resolved Hide resolved

src/distilabel/presets/sft.py Outdated Show resolved Hide resolved

burtenshaw and others added 2 commits November 18, 2024 12:08

feat: rename and document

e463b11

[pre-commit.ci] auto fixes from pre-commit.com hooks

7fe42f9

for more information, see https://pre-commit.ci

burtenshaw added 2 commits November 18, 2024 12:42

docs: add to quickstart in docs

4244dd7

feat: expand logic for turns and llms

e631e07

davidberenstein1957 reviewed Nov 18, 2024

View reviewed changes

burtenshaw marked this pull request as ready for review November 18, 2024 16:21

burtenshaw requested review from plaguss and removed request for dvsrepo November 18, 2024 16:21

plaguss reviewed Nov 19, 2024

View reviewed changes

plaguss added the enhancement New feature or request label Nov 19, 2024

burtenshaw and others added 5 commits November 19, 2024 12:01

feat: drop llm validation

a30bd19

[pre-commit.ci] auto fixes from pre-commit.com hooks

03b6dd5

for more information, see https://pre-commit.ci

docs: improve documentation

048ebf0

open kwargs in run

26c7e53

fix unused variable

79a13d5

plaguss changed the base branch from main to develop November 19, 2024 14:41

plaguss approved these changes Nov 19, 2024

View reviewed changes

burtenshaw merged commit cfe8c05 into develop Nov 19, 2024
7 of 8 checks passed

burtenshaw deleted the feat/pipeline-usecase-abstraction branch November 19, 2024 18:35

burtenshaw restored the feat/pipeline-usecase-abstraction branch November 19, 2024 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implenent basic SFT pipeline based on synthetic data generator #1059

feat: implenent basic SFT pipeline based on synthetic data generator #1059

burtenshaw commented Nov 13, 2024 •

edited

Loading

codspeed-hq bot commented Nov 13, 2024 •

edited

Loading

davidberenstein1957 commented Nov 18, 2024 •

edited

Loading

burtenshaw commented Nov 18, 2024

davidberenstein1957 commented Nov 18, 2024

davidberenstein1957 Nov 18, 2024

burtenshaw Nov 18, 2024

plaguss left a comment

burtenshaw commented Nov 19, 2024 •

edited

Loading

burtenshaw commented Nov 19, 2024

feat: implenent basic SFT pipeline based on synthetic data generator #1059

feat: implenent basic SFT pipeline based on synthetic data generator #1059

Conversation

burtenshaw commented Nov 13, 2024 • edited Loading

codspeed-hq bot commented Nov 13, 2024 • edited Loading

CodSpeed Performance Report

Merging #1059 will not alter performance

🎉 Hooray! pytest-codspeed just leveled up to 3.0.0!

Summary

davidberenstein1957 commented Nov 18, 2024 • edited Loading

burtenshaw commented Nov 18, 2024

davidberenstein1957 commented Nov 18, 2024

davidberenstein1957 Nov 18, 2024

Choose a reason for hiding this comment

burtenshaw Nov 18, 2024

Choose a reason for hiding this comment

plaguss left a comment

Choose a reason for hiding this comment

burtenshaw commented Nov 19, 2024 • edited Loading

burtenshaw commented Nov 19, 2024

burtenshaw commented Nov 13, 2024 •

edited

Loading

codspeed-hq bot commented Nov 13, 2024 •

edited

Loading

🎉 Hooray! `pytest-codspeed` just leveled up to 3.0.0!

davidberenstein1957 commented Nov 18, 2024 •

edited

Loading

burtenshaw commented Nov 19, 2024 •

edited

Loading