Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implenent basic SFT pipeline based on synthetic data generator #1059

Merged
merged 13 commits into from
Nov 19, 2024

Conversation

burtenshaw
Copy link
Contributor

@burtenshaw burtenshaw commented Nov 13, 2024

This is just an idea for how to make distilabel crazy-user-friendly. In this example, an SFT pipeline is abstracted away to make it easy to use. Like this:

from distilabel.presets import InstructionResponsePipeline

pipeline = InstructionResponsePipeline()

distiset = pipeline.run()

Maybe we could do this for other core tasks like: DPO, classification, retrieval

Copy link

codspeed-hq bot commented Nov 13, 2024

CodSpeed Performance Report

Merging #1059 will not alter performance

Comparing feat/pipeline-usecase-abstraction (79a13d5) with main (844165f)

🎉 Hooray! pytest-codspeed just leveled up to 3.0.0!

A heads-up, this is a breaking change and it might affect your current performance baseline a bit. But here's the exciting part - it's packed with new, cool features and promises improved result stability 🥳!
Curious about what's new? Visit our releases page to delve into all the awesome details about this new version.

Summary

✅ 1 untouched benchmarks

src/distilabel/presets/sft.py Outdated Show resolved Hide resolved
src/distilabel/presets/sft.py Outdated Show resolved Hide resolved
@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Nov 18, 2024

@burtenshaw what pipelines do you intend to create?

  • SFT
  • DPO
  • Classification

Should we also have individual components like completion generation (if you only want an additional completion), instruction generation (which can be done with magpie and inheriting with only_instruction=True), preference (without generating instruction/completions from scratch)?

I can imagine people doing these things in steps and reviewing the quality of the generations in between before racking up costs with a complete pipeline.

@burtenshaw
Copy link
Contributor Author

@burtenshaw what pipelines do you intend to create?

  • SFT
  • DPO
  • Classification

@burtenshaw In this PR, I plan to just implement SFT and share that. But I think we could move on to DPO and classification straight after.

Should we also have individual components like completion generation (if you only want an additional completion), instruction generation (which can be done with magpie and inheriting with only_instruction=True), preference (without generating instruction/completions from scratch)?

I can imagine people doing these things in steps and reviewing the quality of the generations in between before racking up costs with a complete pipeline.

I agree that this is useful to more basic users. But this is also potentially complex. I don't want to end up abstracting the Magpie API in multiple ways, because the user could just use a proper pipeline. I would prefer to get feedback first.

@davidberenstein1957
Copy link
Member

@burtenshaw what pipelines do you intend to create?

  • SFT
  • DPO
  • Classification

@burtenshaw In this PR, I plan to just implement SFT and share that. But I think we could move on to DPO and classification straight after.

Should we also have individual components like completion generation (if you only want an additional completion), instruction generation (which can be done with magpie and inheriting with only_instruction=True), preference (without generating instruction/completions from scratch)?
I can imagine people doing these things in steps and reviewing the quality of the generations in between before racking up costs with a complete pipeline.

I agree that this is useful to more basic users. But this is also potentially complex. I don't want to end up abstracting the Magpie API in multiple ways, because the user could just use a proper pipeline. I would prefer to get feedback first.

Sure, sounds fair, I just wanted to see what your thoughts were surrounding the features.

src/distilabel/pipeline/templates/instruction.py Outdated Show resolved Hide resolved
src/distilabel/pipeline/templates/instruction.py Outdated Show resolved Hide resolved
# See the License for the specific language governing permissions and
# limitations under the License.

from .instruction import InstructionResponsePipeline # noqa: F401

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will you also add this to CLI commands?
something like the following would be cool
distilabel pipelines sft --num-rows 100 --n-turns 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's nice, but we should do it another PR.

@burtenshaw burtenshaw marked this pull request as ready for review November 18, 2024 16:21
@burtenshaw burtenshaw requested review from plaguss and removed request for dvsrepo November 18, 2024 16:21
Copy link
Contributor

@plaguss plaguss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea! Just let some small notes. If you can test it and generate a sample dataset that we can link in the docs that would be perfect

docs/sections/getting_started/quickstart.md Outdated Show resolved Hide resolved
docs/sections/getting_started/quickstart.md Outdated Show resolved Hide resolved
src/distilabel/pipeline/templates/instruction.py Outdated Show resolved Hide resolved
src/distilabel/pipeline/templates/instruction.py Outdated Show resolved Hide resolved
@plaguss plaguss added the enhancement New feature or request label Nov 19, 2024
@burtenshaw
Copy link
Contributor Author

burtenshaw commented Nov 19, 2024

@sdiazlor this is the abstraction for SFT in distilabel.

@burtenshaw
Copy link
Contributor Author

I like the idea! Just let some small notes. If you can test it and generate a sample dataset that we can link in the docs that would be perfect

Here's a dataset: https://huggingface.co/datasets/argilla/distilabel-sft-easy

@plaguss plaguss changed the base branch from main to develop November 19, 2024 14:41
@burtenshaw burtenshaw merged commit cfe8c05 into develop Nov 19, 2024
7 of 8 checks passed
@burtenshaw burtenshaw deleted the feat/pipeline-usecase-abstraction branch November 19, 2024 18:35
@burtenshaw burtenshaw restored the feat/pipeline-usecase-abstraction branch November 19, 2024 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants