-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implenent basic SFT pipeline based on synthetic data generator #1059
Conversation
CodSpeed Performance ReportMerging #1059 will not alter performanceComparing 🎉 Hooray!
|
@burtenshaw what pipelines do you intend to create?
Should we also have individual components like completion generation (if you only want an additional completion), instruction generation (which can be done with magpie and inheriting with I can imagine people doing these things in steps and reviewing the quality of the generations in between before racking up costs with a complete pipeline. |
for more information, see https://pre-commit.ci
@burtenshaw In this PR, I plan to just implement SFT and share that. But I think we could move on to DPO and classification straight after.
I agree that this is useful to more basic users. But this is also potentially complex. I don't want to end up abstracting the Magpie API in multiple ways, because the user could just use a proper pipeline. I would prefer to get feedback first. |
Sure, sounds fair, I just wanted to see what your thoughts were surrounding the features. |
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from .instruction import InstructionResponsePipeline # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will you also add this to CLI commands?
something like the following would be cool
distilabel pipelines sft --num-rows 100 --n-turns 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's nice, but we should do it another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea! Just let some small notes. If you can test it and generate a sample dataset that we can link in the docs that would be perfect
@sdiazlor this is the abstraction for SFT in distilabel. |
Here's a dataset: https://huggingface.co/datasets/argilla/distilabel-sft-easy |
This is just an idea for how to make distilabel crazy-user-friendly. In this example, an SFT pipeline is abstracted away to make it easy to use. Like this:
Maybe we could do this for other core tasks like: DPO, classification, retrieval