Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Automated Data Science bootstrap from curated content sets #402

Open
10 tasks
codificat opened this issue May 3, 2022 · 12 comments
Open
10 tasks

[Epic] Automated Data Science bootstrap from curated content sets #402

codificat opened this issue May 3, 2022 · 12 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/key-result This is a Key Result we want to achieve. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@codificat
Copy link
Member

codificat commented May 3, 2022

Problem statement

As a Data Scientist,
I want a service that provides me an easy mechanism to bootstrap a new Data Science project, starting from a curated software stack that is appropriate for my project’s goals and available in a shared environment,
so that I can quickly start working on the Data Science project tasks without having to invest time in preparing a working environment, and I can be confident that the project is reproducible and maintainable.

High-level Goals

Starting a new Data Science project from scratch, user interacts with a git forge to obtain a git repository populated from a relevant curated software stack, with bots that keep it up to date with recommendations and make the git project readily available to start working on the Data Science tasks.

This involves:

  • A catalog of curated software stacks. Currently we have predictable stacks for Image Processing, Computer Vision and Natural Language Processing)
  • Template repositories that can be used to bootstrap the DS project
  • A "bootstrap" command for the bot
  • Pipelines that create a working build of the project content
  • (optional) an online Open Data Hub environment that hosts a running version of the project

Proposal description

Phase 1

As a Data Scientist,
I want to be able to bootstrap a new GitHub repository from an existing template that contains a curated software stack that is relevant to my project.

  1. User is pointed to the relevant template repository
  2. User initializes a repo from the template
  3. The user's new repo contains a ready-to-use software stack with clearly documented next steps
  4. User installs Kebechet in the repo so that it receives automated PRs in the future with update recommendations

Phase 1.5 is: automate phase 1 with a script.

Phase 2

As a Data Scientist
I want to open an Issue "please create an Image Processing notebook" on GitHub that triggers Thoth bot to start populating my repository:

  1. User initializes an empty repo and installs Kebechet
  2. User opens an issue in the repo, e.g. "New content set"
  3. bot bootstraps repo
  4. bot kicks off the Bring-Your-Own-Notebook workflow
  5. user enters notebook spawner on ODH@op1st and sees the spawnable notebook image and can start it

Alternatives

User manually doing each step

Additional context

Acceptance Criteria

  • A service entry point / welcome page provides:
    • A catalog of curated software stacks.
    • Clear and concise instructions on how to use the service
    • Additional documentation and references to the components involved (Thoth advise, pipelines, byon/odh...)
  • template repositories containing the curated software stacks:
  • Tooling exists to streamline the creation of the new repo from the templates
  • A "bootstrap" command for the bot causes the repo to be pre-populated with the chosen stack content
  • Build pipelines create a working build of the new project content
  • (optional) an online Open Data Hub environment hosts a running version of the project
@codificat codificat added the kind/feature Categorizes issue or PR as related to a new feature. label May 3, 2022
@sesheta sesheta added needs-triage Indicates an issue or PR lacks a `triage/...` label and requires one. needs-sig labels May 3, 2022
@codificat
Copy link
Member Author

/kind key-result

@sesheta sesheta added the kind/key-result This is a Key Result we want to achieve. label May 3, 2022
@fridex
Copy link
Contributor

fridex commented May 3, 2022

We have discussed an approach that would reuse logic for template projects. Let's sync if we want to develop and maintain this type of logic in Kebechet.

@durandom
Copy link

durandom commented May 4, 2022

A "bootstrap" command for the bot causes the repo to be pre-populated with the chosen stack content

I suggest making this the first milestone. It's ok to assume the user knows which stacks are available. So that

  • I start with an empty repo
  • I open an issue and ask the bot to create a PR for e.g. the Image Recognition Stack
  • I can merge the PR and have a configured repo, which receives update recommendations via PRs in the future

does that make sense?

@goern
Copy link
Member

goern commented May 5, 2022

/sig user-experience

@sesheta sesheta added sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries. and removed needs-sig labels May 5, 2022
@codificat
Copy link
Member Author

A "bootstrap" command for the bot causes the repo to be pre-populated with the chosen stack content

I suggest making this the first milestone.

There is a pre-requisite to this, which is that the stack content is readily-usable. This fits with Frido's comment about template projects.

So that

  • I start with an empty repo

.. and install the bot.

  • I open an issue and ask the bot to create a PR for e.g. the Image Recognition Stack
  • I can merge the PR and have a configured repo,

These 3 steps can be potentially reduced to one by just using the template project logic.

As this is a pre-requisite anyway, I suggest we make that the first milestone. The bot automation can be a follow-up.

I updated the description a bit to reflect that, with the template logic being "phase 1".

Makes sense?

@codificat codificat changed the title Automated Data Science bootstrap from curated content sets [Epic] Automated Data Science bootstrap from curated content sets May 6, 2022
@MichaelClifford
Copy link

Not sure how you plan to implement this, but sounds like it would require the addition of an ever growing set of template repo's (is that right?). Have you considered using a cookie-cutter [1] like repo that would serve as a single repo, with dynamic options that can be implemented on new repo creations based on the users specific needs?

You can also look at [2] our ds cookie cutter repo for a very simple example.

[1] https://github.com/cookiecutter/cookiecutter
[2] https://github.com/aicoe-aiops/cookiecutter-data-science

@codificat
Copy link
Member Author

Not sure how you plan to implement this, but sounds like it would require the addition of an ever growing set of template repo's (is that right?).

Correct, that was the idea: start initially with the 3 "predictable stacks" we have been working on, but eventually offer more options.

Have you considered using a cookie-cutter [1] like repo

I saw cookie-cutter and the work you are doing with it, and I was planning to look closer at it, starting with one of the repos (see mention of cookie-cutter/your template in thoth-station/ps-nlp#154 as an option to explore).

But I was still thinking on separate repos. Honestly, this did not occur to me:

that would serve as a single repo, with dynamic options that can be implemented on new repo creations based on the users specific needs?

Thanks for the suggestion! I will look closer.

An initial question that comes to mind, though: wouldn't that single repo become too big/complex? e.g. the NLP stack alone already has 4 overlays. One goal of this functionality is to be simple, easy to understand - it is meant to bootstrap/get started, and I am wondering if we would potentially be over-complicating the starting point.

@MichaelClifford
Copy link

wouldn't that single repo become too big/complex? e.g. the NLP stack alone already has 4 overlays.

Its certainly a trade off to consider. Managing 1 complex repo vs complexity of managing multiple simple repos. Again, depends on how you plan to implement this. Was just presenting a possible suggestion/ alternative.

Would it be as or more complex than https://github.com/operate-first/apps ?

@codificat
Copy link
Member Author

Its certainly a trade off to consider. Managing 1 complex repo vs complexity of managing multiple simple repos. Again, depends on how you plan to implement this. Was just presenting a possible suggestion/ alternative.

Yes, and thanks again for the suggestion, it is being considered.

Would it be as or more complex than https://github.com/operate-first/apps ?

It would not be as complex as that one, no.

/milestone OKR review Q2 2022

@codificat
Copy link
Member Author

/milestone OKR review Q2 2022

@sesheta sesheta added this to the OKR review Q2 2022 milestone May 30, 2022
@codificat
Copy link
Member Author

/triage accepted
/lifecycle active
/assign

@sesheta sesheta added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/...` label and requires one. labels May 31, 2022
@codificat
Copy link
Member Author

/remove-lifecycle active
as focus this quarter is in a different KR

@sesheta sesheta removed the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Aug 23, 2022
@Gkrumbach07 Gkrumbach07 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 23, 2022
@codificat codificat moved this to 📋 Backlog in Planning Board Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/key-result This is a Key Result we want to achieve. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: 📋 Backlog
Development

No branches or pull requests

7 participants