Tool that does content addressable storage only? #152
Replies: 2 comments
-
Hi @indigoviolet! Sorry for the late response; I was away on vacation for a while. Thanks for the thoughtful proposal! I very much agree that managing data pipelines--especially large and/or complex ones--is a stumbling point for both Dud and DVC. Many production/"real world" scenarios rely on purpose-built workflow systems such as Airflow, Argo, Kubeflow, etc. The list of such workflow systems is long and growing at an accelerating rate these days. (FWIW, I'm most excited about Flyte.) It is largely impractical to rely on Dud/DVC for these scenarios. I wrestled with omitting the pipeline functionality altogether in Dud for the exact same reason you state. Ultimately, I decided to include pipeline functionality to keep some level of parity with DVC. And crucially, by design, the pipeline feature in Dud is orthogonal to data management. That is, you could ignore pipelines altogether (i.e., never add It's worth mentioning the main benefit you miss by not using content-aware pipelines (such as in Dud or DVC): work avoidance. Decoupled pipeline and data management tools will need some sort of common API to communicate when artifacts are out of date. I would love to talk more about this topic too; for example, how would we make Airflow integrate with Dud? Thanks again for putting some thought into this! I definitely appreciate the idea of further applying the UNIX philosophy to tools in this area. I'm looking forward to continuing our discussion! |
Beta Was this translation helpful? Give feedback.
-
@indigoviolet Did you happen to make that cad tool by any chance ^^. I'm trying to replace dvc and some of my colleagues work on Windows. |
Beta Was this translation helpful? Give feedback.
-
@kevin-hanselman : Thanks for making
dud
, I agree with you about the problems withdvc
and was very glad to find someone had already built a solid tool to fix them.In thinking about this, I wanted to go even further: make a tool that only did the data versioning, and leave the pipeline orchestration to other tools. Since you have spent a lot of time thinking about this space, I'd love to get your thoughts on this.
I will illustrate with an imaginary tool called
cad
, but this is obviously extremely similar todud
ordvc
.You provide it with a
.cadrc
, which is a gitignore-like file that specifies patterns of files that should not be checked in, and are instead managed bycad
. (this is different fromdud
ordvc
, where stage outputs define the managed files)cad
also has a config file specifying a local store and optionally a remote store.When you run
cad
in a folder, it (recursively)<file>.cad
is present and contains the hash of the corresponding data file.As with
dud
ordvc
, you check in the.cad
files.What problems do you see with such a tool? (and if you like this minimalist approach, would you consider having a mode in
dud
that does this? You could auto-create the .yml files, pretending it was a kind of trivial "identity" stage, and then do the rest of the sync dance).Beta Was this translation helpful? Give feedback.
All reactions