Tool that does content addressable storage only? #152

indigoviolet · 2023-07-22T07:03:18Z

indigoviolet
Jul 22, 2023

@kevin-hanselman : Thanks for making dud, I agree with you about the problems with dvc and was very glad to find someone had already built a solid tool to fix them.

In thinking about this, I wanted to go even further: make a tool that only did the data versioning, and leave the pipeline orchestration to other tools. Since you have spent a lot of time thinking about this space, I'd love to get your thoughts on this.

I will illustrate with an imaginary tool called cad, but this is obviously extremely similar to dud or dvc.

You provide it with a .cadrc, which is a gitignore-like file that specifies patterns of files that should not be checked in, and are instead managed by cad. (this is different from dud or dvc, where stage outputs define the managed files)
cad also has a config file specifying a local store and optionally a remote store.
When you run cad in a folder, it (recursively)

finds all data files matching the (nearest) .cadrc, ensures that a corresponding metadata file <file>.cad is present and contains the hash of the corresponding data file.
replaces every data file with a symlink to the file in the local store (which is named for the hash)
for any metadata files missing their data files, creates the symlink to the local store
syncs the local store to the remote store as necessary
potentially: manages a section in a nearby .gitignore to filter out all the files in .cadrc

As with dud or dvc, you check in the .cad files.

What problems do you see with such a tool? (and if you like this minimalist approach, would you consider having a mode in dud that does this? You could auto-create the .yml files, pretending it was a kind of trivial "identity" stage, and then do the rest of the sync dance).

kevin-hanselman · 2023-08-13T21:55:38Z

kevin-hanselman
Aug 13, 2023
Maintainer

Hi @indigoviolet! Sorry for the late response; I was away on vacation for a while.

Thanks for the thoughtful proposal! I very much agree that managing data pipelines--especially large and/or complex ones--is a stumbling point for both Dud and DVC. Many production/"real world" scenarios rely on purpose-built workflow systems such as Airflow, Argo, Kubeflow, etc. The list of such workflow systems is long and growing at an accelerating rate these days. (FWIW, I'm most excited about Flyte.) It is largely impractical to rely on Dud/DVC for these scenarios. I wrestled with omitting the pipeline functionality altogether in Dud for the exact same reason you state. Ultimately, I decided to include pipeline functionality to keep some level of parity with DVC. And crucially, by design, the pipeline feature in Dud is orthogonal to data management. That is, you could ignore pipelines altogether (i.e., never add commands to your stages) and just use the data versioning functionality. I think this is what you're alluding to in your final paragraph. If you're up for it, I would challenge you to build this so-called cad as a thin wrapper around Dud; maybe just start with pseudo-code or a shell script. I hypothesize that cad may boil down to managing one stage file, probably with #23 implemented for quality-of-life 😄.

It's worth mentioning the main benefit you miss by not using content-aware pipelines (such as in Dud or DVC): work avoidance. Decoupled pipeline and data management tools will need some sort of common API to communicate when artifacts are out of date. I would love to talk more about this topic too; for example, how would we make Airflow integrate with Dud?

Thanks again for putting some thought into this! I definitely appreciate the idea of further applying the UNIX philosophy to tools in this area. I'm looking forward to continuing our discussion!

0 replies

mpizenberg · 2023-11-04T08:05:12Z

mpizenberg
Nov 4, 2023

@indigoviolet Did you happen to make that cad tool by any chance ^^. I'm trying to replace dvc and some of my colleagues work on Windows.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool that does content addressable storage only? #152

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Tool that does content addressable storage only? #152

indigoviolet Jul 22, 2023

Replies: 2 comments

kevin-hanselman Aug 13, 2023 Maintainer

mpizenberg Nov 4, 2023

indigoviolet
Jul 22, 2023

kevin-hanselman
Aug 13, 2023
Maintainer

mpizenberg
Nov 4, 2023