Replies: 3 comments
-
Hi, @veriditin, and thanks for the thoughtful post! You make a good point. Committing links to Git shouldn't inherently cause problems. However, I think the biggest risk is that you could accidentally commit something other than a link to Git. For example, what if you forget to Ultimately the decision to gitignore files tracked by Dud is a safeguard against simple mistakes. I still recommend that you do so, but it's your decision to make for your projects. Regarding the perceived tedium of gitignoring every binary, I would recommend using glob patterns in your .gitignore files and multiple .gitignore files to make this a lot easier. For example, if your datasets are comprised of image files, you might add I think this thread is a better fit for a discussion, so I'll be transferring it over there. Thanks again for sharing your thoughts! |
Beta Was this translation helpful? Give feedback.
-
Fair point about the dicussion, did not know this exists!
The large binary problem will be forever be inherent to git. Even without dud you want to avoid committing large binaries to git, so in comes git-lfs (and adding the file pattern to In this list As a tangent, talking about this makes me realize how similar dud and git-lfs actually are. Content seems similarly stored, although the git-lfs cache is contained within the |
Beta Was this translation helpful? Give feedback.
-
Just for those reading this discussion and thinking like I was thinking. It's actually a good idea to add the files that are managed by dud to your .gitignore. Symlinks are based on the contents of the file, but you will sometimes need to edit the file contents and then you need to Be warned :) |
Beta Was this translation helpful? Give feedback.
-
Dear developer(s)/Kevin,
We are evaluating tooling to use in our data pipelines, and after a discussion I saw on Hacker News
dud
seemed like the tool to try due to its small scope and composability, as opposed to some other well-known tool in this space :)The getting started guide is very nice, it's interesting to see how easy it works and the decision to delegate the storage syncing stuff to
rclone
is imo a great one. So thanks a lot :)I am currently confused by one thing.
In the getting started guide you mention needing to add the files tracked by dud to your .gitignore, if you want to commit your data pipeline to a git repository (obviously, we do).
However, having played around with it, considering the files managed by dud are always in the cache, which is ignored by default, and once the files a dud-committed only hard-links remain in the actual repository, what are the downsides about committing these hard-links to the git repo?
E.g. the viewer in gitlab handles it quite nicely, suggesting that it is indeed a hardlink to a content-addressed file in the cache:
and when trying to use the file in any script, we can clearly see that some step is still missing (e.g. performing the untarring manually):
which will immediately trigger you to think: Ah!
dud fetch/dud pull
Adding all these files manually to a .gitignore is quite an annoying process, especially once your data pipeline starts to be built up from multiple different sources of data, that are processed by different stages and have results that are difficult to completely know and thus to ignore.
So, is committing the hardlinks to git completely fine, or am I missing something?
Beta Was this translation helpful? Give feedback.
All reactions