Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature idea: datalad get from other local source without copying files again #7674

Open
lnnrtwttkhn opened this issue Oct 23, 2024 · 2 comments

Comments

@lnnrtwttkhn
Copy link

lnnrtwttkhn commented Oct 23, 2024

Description

Hi everyone, following YODA principles, I regularly run into the following "issue": To keep datasets modular, I usually add them as subdatasets in an inputs directory while they also exist at the project-directory level. Here is an example: I have a BIDS DataLad dataset (bids) that I add as a subdataset to my fmriprep DataLad dataset:

myproject
.
├── fmriprep
│   ├── code
│   └── inputs
│       └── bids (5992f12) # same DataLad dataset as below
└── bids (5992f12)

Now when I run fMRIprep, I give it ./fmriprep/inputs/bids as the input path. But this involves running datalad get to actually get the files of the BIDS dataset into that place. To speed this up, I usually configure a local DataLad sibling for ./fmriprep/inputs/bids like this datalad siblings add -s local --url ../../../bids. Then datalad get can retrieve the data from local. But then I have the full size of the BIDS dataset in two locations which takes up additional disk space. Of course, I could datalad drop the files again but, and here comes the idea, maybe there is a way to adjust the path such that the data does not have to be retrieved and copied again, while still staying in line with YODA principles.

I am not even sure if this is something that can or / should be handled on the DataLad side but maybe you know other nice workarounds for this? Thanks!

@yarikoptic
Copy link
Member

yarikoptic commented Oct 23, 2024

edit: have a look at those derived datasets produced by openneuro folks: https://github.com/OpenNeuroDerivatives/ as for organization of subdatasets

@yarikoptic
Copy link
Member

what stroke me after is to realize that --reckless=ephemeral wouldn't work if used with datalad-containers without ad-hoc/adjustments to bind mount some higher level folder from which .git/annex would be symlinked from, so it wouldn't be the complete solution as is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants