Skip to content
This repository has been archived by the owner on Mar 22, 2021. It is now read-only.

duplicate data in directory structure #16

Open
rgreminger opened this issue Mar 3, 2020 · 2 comments
Open

duplicate data in directory structure #16

rgreminger opened this issue Mar 3, 2020 · 2 comments

Comments

@rgreminger
Copy link

The proposed directory structure is such that the same data will be stored in multiple locations, as it will be duplicated into (possibly multiple) input folders. Though the structure with the input folders adds clarity for the workflow, an approach like this could use up a lot of disk space very quickly (except if symbolic links are used, though I doubt that those would work across different platforms).

@hannesdatta
Copy link
Collaborator

the key is to keep different pipeline stages portable. i.e., you can work on analysis and I have prepped the dataset. I know that for the main guy on this project, you do have a lot of duplicate files. I kind of am fine with this because disk space is cheap. if you can find a solution, let me know.

another issue is that for this minimal example, we could host a zip with the raw data on TilburgScienceHub, as we strictly need to avoid teaching you can store your data on GitHub. makes sense? download data via R script is platform-independent...

@rgreminger
Copy link
Author

Portability definitely is a good point. I'll try to implement this in the example some time soon, but one thing is a bit unclear to me from the site (though I might just have missed this part). What is the best approach to keep the input folders up-to-date with upstream changes? Should this be done by upstream, or downstream?

Good idea regarding the raw data. I'll try adding the zip to the page through a PR, and will update the example.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants