You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've suggested that this lesson be split into two parts: one on workflows and one on environments. Part of this comes from user interviews I have done which indicate that arranging files, codes, etc. is one of the most important untaught skills. Plus, the lesson is kind of long and already has two diverse parts.
This goes a bit beyond "code" and is more about data. It's up for debate if this is on topic for us.
Here is my current backwards lesson design on the new first half (on workflows):
For who:
a) new researcher who is starting from nowhere, and needs to organize their work and use the different systems available to them properly (they have many choices).
b) existing researcher who has stuff spread all over and has made a mess
c) group leader who needs to keep their group's stuff in line.
Misc topics which may need covering, unordered:
How to arrange stuff
each "topic" gets a short name (slug)
you have different super-directories that can contain projects: ~/git, cluster:/scratch/, cluster:/project/, version control host, etc. Each possible machine/locaiton has different trade-offs: backed up or not, long-term or not, shareable or not.
flat organization: each system/filesystem has one place to put stuff, non-nested.
A directory can be single use or multi-use:
singel-use cases: code, software package repo, data
project dir: has subdirs for different purposes: code, data, scratch, results, papers
names should be unique and shouldn't be reused for different purposes. But you can reuse the name for different dirs that are for the same project if in some locations it is for e.g. data. Ideally no duplicate files unless they are the same.
how to syncronize things across systems:
small stuff: version control. This is always preferable
original data: could manually be done. Try to always avoid manually syncing things that can change.
other synchronizers: unison, but is there anything more modern?
try as hard as you can to avoid
How the named directories can related:
One can use another as a code library
One can use another as a data source
...?
Multi-person projects
sharing editable code is not a good idea. Sharing original data OK. sharing scratch data risky.
use version control system to sync, each person has their own workig copies. e.g. user1/proj1, user2/proj1, etc.
If you have a shared directory, each user makes their own workspace inside with their working stuff. e.g. proj/user1/, proj/user2/, etc. Each of these user dirs would have e.g. code/, scratch/, etc.
Avoid duplication and copy and paste
In order to explain the above, we need to invent consistent terms for the name directories and any other types of directories.
arranging files within directories
If your project is anything other than trivial, you will eventually want to automate it. Plan for that already.
types of automation: single code multiple data, multiple code single data, and combinations.
each file has certain source files
arrange your data into "parallel series" which you have a single command to run to generate output from inputs
TODO this needs to be finished and we need some way to explain this.
The snakemake example goes over the automation but last I checked not the motivation/data setup. enough. But it may be enough to learn the arrangement passively, but we should make sure it is pointed out.
Possible exercises:
[more needed... could include "what is wrong with this setup", "organize these files", and so on.]
Interpert several makefiles and say: are they SIMD or MISD or MIMD, and ask how they work.
What type of problems can easily fit in the SIMD and MISD paradigms?
Which of these are not a good system for sharing files/code/data, and what can go wrong with it: github, email, personal webspaces, archive
Who has had stories about disasters of organization?
given a list of a lot of file names for a sample project, decide which go into which subdirs.
Evaluate a sample project with some sort of flaw... perhaps no separation of data (original and scratch) and code.
One episode should explain the concept of the word count repo be to organize the current word count example into the necessary directories in order to do the automation using the snakemake example.
The text was updated successfully, but these errors were encountered:
There are many good thoughts here but the risk is that this issue is too big for anybody to tackle few days before teaching it. A symptom of it is that it is around now for almost 4 months.
Also before we go into a big redesign we should see whether we are not trying to reinvent something that The Turing Way is not already doing and rather we could contribute to their lessons.
If I leave it open, it will stay open forever. If I close it, I risk being rude and miss some good points. I don't know it's many good points here but some go beyond the 2 h format we have here. Maybe RD can point out 2 or 3 most important points which we should absolutely implement?
I've suggested that this lesson be split into two parts: one on workflows and one on environments. Part of this comes from user interviews I have done which indicate that arranging files, codes, etc. is one of the most important untaught skills. Plus, the lesson is kind of long and already has two diverse parts.
This goes a bit beyond "code" and is more about data. It's up for debate if this is on topic for us.
Here is my current backwards lesson design on the new first half (on workflows):
For who:
a) new researcher who is starting from nowhere, and needs to organize their work and use the different systems available to them properly (they have many choices).
b) existing researcher who has stuff spread all over and has made a mess
c) group leader who needs to keep their group's stuff in line.
Misc topics which may need covering, unordered:
Possible exercises:
The text was updated successfully, but these errors were encountered: