-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pixel file format #6
Conversation
I will slightly update the structure of the test directory so it is easier to see which part of the code the test corresponds to. |
Should we use the slightly more elegant way for fixtures described here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my side, no major changes. Minor refactors and text edits have been performed. The fixture question can be addressed in the next PR on snipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my side, no major changes. Minor refactors and text edits have been performed. The fixture question can be addressed in the next PR on snipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mittmich thank you for this PR, that's a lot of work and conceptual development! Generally, I'd approve this PR except for minor code-related issues, but I would like to discuss concepts in depth, especially relating to the organisation of the data folder and column names.
Some of the concepts I'd like to cover:
- Many things are shared between
Pixels
andContacts
. Will it make sense to create a parental class with shared methods and then inherit Pixels and Contacts from it? - What's the purpose of
ContactsParameters
andPixelsParameters
? They are convenient to store parameters and enforce data schemas (likedataclass
), but they look quirky inFileManager.load_contacts
-- usually people specify function parameters with kwargs. Seems a bit non-pythonic to me, but that is also not a big deal here. - Folder structure: I'd like to have a human-readable names of parquet files. It is good for visual inspection of files, developing alternative libraries in other languages and future-proofing against changes in
json.dumps
andhash
. To reduce the complexity of nested folder structure, I propose that all tables within one folder are either Pixels or Contacts, have the same state of global parameters and differ only in contact orders and compositions of labels. This means that Folder doesn't contain all the data from a single experiment, but rather only one transformed view of it. metadata_combi
-- does it contain only sister labels or will it also contain other fragment-level features, like 5mC marks? If not, Shall we distinguish between "molecule" labels (sisterA/sisterB, DNA/RNA, etc) and "fragment" labels (5mC level, open/close chromatin etc.)?
Otherwise, I generally agree with proposed architecture! In my view, it will be great if we discuss the code and our future directions to build a roadmap based on the proposal for the file format and query engine 😉
spoc/io.py
Outdated
""" | ||
metadata = self._load_metadata(path) | ||
# find matching pixels | ||
for pixel_path, value in metadata.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about why this search is O(N), when it could be O(1). Seems like we cannot hash all PixelParameters. But why? For example, for metadata_combi
we could use tuple
instead of list
, since this collection seems to be immutable by design. Then PixelParameters
can be hashed to be dict
keys, so _load_metadata
will return Dict[PixelParameters: path]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We change Metadata file keys are gloal parameters and values are filenames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Json dump can't encode parameters as keys as objects are not allowed as keys. We would need to serialize the object to a string and write a custom implemntation for that. Since this search operation is trivial (length of parameters is likely going to be small), I would vote for leaving it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's leave it like this for now
Here is a quick summary of some more points to discuss today:
|
One thing that I am thinking about after our discussion last week, is related to the flat file structure and global parameters. I would say that the first analytical task that researchers would like to do is to visualize all multi-way contacts (original, not expanded) in some genomic locus. Hence, there should be a way to pull the whole dataset from a Contacts or Pixels container. However, there are no guards implemented or planned that what a users pulls from the container represents the whole dataset. So, who should be responsible for the integrity of the data?
I think it should be the first case, but we have to be transparent about that. Yet it feels like this has to be hardcoded into some specific data representation ("this file comprises the whole original dataset") to guard users from their mistakes. Or has it, really? I would expect a dichotomy in use cases: it's either a complete dataset of original multi-way contacts in a form of either Contacts or Pixels with original coordinates or different bin sizes, or it's expanded contacts of specified order (either doublets, triplets or quadruplets) with different bin sizes (and maybe original coordinates). To somehow specify that I would originally suggest the following strategy:
After our discussion, I would suggest another strategy: specifying "presets" of tables in a container. Say, we have a preset for "original contacts", "expanded triplets", "expanded 2-3-4-5-way contacts", etc. Each preset is generated with a specific method in spoc from Fragments and the name of the preset is stored in the file as well. Thus we hardcode some "flavours" of multi-way contact data. This info can be used for verification by user or by downstream programs. Also, to have distinctions between original and expanded contacts, we should add another global parameter "expanded" with options "no", "combinatorically", "adjacently", "non-adjancently" (or whatever would be neat and grammatically correct). What do you think about all that? |
Hey! |
Nice, thanks @Mittmich! Everything seems good. With regard to keys — let's leave it like this for now and if we encounter any performance issues, we will change it later. |
Implemented pixel file format (completes the following trello card) based on the considerations here. The general ideas have been updated in the datastructures ipython notebook.