Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pixel file format #6

Merged
merged 40 commits into from
Nov 11, 2023
Merged

Pixel file format #6

merged 40 commits into from
Nov 11, 2023

Conversation

Mittmich
Copy link
Contributor

Implemented pixel file format (completes the following trello card) based on the considerations here. The general ideas have been updated in the datastructures ipython notebook.

@Mittmich Mittmich requested a review from dmitrymyl September 18, 2023 18:47
setup.py Outdated Show resolved Hide resolved
@cchlanger cchlanger self-requested a review October 31, 2023 19:05
spoc/cli.py Outdated Show resolved Hide resolved
@cchlanger
Copy link
Contributor

I will slightly update the structure of the test directory so it is easier to see which part of the code the test corresponds to.

@cchlanger cchlanger self-requested a review November 2, 2023 10:12
@cchlanger cchlanger requested review from cchlanger and removed request for cchlanger November 2, 2023 10:34
@cchlanger
Copy link
Contributor

cchlanger commented Nov 2, 2023

Should we use the slightly more elegant way for fixtures described here?
https://stackoverflow.com/questions/46089480/pytest-fixtures-redefining-name-from-outer-scope-pylint
Then we do not have use :
# pylint: disable=redefined-outer-name

@cchlanger cchlanger requested review from cchlanger and removed request for cchlanger November 2, 2023 12:33
@cchlanger cchlanger requested review from cchlanger and removed request for cchlanger November 2, 2023 12:54
Copy link
Contributor

@cchlanger cchlanger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side, no major changes. Minor refactors and text edits have been performed. The fixture question can be addressed in the next PR on snipping.

Copy link
Contributor

@cchlanger cchlanger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side, no major changes. Minor refactors and text edits have been performed. The fixture question can be addressed in the next PR on snipping.

spoc/io.py Show resolved Hide resolved
Copy link
Contributor

@dmitrymyl dmitrymyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mittmich thank you for this PR, that's a lot of work and conceptual development! Generally, I'd approve this PR except for minor code-related issues, but I would like to discuss concepts in depth, especially relating to the organisation of the data folder and column names.

Some of the concepts I'd like to cover:

  1. Many things are shared between Pixels and Contacts. Will it make sense to create a parental class with shared methods and then inherit Pixels and Contacts from it?
  2. What's the purpose of ContactsParameters and PixelsParameters? They are convenient to store parameters and enforce data schemas (like dataclass), but they look quirky in FileManager.load_contacts -- usually people specify function parameters with kwargs. Seems a bit non-pythonic to me, but that is also not a big deal here.
  3. Folder structure: I'd like to have a human-readable names of parquet files. It is good for visual inspection of files, developing alternative libraries in other languages and future-proofing against changes in json.dumps and hash. To reduce the complexity of nested folder structure, I propose that all tables within one folder are either Pixels or Contacts, have the same state of global parameters and differ only in contact orders and compositions of labels. This means that Folder doesn't contain all the data from a single experiment, but rather only one transformed view of it.
  4. metadata_combi -- does it contain only sister labels or will it also contain other fragment-level features, like 5mC marks? If not, Shall we distinguish between "molecule" labels (sisterA/sisterB, DNA/RNA, etc) and "fragment" labels (5mC level, open/close chromatin etc.)?

Otherwise, I generally agree with proposed architecture! In my view, it will be great if we discuss the code and our future directions to build a roadmap based on the proposal for the file format and query engine 😉

spoc/contacts.py Outdated Show resolved Hide resolved
spoc/contacts.py Show resolved Hide resolved
spoc/io.py Outdated Show resolved Hide resolved
spoc/io.py Outdated Show resolved Hide resolved
spoc/io.py Outdated
"""
metadata = self._load_metadata(path)
# find matching pixels
for pixel_path, value in metadata.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about why this search is O(N), when it could be O(1). Seems like we cannot hash all PixelParameters. But why? For example, for metadata_combi we could use tuple instead of list, since this collection seems to be immutable by design. Then PixelParameters can be hashed to be dict keys, so _load_metadata will return Dict[PixelParameters: path].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We change Metadata file keys are gloal parameters and values are filenames.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Json dump can't encode parameters as keys as objects are not allowed as keys. We would need to serialize the object to a string and write a custom implemntation for that. Since this search operation is trivial (length of parameters is likely going to be small), I would vote for leaving it as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's leave it like this for now

spoc/io.py Show resolved Hide resolved
spoc/models/file_parameter_models.py Outdated Show resolved Hide resolved
spoc/models/file_parameter_models.py Show resolved Hide resolved
spoc/models/file_parameter_models.py Outdated Show resolved Hide resolved
spoc/pixels.py Show resolved Hide resolved
@cchlanger
Copy link
Contributor

Here is a quick summary of some more points to discuss today:

  • Direct access to Contacts and Pixels or use Filemanager only?
  • Why hashing?
  • When to inherit?

@dmitrymyl
Copy link
Contributor

One thing that I am thinking about after our discussion last week, is related to the flat file structure and global parameters. I would say that the first analytical task that researchers would like to do is to visualize all multi-way contacts (original, not expanded) in some genomic locus. Hence, there should be a way to pull the whole dataset from a Contacts or Pixels container. However, there are no guards implemented or planned that what a users pulls from the container represents the whole dataset. So, who should be responsible for the integrity of the data?

  1. The user, who specifies which container file they generate from Fragments, and we provide a function to do that.
  2. spoc, and we have to add info about the composition of the original dataset (contact orders and labelling layouts) into the container.

I think it should be the first case, but we have to be transparent about that. Yet it feels like this has to be hardcoded into some specific data representation ("this file comprises the whole original dataset") to guard users from their mistakes. Or has it, really?

I would expect a dichotomy in use cases: it's either a complete dataset of original multi-way contacts in a form of either Contacts or Pixels with original coordinates or different bin sizes, or it's expanded contacts of specified order (either doublets, triplets or quadruplets) with different bin sizes (and maybe original coordinates). To somehow specify that I would originally suggest the following strategy:

  1. Hardcode a hierarchical structure for the original dataset of multi-way contacts (can be stored as flat, but virtually pops up as hierarchical).
  2. Expanded contacts are stored in individual containers per contact order.

After our discussion, I would suggest another strategy: specifying "presets" of tables in a container. Say, we have a preset for "original contacts", "expanded triplets", "expanded 2-3-4-5-way contacts", etc. Each preset is generated with a specific method in spoc from Fragments and the name of the preset is stored in the file as well. Thus we hardcode some "flavours" of multi-way contact data. This info can be used for verification by user or by downstream programs.

Also, to have distinctions between original and expanded contacts, we should add another global parameter "expanded" with options "no", "combinatorically", "adjacently", "non-adjancently" (or whatever would be neat and grammatically correct).

What do you think about all that?

@Mittmich Mittmich requested a review from cchlanger November 11, 2023 11:28
@Mittmich
Copy link
Contributor Author

Hey!
I added all the changes that we discussed, with the exception of the switch of metadata keys and values. The reasoning is that json dump can't deal with complex python objects as keys and I would say that the small search speed benefit doesn't justify the implementation of a custom serialiser :)

@dmitrymyl
Copy link
Contributor

Nice, thanks @Mittmich! Everything seems good. With regard to keys — let's leave it like this for now and if we encounter any performance issues, we will change it later.

@Mittmich Mittmich merged commit bd2b6f8 into master Nov 11, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants