Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up and extend normalizations #47

Open
mbuttner opened this issue Sep 6, 2023 · 27 comments
Open

Speed up and extend normalizations #47

mbuttner opened this issue Sep 6, 2023 · 27 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@mbuttner
Copy link
Collaborator

mbuttner commented Sep 6, 2023

Current normalization tools in pytometry use the FlowKit implementation and are FlowJo input compatible, but could use a speed up, especially the functions for logicle and biexponential transformations. The Cytotransform package implements commonly used transformations and is much faster. An adaptation of these functions to the anndata format would be amazing.

@mbuttner mbuttner added enhancement New feature or request help wanted Extra attention is needed labels Sep 6, 2023
@burtonrj
Copy link
Collaborator

burtonrj commented Sep 7, 2023

Cytotransform uses the FastLogicle C++ implementation from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4761345/. How do you feel about making cytotransform a dependency of pytometry? Might be easier to manage rather than ingesting the C++ code into this repo?

It also accepts a Pandas DataFrame as input so a wrapper for anndata should be easy to create. I'm happy to pick this one up.

@mbuttner
Copy link
Collaborator Author

mbuttner commented Sep 7, 2023

I'm happy to make CytoTransform a dependency of pytometry. The wrapper should be relatively easy because most data matrices for flow data are dense. My only concern is that the pandas DataFrame is less memory efficient than a numpy 2D matrix, which might become problematic as soon as you use 10M+ cells. So I think that using the 2D numpy matrix as input would be preferable, especially when the C++ implementation will convert the input to a 2D array anyways.

@burtonrj
Copy link
Collaborator

That makes perfect sense. It will be a simple refactor to cytotransform to fix this since under the hood it acts on the numpy array of the pandas dataframe anyway. I'll refactor cytotransform base class to accept a dataframe or numpy array and skip wrangling the output into a dataframe when numpy arrays are provided as input - see burtonrj/cytotransform#5

@whitews
Copy link

whitews commented Sep 21, 2023

Hi Maren and Ross,

It's great to see the Python cytometry ecosystem growing. However, I'm concerned that we may be unnecessarily fracturing our efforts. It seems we have 3 different implementations of transformations here. Perhaps there is a way to work together?

Kind regards,
Scott

@mbuttner
Copy link
Collaborator Author

Hi Scott,
thanks for reaching out! The current normalization implementation in pytometry largely follow your implementation in the FlowKit package, and I respect your concern. Happy to talk and join forces.

Best,
Maren

@whitews
Copy link

whitews commented Sep 22, 2023

Maren,

Most transformations in FlowKit are modeled after the GatingML 2.0 specification, and their implementation is defined in the FlowUtils package. The exception is the FlowJo biex transformation, as you've discovered. The history of FlowUtils comes from inheriting the original fcm code base, which I broke apart with the hope that the individual parts would be more useful to others.

I realize the requirements / needs for projects may differ, but the transformation functions are one of those rare, well-defined requirements that seem better served by many eyes on one set of implementations. I can move the FJ biex xform to FlowUtils if that would help. The reason I kept it separate was the added dependency of SciPy, where FlowUtils has the singular dependency of NumPy.

And, I see Ross has parallelized some of the other xforms in the CytoTransform package. I also thought about doing this as the parallelization for transforming FCM events is obvious, but multiprocessing (mp) in Python has a double-edge. Nesting mp in Python is "here be dragons" territory. I use mp in FlowKit for processing multiple FCS samples using the same gating strategy, with the rationale that a user would unlikely mp different gating strategies. Adding mp to the transformations that are necessary for the gating functions threw a wrench in the works. Still, it is an interesting idea Ross has for making the xform classes optionally mp.

Enough rambling by me, curious what you both think and what ideas you have.

-Scott

@burtonrj
Copy link
Collaborator

Hi guys!

Thanks for the input Scott! Much appreciated and I too agree transformations is an area where we should all be aligned regardless of what else we are doing - maybe also on things like compensations, various utilities, gate definitions, but we can discuss that another time - then we should allow the user some choice around what frameworks they want to work in. Maren is doing great work expanding cytometry into the scanpy ecosystem and I am right behind her on that! In fact I even hope to retire CytoPy and point people this direction.

As for cytotransform, I took inspiration from your work Scott and tried to make it GatingML 2.0 compatible and also performing testing against the GatingML 2.0 definitions (https://github.com/burtonrj/cytotransform/blob/main/tests/test_transformer.py). The mp approach is an ambitious attempt to brute force transforms on enormous datasets....but mp in Python is a swamp I will agree (hopefully to be fixed in 3.13 onwards https://peps.python.org/pep-0703/). Recognising that mp is not always desirable, cytotransform will just execute the function in a single process if n_jobs is 0 or 1. I've also made this the default behaviour if you have less than 10000 events....maybe it should just be the default behaviour anyway? MP should be an explicit choice? Interested to hear peoples thoughts on this.

@burtonrj
Copy link
Collaborator

Oh, also I've had far less installation woes with the FastLogicle.cpp implementation coupled with PyBind11. I haven't compared the performance of FastLogicle vs your C implementation though @whitews, not sure which is faster or more precise.

@mbuttner
Copy link
Collaborator Author

mbuttner commented Oct 2, 2023

Hi there,

thank you all for your suggestions. I might be more taciturn, but I enjoy the discussion.

  1. About multiprocessing:
    I had some good experience with multiprocessing in Python on our Linux cluster so far, but I do understand that this might not be generalizable to other architectures. So I agree with @whitews that we should disable multiprocessing for normalization by default and leave it to the user - as @burtonrj stated "make it an active choice".

  2. Normalization functions as defined in FlowUtils/FlowKit:
    I agree that the definitions of the data transformation leaves no room for variation. I do realize that the current implementation of the arcsinh does not quite follow the GatingML 2.0 standard, so that should be adapted. When I looked into FlowKit, I realized that pytometry follows much less the object-oriented coding paradigm as FlowKit does. That is why I decided to adapt these transformations to fit with the anndata format and the scanpy coding style instead of relying on FlowUtils/FlowKit directly. That results into a parallel development, obviously, but I haven't figured out a more elegant way to do this (or at least I didn't when I first implemented the functions).
    Second, I extended the arcsinh and biExp transform functions such that every channel can be transformed individually.That's something that FlowJo does for the biExp function and which is a common practice for spectral flow data. My concern is that when we decide to rely more on FlowUtils, we might be less flexible to respond to user requests to extend the functionality (see issue #54). I'm leaning more towards keeping the pytometry package separate from FlowKit and follow through with @burtonrj's implementations in cytoTransform.

@whitews
Copy link

whitews commented Oct 3, 2023

Maren & Ross,

It sounds like we are pretty much on the same page. What if I move the transform classes from FlowKit to FlowUtils, along with moving the FlowJo Biex implementation there. This adds SciPy as a dependency, which is why it was not done previously, but does put all the transforms in one place.

Ross, would you be available to help maintain FlowUtils if we did this? We could add the optional mp as well. I do feel that the more eyes we have on these critical implementations the better, inline with the Torvalds quote "given enough eyeballs, all bugs are shallow".

One other point is that the Sample class in FlowKit does allow for a dictionary of transforms mapped to individual channels. The Gate classes are also designed to be used independently from the heavy Session & Workspace classes. Not sure if it was clear, but FlowKit is more or less feature complete. It will never expand to have clustering algos, a GUI, etc. This was a decision made early on after being a bit stung by the algo chase building ReFlow.

@mbuttner
Copy link
Collaborator Author

mbuttner commented Oct 3, 2023

Hi Scott,

that sounds good to me. It would be great if the FlowJo Biexp implementation would also be in FlowUtils. pytometry has the scipy dependency already, so that doesn't change much. I guess that I have to spend more time with FlowKit to fully understand its functionality and implementation - thank you for pointing this out 🤷‍♀️
@burtonrj you'll be making the call as this decision will affect you the most. If yes, this issue turns into "inherit FlowUtils normalizations".

@burtonrj
Copy link
Collaborator

burtonrj commented Oct 4, 2023

Hi all,

I'm happy with this call, I was going to have to maintain cytotransform anyway so it makes no difference to me, other than I won't be maintaining it on my own - completely agree with you @whitews, the more eyes the better!

Let me know next steps RE changes to FlowUtils.

@whitews
Copy link

whitews commented Oct 5, 2023

Great, I've started a discussion on the FlowUtils repo here:
whitews/FlowUtils#11

We can plan the changes there (and avoid spamming this issue :)

@whitews
Copy link

whitews commented Oct 8, 2023

Bumping this as I plan to release new versions of FlowUtils and FlowKit soon to support Python 3.12. I've discussed the proposed changes with my colleagues at Duke and have the go ahead. We are also considering moving the base Gate classes to FlowUtils, as these are also well-defined and would likely be helpful to others...though there are some implementation details to be worked out, particularly around the ratio dimension.

@alefrol638
Copy link

Do you also plan to add autologicle normalisation, like used in the R package cytofkit (cytofkit::autoLgcl, https://github.com/JinmiaoChenLab/cytofkit/blob/master/R/cytof_preProcess.R)?

@burtonrj
Copy link
Collaborator

I haven't personally looked into parameter optimisation for logicle transform, I've always left parameters as options for the user.

I would say we could incorporate this but first we need to develop a consensus in methodologies which (as discussed above) appears to be FlowUtils

@mbuttner
Copy link
Collaborator Author

mbuttner commented May 7, 2024

Do you also plan to add autologicle normalisation, like used in the R package cytofkit (cytofkit::autoLgcl, https://github.com/JinmiaoChenLab/cytofkit/blob/master/R/cytof_preProcess.R)?

@alefrol638 The autologicle normalization function is now implemented in pytometry v.0.1.5

@grst
Copy link
Collaborator

grst commented May 8, 2024

I would be in favor of unifying normalization functions between the different ecosystems, which if I followed this discussion correctly, means relying on FlowUtils. In that I think it would be good if someone beyond @whitews had commit/release rights on that repo to avoid a single point of failure.

About multiprocessing

I think multiprocessing is important when moving towards datasets with 10s to 100s of millions of cells. I agree that Python multiprocessing is not perfect. Ideally the C functions can be paralellized (where applicable), or numba could be an option for this kind of code. I also made good experiences with replacing python multiprocessing with joblib.Parallel which is more robust and supports different backends, including dask.

@whitews
Copy link

whitews commented May 8, 2024

@grst

I agree on unifying the well-defined components like the transforms and that it is not ideal to have such a critical repo with a single person. Equally important IMO is the stability of such a well-defined codebase, i.e. there's no need to change it unless something is broken.

On that point, I did make a local branch of FlowUtils with class-based transforms (something mentioned earlier in the thread). In the end, I decided against it. It was awkward to have an end user of FlowKit use those "pass through" classes, as the docstrings are in a second tier import. Seems easy enough to wrap the transform functions in a class as close to the end user as possible. The important part is to unify the underlying math of the transforms.

One other point, related to #65, is FlowUtils already has inverse functions for all the transforms. There are also a few other functions, including compensation and a GatingML compatible points_in_polygon and points_in_ellipsoid that might be useful.

For optimization, I think it would be useful to define some requirements / expectations prior to any implementation. Some data sets and benchmarks would be good as part of that as well. There are several ways to implement this, each with pros and cons. A good case can be made that parallelization could be a downstream responsibility, it is rather easy to batch transforms per sample or even on a more granular level.

Just a proposal, and I'd need to verify this on my end, but what if we move FlowUtils to scverse? I'd be in support of this if we can agree to keep it stable and focused, avoiding breaking changes for downstream packages such as pytometry and FlowKit.

@grst
Copy link
Collaborator

grst commented May 13, 2024

Equally important IMO is the stability of such a well-defined codebase, i.e. there's no need to change it unless something is broken

Fully agree.

Just a proposal, and I'd need to verify this on my end, but what if we move FlowUtils to scverse? I'd be in support of this if we can agree to keep it stable and focused, avoiding breaking changes for downstream packages such as pytometry and FlowKit.

I'd be up for that, but this is something I would like to ask in the core team first. I'll put it on the agenda for next week.

For optimization, I think it would be useful to define some requirements / expectations prior to any implementation. Some data sets and benchmarks would be good as part of that as well.

We now have some infrastructure for continuous benchmarking in place for scverse repos. See in action e.g. here:
scverse/scanpy#3031 (comment). This would be another advantag of moving it to the scverse org.

Wherever it makes sense, I would be slightly in favor of implementing paralellization in the central repo. As discussed above, it's trivial to do parallelization, but doing it well might not be that easy.

@flying-sheep
Copy link
Member

If you need help with the benchmark setup, ask @ilan-gold or me. It’s easy to set up but you might need help with permissions. It’s also currently in alpha state so expect it to break sometimes.

@grst
Copy link
Collaborator

grst commented May 21, 2024

Hi @whitews,

core team is fine with moving FlowUtils to scverse. One concern that came up is that the release process is well documented (we don't have a lot of experience with packages that have C extensions) and ideally automated via GitHub actions.

So we all agree that this is the way forward and pytometry will replace its normalization code with FlowUtils? @mbuttner @quentinblampey?

@mbuttner
Copy link
Collaborator Author

Hi @grst

Thank you for the update. I am happy with the suggested action to replace pytometry's normalization code with FlowUtils.

@quentinblampey
Copy link
Contributor

Hello @grst, I'm also fine with this! One small question though: since the Python package is called flowutils, should we rename the GitHub repository as flowutils as well? I think this would also make more sense regarding python naming conventions

@whitews
Copy link

whitews commented May 22, 2024

@grst OK, so how would this work? Would I remain the lead on FlowUtils?

@grst
Copy link
Collaborator

grst commented May 22, 2024

Of course! You remain (co-)owner of the repository. You can also add a section to the readme to make clear you are the lead developer.

I guess @quentinblampey would get commit rights, too. Additionally all scverse core developers will have committ rights in theory, but not use them without your approval as long as you are reachable.

@grst
Copy link
Collaborator

grst commented May 22, 2024

To initiate the transfer it's probably easiest if you make me an owner of the flowutils repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants