Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Pint for units #1483

Open
ilan-gold opened this issue Apr 29, 2024 · 9 comments
Open

Using Pint for units #1483

ilan-gold opened this issue Apr 29, 2024 · 9 comments

Comments

@ilan-gold
Copy link
Contributor

Please describe your wishes and possible alternatives to achieve the desired result.

Continuing the discussion from #1481, we are considering Pint as both an in-memory data structure and something serializable.

It seems there are a few things that would have to be worked out?

  1. Is there a global "AnnData registry" so that objects are interoperable with one another? Could a context manager e.g., from settings help override behavior as needed?
  2. What would the relationship (if any) of this be to SpatialData
  3. What should the behavior of operations like concat be which involve more than one object?

cc @mruffalo @ivirshup @flying-sheep

@ilan-gold
Copy link
Contributor Author

I would like to add here that I could see this as an enhancement for spatialdata. The two need not be opposed. I think it's super cool to be able to see units on arrays and is super useful for obvious readability reasons.

@ilan-gold
Copy link
Contributor Author

In terms of the issues at play, I don't think merging two different things with Pint-typed arrays i.e., in obsm should do anything i.e., harmonization of coordinates for different omics experiments. I am not even sure the registries need to be compatible. To this end, I think it could be worth doing some sort of versioning of the registries as well (even though they are instantiated). The only reason I could see them changing is if we introduce something like contexts for conversions across domains, although it seems unlikely we would use this.

@Zethson
Copy link
Member

Zethson commented May 2, 2024

This is also a use-case for ehrapy, where we're constantly dealing with various units for measurements

@mruffalo
Copy link

mruffalo commented May 2, 2024

Thanks very much @ilan-gold!

I don't have much of an opinion about SpatialData yet, nor am I too familiar with its transformation methods and coordinate systems. If it isn't already using Pint, I'd expect that to be a significant improvement there also.

For your other items:

1: It's probably no surprise, but I advocate for the approach I took in #1481, in which all AnnData objects instantiated by the package use the same UnitRegistry object stored in some easily-accessible place. Users could adjust that registry's defaults, add new units, whatever they'd like. In my opinion this is the best way to ensure that users see the addition of physical units as an enhancement, making it convenient to handle physical quantities properly -- rather than an obstacle, especially when loading multiple AnnData objects from disk for integrative analysis.

3: anndata.concat already concatenates matrices in .obsm when present in multiple objects, with potentially-meaningless results for things like PCA or UMAP coordinates. If this is to be reconsidered, I strongly believe this should be done independently of adding support for physical units.

I consider this code snippet, included in #1481 and duplicated here for convenience, to be purely an improvement:

In [1]: import numpy as np

In [2]: from pint import UnitRegistry

In [3]: reg = UnitRegistry()

In [4]: coords_1 = np.array([[1, 2], [2, 1]]) * reg.mm

In [5]: coords_1
Out[5]: 
array([[1, 2],
       [2, 1]]) <Unit('millimeter')>

In [6]: coords_2 = np.array([[0, 3], [4, 0]]) * reg.um

In [7]: coords_2
Out[7]: 
array([[0, 3],
       [4, 0]]) <Unit('micrometer')>

In [8]: np.vstack([coords_1, coords_2])
Out[8]: 
array([[1.   , 2.   ],
       [2.   , 1.   ],
       [0.   , 0.003],
       [0.004, 0.   ]]) <Unit('millimeter')>

@mruffalo
Copy link

Any other stakeholders, or opinions about the implementation in the draft PR?

Does anyone else agree or disagree that these items should be considered separately:

  • Adding support for physical units
  • Disallowing concatenation of matrices in .obsm

I think a global unit registry instantiated by anndata makes it much easier to use units and potentially compare across AnnData objects, but the documentation will need to be clear about the existence and intended usage of anndata.units.

@ilan-gold
Copy link
Contributor Author

ilan-gold commented May 22, 2024

Disallowing concatenation of matrices in .obsm

After reading your above thoughts, and also thinking about it a bit, I don't think this is such a big deal. Allowing concatenation is fine, because people have to opt-in to using units (and there is already some meaninglessness in the operation on occasion as you mention).

I am more curious what perhaps should happen if one array has units and the other doesn't. What's the result then? I think this should be resolved in the PR.

@ilan-gold
Copy link
Contributor Author

I am more curious what perhaps should happen if one array has units and the other doesn't. What's the result then? I think this should be resolved in the PR.

Independent of whether we even allow the operation, this problem would need to be resolved.

@mruffalo
Copy link

Thanks for the feedback @ilan-gold!

Combining arrays with and without units is not allowed -- Pint throws an exception.

This:

>>> import numpy as np
>>> from pint import UnitRegistry
>>> reg = UnitRegistry()
>>> a1 = np.array([[1, 2], [3, 4]])
>>> a2 = np.array([[0, 1], [2, 3]]) * reg.mm
>>> np.vstack([a1, a2])

results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 180, in vstack

... uninformative stack frames omitted ...

  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/numpy/numpy_func.py", line 86, in <listcomp>
    return [convert_arg(item, pre_calc_units) for item in arg]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/numpy/numpy_func.py", line 93, in convert_arg
    raise DimensionalityError("dimensionless", pre_calc_units)
pint.errors.DimensionalityError: Cannot convert from 'dimensionless' to 'millimeter'

@ilan-gold
Copy link
Contributor Author

Great @mruffalo - I think you have answered most of my questions then. I think I have a few comments open on the PR. Another thing I'd be curious about is how this works with dask and scipy sparse. Probably want to add an i/o test there for that and/or try out locally something like to_zarr from dask. These aren't really concerns, more just things we should be clear about to end-users. I would be fine moving back over to the PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants