Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame reduction methods broken #174

Closed
coroa opened this issue May 12, 2023 · 15 comments
Closed

DataFrame reduction methods broken #174

coroa opened this issue May 12, 2023 · 15 comments

Comments

@coroa
Copy link
Contributor

coroa commented May 12, 2023

With the versions:

In [20]: pint_pandas.show_versions()
{'numpy': '1.24.2',
 'pandas': '1.5.3',
 'pint': '0.20.1',
 'pint_pandas': '0.4.dev32+gc58a7fc'}

I am unable to run any dataframe-wise aggregation functions along axis=0:

In [19]: pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]").sum()
/Users/coroa/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/internals/blocks.py:369: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  res_values = np.array([[result]])
---------------------------------------------------------------------------
DimensionalityError                       Traceback (most recent call last)
File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pint/facets/plain/quantity.py:702, in PlainQuantity.__int__(self)
    701     return int(self._convert_magnitude_not_inplace(UnitsContainer()))
--> 702 raise DimensionalityError(self._units, "dimensionless")

DimensionalityError: Cannot convert from 'meter' to 'dimensionless'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]").sum()

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/generic.py:11797, in NDFrame._add_numeric_operations.<locals>.sum(self, axis, skipna, level, numeric_only, min_count, **kwargs)
  11777 @doc(
  11778     _num_doc,
  11779     desc="Return the sum of the values over the requested axis.\n\n"
   (...)
  11795     **kwargs,
  11796 ):
> 11797     return NDFrame.sum(
  11798         self, axis, skipna, level, numeric_only, min_count, **kwargs
  11799     )

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/generic.py:11501, in NDFrame.sum(self, axis, skipna, level, numeric_only, min_count, **kwargs)
  11492 def sum(
  11493     self,
  11494     axis: Axis | None = None,
   (...)
  11499     **kwargs,
  11500 ):
> 11501     return self._min_count_stat_function(
  11502         "sum", nanops.nansum, axis, skipna, level, numeric_only, min_count, **kwargs
  11503     )

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/generic.py:11483, in NDFrame._min_count_stat_function(self, name, func, axis, skipna, level, numeric_only, min_count, **kwargs)
  11467     warnings.warn(
  11468         "Using the level keyword in DataFrame and Series aggregations is "
  11469         "deprecated and will be removed in a future version. Use groupby "
   (...)
  11472         stacklevel=find_stack_level(),
  11473     )
  11474     return self._agg_by_level(
  11475         name,
  11476         axis=axis,
   (...)
  11480         numeric_only=numeric_only,
  11481     )
> 11483 return self._reduce(
  11484     func,
  11485     name=name,
  11486     axis=axis,
  11487     skipna=skipna,
  11488     numeric_only=numeric_only,
  11489     min_count=min_count,
  11490 )

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/frame.py:10856, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  10852 ignore_failures = numeric_only is None
  10854 # After possibly _get_data and transposing, we are now in the
  10855 #  simple case where we can use BlockManager.reduce
> 10856 res, _ = df._mgr.reduce(blk_func, ignore_failures=ignore_failures)
  10857 out = df._constructor(res).iloc[0]
  10858 if out_dtype is not None:

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/internals/managers.py:1569, in BlockManager.reduce(self, func, ignore_failures)
   1567 res_blocks: list[Block] = []
   1568 for blk in self.blocks:
-> 1569     nbs = blk.reduce(func, ignore_failures)
   1570     res_blocks.extend(nbs)
   1572 index = Index([None])  # placeholder

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/internals/blocks.py:369, in Block.reduce(self, func, ignore_failures)
    365     raise
    367 if self.values.ndim == 1:
    368     # TODO(EA2D): special case not needed with 2D EAs
--> 369     res_values = np.array([[result]])
    370 else:
    371     res_values = result.reshape(-1, 1)

ValueError: setting an array element with a sequence.

Similarly df.min(), df.max() and so on are failing.

This might be connected with:
hgrecco/pint#1128 (comment)

@andrewgsavage
Copy link
Collaborator

Can this be closed now? Or we could leave this open until df.min(), df.max() etc are tested.

@coroa
Copy link
Contributor Author

coroa commented May 15, 2023

No, this is still broken under pandas~=1.5, as well as pandas~=2.0. I don't think anything short of a change in the __array__ or __array_func__ implementation of pint would help here.

@andrewgsavage
Copy link
Collaborator

andrewgsavage commented May 17, 2023

What versions was this last working with? I am unsure if it was a pandas change or a pint change that has caused this.
I'm leaning towards a pint change as I don't see anything obvious in pandas that would cause it. Could also trying implementing __array_func__ on PintArray

In Block.reduce, pandas 1.4.x there's a try, except clause https://github.com/pandas-dev/pandas/blob/1.4.x/pandas/core/internals/blocks.py#L413
Which is also in 1.5.x
https://github.com/pandas-dev/pandas/blob/1.5.x/pandas/core/internals/blocks.py#L360
but not in 2.0.
https://github.com/pandas-dev/pandas/blob/2.0.x/pandas/core/internals/blocks.py#L340

@coroa
Copy link
Contributor Author

coroa commented May 23, 2023

Note that it is failing outside the try-except block. result is produced successfully (as a scalar Quantity), but wrapping it in np.array([[result]]) then raises at https://github.com/pandas-dev/pandas/blob/778ab8234059059d77eb5d71442c959f8d15c1e0/pandas/core/internals/blocks.py#L369. I do not know how or if this worked previously.

@andrewgsavage
Copy link
Collaborator

I tried several versions which didn't work.

# {'numpy': '1.24.3', 'pandas': '1.4.0', 'pint': '0.21', 'pint_pandas': '0.4'} fails
# {'numpy': '1.24.3', 'pandas': '1.3.0', 'pint': '0.21', 'pint_pandas': '0.4'} fails
# {'numpy': '1.24.3', 'pandas': '1.4.0', 'pint': '0.19', 'pint_pandas': '0.4'} fails
# {'numpy': '1.24.3', 'pandas': '1.4.0', 'pint': '0.19', 'pint_pandas': '0.3'} fails

I note that running pd.DataFrame([[0, 1, 2]]).astype("pint[m]").sum() works

@andrewgsavage
Copy link
Collaborator

Looks like pandas is working on this already and should be in the next version!
pandas-dev/pandas#52788

@MichaelTiemannOSC
Copy link
Collaborator

MichaelTiemannOSC commented Jul 23, 2023

On it...this has one extraneous change (preserve dtype when converting NA to array of Quantities), but I offer these as essentially the minimal changes needed to work with upcoming Pandas 2.1. (This work work any Pandas < 2.1)

I will be bundling this with a PR that's already slated to land in Pint 0.23 (and other PRs in Pandas, Pint-Pandas, and uncertainties). But if you want to apply and test by itself...

(itr_env) iMac-Pro:pint-pandas michael$ git diff
diff --git a/pint_pandas/pint_array.py b/pint_pandas/pint_array.py
index a445b77..4605704 100644
--- a/pint_pandas/pint_array.py
+++ b/pint_pandas/pint_array.py
@@ -208,8 +208,8 @@ dtypemap = {
     float: pd.Float64Dtype(),
     np.float64: pd.Float64Dtype(),
     np.float32: pd.Float32Dtype(),
-    np.complex128: pd.core.dtypes.dtypes.PandasDtype("complex128"),
-    np.complex64: pd.core.dtypes.dtypes.PandasDtype("complex64"),
+    np.complex128: pd.core.dtypes.dtypes.NumpyEADtype("complex128"),
+    np.complex64: pd.core.dtypes.dtypes.NumpyEADtype("complex64"),
     # np.float16: pd.Float16Dtype(),
 }
 dtypeunmap = {v: k for k, v in dtypemap.items()}
@@ -520,7 +520,10 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
                 # magnitude is in fact an array scalar, which will get rejected by pandas.
                 fill_value = fill_value[()]
 
-        result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            # Turn off warning that PandasArray is deprecated for ``take``
+            result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)
 
         return PintArray(result, dtype=self.dtype)
 
@@ -990,7 +993,7 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
         qtys = [
             self._Q(item, self._dtype.units)
             if item is not self.dtype.na_value.m
-            else item
+            else self.dtype.na_value
             for item in self._data
         ]
         with warnings.catch_warnings(record=True):
@@ -1048,7 +1051,7 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
             value = [item.to(self.units).magnitude for item in value]
         return arr.searchsorted(value, side=side, sorter=sorter)
 
-    def _reduce(self, name, **kwds):
+    def _reduce(self, name, *, skipna: bool = True, keepdims: bool = False, **kwds):
         """
         Return a scalar result of performing the reduction operation.
 
@@ -1092,14 +1095,18 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
 
         if isinstance(self._data, ExtensionArray):
             try:
-                result = self._data._reduce(name, **kwds)
+                result = self._data._reduce(name, skipna=skipna, keepdims=keepdims, **kwds)
             except NotImplementedError:
                 result = functions[name](self.numpy_data, **kwds)
 
         if name in {"all", "any", "kurt", "skew"}:
             return result
         if name == "var":
+            if keepdims:
+                return PintArray(result, f"pint[({self.units})**2]")
             return self._Q(result, self.units**2)
+        if keepdims:
+            return PintArray(result, self.dtype)
         return self._Q(result, self.units)
 
     def _accumulate(self, name: str, *, skipna: bool = True, **kwds):
diff --git a/pint_pandas/testsuite/test_issues.py b/pint_pandas/testsuite/test_issues.py
index c288e71..384f84f 100644
--- a/pint_pandas/testsuite/test_issues.py
+++ b/pint_pandas/testsuite/test_issues.py
@@ -202,3 +202,16 @@ def test_issue_139():
     assert np.all(a_m[0:4] == a_cm[0:4])
     for x, y in zip(a_m[4:], a_cm[4:]):
         assert unp.isnan(x) == unp.isnan(y)
+
+class TestIssue174(BaseExtensionTests):
+    def test_sum(self):
+        a = pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]")
+        row_sum = a.sum(axis=0)
+        expected_1 = pd.Series([3, 5, 7], dtype="pint[m]")
+
+        self.assert_series_equal(row_sum, expected_1)
+
+        col_sum = a.sum(axis=1)
+        expected_2 = pd.Series([3, 12], dtype="pint[m]")
+
+        self.assert_series_equal(col_sum, expected_2)
diff --git a/pint_pandas/testsuite/test_pandas_extensiontests.py b/pint_pandas/testsuite/test_pandas_extensiontests.py
index 7c92751..a638c5b 100644
--- a/pint_pandas/testsuite/test_pandas_extensiontests.py
+++ b/pint_pandas/testsuite/test_pandas_extensiontests.py
@@ -523,7 +523,7 @@ class TestArithmeticOps(base.BaseArithmeticOpsTests):
                 divmod(s, other)
 
     def _get_exception(self, data, op_name):
-        if data.data.dtype == pd.core.dtypes.dtypes.PandasDtype("complex128"):
+        if data.data.dtype == pd.core.dtypes.dtypes.NumpyEADtype("complex128"):
             if op_name in ["__floordiv__", "__rfloordiv__", "__mod__", "__rmod__"]:
                 return op_name, TypeError
         if op_name in ["__pow__", "__rpow__"]:
@@ -627,6 +627,10 @@ class TestNumericReduce(base.BaseNumericReduceTests):
             expected = expected_m
         assert result == expected
 
+    @pytest.mark.skip("tests not written yet")
+    def check_reduce_frame(self, ser: pd.Series, op_name: str, skipna: bool):
+        pass
+
     @pytest.mark.parametrize("skipna", [True, False])
     def test_reduce_scaling(
         self, data, all_numeric_reductions, skipna, USE_UNCERTAINTIES

MichaelTiemannOSC added a commit to MichaelTiemannOSC/pint-pandas that referenced this issue Jul 23, 2023
Issue hgrecco#174 reports that DataFrame reduction was broken by the latest Pint-Pandas changes.  This commit adapts Pint-Pandas to work with upcoming Pandas 2.1, currently scheduled for release Aug 20, 2023.

Signed-off-by: Michael Tiemann <[email protected]>
@topper-123
Copy link
Contributor

Just a small note that pandas-dev/pandas#52788 has been merged in pandas. I think that should help fix this issue in pint-pandas?

@topper-123
Copy link
Contributor

I agree this should be possible in v2.1 (not released yet), but pint_pandas still needs to implement a keepdims parameter to PintArray._reduce at your end to make it happen, though.

I could take a stab at a PR for this if I can get someone to review the PR. I assume the tests should be added to pint_pandas/testsuite/test_pandas_extensiontests.py in e.g. a new TestReduction class?

@MichaelTiemannOSC
Copy link
Collaborator

Hey, that was me 😄

PR#140 has the keepdims changes. You could split them out and merge them independently. The TestReductions (note 's' at the end) does need to be incorporated. It looks like the Pandas base test case makes up its own data, and that such data is not as trivial as data, so a TestReductions class in Pint-Pandas would take some work and would be welcome!

@andrewgsavage
Copy link
Collaborator

yea go for it.
it would be good to split Michaels changes out in seperate PRs to make it easier to review.

I was holding out for a rc (which doesn't look like is happening) before taking a look as it takes ages for the CI to build pandas otherwise.

@MichaelTiemannOSC
Copy link
Collaborator

I've split out the changes here: #196

@MichaelTiemannOSC
Copy link
Collaborator

Pandas 2.1.0rc0 is now available. Could be a good time to sync up with our own release candidates.

bors bot added a commit that referenced this issue Aug 12, 2023
195: fix pandas v2.1 NumpyEADtype issue r=andrewgsavage a=topper-123

- [x] Closes #194
- [x] Executed `pre-commit run --all-files` with no errors
- [x] The change is fully covered by automated unit tests
- [ ] Documented in docs/ as appropriate
- [ ] Added an entry to the CHANGES file

Precursor to working on #174.


Co-authored-by: Terji Petersen <[email protected]>
@andrewgsavage
Copy link
Collaborator

close by #195

pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]").sum()
0    3.0
1    5.0
2    7.0
dtype: pint[meter]

@coroa
Copy link
Contributor Author

coroa commented Aug 15, 2023

Amazing, thanks @topper-123 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants