-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge performance penalty with pd.arrays.SparseArray
#228
Comments
On the other hand, I saw how among those |
reading the definitions file is cached in latest pint release can you try the different methods in: It would be good to add a note in the docs to suggest the most performant method. |
You can also use the SparseArray as the magnitude of the PintArray: sa = pd.arrays.SparseArray([1,2,3]*M, fill_value=np.nan, dtype=np.float64)
pa = pint_pandas.PintArray(sa, dtype="pint[rpm]")
type(pa.data) If you want better support for storing data in SparseArrays or other Arrays do comment in #192 |
Sure, I did a quick benchmark test. Ordered from quickest to slowest: (test_series is just the standard Code used: # Requires pytest, pytest-benchmark + pint related dependencies
import numpy as np
import pandas as pd
import pint_pandas
import pytest
PA_ = pint_pandas.PintArray
ureg = pint_pandas.PintType.ureg
Q_ = ureg.Quantity
@pytest.fixture
def M() -> int:
return 1_000
def test_series(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"A": pd.Series([0]*M, dtype=np.float64)}))
def test_A(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"A": pd.Series([0]*M, dtype="pint[m]")}))
def test_B(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"B": pd.Series([0]*M).astype("pint[m]")}))
def test_C(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"C": PA_([0]*M, dtype="pint[m]")}))
def test_D(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"D": PA_([0]*M, dtype="m")}))
def test_E(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"E": PA_([0]*M, dtype=ureg.m)}))
def test_F(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"F": PA_.from_1darray_quantity(Q_([0]*M, ureg.m))}))
def test_G(M: int, benchmark) -> None:
benchmark(lambda: pd.DataFrame({"G": PA_(Q_([0]*M, ureg.m))})) |
I would like to use this tool, but having such a big performance issue makes it unusable with big sparse arrays. Below there is the code I use to benchmark the issue. I usually work with dataframes with +1M columns but the benchmark is just with 100k. Although this tendency can be also seen with
pd.Series
(as expected) sparse arrays experiment a much bigger performance issue:Here is the output of
pyinstrument
:It seems the main penalty is creating a list where a N
Quantity
objects are created. I tried with quick alternatives but did not manage to find anything that does not break. Any possible ideas? Wouldn't it be possible to just create a numpy array with one assigned quantity instead of NQuantity
objects? Thanks!The text was updated successfully, but these errors were encountered: