Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of tabmat.from_pandas for sparse columns #378

Open
fbrueggemann-axa opened this issue Jul 30, 2024 · 1 comment
Open

Comments

@fbrueggemann-axa
Copy link

fbrueggemann-axa commented Jul 30, 2024

Instead of

if (coldata != 0).mean() <= sparse_threshold:

we should consider

if isinstance(coldata.dtype, pd.SparseDtype) and coldata.sparse.fill_value == 0:
    sparse_density = coldata.sparse.density
else:
    sparse_density = (coldata != 0).mean()

if sparse_density <= sparse_threshold:
    ...

pandas.Series.sparse.density operates in constant time (e.g., 3e-5s on my machine) while pandas.Series.ne and pandas.Series.mean are linear in the amount of non-zero entries for sparse columns (e.g., 3e-3s on my machine for one million rows with density 0.1).

The proposed changes reduce the time to convert a pandas.DataFrame with 130 sparse columns (and a few categoricals) and one million rows from 0.5s to 0.15s.

Edit: It's possible for pandas.Series.sparse.sp_values to contain zeros (e.g., when multiplying two sparse arrays, their product's sp_index is the union rather than the intersection of the factors' sp_index). In particular, coldata.sparse.density and (coldata != 0).mean() are not equivalent in those cases. Because scipy.sparse.coo_matrix uses pandas.Series.sparse.sp_values though, coldata.sparse.density appears to be the more sensible solution. Consider the below example with coldata.sparse.density = 1 and (coldata != 0).mean() = 0 which results in a "sparse" matrix with 2000 zero entries with the current tabmat version.

import pandas as pd
import tabmat as tm

df = (
    pd.DataFrame(
        {
            "col_a": [1, 0] * 1000,
            "col_b": [0, 1] * 1000,
        }
    )
    .astype(pd.SparseDtype(fill_value=0))
    .assign(col_c=lambda x: x["col_a"] * x["col_b"])
)
tm.from_pandas(df.filter(["col_c"]))._array.shape
@MatthiasSchmidtblaicherQC
Copy link
Contributor

MatthiasSchmidtblaicherQC commented Aug 7, 2024

I like your suggestion. To address the issue pointed out in the edit, we should consider adding coldata.eliminate_zeros() before the second line of the suggested code. This may reduce performance, but it shouldn't have a significant impact if self.data is indeed sparse. It will also ensure that sparse_density has the same meaning for sparse and dense coldata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants