Improve performance of `tabmat.from_pandas` for sparse columns #378

fbrueggemann-axa · 2024-07-30T06:01:54Z

Instead of

Line 158 in c0c8626

if (coldata != 0).mean() <= sparse_threshold:

we should consider

if isinstance(coldata.dtype, pd.SparseDtype) and coldata.sparse.fill_value == 0:
    sparse_density = coldata.sparse.density
else:
    sparse_density = (coldata != 0).mean()

if sparse_density <= sparse_threshold:
    ...

pandas.Series.sparse.density operates in constant time (e.g., 3e-5s on my machine) while pandas.Series.ne and pandas.Series.mean are linear in the amount of non-zero entries for sparse columns (e.g., 3e-3s on my machine for one million rows with density 0.1).

The proposed changes reduce the time to convert a pandas.DataFrame with 130 sparse columns (and a few categoricals) and one million rows from 0.5s to 0.15s.

Edit: It's possible for pandas.Series.sparse.sp_values to contain zeros (e.g., when multiplying two sparse arrays, their product's sp_index is the union rather than the intersection of the factors' sp_index). In particular, coldata.sparse.density and (coldata != 0).mean() are not equivalent in those cases. Because scipy.sparse.coo_matrix uses pandas.Series.sparse.sp_values though, coldata.sparse.density appears to be the more sensible solution. Consider the below example with coldata.sparse.density = 1 and (coldata != 0).mean() = 0 which results in a "sparse" matrix with 2000 zero entries with the current tabmat version.

import pandas as pd
import tabmat as tm

df = (
    pd.DataFrame(
        {
            "col_a": [1, 0] * 1000,
            "col_b": [0, 1] * 1000,
        }
    )
    .astype(pd.SparseDtype(fill_value=0))
    .assign(col_c=lambda x: x["col_a"] * x["col_b"])
)
tm.from_pandas(df.filter(["col_c"]))._array.shape

The text was updated successfully, but these errors were encountered:

MatthiasSchmidtblaicherQC · 2024-08-07T15:40:43Z

I like your suggestion. To address the issue pointed out in the edit, we should consider adding coldata.eliminate_zeros() before the second line of the suggested code. This may reduce performance, but it shouldn't have a significant impact if self.data is indeed sparse. It will also ensure that sparse_density has the same meaning for sparse and dense coldata.

MatthiasSchmidtblaicherQC added the performance label Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `tabmat.from_pandas` for sparse columns #378

Improve performance of `tabmat.from_pandas` for sparse columns #378

fbrueggemann-axa commented Jul 30, 2024 •

edited

Loading

MatthiasSchmidtblaicherQC commented Aug 7, 2024 •

edited

Loading

Improve performance of tabmat.from_pandas for sparse columns #378

Improve performance of tabmat.from_pandas for sparse columns #378

Comments

fbrueggemann-axa commented Jul 30, 2024 • edited Loading

MatthiasSchmidtblaicherQC commented Aug 7, 2024 • edited Loading

Improve performance of `tabmat.from_pandas` for sparse columns #378

Improve performance of `tabmat.from_pandas` for sparse columns #378

fbrueggemann-axa commented Jul 30, 2024 •

edited

Loading

MatthiasSchmidtblaicherQC commented Aug 7, 2024 •

edited

Loading