You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pandas.Series.sparse.density operates in constant time (e.g., 3e-5s on my machine) while pandas.Series.ne and pandas.Series.mean are linear in the amount of non-zero entries for sparse columns (e.g., 3e-3s on my machine for one million rows with density 0.1).
The proposed changes reduce the time to convert a pandas.DataFrame with 130 sparse columns (and a few categoricals) and one million rows from 0.5s to 0.15s.
Edit: It's possible for pandas.Series.sparse.sp_values to contain zeros (e.g., when multiplying two sparse arrays, their product's sp_index is the union rather than the intersection of the factors' sp_index). In particular, coldata.sparse.density and (coldata != 0).mean() are not equivalent in those cases. Because scipy.sparse.coo_matrix uses pandas.Series.sparse.sp_values though, coldata.sparse.density appears to be the more sensible solution. Consider the below example with coldata.sparse.density = 1 and (coldata != 0).mean() = 0 which results in a "sparse" matrix with 2000 zero entries with the current tabmat version.
I like your suggestion. To address the issue pointed out in the edit, we should consider adding coldata.eliminate_zeros() before the second line of the suggested code. This may reduce performance, but it shouldn't have a significant impact if self.data is indeed sparse. It will also ensure that sparse_density has the same meaning for sparse and dense coldata.
Instead of
tabmat/src/tabmat/constructor.py
Line 158 in c0c8626
we should consider
pandas.Series.sparse.density operates in constant time (e.g., 3e-5s on my machine) while pandas.Series.ne and pandas.Series.mean are linear in the amount of non-zero entries for sparse columns (e.g., 3e-3s on my machine for one million rows with density 0.1).
The proposed changes reduce the time to convert a
pandas.DataFrame
with 130 sparse columns (and a few categoricals) and one million rows from 0.5s to 0.15s.Edit: It's possible for
pandas.Series.sparse.sp_values
to contain zeros (e.g., when multiplying two sparse arrays, their product'ssp_index
is the union rather than the intersection of the factors'sp_index
). In particular,coldata.sparse.density
and(coldata != 0).mean()
are not equivalent in those cases. Becausescipy.sparse.coo_matrix
usespandas.Series.sparse.sp_values
though,coldata.sparse.density
appears to be the more sensible solution. Consider the below example withcoldata.sparse.density = 1
and(coldata != 0).mean() = 0
which results in a "sparse" matrix with 2000 zero entries with the currenttabmat
version.The text was updated successfully, but these errors were encountered: