Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement similar functions for polars #1352

Open
20 of 26 tasks
samukweku opened this issue Apr 21, 2024 · 9 comments
Open
20 of 26 tasks

implement similar functions for polars #1352

samukweku opened this issue Apr 21, 2024 · 9 comments

Comments

@samukweku
Copy link
Collaborator

samukweku commented Apr 21, 2024

in relation to #1343 - this is a list of functions missing in the polars library that could be implemented :

  • clean_names
  • pivot_longer
  • pivot_wider
  • xlsx_tables
  • xlsx_cells
  • read_commandline
  • conditional_join polars has a pl.join_where to cover this
  • complete
  • expand_grid pl.join with how='cross' covers this
  • convert_excel_date
  • convert_matlab_date
  • convert_unix_date pl.from_epoch covers this
  • bin_numeric pl.Expr.cut covers this
  • concatenate_columns can be replicated with pl.concat_str
  • deconcatenate_columns pl.Expr.str.split covers this
  • factorize_columns pl.rank(dense) or pl.Expr.to_physical covers this
  • get_dupes Expr.is_duplicated() covers this
  • jitter
  • limit_column_characters
  • min_max_scale
  • move can be replicated with polars' selectors
  • row_to_names
  • shuffle pl.Expr.shuffle covers this
  • sort_naturally
  • take_first group_by.first() covers this
  • also

Care should be taken to not create the function, if an existing solution already exists for any of these functions (probably named differently, or a combination of existing polars functions that covers all use cases of any of the listed functions above)

@samukweku
Copy link
Collaborator Author

samukweku commented Jun 14, 2024

the current pivot_longer implementation is not good enough. I'll submit a PR with improvements

@samukweku
Copy link
Collaborator Author

I assumed (wrongly) that polars' join maintains order (it only does so for left join). need to rethink the computation logic for complete. I'll submit a PR with improvements

@3SMMZRjWgS
Copy link

eagerly awaits for 0.28.0 release!

@samukweku
Copy link
Collaborator Author

@3SMMZRjWgS version 0.28.0 is released. would love feedback on the functions - would also love PRs if you are interested.

@samukweku
Copy link
Collaborator Author

samukweku commented Sep 8, 2024

possible improvement to jn.polars.pivot_longer based on this pola-rs/polars#18519? the performance hit (about 2x) is ok, considering the jn.pivot_longer implementation is flexible enough to deal with multiple columns that are not '.value'. probably add a note advising users about this?

@samukweku
Copy link
Collaborator Author

example below about the performance hit for a single column extraction:

import polars as pl
import janitor.polars
In [58]: df = pl.DataFrame(
    ...:     {
    ...:         "Sepal.Length": [5.1, 5.9],
    ...:         "Sepal.Width": [3.5, 3.0],
    ...:         "Petal.Length": [1.4, 5.1],
    ...:         "Petal.Width": [0.2, 1.8],
    ...:         "Species": ["setosa", "virginica"],
    ...:     }
    ...: )
    ...: df
Out[58]: 
shape: (2, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies   │
│ ---------------       │
│ f64f64f64f64str       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.13.51.40.2setosa    │
│ 5.93.05.11.8virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

DF = pl.concat([df]*5_000_000,rechunk=True)

orig=(DF
          .select('Species', 
                       pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), 
                      pl.struct(Length='Petal.Length',Width='Petal.Width').alias('Petal'))
          .unpivot(index='Species', variable_name='part').unnest('value')
        )

other=DF.pivot_longer(index='Species', names_sep='.', names_to = ('part', '.value'))

In [72]: orig.sort(pl.all()).equals(other.sort(pl.all()))
Out[72]: True

In [73]: %timeit orig=DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal
    ...: .Width').alias('Petal')).unpivot(index='Species', variable_name='part').unnest('value')
95.9 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [75]: %timeit other=DF.pivot_longer(index='Species', names_sep='.', names_to = ('part', '.value'))
188 ms ± 6.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

a 2x performance slowdown of pivot_longer compared to manually executing this. The limitation of the orig approach is you cannot extend it to extract multiple columns (not that i'm aware of) . take a silly example below:

In [76]: another=DF.pivot_longer(index='Species', names_pattern=r"(.{2})(.+)\.(.+)", names_to = ('part1', 'part2', '.value'))

In [77]: another
Out[77]: 
shape: (20_000_000, 5)
┌───────────┬───────┬───────┬────────┬───────┐
│ Speciespart1part2LengthWidth │
│ ---------------   │
│ strstrstrf64f64   │
╞═══════════╪═══════╪═══════╪════════╪═══════╡
│ setosaPetal1.40.2   │
│ virginicaPetal5.11.8   │
│ setosaPetal1.40.2   │
│ virginicaPetal5.11.8   │
│ setosaPetal1.40.2   │
│ …         ┆ …     ┆ …     ┆ …      ┆ …     │
│ virginicaSepal5.93.0   │
│ setosaSepal5.13.5   │
│ virginicaSepal5.93.0   │
│ setosaSepal5.13.5   │
│ virginicaSepal5.93.0   │
└───────────┴───────┴───────┴────────┴───────┘

In [78]: %timeit another=DF.pivot_longer(index='Species', names_pattern=r"(.{2})(.+)\.(.+)", names_to = ('part1', 'part2', '.value'))
204 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [86]: DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal.Width').alia
    ...: s('Petal')).unpivot(index='Species').unnest('value').with_columns(part1=pl.col.variable.str.slice(offset=0,length=2), part2=pl.col.variable.str.sli
    ...: ce(offset=2)).drop('variable')
Out[86]: 
shape: (20_000_000, 5)
┌───────────┬────────┬───────┬───────┬───────┐
│ SpeciesLengthWidthpart1part2 │
│ ---------------   │
│ strf64f64strstr   │
╞═══════════╪════════╪═══════╪═══════╪═══════╡
│ setosa5.13.5Sepal   │
│ virginica5.93.0Sepal   │
│ setosa5.13.5Sepal   │
│ virginica5.93.0Sepal   │
│ setosa5.13.5Sepal   │
│ …         ┆ …      ┆ …     ┆ …     ┆ …     │
│ virginica5.11.8Petal   │
│ setosa1.40.2Petal   │
│ virginica5.11.8Petal   │
│ setosa1.40.2Petal   │
│ virginica5.11.8Petal   │
└───────────┴────────┴───────┴───────┴───────┘

In [85]: %timeit DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal.Widt
    ...: h').alias('Petal')).unpivot(index='Species').unnest('value').with_columns(part1=pl.col.variable.str.slice(offset=0,length=2), part2=pl.col.variable
    ...: .str.slice(offset=2)).drop('variable')
301 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It's just a crude example of where jn.pivot_longer can be advantageous, especially when there are multiple columns to be extracted

@samukweku
Copy link
Collaborator Author

the long rant above does leave a question though - can we possibly speed up jn.pivot_longer within rust? it might be possible 🤷

@Phil-Garmann
Copy link

@samukweku I just wanted to let you know that the pl.join_where function does not offer the same flexibility as conditional_join as the former is implemented strictly as a inner-join. At least this is the behavior in the current stable version of Polars, but it could be subject to change in the future - link to the documentation.

Other join-types for join_where has been requested on the polars repo, but whether/when this will be picked up by their devs is still to be determined.
Whether this changes the "feature" list to be implemented for polars in pyjanitor is for you to decide.

Either way, I am looking forward to using pyjanitor alongside polars in the future 🚀

@samukweku
Copy link
Collaborator Author

@Phil-Garmann thanks for the feedback; it is much appreciated. I'll keep an eye on the progress for join-where and see how it evolves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants