Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] polars complete function #1367

Merged
merged 78 commits into from
Jun 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
862c7dd
add make_clean_names function that can be applied to polars
Apr 19, 2024
01531cc
add examples for make_clean_names
Apr 20, 2024
0fb440e
changelog
Apr 20, 2024
5e944b2
limit import location for polars
Apr 20, 2024
501d9c6
limit import location for polars
Apr 20, 2024
9506832
fix polars in environment-dev.yml
Apr 20, 2024
1ae8edd
install polars in doctest
Apr 20, 2024
3b1829b
limit polars imports - user should have polars already installed
Apr 20, 2024
52fd80c
use subprocess.run
Apr 20, 2024
2dce78b
add subprocess.devnull to docstrings
Apr 20, 2024
37b3feb
add subprocess.devnull to docstrings
Apr 20, 2024
0953f2d
add subprocess.devnull to docstrings
Apr 20, 2024
d7c71b6
add subprocess.devnull to docstrings
Apr 20, 2024
40b8502
add os.devnull
Apr 20, 2024
4f11d09
add polars as requirement for docs
Apr 20, 2024
54b179c
add polars to tests requirements
Apr 20, 2024
25b39b9
delete irrelevant folder
Apr 20, 2024
a09f34b
changelog
Apr 20, 2024
1b375f8
create submodule for polars
Apr 21, 2024
799532f
fix doctests
Apr 21, 2024
dbce4b9
fix tests; add polars to documentation
Apr 21, 2024
1c642e6
fix tests; add polars to documentation
Apr 21, 2024
407d21b
import janitor.polars
Apr 21, 2024
aedfc65
control docs output for polars submodule
Apr 21, 2024
db9b486
exclude functions in docs rendering
Apr 21, 2024
6a91e67
exclude functions in docs rendering
Apr 21, 2024
7a88078
show_submodules=true
Apr 21, 2024
6d7885e
fix docstring rendering for polars
Apr 21, 2024
944fa02
Expression -> expression
Apr 21, 2024
b9aefaa
Merge dev into samukweku/polars_clean_names
ericmjl Apr 23, 2024
e9c370a
rename functions.py
Apr 23, 2024
e3021dd
add support for lazyframe
May 4, 2024
55b9a43
add support for lazyframe
May 4, 2024
25ea8d0
update typing to include lazyframe
May 4, 2024
5ca7581
update typing to include lazyframe
May 4, 2024
18e6be8
separate namespaces for lazyframe and eager dataframe
May 5, 2024
f50369f
separate namespaces for lazyframe and eager dataframe
May 5, 2024
1820fab
separate namespaces for lazyframe and eager dataframe
May 5, 2024
2a7b033
separate namespaces for lazyframe and eager dataframe
May 5, 2024
5f21470
make edits to docs
May 5, 2024
122e960
make edits to docs
May 5, 2024
6295526
use LazyFrame constructor
May 5, 2024
e2356c5
use LazyFrame constructor
May 5, 2024
c28260e
Merge dev into samukweku/polars_clean_names
ericmjl May 6, 2024
4d0e2ca
Merge dev into samukweku/polars_clean_names
ericmjl May 10, 2024
8a5552e
Merge dev into samukweku/polars_clean_names
ericmjl May 19, 2024
54da3c9
complete for polars
May 23, 2024
7ec88ef
add tests
May 24, 2024
89affbb
remove config in docs
May 24, 2024
63ee595
remove numpy from docs
May 24, 2024
c9aa55b
use with clause for config
May 24, 2024
47f1066
use with clause for config
May 24, 2024
39142e1
use with clause for config
May 24, 2024
412e70f
use with clause for config
May 24, 2024
68e0093
check for lazyframe
May 24, 2024
d6d875b
fix expand_selector usage
May 25, 2024
7030d05
keep changes to polars
May 25, 2024
eb00f67
fix column name selection in group by
May 25, 2024
692538d
Merge dev into samukweku/polars_complete
ericmjl May 27, 2024
9e99e5e
fix branching
May 31, 2024
bd6c32a
fix annotations
May 31, 2024
5a584e1
fix annotations
May 31, 2024
f5c8492
use full join
Jun 1, 2024
142674f
modify code for structs
Jun 2, 2024
cebe6e6
add examples
Jun 2, 2024
55019ce
add examples
Jun 2, 2024
38d0c06
add examples
Jun 2, 2024
2e4cb14
add examples
Jun 2, 2024
c069dfd
polars expression should ensure uniqueness
Jun 2, 2024
d4464b1
Merge branch 'dev' into samukweku/polars_complete
samukweku Jun 3, 2024
4b2382b
fix conflicts
Jun 3, 2024
5126d00
fix tests
Jun 3, 2024
4e8a804
Merge dev into samukweku/polars_complete
ericmjl Jun 4, 2024
69f7ca1
update docs and code
Jun 5, 2024
af09c7a
update docs
Jun 5, 2024
75c775b
Merge branch 'dev' into samukweku/polars_complete
samukweku Jun 11, 2024
90d2033
fix docs
Jun 11, 2024
0f24a6e
Merge dev into samukweku/polars_complete
ericmjl Jun 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Changelog

## [Unreleased]
- [ENH] Added a `row_to_names` method for polars. Issue #1352
- [ENH] `read_commandline` function now supports polars - Issue #1352

- [ENH] Added a `complete` method for polars. - Issue #1352 @samukweku
- [ENH] Added a `row_to_names` method for polars. Issue #1352
- [ENH] `read_commandline` function now supports polars - Issue #1352

- [ENH] Improved performance for non-equi joins when using numba - @samukweku PR #1341
- [ENH] Added a `clean_names` method for polars - it can be used to clean the column names, or clean column values . Issue #1343 @samukweku
Expand Down
125 changes: 125 additions & 0 deletions janitor/polars/complete.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
"""complete implementation for polars."""

from __future__ import annotations

from typing import Any

from janitor.utils import check, import_message

try:
import polars as pl
import polars.selectors as cs
from polars.type_aliases import ColumnNameOrSelector
except ImportError:
import_message(
submodule="polars",
package="polars",
conda_channel="conda-forge",
pip_install=True,
)


def _complete(
df: pl.DataFrame | pl.LazyFrame,
columns: tuple[ColumnNameOrSelector],
fill_value: dict | Any | pl.Expr,
explicit: bool,
sort: bool,
by: ColumnNameOrSelector,
) -> pl.DataFrame | pl.LazyFrame:
"""
This function computes the final output for the `complete` function.

A DataFrame, with rows of missing values, if any, is returned.
"""
if not columns:
return df

check("sort", sort, [bool])
check("explicit", explicit, [bool])
_columns = []
for column in columns:
if isinstance(column, str):
col = pl.col(column).unique()
if sort:
col = col.sort()
_columns.append(col)
elif cs.is_selector(column):
col = column.as_expr().unique()
if sort:
col = col.sort()
_columns.append(col)
elif isinstance(column, pl.Expr):
_columns.append(column)
else:
raise TypeError(
f"The argument passed to the columns parameter "
"should either be a string, a column selector, "
"or a polars expression, instead got - "
f"{type(column)}."
)
by_does_not_exist = by is None
if by_does_not_exist:
_columns = [column.implode() for column in _columns]
uniques = df.select(_columns)
_columns = uniques.columns
else:
uniques = df.group_by(by, maintain_order=sort).agg(_columns)
_by = uniques.select(by).columns
_columns = uniques.select(pl.exclude(_by)).columns
for column in _columns:
uniques = uniques.explode(column)

_columns = [
column
for column, dtype in zip(_columns, uniques.select(_columns).dtypes)
# this way we ensure there is no tampering with existing struct columns
if (dtype == pl.Struct) and (column not in df.columns)
]

if _columns:
for column in _columns:
uniques = uniques.unnest(columns=column)

if fill_value is None:
return uniques.join(df, on=uniques.columns, how="full", coalesce=True)
idx = None
columns_to_select = df.columns
if not explicit:
idx = "".join(df.columns)
df = df.with_row_index(name=idx)
df = uniques.join(df, on=uniques.columns, how="full", coalesce=True)
# exclude columns that were not used
# to generate the combinations
exclude_columns = uniques.columns
if idx:
exclude_columns.append(idx)
expression = pl.exclude(exclude_columns).is_null().any()
booleans = df.select(expression)
if isinstance(booleans, pl.LazyFrame):
booleans = booleans.collect()
_columns = [
column
for column in booleans.columns
if booleans.get_column(column).item()
]
if _columns and isinstance(fill_value, dict):
fill_value = [
pl.col(column_name).fill_null(value=value)
for column_name, value in fill_value.items()
if column_name in _columns
]
elif _columns:
fill_value = [
pl.col(column).fill_null(value=fill_value) for column in _columns
]
if _columns and not explicit:
condition = pl.col(idx).is_null()
fill_value = [
pl.when(condition).then(_fill_value).otherwise(pl.col(column_name))
for column_name, _fill_value in zip(_columns, fill_value)
]
if _columns:
df = df.with_columns(fill_value)

return df.select(columns_to_select)
Loading
Loading