[ENH]improve conditional join performance #1419

samukweku · 2024-11-02T14:05:30Z

PR Description

Please describe the changes proposed in the pull request:

improve performance when using numba and join conditions are more than two

This PR resolves #1415 .

In [2]: import pandas as pd; import janitor as jn; import numpy as np; import numba

In [3]: def east_west():
   ...:     num_rows_left, num_rows_right = 5_000_000, 5_000_000
   ...:     rng = np.random.default_rng(42)
   ...:
   ...:     # Generate two separate datasets where revenue/cost are linearly related to
   ...:     # duration/time, but add some noise to the west table so that there are some
   ...:     # rows where the cost for the same or greater time will be less than the east table.
   ...:     east_dur = rng.integers(1_000, 10_000_000, num_rows_left)
   ...:     east_rev = (east_dur * 0.123).astype(np.int32)
   ...:     west_time = rng.integers(1_000, 500_000, num_rows_right)
   ...:     west_cost = west_time * 0.123
   ...:     west_cost += rng.normal(0.0, 1.0, num_rows_right)
   ...:     west_cost = west_cost.astype(np.int32)
   ...:
   ...:     east = pd.DataFrame(
   ...:         {
   ...:             "id": np.arange(0, num_rows_left),
   ...:             "dur": east_dur,
   ...:             "rev": east_rev,
   ...:             "cores": rng.integers(1, 10, num_rows_left),
   ...:         }
   ...:     )
   ...:     west = pd.DataFrame(
   ...:         {
   ...:             "t_id": np.arange(0, num_rows_right),
   ...:             "time": west_time,
   ...:             "cost": west_cost,
   ...:             "cores": rng.integers(1, 10, num_rows_right),
   ...:         }
   ...:     )
   ...:     return east, west
   ...: east,west = east_west()

# this PR
In [4]: %timeit east.conditional_join(west, ('dur','time','<='),('rev','cost','>='), ('id','t_id', '>'), use_numba=False)
2.64 s ± 5.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit east.conditional_join(west, ('dur','time','<='),('rev','cost','>='), ('id','t_id', '>'), use_numba=True)
3.41 s ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Please tag maintainers to review.

@ericmjl

ericmjl · 2024-11-02T14:07:52Z

🚀 Deployed on https://deploy-preview-1419--pyjanitor.netlify.app

samuel.oranyeli added 3 commits November 1, 2024 14:13

rework non equi joins in numba to improve perf. for >2 conditions

eb97854

handle nulls outside numba

309e749

remove return result comment

df50a95

samukweku requested review from ericmjl, thatlittleboy and a team November 2, 2024 14:05

samukweku self-assigned this Nov 2, 2024

samukweku linked an issue Nov 2, 2024 that may be closed by this pull request

[ENH] improve conditional_join performance #1415

Open

samuel.oranyeli and others added 2 commits November 3, 2024 18:20

minor refactoring - keep _numba.py as majorly numba only code

44d9548

Merge dev into 1415-enh-improve-conditional_join-performance

9813125

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH]improve conditional join performance #1419

[ENH]improve conditional join performance #1419

samukweku commented Nov 2, 2024

ericmjl commented Nov 2, 2024 •

edited

Loading

[ENH]improve conditional join performance #1419

Are you sure you want to change the base?

[ENH]improve conditional join performance #1419

Conversation

samukweku commented Nov 2, 2024

PR Description

ericmjl commented Nov 2, 2024 • edited Loading

ericmjl commented Nov 2, 2024 •

edited

Loading