Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pre-optimized query plan by benchmark LazyFrame for the evaluation of another LazyFrame #19906

Open
mathjhshin opened this issue Nov 21, 2024 · 1 comment
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mathjhshin
Copy link

Description

Currently lazyFrame needs to be collected after calculating optimized query plan for each execution even if it is same lazyFrame.
For my case, query plan is very complex, so it is very expensive for optimizing it.
Assume that I need to apply same lazyFrame function on different input Dataframe with same schema.
Then there is no need for optimizing query plan every time.
One time will be enough.
Here is the example of the case

def my_function(df: pl.LazyFrame) -> pl.LazyFrame:
    df = df.with_columns(
        # ...
    )
    return df

input_path = "..."
input_df = pl.read_parquet(input_path)
input_ldf = input_df.lazy()
output_ldf = my_function(input_ldf)
optimized_plan = output_ldf.explain()

case_path = "..."
case_df = pl.read_parquet(case_path) # same schema with input_df
case_ldf = case_df.lazy()
assert input_ldf.collect_schema() == case_ldf.collect_schema()

# change input lazyFrame only from optimized_plan
plan_for_case = optimized_plan.replace(input=case_ldf)
# execute the plan for case without optimization for new lazyFrame
case_output: pl.DataFrame = plan_for_case.execute()

This feature will be very powerful for realtime serving of ML features.

@mathjhshin mathjhshin added the enhancement New feature or an improvement of an existing feature label Nov 21, 2024
@mathjhshin mathjhshin changed the title Use pre-optimized plan for collection of LazyFrame Use pre-optimized plan by benchmark LazyFrame for the evaluation of another LazyFrame Nov 21, 2024
@mathjhshin mathjhshin changed the title Use pre-optimized plan by benchmark LazyFrame for the evaluation of another LazyFrame Use pre-optimized query plan by benchmark LazyFrame for the evaluation of another LazyFrame Nov 21, 2024
@mcrumiller
Copy link
Contributor

mcrumiller commented Nov 21, 2024

I think what you're asking for is a general LazyFrame query plan that you can pre-optimize and swap out the input source. This would definitely be nice, something like:

source = pl.LazyFrame.Schema({ ... })  # specify LDF input schema
output = source.with_columns(...).optimize()  # frame operations, optimize query

output.with_source(input1).collect()  # use input with specified schema
output.with_source(input2).collect()  # use different input, same schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants