You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently lazyFrame needs to be collected after calculating optimized query plan for each execution even if it is same lazyFrame.
For my case, query plan is very complex, so it is very expensive for optimizing it.
Assume that I need to apply same lazyFrame function on different input Dataframe with same schema.
Then there is no need for optimizing query plan every time.
One time will be enough.
Here is the example of the case
def my_function(df: pl.LazyFrame) -> pl.LazyFrame:
df = df.with_columns(
# ...
)
return df
input_path = "..."
input_df = pl.read_parquet(input_path)
input_ldf = input_df.lazy()
output_ldf = my_function(input_ldf)
optimized_plan = output_ldf.explain()
case_path = "..."
case_df = pl.read_parquet(case_path) # same schema with input_df
case_ldf = case_df.lazy()
assert input_ldf.collect_schema() == case_ldf.collect_schema()
# change input lazyFrame only from optimized_plan
plan_for_case = optimized_plan.replace(input=case_ldf)
# execute the plan for case without optimization for new lazyFrame
case_output: pl.DataFrame = plan_for_case.execute()
This feature will be very powerful for realtime serving of ML features.
The text was updated successfully, but these errors were encountered:
mathjhshin
changed the title
Use pre-optimized plan for collection of LazyFrame
Use pre-optimized plan by benchmark LazyFrame for the evaluation of another LazyFrame
Nov 21, 2024
mathjhshin
changed the title
Use pre-optimized plan by benchmark LazyFrame for the evaluation of another LazyFrame
Use pre-optimized query plan by benchmark LazyFrame for the evaluation of another LazyFrame
Nov 21, 2024
I think what you're asking for is a general LazyFrame query plan that you can pre-optimize and swap out the input source. This would definitely be nice, something like:
source=pl.LazyFrame.Schema({ ... }) # specify LDF input schemaoutput=source.with_columns(...).optimize() # frame operations, optimize queryoutput.with_source(input1).collect() # use input with specified schemaoutput.with_source(input2).collect() # use different input, same schema
Description
Currently lazyFrame needs to be collected after calculating optimized query plan for each execution even if it is same lazyFrame.
For my case, query plan is very complex, so it is very expensive for optimizing it.
Assume that I need to apply same lazyFrame function on different input Dataframe with same schema.
Then there is no need for optimizing query plan every time.
One time will be enough.
Here is the example of the case
This feature will be very powerful for realtime serving of ML features.
The text was updated successfully, but these errors were encountered: