[Re-opened elsewhere] Handle nullable types and empty partitions before Dask-ML predict #783

sarahyurick · 2022-09-21T17:18:52Z

Changes in create_model.py handle nullable types, such as for this example:

df = pd.DataFrame({
    "rough_day_of_year": pd.Series([0, 1, 2, 3], dtype='Int32'),
    "prev_day_inches_rained": pd.Series([0, 1, 2, 3], dtype='float32'),
    "rained": pd.Series([False, False, False, True])
})
c.create_table("train_set", df)

model_class = ".linear_model.LogisticRegression'"
if GPU:
    model_class = "'cuml" + model_class
else:
    model_class = "'sklearn" + model_class

c.sql(f"""
CREATE OR REPLACE MODEL model WITH (
    model_class = {model_class},
     wrap_predict = True,
     wrap_fit = False,
    target_column = 'rained'
) AS (
    SELECT *
    FROM train_set
)
""")

c.sql("""
SELECT * FROM PREDICT(
  MODEL model,
  SELECT * FROM train_set
)
""").compute()

Changes in predict.py handle empty partitions, modeled based on this Dask-ML PR: dask/dask-ml#912

sarahyurick · 2022-09-21T17:21:00Z

dask_sql/physical/rel/custom/create_model.py

@@ -183,7 +184,13 @@ def convert(self, rel: "LogicalPlan", context: "dask_sql.Context") -> DataContai

            delayed_model = [delayed(model.fit)(x_p, y_p) for x_p, y_p in zip(X_d, y_d)]
            model = delayed_model[0].compute()
-            model = ParallelPostFit(estimator=model)
+            output_meta = np.array([])


With this, output_meta is always []. Should this maybe be in some sort of try/except block since we're only handling the CPU case?

I dont think we can just hardcode the meta to be output_meta to be np.array([]) . We also use cuML for this case and that outputs a cuDF Series.

sarahyurick · 2022-09-21T17:22:32Z

dask_sql/physical/rel/custom/predict.py

-        prediction = model.predict(df[training_columns])
+        part = df[training_columns]
+        output_meta = model.predict_meta
+        if part.shape[0].compute() == 0 and output_meta is not None:


compute() is needed on the Delayed object to get the number of rows in the partition. I believe that right now, output_meta will always be []?

You dont need to compute for this to do this, we can do it lazily too.

sarahyurick · 2022-09-21T17:43:00Z

dask_sql/physical/rel/custom/predict.py

@@ -59,7 +60,13 @@ def convert(self, rel: "LogicalPlan", context: "dask_sql.Context") -> DataContai

        model, training_columns = context.schema[schema_name].models[model_name]
        df = context.sql(sql_select)
-        prediction = model.predict(df[training_columns])
+        part = df[training_columns]
+        output_meta = model.predict_meta


AttributeError: 'KMeans' object has no attribute 'predict_meta'

VibhuJawa

We should not hard code any meta values and should only handle cases when model is ParallelPostFit .

VibhuJawa · 2022-09-21T18:27:56Z

dask_sql/physical/rel/custom/create_model.py

@@ -183,7 +184,13 @@ def convert(self, rel: "LogicalPlan", context: "dask_sql.Context") -> DataContai

            delayed_model = [delayed(model.fit)(x_p, y_p) for x_p, y_p in zip(X_d, y_d)]
            model = delayed_model[0].compute()
-            model = ParallelPostFit(estimator=model)
+            output_meta = np.array([])


I dont think we can just hardcode the meta to be output_meta to be np.array([]) . We also use cuML for this case and that outputs a cuDF Series.

VibhuJawa · 2022-09-21T18:28:56Z

dask_sql/physical/rel/custom/predict.py

-        prediction = model.predict(df[training_columns])
+        part = df[training_columns]
+        output_meta = model.predict_meta
+        if part.shape[0].compute() == 0 and output_meta is not None:


You dont need to compute for this to do this, we can do it lazily too.

VibhuJawa · 2022-09-21T18:52:12Z

dask_sql/physical/rel/custom/predict.py

+            empty_output = self.handle_empty_partitions(output_meta)
+            if empty_output is not None:
+                return empty_output
+        prediction = model.predict(part)


We should wrap the predict like the following for cases only for when we have a ParallelPostFit model.

if isinstance(model, ParallelPostFit): output_meta = model.predict_meta if predict_meta is None: predict_meta = model.estimator.predict(part._meta_nonempty) prediction = part.map_partitions(_predict, predict_meta, model.estimator, meta=predict_meta) def _pedict(part, predict_meta, estimator): if part.shape[0] == 0 and predict_meta is not None: empty_output = handle_empty_partitions(output_meta) return empty_output return estimator.predict(part)

VibhuJawa

Please add more tests

initial changes

85f2d7a

sarahyurick requested review from ayushdg, charlesbluca and galipremsagar as code owners September 21, 2022 17:18

sarahyurick commented Sep 21, 2022

View reviewed changes

randerzander requested a review from VibhuJawa September 21, 2022 17:24

sarahyurick commented Sep 21, 2022

View reviewed changes

fix failures

c56a155

VibhuJawa suggested changes Sep 21, 2022

View reviewed changes

charlesbluca deleted the branch dask-contrib:datafusion-sql-planner September 21, 2022 20:57

charlesbluca closed this Sep 21, 2022

sarahyurick mentioned this pull request Sep 22, 2022

Handle nullable types and empty partitions before Dask-ML predict #799

Closed

sarahyurick changed the title ~~Handle nullable types and empty partitions before Dask-ML predict~~ [Re-opened elsewhere] Handle nullable types and empty partitions before Dask-ML predict Sep 22, 2022

sarahyurick deleted the predict_bug branch May 26, 2023 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Re-opened elsewhere] Handle nullable types and empty partitions before Dask-ML predict #783

[Re-opened elsewhere] Handle nullable types and empty partitions before Dask-ML predict #783

sarahyurick commented Sep 21, 2022

sarahyurick Sep 21, 2022 •

edited

Loading

VibhuJawa Sep 21, 2022

sarahyurick Sep 21, 2022 •

edited

Loading

VibhuJawa Sep 21, 2022

sarahyurick Sep 21, 2022

VibhuJawa left a comment

VibhuJawa Sep 21, 2022

VibhuJawa Sep 21, 2022

VibhuJawa Sep 21, 2022

VibhuJawa left a comment

[Re-opened elsewhere] Handle nullable types and empty partitions before Dask-ML predict #783

[Re-opened elsewhere] Handle nullable types and empty partitions before Dask-ML predict #783

Conversation

sarahyurick commented Sep 21, 2022

sarahyurick Sep 21, 2022 • edited Loading

Choose a reason for hiding this comment

VibhuJawa Sep 21, 2022

Choose a reason for hiding this comment

sarahyurick Sep 21, 2022 • edited Loading

Choose a reason for hiding this comment

VibhuJawa Sep 21, 2022

Choose a reason for hiding this comment

sarahyurick Sep 21, 2022

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa Sep 21, 2022

Choose a reason for hiding this comment

VibhuJawa Sep 21, 2022

Choose a reason for hiding this comment

VibhuJawa Sep 21, 2022

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

sarahyurick Sep 21, 2022 •

edited

Loading

sarahyurick Sep 21, 2022 •

edited

Loading