[pyspark] rework transform to reuse same code #9292

wbo4958 · 2023-06-12T03:28:36Z

wbo4958 · 2023-06-12T06:25:53Z

@WeichenXu123 @trivialfis could you please help to review this PR? Previously, SparkXGBClassifier totally overrides the whole _transform function which causes duplicated common code, so this PR tries to unify them.

trivialfis

Does the refactor make the code cleaner or easier to understand? I find it quite hacky, but might be a general issue with PySpark-based libraries.

python-package/xgboost/spark/core.py

trivialfis · 2023-06-14T22:23:17Z

python-package/xgboost/spark/core.py

+            result = data[pred.prediction]
+            if pred_contrib_col_name:
+                contribs = pred_contribs(model, X, base_margin)
+                data[pred.pred_contrib] = pd.Series(list(contribs))


Not sure how it works. Wouldn't this be super slow?

any thoughts to rework this? using contribs.tolist() ?

python-package/xgboost/spark/core.py

trivialfis · 2023-06-14T22:28:14Z

python-package/xgboost/spark/core.py

+    def _post_transform(self, dataset: DataFrame, pred_col: Column) -> DataFrame:
+        """Post process of transform"""
+        prediction_col_name = self.getOrDefault(self.predictionCol)
+        single_pred = "," not in self._out_schema()


That's a bit, hmm, unconventional. Is there a way to refactor this to make it more "standard"?

The _out_schema has typing hint with str returned which is a DDL formatted string, so if there're many columns, it must have at least a ",". So it's conventional when we check if it is a single column according to if "," is in the schema.

We are meta-programming by manipulating strings here.

Maybe just use the pred_contrib_col_name as a predicate?

fixed this issue.

python-package/xgboost/spark/core.py

trivialfis · 2023-09-02T00:35:27Z

python-package/xgboost/spark/core.py

+        pred_contrib_col_name = self._get_pred_contrib_col_name()
+
+        def _predict(
+            model: XGBModel, X: ArrayLike, base_margin: Optional[np.ndarray]


base_margin is not necessarily np.ndarray. Let's stick with ArrayLike.

trivialfis · 2023-09-02T00:44:29Z

python-package/xgboost/spark/core.py

+    def _post_transform(self, dataset: DataFrame, pred_col: Column) -> DataFrame:
+        """Post process of transform"""
+        prediction_col_name = self.getOrDefault(self.predictionCol)
+        single_pred = "," not in self._out_schema()


Maybe just use the pred_contrib_col_name as a predicate?

Co-authored-by: Bobby Wang <[email protected]>

trivialfis reviewed Jun 14, 2023

View reviewed changes

wbo4958 added 2 commits September 2, 2023 05:44

[pyspark] rework transform to reuse same code

1b9675e

comments

1b4dd5a

trivialfis reviewed Sep 2, 2023

View reviewed changes

comments

c8d353b

wbo4958 force-pushed the rework-transform branch from 62aebbd to c8d353b Compare September 3, 2023 00:48

wbo4958 requested a review from trivialfis September 4, 2023 06:11

trivialfis approved these changes Sep 4, 2023

View reviewed changes

trivialfis merged commit 419e052 into dmlc:master Sep 4, 2023
21 checks passed

trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Sep 7, 2023

[backport] [pyspark] rework transform to reuse same code (dmlc#9292)

01105eb

trivialfis added a commit that referenced this pull request Sep 7, 2023

[backport] [pyspark] rework transform to reuse same code (#9292) (#9558)

4d387cb

Co-authored-by: Bobby Wang <[email protected]>

wbo4958 deleted the rework-transform branch April 23, 2024 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyspark] rework transform to reuse same code #9292

[pyspark] rework transform to reuse same code #9292

wbo4958 commented Jun 12, 2023

wbo4958 commented Jun 12, 2023

trivialfis left a comment

trivialfis Jun 14, 2023

wbo4958 Jun 19, 2023

trivialfis Jun 14, 2023

wbo4958 Jun 19, 2023

trivialfis Sep 1, 2023

trivialfis Sep 2, 2023

wbo4958 Sep 3, 2023

trivialfis Sep 2, 2023

wbo4958 Sep 3, 2023

trivialfis Sep 2, 2023

[pyspark] rework transform to reuse same code #9292

[pyspark] rework transform to reuse same code #9292

Conversation

wbo4958 commented Jun 12, 2023

wbo4958 commented Jun 12, 2023

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment