Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark XGB converter accesses missing param improperly #697

Open
addisonklinke opened this issue Jul 1, 2024 · 4 comments
Open

Spark XGB converter accesses missing param improperly #697

addisonklinke opened this issue Jul 1, 2024 · 4 comments

Comments

@addisonklinke
Copy link

addisonklinke commented Jul 1, 2024

Description

#373 introduced this constraint on the converter

if hasattr(xgb_node, 'missing') and not np.isnan(xgb_node.missing):
    raise RuntimeError("Cannot convert a XGBoost model where missing values are not "
                       "nan but {}.".format(xgb_node.missing))

Even when I initialize SparkXGBClassifier(missing=np.nan), this check still fails

TypeError: ufunc 'isnan' not supported for the input types, 
and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Upon inspecting, xgb_node.missing is type pyspark.ml.param.Param so it makes sense that numpy can't apply the function. @xadupre can you provide some context on this change? I couldn't find much in the PR or linked issues but it seems like this is missing something like Param.value to access the data before passing to numpy

Example

Reproducible script using the following library versions

numpy==1.26.4
onnxconverter_common==1.14.0
onnxmltools==1.12.0
pyspark==3.5.1
xgboost==2.1.0 
import numpy as np
from onnxconverter_common.data_types import FloatTensorType
from onnxconverter_common.data_types import Int64TensorType
from onnxconverter_common.registration import register_converter, register_shape_calculator
from onnxmltools import convert_sparkml
from onnxmltools.convert.xgboost.operator_converters.XGBoost import convert_xgboost
from onnxmltools.convert.sparkml.ops_names import sparkml_operator_name_map
from onnxmltools.convert.sparkml.ops_input_output import io_name_map
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from xgboost.spark import SparkXGBClassifier, SparkXGBClassifierModel

spark = SparkSession.builder.appName("example").getOrCreate()


def calculate_xgb_output_shape(operator):
    n = operator.inputs[0].type.shape[0]
    operator.outputs[0].type = Int64TensorType(shape=[n])
    operator.outputs[1].type = FloatTensorType(shape=[n, 1])


# Train dummy model
df = spark.createDataFrame([(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)], ["input", "input_2"])
pipe = Pipeline(stages=[
    VectorAssembler(outputCol="features", inputCols=["input", "input_2"]),
    SparkXGBClassifier(
        features_col="features",
        label_col="input",
        missing=np.nan,  # <-- doesn't take effect
    )
])
fitted_pipe = pipe.fit(df)

# Register converters so SparkML recognizes XGB
sparkml_operator_name_map[SparkXGBClassifierModel] = "XGBOOST"
register_converter("XGBOOST", convert_xgboost, overwrite=True)
register_shape_calculator("XGBOOST", calculate_xgb_output_shape, overwrite=True)
io_name_map["XGBOOST"] = (
    lambda model: [model.getFeaturesCol()],
    lambda model: [model.getPredictionCol(), model.getProbabilityCol()],
)

# Convert to ONNX
initial_types = [
    ("input", FloatTensorType([None, 1])),
    ("input_2", FloatTensorType([None, 1]))
]
onnx_pipe = convert_sparkml(fitted_pipe, "pipe", initial_types)  # Triggers numpy TypeError
@xadupre
Copy link
Collaborator

xadupre commented Jul 1, 2024

It may be related to the fact onnxruntime does not support sparse features. What do you suggest as a fix?

@addisonklinke
Copy link
Author

I saw your comments about sparsity, so I understand the motivation but it seems like the implementation has a bug. Do you recall any tests that were able to pass this check?

I'm not particularly familiar with pyspark, but it seems we cannot operate directly on the Param type without numpy complaining

@addisonklinke
Copy link
Author

After researching, the fix would be accessing via xgb_node.getOrDefault("missing") instead of directly using xgb_node.missing. Let me submit a PR with changes

@addisonklinke addisonklinke changed the title Spark XGB converter fails to recognize missing param Spark XGB converter accesses missing param improperly Jul 2, 2024
@addisonklinke
Copy link
Author

Should public access be open for feature branches? Having trouble pushing mine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants