Error "cannot convert float NaN to integer" when training (fit) an Isolation Forest model #1138

edr1 · 2024-01-26T08:57:43Z

edr1
Jan 26, 2024

Hi,
When training the Isolation Forest model using the VerticaPy library version 1.0.1, the error "cannot convert float NaN to integer" appears even though the 4 columns of my dataframe used for training do 'not' contain any NULL or NaN values. I confirmed this using the 'zeroifnull' function and store the dataframe into a Vertica table to verify the 4 columns do 'not' have NULL or NaN values.

After doing some tests, I noticed that the ocurrence of this issue depends on the positioning of the columns in the list passed as the 'X' parameter to the 'fit' function. This is weird because all 4 columns have the same Numeric data type but when column 'recurrencia_solicita_norm' is the first element in the list passed as the X parameter in fit function, this error does not occur, however, when this column is the last element in the list, this error appears.

The erro trace follows below:

print(PROPERTIES['COLUMNAS_MODELO_1'])
['frecuencia_1d_norm', 'frecuencia_1s_norm', 'frecuencia_1m_norm', 'recurrencia_solicita_norm']

ValueError Traceback (most recent call last)
Cell In[186], line 11
9 #model_if = IsolationForest(name=PROPERTIES['MODELO_ANOMALIAS_1'],overwrite_model=True,n_estimators=100,max_depth=10,nbins=32,sample=0.632,col_sample_by_tree=0.8)
10 model_if.drop()
---> 11 model_if.fit(input_relation=vdf_vpy_pcX_max, X=PROPERTIES['COLUMNAS_MODELO_1'])
12 model_if.predict(vdf_vpy_pcX_max, X=PROPERTIES['COLUMNAS_MODELO_1'], name='is_anomaly', contamination=0.9, inplace=True)
13 model_if.decision_function(vdf_vpy_pcX_max, X=PROPERTIES['COLUMNAS_MODELO_1'], name='anomaly_score', inplace=True)

File ~/verticapyenv/lib/python3.11/site-packages/verticapy/machine_learning/vertica/base.py:7980, in Unsupervised.fit(self, input_relation, X, return_report)
7976 if "init_method" in parameters and not (
7977 isinstance(parameters["init_method"], str)
7978 ):
7979 drop(name_init, method="table")
-> 7980 self._compute_attributes()
7981 if self._is_native:
7982 report = self.summarize()

File ~/verticapyenv/lib/python3.11/site-packages/verticapy/machine_learning/vertica/ensemble.py:4235, in IsolationForest.compute_attributes(self)
4233 trees = []
4234 for i in range(self.n_estimators):
-> 4235 tree = self.compute_trees_arrays(
4236 self.get_tree(i),
4237 self.X,
4238 )
4239 tree_d = {
4240 "children_left": tree[0],
4241 "children_right": tree[1],
(...)
4245 "psy": self.psy,
4246 }
4247 for idx in range(len(tree[5])):

File ~/verticapyenv/lib/python3.11/site-packages/verticapy/machine_learning/vertica/base.py:2487, in Tree._compute_trees_arrays(self, tree, X, return_probability)
2482 for i in range(n):
2483 if not isinstance(tree.values["leaf_path_length"][i], NoneType):
2484 tree.values["prediction"] += [
2485 [
2486 int(float(tree.values["leaf_path_length"][i])),
-> 2487 int(float(tree.values["training_row_count"][i])),
2488 ]
2489 ]
2490 else:
2491 tree.values["prediction"] += [None]

ValueError: cannot convert float NaN to integer

Answered by mail4umar

Feb 2, 2024

Thank you so much for reporting this. We have investigated this issue and found out that there is a minor bug at the Vertica Server side. We will fix it on priority.

Meanwhile in order to work with your model we recommend converting your INTEGER data type columns that you are using as input to NUMERIC. Do it before scaling. This should work.

View full answer

edr1 · 2024-01-29T13:28:01Z

edr1
Jan 29, 2024
Author

Hello, I would like to give you an update based on further tests performed. This issue occurs when the variables used to traind the Isolation Forest model have floating-point values and not integer values., In our case, the variables passed to the Isolation Forest model are first normalized using the Z-score normalization method, which produces floating-point valuies and then passed to the model.

6 replies

mail4umar Feb 2, 2024
Maintainer

Thank you so much for reporting this. We have investigated this issue and found out that there is a minor bug at the Vertica Server side. We will fix it on priority.

Meanwhile in order to work with your model we recommend converting your INTEGER data type columns that you are using as input to NUMERIC. Do it before scaling. This should work.

Answer selected by mail4umar

edr1 Feb 8, 2024
Author

Hello, thank you for your reply and for prioritizing this, as we are implementing for a client.

As an alternative solution, I would like to apply the recommendation you mentioned but the input variables to the Isolation Forest model are already of the Numeric type, as can be confirmed using the dtypes function:

print(vdf_vpy_pcX_max.dtypes())
None dtype
"cve_empleado" integer
"dia_anio" integer
"frecuencia_1d" numeric(20)
"frecuencia_1s" numeric(20)
"frecuencia_1m" numeric(20)
"total_registro" numeric(20)
"frecuencia_1d_norm" numeric(64)
"frecuencia_1s_norm" numeric(63)
"frecuencia_1m_norm" numeric(63)
"total_registro_norm" numeric(63)

model_if = IsolationForest(name='db_ml.MODEL_IF_PC1',overwrite_model=True,n_estimators=100,max_depth=10,nbins=32,sample=0.9,col_sample_by_tree=0.9)
model_if.fit(input_relation=vdf_vpy_pcX_max, X=['frecuencia_1d_norm','frecuencia_1s_norm','frecuencia_1m_norm','total_registro_norm'])
model_if.predict(vdf_vpy_pcX_max, X=['frecuencia_1d_norm','frecuencia_1s_norm','frecuencia_1m_norm','total_registro_norm'], name='is_anomaly', contamination=0.9, inplace=True)
model_if.decision_function(vdf_vpy_pcX_max, X=['frecuencia_1d_norm','frecuencia_1s_norm','frecuencia_1m_norm','total_registro_norm'], name='anomaly_score', inplace=True)

Are you referring to converting them from Numeric to Integer before training the model ???

edr1 Feb 9, 2024
Author

@mail4umar please do you have any news about my last question? We would like to apply your recomendation while we wait for the fx.

mail4umar Feb 26, 2024
Maintainer

Hi @edr1,

It has been addressed in the PR below which is already merged.

ac2824e

edr1 Mar 5, 2024
Author

Hello @mail4umar
Thank you for your reply

We are releasing a project that implements ML models using the Isolation Forest algorithm in VerticaPy version 0.12, but we would strongly like to be able to release it with version 1.x. Do you have any estimated date to release at least a minor release with that fix?

Many thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error "cannot convert float NaN to integer" when training (fit) an Isolation Forest model #1138

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Error "cannot convert float NaN to integer" when training (fit) an Isolation Forest model #1138

edr1 Jan 26, 2024

Replies: 1 comment · 6 replies

edr1 Jan 29, 2024 Author

mail4umar Feb 2, 2024 Maintainer

edr1 Feb 8, 2024 Author

edr1 Feb 9, 2024 Author

mail4umar Feb 26, 2024 Maintainer

edr1 Mar 5, 2024 Author

edr1
Jan 26, 2024

Replies: 1 comment 6 replies

edr1
Jan 29, 2024
Author

mail4umar Feb 2, 2024
Maintainer

edr1 Feb 8, 2024
Author

edr1 Feb 9, 2024
Author

mail4umar Feb 26, 2024
Maintainer

edr1 Mar 5, 2024
Author