visualize_as_dataframe(show_only_changes=True) does not work when categorical data is composed of numbers #384

mimicarina · 2023-07-08T02:34:57Z

DiCE/dice_ml/data_interfaces/private_data_interface.py

Line 336 in e9e7147

levels.append(self.categorical_levels[cat_feature])

When categorical columns contain numerical levels (e.g. yes - 1, no - 0) visualize_as_dataframe(show_only_changes=True) (and also visualize_as_list()) does not work, as it encodes the string values to numerical.

Example dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/00573/SouthGermanCredit.zip.

During data prep, categorical values are encoded as 'category' data type (see query instance below). The counterfactual uses numeric representation; hence it will show as 'changed' value even though it is the same category (e.g. '2' vs 2).

Query instance (original outcome : 1)
['1', 21, '2', '2', 3599, '1', '4', '1', '2', '1', '4', '3', '3', '1', '1', '2', '2', '1', '2', '1', 1]

Diverse Counterfactual set (new outcome: 0.0)
[1, '-', 2, 2, 17507, 1, 4, 1, 2, 1, 4, 3, 3, 1, 1, 2, 2, '2', 2, 1, 0]
[1, '-', '0', 2, '-', 1, '-', 1, 2, 1, 4, 3, 3, 1, 1, 2, 2, 1, 2, 1, 0]

This is happening because train_data[cat_feature].cat.categories.tolist() returns integer and not categories/strings; for sample dataset above the categories and levels are:

['credit_history', 'foreign_worker', 'housing', 'other_debtors', 'other_installment_plans', 'people_liable', 'personal_status_sex', 'purpose', 'savings', 'status', 'telephone', 'employment_duration', 'installment_rate', 'job', 'number_credits', 'present_residence', 'property']
[[0, 1, 2, 3, 4], [1, 2], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2, 3, 4], [0, 1, 2, 3, 4, 5, 6, 8, 9, 10], [1, 2, 3, 4, 5], [1, 2, 3, 4], [1, 2], [1, 2, 3, 4, 5], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]

The text was updated successfully, but these errors were encountered:

ascripter · 2024-09-02T08:21:42Z

I have a probably related issue with categorical columns that contain integer numbers. Calling Dice.generate_counterfactuals raises:

ValueError: Found unknown categories ['9', '2', '13', '7', '5', '12', '11', '15', '18', '3', '1', '14', '8', '10', '17', '4', '16'] in column 2 during transform

I realised that Data.permitted_range already has integers of categorical columns converted to strings, that's probably the root cause of the problem. Having only number and category type columns in my dataframe, I get it fixed with:

data = dice_ml.Data(dataframe=df_train, continuous_features=df_train.select_dtypes("number").columns, outcome_name="y")
for col in df_train.select_dtypes("category").columns:
    data.permitted_range[col] = df_train[col].cat.categories

Edit: This only works for Dice(method="random") not for "genetic" or "kdtree".

Edit2: The actual culprit may be PublicData._set_feature_dtypes where each column in categorical_feature_names is converted to str before being converted to category. However when tweaking the source code and omitting the string conversion, I get another error from the genetic algorithm's LabelEncoder which encodes to int64, which in turn cannot be handled in an numpy-internal np.isnan check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

visualize_as_dataframe(show_only_changes=True) does not work when categorical data is composed of numbers #384

visualize_as_dataframe(show_only_changes=True) does not work when categorical data is composed of numbers #384

mimicarina commented Jul 8, 2023 •

edited

Loading

ascripter commented Sep 2, 2024 •

edited

Loading

visualize_as_dataframe(show_only_changes=True) does not work when categorical data is composed of numbers #384

visualize_as_dataframe(show_only_changes=True) does not work when categorical data is composed of numbers #384

Comments

mimicarina commented Jul 8, 2023 • edited Loading

ascripter commented Sep 2, 2024 • edited Loading

mimicarina commented Jul 8, 2023 •

edited

Loading

ascripter commented Sep 2, 2024 •

edited

Loading