Correlations widget returns wrong values if fields are missing #6891

ruvilonix · 2024-09-15T17:15:15Z

What's wrong?

Maybe I'm just a novice at statistics and this is how it's supposed to work, but it seems like a bug. When I connect a correlations widget to data that has missing fields in one of the features, the correlations are different than if I provide the same data but with the problem row removed. What I would expect is that if I am getting a correlation of six rows of two features, and one of the rows is missing the second feature, then only the five complete rows would factor into the correlation. But the value is not the same as a file with only those five rows.

I'll try to illustrate. Here is the test csv file, missing one value for score:

name,age,score
Bob,24,78
Gill,32,89
Fred,33,93
Julie,25,75
Sandra,20,98
Lucy,45,

Here is Orange:

The Scatter Plot shows the correct r value for the regression line of age and score (0.09). The upper Correlations widget shows an incorrect Pearson correlation (0.050). The lower Select Rows excludes the undefined "Lucy" row, then connects to another Correlations widget, which shows the correct value (0.090).

How can we reproduce the problem?

missing_row_test.ows.zip

Look at the correlations in the two Correlations widgets.

What's your environment?

Operating system: Ubuntu 24.04.1 LTS
Orange version: 3.37.0
How you installed Orange: mamba/conda

The text was updated successfully, but these errors were encountered:

ruvilonix · 2024-10-13T02:39:38Z

Ah, it imputes the values using the mean.

owcorrelations.py

self.cont_data = SklImpute()(cont_data)

preprocess.py

class SklImpute(Preprocess):
    __wraps__ = SimpleImputer

    def __init__(self, strategy='mean'):
        self.strategy = strategy

.....

        imputer = SimpleImputer(strategy=self.strategy)

It'd be great to have an option to choose the imputation method, or be able to exclude missing rows.

janezd · 2024-10-14T17:40:21Z

The widget should show a warning when it imputes values. I'm marking this as a bug.

To choose the imputation method, put the widget Impute before Correlations. There you can either select the imputation method or remove the rows with missing values.

janezd · 2024-10-14T17:41:22Z

Thanks for reporting as well as for diagnosing!

ruvilonix · 2024-10-14T21:45:46Z

So the Impute widget works as I would hope for the simple csv I posted. But I run into an issue if there are more than two columns:

name,age,score,height
Bob,24,78,100
Gill,32,89,94
Fred,33,93,103
Julie,25,75,120
Sandra,20,98,
Lucy,45,,82

In this case, Sandra is missing a value for height, and Lucy is missing a value for score. Ideally, I would like it to calculate correlations between pairs of columns, only excluding rows where data is missing for that column only. So for the correlation of age and score, it should still show 0.090, using all rows except Lucy.

But the Impute widget with "Remove instances" excludes Sandra as well because the height value is missing, even though height is not used in this correlation. The correlation given for age to score is 0.968.

The Scatter Plot widget (without Impute) provides a correlation for the linear regression as I would expect. It excludes instances with missing values only for the columns of interest. But I would love if I could see the ordered list of these same correlations in the Correlations widget.

janezd · 2024-10-15T17:54:08Z

Huh, you're right. Impute widget is not a general solution. I haven't thought about this.

We could add a checkbox to exclude missing rows (in pairwise fashion) as an alternative to imputation of means. For any other imputation method, there's a special widget. And we'd add a warning if there are any missing values - disregarding the checkbox state.

ruvilonix · 2024-10-15T18:15:48Z

Yes, I think a simple checkbox like that would be a good solution.

ruvilonix added the bug report Bug is reported by user, not yet confirmed by the core team label Sep 15, 2024

janezd assigned lanzagar Sep 20, 2024

janezd added bug A bug confirmed by the core team and removed bug report Bug is reported by user, not yet confirmed by the core team labels Oct 14, 2024

janezd unassigned lanzagar Oct 14, 2024

janezd added the snack This will take an hour or two label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correlations widget returns wrong values if fields are missing #6891

Correlations widget returns wrong values if fields are missing #6891

ruvilonix commented Sep 15, 2024

ruvilonix commented Oct 13, 2024 •

edited

Loading

janezd commented Oct 14, 2024

janezd commented Oct 14, 2024

ruvilonix commented Oct 14, 2024 •

edited

Loading

janezd commented Oct 15, 2024

ruvilonix commented Oct 15, 2024

Correlations widget returns wrong values if fields are missing #6891

Correlations widget returns wrong values if fields are missing #6891

Comments

ruvilonix commented Sep 15, 2024

ruvilonix commented Oct 13, 2024 • edited Loading

janezd commented Oct 14, 2024

janezd commented Oct 14, 2024

ruvilonix commented Oct 14, 2024 • edited Loading

janezd commented Oct 15, 2024

ruvilonix commented Oct 15, 2024

ruvilonix commented Oct 13, 2024 •

edited

Loading

ruvilonix commented Oct 14, 2024 •

edited

Loading