Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong percentage calculated when categorical column includes missing values. #114

Closed
rherman9 opened this issue Feb 27, 2021 · 5 comments
Closed

Comments

@rherman9
Copy link

rherman9 commented Feb 27, 2021

When your categorical variables include missing values, a wrong percentage gets calculated.

See Parameter1:
Screenshot 2021-02-27 at 18 07 39

Currently percentages for categorical variables get calculated with the total being the amount of non-missing values for that variable.

In my opinion the percentage, in cell E3 for example should be 17.1, because you're interested in how many times Parameter1 Category 1.0 occured in the validation_set: 104/609

A quick workaround is to replace all nan values in categorical variables to some number and then dropping the rows with that number:

df[categorical] = df[categorical].replace(np.nan, 199848)
mytable = mytable.tableone[mytable.tableone.index.get_level_values(1) != "199848.0"]

Great library though!

@tompollard
Copy link
Owner

@rherman9 thanks for picking this up. we'll take a look!

@tompollard
Copy link
Owner

To reproduce this issue:

import pandas as pd
from tableone import tableone

df = pd.DataFrame(
    {'cats': ["1", "2", "3", "4", None, None],
    'set': ["train","train", "val", "val", "val", "val"]}
    )

t = tableone(df, groupby = "set")
print(t.tabulate(headers=None, tablefmt="github"))

Output:

Missing Overall train val
n 6 2 4
cats, n (%) 1 2 1 (25.0) 1 (50.0)
2 1 (25.0) 1 (50.0)
3 1 (25.0) 1 (50.0)
4 1 (25.0) 1 (50.0)

Expected output:

Missing Overall train val
n 6 2 4
cats, n (%) 1 2 1 (25.0) 1 (50.0)
2 1 (25.0) 1 (50.0)
3 1 (25.0) 1 (25.0)
4 1 (25.0) 1 (25.0)

The best fix for this might be to treat NaN/None etc as a category? @lbulgarelli any thoughts?

@lbulgarelli
Copy link
Collaborator

It is a good idea to add missing as a category itself, especially because it will allow to easily compare missing values between groups.

That said, the number of missing alone is not very informative for non-categorical variables, so I'd also probably hide that information by default, with the option to display it.

@tompollard
Copy link
Owner

That said, the number of missing alone is not very informative for non-categorical variables, so I'd also probably hide that information by default, with the option to display it.

I feel like it's pretty important to know how many data points are missing, even for continuous variables. If you're reporting a summary statistic and it is based on a small proportion of your overall data, it feels like it would be good to know.

@jraffa any thoughts on this conversation? (how to handle missing values for categorical and continuous variables).

tompollard added a commit that referenced this issue Jun 14, 2024
Missing values are now treated as a category for categorical values.
tompollard added a commit that referenced this issue Jun 14, 2024
tompollard added a commit that referenced this issue Jun 14, 2024
Add include_null argument to handle nulls for categorical values. Ref #114.
@tompollard
Copy link
Owner

Should be fixed in #175, which adds an include_null argument. When include_null=True (the default), missing values are treated as a level of the categorical variable.

! pip install git+https://github.com/tompollard/tableone.git@main

df = pd.DataFrame(
    {'cats': ["1", "2", "3", "4", None, None],
    'set': ["train","train", "val", "val", "val", "val"]}
    )

t = tableone(df, groupby = "set")
print(t.tabulate(headers=None, tablefmt="github"))

Outputs:

Missing Overall train val
n 6 2 4
cats, n (%) 1 1 (16.7) 1 (50.0)
2 1 (16.7) 1 (50.0)
3 1 (16.7) 1 (25.0)
4 1 (16.7) 1 (25.0)
None 2 (33.3) 2 (50.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants