Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAs causing unexpected special cause flags #156

Open
gdfiler opened this issue Aug 10, 2022 · 6 comments
Open

NAs causing unexpected special cause flags #156

gdfiler opened this issue Aug 10, 2022 · 6 comments

Comments

@gdfiler
Copy link

gdfiler commented Aug 10, 2022

Hi All
I'm new to github and NHSRplotthedots, fairly new to NHSR Community and not that experienced in R either so please be forgiving if this is not appropriate or has been dealt with elsewhere - I came across an apparent issue today which I thought I would share.

I've been playing around with the NHSRplotthedots package and accidently left an NA value in my dataset which caused some special cause flags that should have been common cause.

Here is an example of the issue:

`library(NHSRplotthedots)
library(NHSRdatasets)
library(dplyr)
library(ggplot2)
library(scales)

data("ae_attendances")

stable_set <- ae_attendances %>%
filter(org_code == "RRK",
type ==1,
period < as.Date("2018-04-01"))

ptd_spc(stable_set, value_field = breaches, date_field = period, improvement_direction = "decrease")

#Note last 6 data points show common cause variation`

image

`#Now set the last data point to NA and rerun SPC

stable_set$breaches[stable_set$period==as.Date("2018-03-01")] <- NA
ptd_spc(stable_set, value_field = breaches, date_field = period, improvement_direction = "decrease")

#Now last 5 points show special cause variation`

image

Thanks for all the great work you are doing and looking forward to future developments and more packages from the community!
Gary

@ThomUK
Copy link
Collaborator

ThomUK commented Aug 10, 2022

Thanks for raising this issue, and excellent catch. We'll need to decide what the behaviour should be here. My preference would be to tolerate the NA, dropping the point from the x-axis (the real world is messy and does occasionally contain unavoidable holes with missing data)

Reproducing the case with no NAs:
image

Reproducing the case with an NA:
image

The point_type column is calculated in the code below, which looks at both the special_cause_flag, and relative_to_mean columns. These are both NA, so line 29 is executed.

point_type = case_when(
!special_cause_flag ~ "common_cause",
improvement_direction == 0 ~ "special_cause_neutral",
relative_to_mean == improvement_direction ~ "special_cause_improvement",
TRUE ~ "special_cause_concern"
)

The values for special_cause_flag and relative_to_mean columns are set here, so this is likely where the code to tolerate NAs will need to be added:

# Identify any points which are outside the upper or lower process limits
outside_limits = (.data$y > .data$upl | .data$y < .data$lpl),
# Identify whether a point is above or below the mean
relative_to_mean = sign(.data$y - .data$mean),
# Identify if a point is between the near process limits and process limits
close_to_limits = !.data$outside_limits & (.data$y < .data$nlpl | .data$y > .data$nupl)

The next stage is probably for someone to write a failing test. It would be worth also checking for other bugs that might be created by the other NAs that appear in that row.

@ThomUK
Copy link
Collaborator

ThomUK commented Aug 10, 2022

Just a note, that as a workaround, you can pre-filter the incoming dataframe to contain only dates with data. This gives a result with gaps in the x axis, which makes the missing data more obvious while not affecting the SPC logic.

data("ae_attendances")

stable_set <- ae_attendances %>%
  filter(org_code == "RRK",
         type ==1,
         period < as.Date("2018-04-01"))

# remove some data
stable_set$breaches[stable_set$period==as.Date("2018-03-01")] <- NA

# removing more data in the middle of the plot, as an illustration
stable_set$breaches[stable_set$period==as.Date("2017-06-01")] <- NA

# filter to remove any dates with NAs
filtered_set <- stable_set %>% filter(!is.na(breaches))

ptd_spc(filtered_set, value_field = breaches, date_field = period, improvement_direction = "decrease")

image

@ThomUK
Copy link
Collaborator

ThomUK commented Oct 26, 2022

I am leaning towards not implementing any changes to tolerate or work around NAs. Perhaps we should throw a warning to prompt to user to look more closely at their data? The user is in control of the data being passed in.

Open to thoughts from others...

@gdfiler
Copy link
Author

gdfiler commented Oct 27, 2022

If I recall correctly, the documentation clearly states to not include NAs. I had read the documentation but still accidently included them so I like the idea of a warning when NAs are present to prompt a check of the data.

When NAs are included there is a risk that the users without experience of SPC may interpret the false special cause flags as real. If an error message will mitigate that risk then all good.

@tomjemmett
Copy link
Member

I think the best solution here would be to have a check along the lines of stopifnot(all(!is.na(x_values)) [pseudocode]. The error message should suggest to use tidyr::drop_na(x_values) explicitly.

I don't like the idea of doing implicit dropping of values... the logic would be a mess (if there are no NA's, do nothing. If there are, drop na's but give a warning). I think the cleanness of raising an error but telling what to do to fix the issue is best.

@jaspercain
Copy link

I've just had a use case where I have encountered a similar behaviour, but with a different cause.
For context, I'm using the function to produce the special cause flag for all ward areas, which is then being exported for use within a heatmap on our BI tool.
I have multiple cases where 0 values exist for the whole period (e.g, number of harm falls could be zero for the whole period). These should not be counted as a special cause for concern
However, where the entire period is 0, the function cannot calculate the lower or upper control limits, producing an NA for "special_cause_flag", and a subsequent "special_cause_concern"

SPC object NA values

I have put in place a workaround for my workflow with a simple REPLACE
exceptions$point_type <- replace(exceptions$point_type, is.na(exceptions$special_cause_flag),values=c("common_cause"))

I'm not sure if this needs a fix as such since this is probably a niche usage, but potentially needs a warning where the control limits are NaN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants