Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

project_kids_items #727

Closed
ben-domingue opened this issue Dec 10, 2024 · 14 comments
Closed

project_kids_items #727

ben-domingue opened this issue Dec 10, 2024 · 14 comments
Assignees
Labels
data fix fixing an existing dataset

Comments

@ben-domingue
Copy link
Owner

this one needs to be fixed. in particular, i want to uncouple the wave and person id but i also want the wave to be ordinal numeric values. so i think we need to go back to the code.
table: https://redivis.com/datasets/as2e-cv7jb41fd/tables/r4fk-ckdx7s2mr?tablesList-search=project_ki&tablesList-orderDirection=DESC&tablesList-orderBy=relevance&tablesList-tableToken=tablesList#cells

see also: https://domingue-lab.slack.com/archives/C05AKJGU3UJ/p1733849254844299?thread_ts=1733766196.979549&cid=C05AKJGU3UJ

@saviranadela & @KingArthur0205 any chance one of you could take a look at this?

@ben-domingue ben-domingue added the data fix fixing an existing dataset label Dec 10, 2024
@saviranadela saviranadela self-assigned this Dec 13, 2024
@saviranadela
Copy link
Collaborator

hi @ben-domingue i have a question

i saw wave has these values
image
they distinguished the waves into "wave" and "grade", how would you want it to be transformed to ordinal numeric?

codebook:
Codebook_itemleveldata_PKKIDS.xlsx

@ben-domingue
Copy link
Owner Author

@saviranadela let me dive in

@ben-domingue
Copy link
Owner Author

I actually think that this just needs to be redone from the ground-up. The description reads:

Data on behavioral and achievement assessment items from elementary school children

These data were processed from before we were being as strict about 'different scales -> different tables'. But I think we can redo this somewhat efficiently using the original code:
https://github.com/ben-domingue/irw/blob/main/data/project_kids_items.R

@saviranadela
Copy link
Collaborator

got it! will re-do from the ground-up

@saviranadela
Copy link
Collaborator

@ben-domingue one more question: i see that one item level could have multiple types of waves:
image
image
i'm thinking to stick with non-numeric waves, but happy to hear your suggestion!

@ben-domingue
Copy link
Owner Author

let me see if i follow: i think the issue here is that they are discussing things in a confusing way. they talk about grades and waves and it isn't really clear what is going on. wave 3 presumably occurs after wave 1. but what does that have to do with grade 1/3 followup? it's kind of difficult to know. does this seem right?

my sense is that wave should be ordinal given that it captures a temporal element of data collection and time is always ordered. what if we split the data here into the chunk we can order by wave (so the first two told rows) and that we can order by 'follow' (so the latter two?). or, failing that, maybe i should email the authors asking for clarity?

@saviranadela
Copy link
Collaborator

saviranadela commented Dec 14, 2024

@ben-domingue do you mean splitting them like this?
project_kids_items_split.zip
makes more sense to me as we split items also based on different "types" of waves, but please let me know if this isn't what you meant

@saviranadela
Copy link
Collaborator

code:

library(tidyverse)
library(readr)

df_raw <- read_csv('PK_ItemLevelData.csv')

names(df_raw) <- tolower(names(df_raw))

# find variables with no response or single responses to drop
# put them in a list to drop
drop_vars <- c()

for (i in 1:ncol(df_raw)) {
  unique_vals <- unique(df_raw[[i]])
  unique_len <- length(unique_vals)
  
  if (unique_len == 1 & is.na(unique(unique_vals[1]))) {
    drop_vars <- append(drop_vars, names(df_raw)[i])
  }
  
  if (unique_len == 2 & (is.na(unique_vals[1]) | is.na(unique_vals[2]))) {
    drop_vars <- append(drop_vars, names(df_raw)[i])
  }
}


df_raw <- df_raw |>
  # drop unneeded variables
  select(-all_of(drop_vars),
         -pk_id,
         -starts_with('ctrs'),
         -starts_with('swan'),
         -starts_with('ssrs'),
         -starts_with('tq')) |>
  # create participant ID
  mutate(id = row_number()) 

# transform tosrec assessment variables
tosrec <- df_raw |>
  select(id,
         starts_with('tosrec_g2c')) |>
  pivot_longer(cols = -id,
               names_to = c('pt1', 'wave', 'pt2', 'pt3'),
               names_sep = '_',
               values_to = 'resp',
               values_drop_na = T) |>
  mutate(wave = 'g2_end',
         item = paste0(pt1, '_', pt2, '_', pt3),
         wave_temp = '3') |>
  select(id, item, wave, wave_temp, resp) 

# transform variables with three underscores       
three <- df_raw |>
  select(id,
         starts_with('ctopp'),
         starts_with('told'),
         starts_with('wj_ak'),
         starts_with('wj_ap'),
         starts_with('wj_lw'),
         starts_with('wj_pc'),
         starts_with('wj_pv'),
         starts_with('wj_qc'),
         starts_with('wj_sa'),
         starts_with('wj_spell'),
         starts_with('wj_wa'),
         starts_with('wj_wf')) |>
  pivot_longer(cols = -id,
               names_to = c('pt1', 'pt2', 'pt3', 'wave'),
               names_sep = '_',
               values_to = 'resp',
               values_drop_na = T) |>
  mutate(item = paste0(pt1, '_', pt2, '_', pt3)) |>
  mutate(wave_temp = case_when(wave == 'g1' ~ '1',
                          wave == 'g2' ~ '2',
                          wave == 'g3' ~ '3',
                          wave == 'w1' ~ '1',
                          wave == 'w2' ~ '2',
                          wave == 'w3' ~ '3')) |>
  select(id, item, wave, wave_temp, resp) 

# transform kbit assessment variables
kbit <- df_raw |>
  select(id,
         starts_with('kbit')) |>
  pivot_longer(cols = -id,
               names_to = 'item',
               values_to = 'resp',
               values_drop_na = T) |>
  mutate(wave = NA, wave_temp = NA) |>
  select(id, item, wave, wave_temp, resp)

# transform variables with two underscores
two <- df_raw |>
  select(id,
         starts_with('swan'),
         starts_with('topel')) |>
  pivot_longer(cols = -id,
               names_to = c('pt1', 'pt2', 'wave'),
               names_sep = '_',
               values_to = 'resp',
               values_drop_na = T) |>
  mutate(item = paste0(pt1, '_', pt2)) |>
  mutate(wave_temp = case_when(wave == 'g1' ~ '1',
                               wave == 'g2' ~ '2',
                               wave == 'g3' ~ '3',
                               wave == 'w1' ~ '1',
                               wave == 'w2' ~ '2',
                               wave == 'w3' ~ '3')) |>
  select(id, item, wave, wave_temp, resp)

tosrec2 <- df_raw |>
  select(id,
         starts_with('tosrec_g1c'),
         starts_with('tosrec_g2a')) |>
  pivot_longer(cols = -id,
               names_to = c('pt1', 'wave', 'pt2', 'pt3', 'pt4'),
               names_sep = '_',
               values_to = 'resp',
               values_drop_na = T) |>
  # mutate(wave = case_when(wave == 'g1c' ~ 'g1_end',
  #                         wave == 'g2a' ~ 'g2_beginning'),
  #        item = paste0(pt1, '_', pt2, '_', pt3, '_', pt4)) |>
  mutate(wave_temp = case_when(wave == 'g1c' ~ '1',
                          wave == 'g2a' ~ '2'),
         item = paste0(pt1, '_', pt2, '_', pt3, '_', pt4)) |>
  select(id, item, wave, wave_temp, resp)

# transform variables with four underscores
four <- df_raw |>
  select(id,
         starts_with('wj_mf')) |>
  pivot_longer(cols = -id,
               names_to = c('pt1', 'pt2', 'pt3', 'pt4', 'wave'),
               names_sep = '_',
               values_to = 'resp',
               values_drop_na = T) |>
  mutate(item = paste0(pt1, '_', pt2, '_', pt3, '_', pt4)) |>
  mutate(wave_temp = case_when(wave == 'g1' ~ '1',
                               wave == 'g2' ~ '2',
                               wave == 'g3' ~ '3',
                               wave == 'w1' ~ '1',
                               wave == 'w2' ~ '2',
                               wave == 'w3' ~ '3')) |>
  select(id, item, wave, wave_temp, resp) 

df <- rbind(four, kbit, three, tosrec, tosrec2, two)

df$check <- str_sub(df$item, 1, 5)

df_ctopp <- df %>%
  filter(grepl("ctopp",df$check)) %>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_mf <- df %>%
  filter(grepl("wj_mf",df$check)) %>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_kbit <- df %>%
  filter(grepl("kbit",df$check)) %>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")


df_wj_lw_grade <- df %>%
  filter(grepl("wj_lw",df$check), grepl("g",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_lw_wave <- df %>%
  filter(grepl("wj_lw",df$check), grepl("w",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_pc_grade <- df %>%
  filter(grepl("wj_pc",df$check), grepl("g",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_pc_wave <- df %>%
  filter(grepl("wj_pc",df$check), grepl("w",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_pv_grade <- df %>%
  filter(grepl("wj_pv",df$check), grepl("g",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_pv_wave <- df %>%
  filter(grepl("wj_pv",df$check), grepl("w",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_ak_grade <- df %>%
  filter(grepl("wj_ak",df$check), grepl("g",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_ak_wave <- df %>%
  filter(grepl("wj_ak",df$check), grepl("w",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_sa <- df %>%
  filter(grepl("wj_sa",df$check)) %>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_wa_grade <- df %>%
  filter(grepl("wj_wa",df$check), grepl("g",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_wa_wave <- df %>%
  filter(grepl("wj_wa",df$check), grepl("w",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_wf <- df %>%
  filter(grepl("wj_wf",df$check)) %>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_ap <- df %>%
  filter(grepl("wj_ap",df$check)) %>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_qc <- df %>%
  filter(grepl("wj_qc",df$check)) %>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_told_grade <- df %>%
  filter(grepl("told",df$check), grepl("g",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_told_wave <- df %>%
  filter(grepl("told",df$check), grepl("w",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_spell_grade <- df %>%
  filter(grepl("wj_sp",df$check), grepl("g",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_wj_spell_wave <- df %>%
  filter(grepl("wj_sp",df$check), grepl("w",wave))%>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_tosrec <- df %>%
  filter(grepl("tosre",df$check)) %>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

df_topel <- df %>%
  filter(grepl("topel",df$check)) %>%
  select(id, item, wave_temp, resp) %>%
  rename("wave" = "wave_temp")

write.csv(df_ctopp, "project_kids_ctopp.csv", row.names=FALSE)
write.csv(df_wj_mf, "project_kids_wj_mf.csv", row.names=FALSE)
write.csv(df_kbit, "project_kids_kbit.csv", row.names=FALSE)
write.csv(df_wj_lw_grade, "project_kids_wj_lwid_grade.csv", row.names=FALSE)
write.csv(df_wj_lw_wave, "project_kids_wj_lwid_wave.csv", row.names=FALSE)
write.csv(df_wj_pc_grade, "project_kids_wj_pc_grade.csv", row.names=FALSE)
write.csv(df_wj_pc_wave, "project_kids_wj_pc_wave.csv", row.names=FALSE)
write.csv(df_wj_pv_grade, "project_kids_wj_pv_grade.csv", row.names=FALSE)
write.csv(df_wj_pv_wave, "project_kids_wj_pv_wave.csv", row.names=FALSE)
write.csv(df_wj_ak_grade, "project_kids_wj_ak_grade.csv", row.names=FALSE)
write.csv(df_wj_ak_wave, "project_kids_wj_ak_wave.csv", row.names=FALSE)
write.csv(df_wj_sa, "project_kids_wj_sa.csv", row.names=FALSE)
write.csv(df_wj_wa_grade, "project_kids_wj_wa_grade.csv", row.names=FALSE)
write.csv(df_wj_wa_wave, "project_kids_wj_wa_wave.csv", row.names=FALSE)
write.csv(df_wj_wf, "project_kids_wj_wf.csv", row.names=FALSE)
write.csv(df_wj_ap, "project_kids_wj_ap.csv", row.names=FALSE)
write.csv(df_wj_qc, "project_kids_wj_qc.csv", row.names=FALSE)
write.csv(df_told_grade, "project_kids_told_grade.csv", row.names=FALSE)
write.csv(df_told_wave, "project_kids_told_wave.csv", row.names=FALSE)
write.csv(df_wj_spell_grade, "project_kids_wj_spell_grade.csv", row.names=FALSE)
write.csv(df_wj_spell_wave, "project_kids_wj_spell_wave.csv", row.names=FALSE)
write.csv(df_tosrec, "project_kids_tosrec.csv", row.names=FALSE)
write.csv(df_topel, "project_kids_topel.csv", row.names=FALSE)

@ben-domingue
Copy link
Owner Author

woo boy. this data is tricky! let's focus on just one pair:

==> project_kids_wj_ak_grade.csv <==
"id","item","wave","resp"
2551,"wj_aka_10s","1",1
2551,"wj_aka_11s","1",1
2551,"wj_aka_12s","1",1
2551,"wj_aka_13s","1",1
2551,"wj_aka_14s","1",1
2551,"wj_aka_15s","1",1
2551,"wj_aka_16s","1",0
2551,"wj_aka_17s","1",1
2551,"wj_aka_18s","1",1

==> project_kids_wj_ak_wave.csv <==
"id","item","wave","resp"
2,"wj_aka_1s","1",1
2,"wj_aka_10s","1",1
2,"wj_aka_11s","1",1
2,"wj_aka_12s","1",1
2,"wj_aka_13s","1",1
2,"wj_aka_14s","1",0
2,"wj_aka_15s","1",0
2,"wj_aka_16s","1",0
2,"wj_aka_10s","3",1

from the index, these should be:

image

[going to followup below]

@ben-domingue
Copy link
Owner Author

what i think is happening:

  • the _grade file is the one 'grade' row (row 4)
  • if you look at the original documentation you can see that the data come from two studies: "These data represent a combined dataset of Drs. Stephanie Al Otaiba and Carol Connor's RCT projects. Therefore, these data are not unique, in that the original investigators may have shared the individual datasets in other places that we are not aware of."
  • my guess is that 'grade' is one study and 'wave' is another.
    if that is the case, i think we are ok. but can someone double check my thinking. a very thorny puzzle....

excellent work @saviranadela !

@saviranadela
Copy link
Collaborator

my guess is that 'grade' is one study and 'wave' is another.

yes, it might be! agree with you. but still, we can’t be 100% sure what was going on just yet. either way, i think splitting those waves is already a good move. however, i am happy to email the project PIC if you think it’s necessary!

@ben-domingue
Copy link
Owner Author

i'm willing to live on the edge. i feel pretty confident that is the right thing! unless you remain dubious @saviranadela i would be ok processing the ones we have here :)

@saviranadela
Copy link
Collaborator

sounds good!

PR: #746

@ben-domingue
Copy link
Owner Author

thanks @saviranadela ! adding these now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data fix fixing an existing dataset
Projects
None yet
Development

No branches or pull requests

2 participants