Skip to content

Commit

Permalink
Merge pull request #7 from unjournal/julia-analysis
Browse files Browse the repository at this point in the history
Small changes to chapters and app
  • Loading branch information
daaronr authored Sep 12, 2023
2 parents 5b5de96 + 4bb540a commit 9cc8cab
Show file tree
Hide file tree
Showing 24 changed files with 1,333 additions and 935 deletions.
8 changes: 5 additions & 3 deletions _freeze/chapters/aggregation/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
{
"hash": "7f66fb53fad0da9a8e9d43a42b42b3de",
"hash": "68440e6d131ccee0354bc095732ca0e2",
"result": {
"markdown": "# Aggregation of evaluators judgments (modeling)\n\n\n\n\n\n\n## Notes on sources and approaches\n\n\n::: {.callout-note collapse=\"true\"}\n\n## Hanea et al {-}\n(Consult, e.g., repliCATS/Hanea and others work; meta-science and meta-analysis approaches)\n\n`aggrecat` package\n\n> Although the accuracy, calibration, and informativeness of the majority of methods are very similar, a couple of the aggregation methods consistently distinguish themselves as among the best or worst. Moreover, the majority of methods outperform the usual benchmarks provided by the simple average or the median of estimates.\n\n[Hanea et al, 2021](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256919#sec007)\n\n However, these are in a different context. Most of those measures are designed to deal with probablistic forecasts for binary outcomes, where the predictor also gives a 'lower bound' and 'upper bound' for that probability. We could roughly compare that to our continuous metrics with 90% CI's (or imputations for these).\n\nFurthermore, many (all their successful measures?) use 'performance-based weights', accessing metrics from prior prediction performance of the same forecasters We do not have these, nor do we have a sensible proxy for this. \n:::\n\n\n::: {.callout-note collapse=\"true\"}\n## D Veen et al (2017)\n\n[link](https://www.researchgate.net/profile/Duco-Veen/publication/319662351_Using_the_Data_Agreement_Criterion_to_Rank_Experts'_Beliefs/links/5b73e2dc299bf14c6da6c663/Using-the-Data-Agreement-Criterion-to-Rank-Experts-Beliefs.pdf)\n\n... we show how experts can be ranked based on their knowledge and their level of (un)certainty. By letting experts specify their knowledge in the form of a probability distribution, we can assess how accurately they can predict new data, and how appropriate their level of (un)certainty is. The expert’s specified probability distribution can be seen as a prior in a Bayesian statistical setting. We evaluate these priors by extending an existing prior-data (dis)agreement measure, the Data Agreement Criterion, and compare this approach to using Bayes factors to assess prior specification. We compare experts with each other and the data to evaluate their appropriateness. Using this method, new research questions can be asked and answered, for instance: Which expert predicts the new data best? Is there agreement between my experts and the data? Which experts’ representation is more valid or useful? Can we reach convergence between expert judgement and data? We provided an empirical example ranking (regional) directors of a large financial institution based on their predictions of turnover. \n\nBe sure to consult the [correction made here](https://www.semanticscholar.org/paper/Correction%3A-Veen%2C-D.%3B-Stoel%2C-D.%3B-Schalken%2C-N.%3B-K.%3B-Veen-Stoel/a2882e0e8606ef876133f25a901771259e7033b1)\n\n::: \n\n\n::: {.callout-note collapse=\"true\"}\n## Also seems relevant:\n\nSee [Gsheet HERE](https://docs.google.com/spreadsheets/d/14japw6eLGpGjEWy1MjHNJXU1skZY_GAIc2uC2HIUalM/edit#gid=0), generated from an Elicit.org inquiry.\n\n\n::: \n\n\n\nIn spite of the caveats in the fold above, we construct some measures of aggregate beliefs using the `aggrecat` package. We will make (and explain) some ad-hoc choices here. We present these:\n\n1. For each paper\n2. For categories of papers and cross-paper categories of evaluations\n3. For the overall set of papers and evaluations\n\nWe can also hold onto these aggregated metrics for later use in modeling.\n\n\n- Simple averaging\n\n- Bayesian approaches \n\n- Best-performing approaches from elsewhere \n\n- Assumptions over unit-level random terms \n\n\n### Simple rating aggregation {-}\n\nBelow, we are preparing the data for the aggreCATS package.\n\n\n::: {.cell}\n\n:::\n\n\n\n\n\n### Explicit modeling of 'research quality' (for use in prizes, etc.) {-}\n\n- Use the above aggregation as the outcome of interest, or weight towards categories of greater interest?\n\n- Model with controls -- look for greatest positive residual? \n\n\n## Inter-rater reliability\n\n## Decomposing variation, dimension reduction, simple linear models\n\n\n## Later possiblities\n\n- Relation to evaluation text content (NLP?)\n\n- Relation/prediction of later outcomes (traditional publication, citations, replication)\n",
"supporting": [],
"markdown": "# Aggregation of evaluators judgments (modeling)\n\n\n\n\n\n\n## Notes on sources and approaches\n\n\n::: {.callout-note collapse=\"true\"}\n\n## Hanea et al {-}\n(Consult, e.g., repliCATS/Hanea and others work; meta-science and meta-analysis approaches)\n\n`aggrecat` package\n\n> Although the accuracy, calibration, and informativeness of the majority of methods are very similar, a couple of the aggregation methods consistently distinguish themselves as among the best or worst. Moreover, the majority of methods outperform the usual benchmarks provided by the simple average or the median of estimates.\n\n[Hanea et al, 2021](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256919#sec007)\n\n However, these are in a different context. Most of those measures are designed to deal with probablistic forecasts for binary outcomes, where the predictor also gives a 'lower bound' and 'upper bound' for that probability. We could roughly compare that to our continuous metrics with 90% CI's (or imputations for these).\n\nFurthermore, many (all their successful measures?) use 'performance-based weights', accessing metrics from prior prediction performance of the same forecasters We do not have these, nor do we have a sensible proxy for this. \n:::\n\n\n::: {.callout-note collapse=\"true\"}\n## D Veen et al (2017)\n\n[link](https://www.researchgate.net/profile/Duco-Veen/publication/319662351_Using_the_Data_Agreement_Criterion_to_Rank_Experts'_Beliefs/links/5b73e2dc299bf14c6da6c663/Using-the-Data-Agreement-Criterion-to-Rank-Experts-Beliefs.pdf)\n\n... we show how experts can be ranked based on their knowledge and their level of (un)certainty. By letting experts specify their knowledge in the form of a probability distribution, we can assess how accurately they can predict new data, and how appropriate their level of (un)certainty is. The expert’s specified probability distribution can be seen as a prior in a Bayesian statistical setting. We evaluate these priors by extending an existing prior-data (dis)agreement measure, the Data Agreement Criterion, and compare this approach to using Bayes factors to assess prior specification. We compare experts with each other and the data to evaluate their appropriateness. Using this method, new research questions can be asked and answered, for instance: Which expert predicts the new data best? Is there agreement between my experts and the data? Which experts’ representation is more valid or useful? Can we reach convergence between expert judgement and data? We provided an empirical example ranking (regional) directors of a large financial institution based on their predictions of turnover. \n\nBe sure to consult the [correction made here](https://www.semanticscholar.org/paper/Correction%3A-Veen%2C-D.%3B-Stoel%2C-D.%3B-Schalken%2C-N.%3B-K.%3B-Veen-Stoel/a2882e0e8606ef876133f25a901771259e7033b1)\n\n::: \n\n\n::: {.callout-note collapse=\"true\"}\n## Also seems relevant:\n\nSee [Gsheet HERE](https://docs.google.com/spreadsheets/d/14japw6eLGpGjEWy1MjHNJXU1skZY_GAIc2uC2HIUalM/edit#gid=0), generated from an Elicit.org inquiry.\n\n\n::: \n\n\n\nIn spite of the caveats in the fold above, we construct some measures of aggregate beliefs using the `aggrecat` package. We will make (and explain) some ad-hoc choices here. We present these:\n\n1. For each paper\n2. For categories of papers and cross-paper categories of evaluations\n3. For the overall set of papers and evaluations\n\nWe can also hold onto these aggregated metrics for later use in modeling.\n\n\n- Simple averaging\n\n- Bayesian approaches \n\n- Best-performing approaches from elsewhere \n\n- Assumptions over unit-level random terms \n\n\n### Simple rating aggregation {-}\n\nBelow, we are preparing the data for the aggreCATS package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# JB: This section is a work in progress, please do not edit\n\n# paper_ratings: one row per rating category and 'type' (score, upper, lower bound.)\nevals_pub %>% \n select(id, eval_name, paper_abbrev, \n overall, overall_lb_imp, overall_ub_imp,\n adv_knowledge, adv_knowledge_lb_imp, adv_knowledge_ub_imp,\n methods, methods_lb_imp, methods_ub_imp,\n logic_comms, logic_comms_lb_imp, logic_comms_ub_imp,\n real_world, real_world_lb_imp, real_world_ub_imp,\n gp_relevance, gp_relevance_lb_imp, gp_relevance_ub_imp,\n open_sci, open_sci_lb_imp, open_sci_ub_imp) %>% \n rename_with(function(x) paste0(x,\"_score\"), all_of(rating_cats)) %>%\n pivot_longer(cols = c(-id, -eval_name, -paper_abbrev),\n names_pattern = \"(.+)_(score|[ul]b_imp)\",\n names_to = c(\"criterion\",\"element\"),\n values_to = \"value\") -> paper_ratings\n\n# renaming to conform with aggreCATS expectations\npaper_ratings <- paper_ratings %>% \n rename(paper_id = paper_abbrev,\n user_name = eval_name) %>% \n mutate(round = \"round_1\",\n element = case_when(element == \"lb_imp\" ~ \"three_point_lower\",\n element == \"ub_imp\" ~ \"three_point_upper\",\n element == \"score\" ~ \"three_point_best\"))\n\n# filter only overall for now\npaper_ratings %>% \n filter(criterion == \"overall\") %>% \n group_by(user_name, paper_id) %>% \n filter(sum(is.na(value))==0) %>% \n ungroup() -> temp\n \n\nAverageWAgg(expert_judgements = temp, round_2_filter = FALSE, type = \"ArMean\")\n\nIntervalWAgg(expert_judgements = temp, round_2_filter = FALSE, type = \"IntWAgg\")\n\naggreCAT::DistributionWAgg(expert_judgements = temp, round_2_filter = FALSE, type = \"DistribArMean\", percent_toggle = T)\n\n# EXAMPLE CODE ===============================\n# data(data_ratings)\n# set.seed(1234)\n# \n# participant_subset <- data_ratings %>%\n# distinct(user_name) %>%\n# sample_n(5) %>%\n# mutate(participant_name = paste(\"participant\", rep(1:n())))\n# \n# single_claim <- data_ratings %>%\n# filter(paper_id == \"28\") %>%\n# right_join(participant_subset, by = \"user_name\") %>%\n# filter(grepl(x = element, pattern = \"three_.+\")) %>%\n# select(-group, -participant_name, -question)\n# \n# DistributionWAgg(expert_judgements = single_claim,\n# type = \"DistribArMean\", percent_toggle = T)\n# \n```\n:::\n\n\n\n\n\n### Explicit modeling of 'research quality' (for use in prizes, etc.) {-}\n\n- Use the above aggregation as the outcome of interest, or weight towards categories of greater interest?\n\n- Model with controls -- look for greatest positive residual? \n\n\n## Inter-rater reliability\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](aggregation_files/figure-html/unnamed-chunk-1-1.png){width=672}\n:::\n:::\n\n\n\n\n## Decomposing variation, dimension reduction, simple linear models\n\n\n## Later possiblities\n\n- Relation to evaluation text content (NLP?)\n\n- Relation/prediction of later outcomes (traditional publication, citations, replication)\n",
"supporting": [
"aggregation_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

51 changes: 51 additions & 0 deletions chapters/aggregation.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,14 @@
library(tidyverse)
library(aggreCAT)
library(here)
library(irr)
evals_pub <- read_rds(file = here("data", "evals.Rdata"))
# Lists of categories
rating_cats <- c("overall", "adv_knowledge", "methods", "logic_comms", "real_world", "gp_relevance", "open_sci")
pred_cats <- c("journal_predict", "merits_journal")
```


Expand Down Expand Up @@ -154,6 +159,52 @@ aggreCAT::DistributionWAgg(expert_judgements = temp, round_2_filter = FALSE, typ

## Inter-rater reliability

```{r}
#| echo: false
#| fig-height: 8
# function that returns kripp.alpha
# value. It checks if there is more than
# one rater first to avoid errors
# and converts the nested data into
# a matrix to allow
mod_kripp_alpha <- function(dat) {
dat = as.matrix.POSIXlt(dat)
if(nrow(dat)>1) {
a = kripp.alpha(dat, method = "ratio")
res = a$value
} else {
res = NA_integer_
}
return(res)
}
# plot
evals_pub %>%
group_by(paper_abbrev) %>%
select(paper_abbrev, all_of(rating_cats)) %>%
nest(data = -paper_abbrev) %>%
mutate(KrippAlpha = map_dbl(.x = data, .f = mod_kripp_alpha)) %>%
unnest(data) %>%
group_by(KrippAlpha, add = T) %>%
summarize(Raters = n()) %>%
ungroup() %>%
ggplot(aes(x = reorder(paper_abbrev, KrippAlpha), y = KrippAlpha)) +
geom_point(aes(color = paper_abbrev, size = Raters),
stat = "identity", shape = 16, stroke = 1) +
coord_flip() +
labs(x = "Paper", y = "Krippendorf's Alpha") +
theme_bw() +
theme(text = element_text(size = 15)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20)) +
scale_size(breaks = c(2,3)) +
scale_y_continuous(limits = c(-1,1)) +
guides(color=F)
```



## Decomposing variation, dimension reduction, simple linear models


Expand Down
16 changes: 7 additions & 9 deletions chapters/evaluation_data_analysis.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -145,9 +145,6 @@ rename_dtstuff <- function(df){
# Need to find a way to control column width but it seems to be a problem with DT
# https://github.com/rstudio/DT/issues/29
# we didn't seem to be using all_evals_dt so I removed it to increase readability
evals_pub_df <- evals_pub %>%
# Arrange data
Expand Down Expand Up @@ -323,7 +320,8 @@ Todo: ^[Make interactive/dashboards of the elements below]

```{r all_categories}
evals_pub %>%
select(paper_abbrev, starts_with("cat_")) %>%
select(paper_abbrev, starts_with("cat_")) %>%
distinct() %>%
pivot_longer(cols = starts_with("cat_"), names_to = "CatNum", values_to = "Category") %>%
group_by(CatNum, Category) %>%
count() %>%
Expand Down Expand Up @@ -352,7 +350,9 @@ evals_pub %>%
rowwise() %>%
mutate(source_main = str_replace_all(string = source_main,
pattern = "-",
replace = " ") %>% str_to_title()) %>%
replace = " ") %>% str_to_title()) %>%
select(paper_abbrev, source_main) %>%
distinct() %>%
ggplot(aes(x = source_main)) +
geom_bar(position = "stack", stat = "count", color = "grey30", fill = "grey80") +
labs(x = "Source", y = "Count") +
Expand All @@ -363,7 +363,7 @@ evals_pub %>%
```

```{r data_clean_}
```{r data clean}
# JB: Most of these should probably be cleaned in data storage
library(RColorBrewer) # for color palettes
Expand Down Expand Up @@ -460,7 +460,7 @@ In future, we aim to build a dashboard allowing people to use the complete set o

```{=html}
<!-- Currently this is a random Shiny app -->
<iframe height="800" width="120%" frameborder="no" src="https://juliagb.shinyapps.io/DataExplorer/"> </iframe>
<iframe height="800" width="120%" frameborder="no" src="https://unjournal.shinyapps.io/DataExplorer/"> </iframe>
```

Expand Down Expand Up @@ -525,8 +525,6 @@ Next, look for systematic variation in the ratings
\




### Relationship among the ratings (and predictions) {-}

::: {.callout-note collapse="true"}
Expand Down
62 changes: 31 additions & 31 deletions chapters/evaluation_data_input.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,16 @@ rm(evals)
```


```{r evals_pub to longer format}
evals_pub_long <- evals_pub %>%
pivot_longer(cols = -c(id, crucial_rsx, paper_abbrev, eval_name,
cat_1,cat_2, cat_3,source_main,author_agreement),
names_pattern = "(lb_|ub_|conf_)?(.+)",
names_to = c("value_type", "rating_type")) %>% # one line per rating type
mutate(value_type = if_else(value_type == "", "est_", value_type)) %>% #add main rating id
pivot_wider(names_from = value_type,
values_from = value)
```


<!-- need airtable API stuff -->
Expand Down Expand Up @@ -331,23 +341,10 @@ impute_bounds <- function(var_name, est, lb, ub, conf, bound_type) {
return(imp_bound)
}
# apply functions to evals_pub
# JB: I had to do a lot of gymnastics to make this data frame
# right for applying the functions, then turn it back into
# the data structure we are currently using I didn't want to deal
# with how changing the data input would break the rest of the
# code But we should prioritize saving this data frame in a long format
# so there isn't data in the variable names (i.e., each row should at least
# be one type of rating, not one article).
evals_pub %>%
pivot_longer(cols = -c(id, crucial_rsx, paper_abbrev, eval_name,
cat_1,cat_2, cat_3,source_main,author_agreement),
names_pattern = "(lb_|ub_|conf_)?(.+)",
names_to = c("value_type", "rating_type")) %>% # one line per rating type
mutate(value_type = if_else(value_type == "", "est_", value_type)) %>% #add main rating id
pivot_wider(names_from = value_type,
values_from = value) %>%
# apply functions to evals_pub_long
# where each row is one type of rating
# so each evaluation is 9 rows long
evals_pub_long <- evals_pub_long %>%
rowwise() %>% # apply function to each row
mutate(lb_imp_ = impute_bounds(var_name = rating_type,
est = est_,
Expand All @@ -356,7 +353,10 @@ evals_pub %>%
mutate(ub_imp_ = impute_bounds(var_name = rating_type,
est = est_,
lb = lb_, ub = ub_, conf = conf_,
bound_type = "upper")) %>%
bound_type = "upper"))
# Reshape evals_pub_long into evals_pub to add imputed bounds
evals_pub <- evals_pub_long %>%
pivot_wider(names_from = rating_type, # take the dataframe back to old format
values_from = c(est_, ub_, lb_, conf_, lb_imp_, ub_imp_),
names_sep = "") %>%
Expand All @@ -367,7 +367,13 @@ evals_pub %>%
dplyr::rename_with(.cols = starts_with("est_"),
.fn = gsub,
pattern = "est_(.+)",
replacement = "\\1") -> evals_pub
replacement = "\\1")
# Clean evals_pub_long names (remove _ at end)
evals_pub_long <- evals_pub_long %>%
rename_with(.cols = ends_with("_"),
.fn = str_remove,
pattern = "_$")
```

Expand Down Expand Up @@ -405,22 +411,13 @@ all_papers_p <- all_pub_records %>%
#| include: false
#|
paper_ratings <- evals_pub %>%
select(paper_abbrev, eval_name, one_of(rating_cats), one_of(pred_cats), ends_with("_imp")) %>% # rating vars
dplyr::rename_with(.cols = c(one_of(rating_cats), one_of(pred_cats)),
.fn = gsub,
pattern = "(.+)",
replacement = "\\1_best") %>%
pivot_longer(cols = -c(paper_abbrev, eval_name),
names_pattern = "(overall|adv_knowledge|methods|logic_comms|real_world|gp_relevance|open_sci|journal_predict|merits_journal)_(.+)",
names_to = c("rating_type", ".value"))# one line per rating type
write_rds(paper_ratings, file = here("shinyapp/DataExplorer", "shiny_explorer.rds"))
evals_pub_long %>%
write_rds(file = here("shinyapp/DataExplorer", "shiny_explorer.rds"))
```


```{r}
```{r save data}
#| label: savedata
#| code-summary: "save data for others' use"
Expand All @@ -431,6 +428,9 @@ all_papers_p %>% write_csv(file = here("data", "all_papers_p.csv"))
evals_pub %>% saveRDS(file = here("data", "evals.Rdata"))
evals_pub %>% write_csv(file = here("data", "evals.csv"))
evals_pub_long %>% write_rds(file = here("data", "evals_long.rds"))
evals_pub_long %>% write_csv(file = here("data", "evals_long.csv"))
#evals_pub %>% readRDS(file = here("data", "evals.Rdata"))
```
Expand Down
Loading

0 comments on commit 9cc8cab

Please sign in to comment.