Skip to content

Commit

Permalink
temp
Browse files Browse the repository at this point in the history
  • Loading branch information
daaronr committed Sep 13, 2023
1 parent 9cc8cab commit aa9866a
Show file tree
Hide file tree
Showing 25 changed files with 1,083 additions and 2,315 deletions.
4 changes: 2 additions & 2 deletions _freeze/chapters/aggregation/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "68440e6d131ccee0354bc095732ca0e2",
"hash": "2726ad363da560877f7c0b4345be5343",
"result": {
"markdown": "# Aggregation of evaluators judgments (modeling)\n\n\n\n\n\n\n## Notes on sources and approaches\n\n\n::: {.callout-note collapse=\"true\"}\n\n## Hanea et al {-}\n(Consult, e.g., repliCATS/Hanea and others work; meta-science and meta-analysis approaches)\n\n`aggrecat` package\n\n> Although the accuracy, calibration, and informativeness of the majority of methods are very similar, a couple of the aggregation methods consistently distinguish themselves as among the best or worst. Moreover, the majority of methods outperform the usual benchmarks provided by the simple average or the median of estimates.\n\n[Hanea et al, 2021](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256919#sec007)\n\n However, these are in a different context. Most of those measures are designed to deal with probablistic forecasts for binary outcomes, where the predictor also gives a 'lower bound' and 'upper bound' for that probability. We could roughly compare that to our continuous metrics with 90% CI's (or imputations for these).\n\nFurthermore, many (all their successful measures?) use 'performance-based weights', accessing metrics from prior prediction performance of the same forecasters We do not have these, nor do we have a sensible proxy for this. \n:::\n\n\n::: {.callout-note collapse=\"true\"}\n## D Veen et al (2017)\n\n[link](https://www.researchgate.net/profile/Duco-Veen/publication/319662351_Using_the_Data_Agreement_Criterion_to_Rank_Experts'_Beliefs/links/5b73e2dc299bf14c6da6c663/Using-the-Data-Agreement-Criterion-to-Rank-Experts-Beliefs.pdf)\n\n... we show how experts can be ranked based on their knowledge and their level of (un)certainty. By letting experts specify their knowledge in the form of a probability distribution, we can assess how accurately they can predict new data, and how appropriate their level of (un)certainty is. The expert’s specified probability distribution can be seen as a prior in a Bayesian statistical setting. We evaluate these priors by extending an existing prior-data (dis)agreement measure, the Data Agreement Criterion, and compare this approach to using Bayes factors to assess prior specification. We compare experts with each other and the data to evaluate their appropriateness. Using this method, new research questions can be asked and answered, for instance: Which expert predicts the new data best? Is there agreement between my experts and the data? Which experts’ representation is more valid or useful? Can we reach convergence between expert judgement and data? We provided an empirical example ranking (regional) directors of a large financial institution based on their predictions of turnover. \n\nBe sure to consult the [correction made here](https://www.semanticscholar.org/paper/Correction%3A-Veen%2C-D.%3B-Stoel%2C-D.%3B-Schalken%2C-N.%3B-K.%3B-Veen-Stoel/a2882e0e8606ef876133f25a901771259e7033b1)\n\n::: \n\n\n::: {.callout-note collapse=\"true\"}\n## Also seems relevant:\n\nSee [Gsheet HERE](https://docs.google.com/spreadsheets/d/14japw6eLGpGjEWy1MjHNJXU1skZY_GAIc2uC2HIUalM/edit#gid=0), generated from an Elicit.org inquiry.\n\n\n::: \n\n\n\nIn spite of the caveats in the fold above, we construct some measures of aggregate beliefs using the `aggrecat` package. We will make (and explain) some ad-hoc choices here. We present these:\n\n1. For each paper\n2. For categories of papers and cross-paper categories of evaluations\n3. For the overall set of papers and evaluations\n\nWe can also hold onto these aggregated metrics for later use in modeling.\n\n\n- Simple averaging\n\n- Bayesian approaches \n\n- Best-performing approaches from elsewhere \n\n- Assumptions over unit-level random terms \n\n\n### Simple rating aggregation {-}\n\nBelow, we are preparing the data for the aggreCATS package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# JB: This section is a work in progress, please do not edit\n\n# paper_ratings: one row per rating category and 'type' (score, upper, lower bound.)\nevals_pub %>% \n select(id, eval_name, paper_abbrev, \n overall, overall_lb_imp, overall_ub_imp,\n adv_knowledge, adv_knowledge_lb_imp, adv_knowledge_ub_imp,\n methods, methods_lb_imp, methods_ub_imp,\n logic_comms, logic_comms_lb_imp, logic_comms_ub_imp,\n real_world, real_world_lb_imp, real_world_ub_imp,\n gp_relevance, gp_relevance_lb_imp, gp_relevance_ub_imp,\n open_sci, open_sci_lb_imp, open_sci_ub_imp) %>% \n rename_with(function(x) paste0(x,\"_score\"), all_of(rating_cats)) %>%\n pivot_longer(cols = c(-id, -eval_name, -paper_abbrev),\n names_pattern = \"(.+)_(score|[ul]b_imp)\",\n names_to = c(\"criterion\",\"element\"),\n values_to = \"value\") -> paper_ratings\n\n# renaming to conform with aggreCATS expectations\npaper_ratings <- paper_ratings %>% \n rename(paper_id = paper_abbrev,\n user_name = eval_name) %>% \n mutate(round = \"round_1\",\n element = case_when(element == \"lb_imp\" ~ \"three_point_lower\",\n element == \"ub_imp\" ~ \"three_point_upper\",\n element == \"score\" ~ \"three_point_best\"))\n\n# filter only overall for now\npaper_ratings %>% \n filter(criterion == \"overall\") %>% \n group_by(user_name, paper_id) %>% \n filter(sum(is.na(value))==0) %>% \n ungroup() -> temp\n \n\nAverageWAgg(expert_judgements = temp, round_2_filter = FALSE, type = \"ArMean\")\n\nIntervalWAgg(expert_judgements = temp, round_2_filter = FALSE, type = \"IntWAgg\")\n\naggreCAT::DistributionWAgg(expert_judgements = temp, round_2_filter = FALSE, type = \"DistribArMean\", percent_toggle = T)\n\n# EXAMPLE CODE ===============================\n# data(data_ratings)\n# set.seed(1234)\n# \n# participant_subset <- data_ratings %>%\n# distinct(user_name) %>%\n# sample_n(5) %>%\n# mutate(participant_name = paste(\"participant\", rep(1:n())))\n# \n# single_claim <- data_ratings %>%\n# filter(paper_id == \"28\") %>%\n# right_join(participant_subset, by = \"user_name\") %>%\n# filter(grepl(x = element, pattern = \"three_.+\")) %>%\n# select(-group, -participant_name, -question)\n# \n# DistributionWAgg(expert_judgements = single_claim,\n# type = \"DistribArMean\", percent_toggle = T)\n# \n```\n:::\n\n\n\n\n\n### Explicit modeling of 'research quality' (for use in prizes, etc.) {-}\n\n- Use the above aggregation as the outcome of interest, or weight towards categories of greater interest?\n\n- Model with controls -- look for greatest positive residual? \n\n\n## Inter-rater reliability\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](aggregation_files/figure-html/unnamed-chunk-1-1.png){width=672}\n:::\n:::\n\n\n\n\n## Decomposing variation, dimension reduction, simple linear models\n\n\n## Later possiblities\n\n- Relation to evaluation text content (NLP?)\n\n- Relation/prediction of later outcomes (traditional publication, citations, replication)\n",
"markdown": "# Aggregation of evaluators judgments (modeling)\n\n\n\n\n\n\n## Notes on sources and approaches\n\n\n::: {.callout-note collapse=\"true\"}\n\n## Hanea et al {-}\n(Consult, e.g., repliCATS/Hanea and others work; meta-science and meta-analysis approaches)\n\n`aggrecat` package\n\n> Although the accuracy, calibration, and informativeness of the majority of methods are very similar, a couple of the aggregation methods consistently distinguish themselves as among the best or worst. Moreover, the majority of methods outperform the usual benchmarks provided by the simple average or the median of estimates.\n\n[Hanea et al, 2021](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256919#sec007)\n\n However, these are in a different context. Most of those measures are designed to deal with probablistic forecasts for binary outcomes, where the predictor also gives a 'lower bound' and 'upper bound' for that probability. We could roughly compare that to our continuous metrics with 90% CI's (or imputations for these).\n\nFurthermore, many (all their successful measures?) use 'performance-based weights', accessing metrics from prior prediction performance of the same forecasters We do not have these, nor do we have a sensible proxy for this. \n:::\n\n\n::: {.callout-note collapse=\"true\"}\n## D Veen et al (2017)\n\n[link](https://www.researchgate.net/profile/Duco-Veen/publication/319662351_Using_the_Data_Agreement_Criterion_to_Rank_Experts'_Beliefs/links/5b73e2dc299bf14c6da6c663/Using-the-Data-Agreement-Criterion-to-Rank-Experts-Beliefs.pdf)\n\n... we show how experts can be ranked based on their knowledge and their level of (un)certainty. By letting experts specify their knowledge in the form of a probability distribution, we can assess how accurately they can predict new data, and how appropriate their level of (un)certainty is. The expert’s specified probability distribution can be seen as a prior in a Bayesian statistical setting. We evaluate these priors by extending an existing prior-data (dis)agreement measure, the Data Agreement Criterion, and compare this approach to using Bayes factors to assess prior specification. We compare experts with each other and the data to evaluate their appropriateness. Using this method, new research questions can be asked and answered, for instance: Which expert predicts the new data best? Is there agreement between my experts and the data? Which experts’ representation is more valid or useful? Can we reach convergence between expert judgement and data? We provided an empirical example ranking (regional) directors of a large financial institution based on their predictions of turnover. \n\nBe sure to consult the [correction made here](https://www.semanticscholar.org/paper/Correction%3A-Veen%2C-D.%3B-Stoel%2C-D.%3B-Schalken%2C-N.%3B-K.%3B-Veen-Stoel/a2882e0e8606ef876133f25a901771259e7033b1)\n\n::: \n\n\n::: {.callout-note collapse=\"true\"}\n## Also seems relevant:\n\nSee [Gsheet HERE](https://docs.google.com/spreadsheets/d/14japw6eLGpGjEWy1MjHNJXU1skZY_GAIc2uC2HIUalM/edit#gid=0), generated from an Elicit.org inquiry.\n\n\n::: \n\n\n\nIn spite of the caveats in the fold above, we construct some measures of aggregate beliefs using the `aggrecat` package. We will make (and explain) some ad-hoc choices here. We present these:\n\n1. For each paper\n2. For categories of papers and cross-paper categories of evaluations\n3. For the overall set of papers and evaluations\n\nWe can also hold onto these aggregated metrics for later use in modeling.\n\n\n- Simple averaging\n\n- Bayesian approaches \n\n- Best-performing approaches from elsewhere \n\n- Assumptions over unit-level random terms \n\n\n### Simple rating aggregation {-}\n\nBelow, we are preparing the data for the aggreCATS package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# JB: This section is a work in progress, please do not edit\n\n# paper_ratings: one row per rating category and 'type' (score, upper, lower bound.)\nevals_pub %>% \n select(id, eval_name, paper_abbrev, \n overall, overall_lb_imp, overall_ub_imp,\n adv_knowledge, adv_knowledge_lb_imp, adv_knowledge_ub_imp,\n methods, methods_lb_imp, methods_ub_imp,\n logic_comms, logic_comms_lb_imp, logic_comms_ub_imp,\n real_world, real_world_lb_imp, real_world_ub_imp,\n gp_relevance, gp_relevance_lb_imp, gp_relevance_ub_imp,\n open_sci, open_sci_lb_imp, open_sci_ub_imp) %>% \n rename_with(function(x) paste0(x,\"_score\"), all_of(rating_cats)) %>%\n pivot_longer(cols = c(-id, -eval_name, -paper_abbrev),\n names_pattern = \"(.+)_(score|[ul]b_imp)\",\n names_to = c(\"criterion\",\"element\"),\n values_to = \"value\") -> paper_ratings\n\n# renaming to conform with aggreCATS expectations\npaper_ratings <- paper_ratings %>% \n rename(paper_id = paper_abbrev,\n user_name = eval_name) %>% \n mutate(round = \"round_1\",\n element = case_when(element == \"lb_imp\" ~ \"three_point_lower\",\n element == \"ub_imp\" ~ \"three_point_upper\",\n element == \"score\" ~ \"three_point_best\"))\n\n# filter only overall for now\npaper_ratings %>% \n filter(criterion == \"overall\") %>% \n group_by(user_name, paper_id) %>% \n filter(sum(is.na(value))==0) %>% \n ungroup() -> temp\n \n\nAverageWAgg(expert_judgements = temp, round_2_filter = FALSE, type = \"ArMean\")\n\nIntervalWAgg(expert_judgements = temp, round_2_filter = FALSE, type = \"IntWAgg\")\n\naggreCAT::DistributionWAgg(expert_judgements = temp, round_2_filter = FALSE, type = \"DistribArMean\", percent_toggle = T)\n\n# EXAMPLE CODE ===============================\n# data(data_ratings)\n# set.seed(1234)\n# \n# participant_subset <- data_ratings %>%\n# distinct(user_name) %>%\n# sample_n(5) %>%\n# mutate(participant_name = paste(\"participant\", rep(1:n())))\n# \n# single_claim <- data_ratings %>%\n# filter(paper_id == \"28\") %>%\n# right_join(participant_subset, by = \"user_name\") %>%\n# filter(grepl(x = element, pattern = \"three_.+\")) %>%\n# select(-group, -participant_name, -question)\n# \n# DistributionWAgg(expert_judgements = single_claim,\n# type = \"DistribArMean\", percent_toggle = T)\n# \n```\n:::\n\n\n\n\n\n### Explicit modeling of 'research quality' (for use in prizes, etc.) {-}\n\n- Use the above aggregation as the outcome of interest, or weight towards categories of greater interest?\n\n- Model with controls -- look for greatest positive residual? \n\n\n## Inter-rater reliability\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](aggregation_files/figure-html/unnamed-chunk-1-1.png){width=672}\n:::\n:::\n\n\n\n<!-- TODO: Could you do a bit more to interpret the Krippendorf’s alpha IRR measure?\nAlso, wouldn’t there be some other ways of aggregating for this:\nfor each metric, across all papers (comparing the relative IRRs of these)\nfor all metrics across all papers \n\nhttps://unjournalfriends.slack.com/archives/D05JMD2KQMP/p1694561973337029\n-->\n\n\n## Decomposing variation, dimension reduction, simple linear models\n\n\n## Later possiblities\n\n- Relation to evaluation text content (NLP?)\n\n- Relation/prediction of later outcomes (traditional publication, citations, replication)\n",
"supporting": [
"aggregation_files"
],
Expand Down

Large diffs are not rendered by default.

10 changes: 9 additions & 1 deletion chapters/aggregation.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#| include: false
library(tidyverse)
library(aggreCAT)
#library(aggreCAT)
library(here)
library(irr)
Expand Down Expand Up @@ -204,6 +204,14 @@ evals_pub %>%
```


<!-- TODO: Could you do a bit more to interpret the Krippendorf’s alpha IRR measure?
Also, wouldn’t there be some other ways of aggregating for this:
for each metric, across all papers (comparing the relative IRRs of these)
for all metrics across all papers
https://unjournalfriends.slack.com/archives/D05JMD2KQMP/p1694561973337029
-->


## Decomposing variation, dimension reduction, simple linear models

Expand Down
Binary file modified data/all_papers_p.Rdata
Binary file not shown.
3 changes: 1 addition & 2 deletions data/all_papers_p.csv
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ rec8CVePFXLK7bxWn,,0.53,NA,0.8,,"""Innovation, meta-science, and research""",NA,
rec8MONL7xFGq2BaM,,0.7,0.58,0.9,,Economic development & governance (LMICs),"""Other: Economics, growth, policy, global markets and population ""","Development, Development and Growth, Regional Economics, Regional and Urban Economics, Migration",Emmanuel Orkoh,Unpublished working paper,Follow-up email sent,seeking_(more)_evaluators,internal-NBER,not needed (Unjournal Direct),NA,2022-11-05T15:57:00.000Z
rec99VmJ2A7naPxuc,,0.95,NA,0.85,,"""Global health; """"Health & well-being in low-income countries""""""",NA,Health,Ryan Briggs ,NA,NA,seeking_(more)_evaluators,suggested - externally - NGO,not needed (Unjournal Direct),NA,2023-06-21T13:57:12.000Z
rec9sED8IjDgjICxH,,0.65,NA,0.65,,Empirical methods,"""Catastrophic and existential risks, the long-term future, forecasting""",Econometrics,NA,NA,NA,NA,internal-from-syllabus-agenda-policy-database,NA,NA,2022-04-26T20:23:52.000Z
recAIY1CCAN8PB417,,NA,NA,NA,,NA,NA,NA,NA,NA,NA,Not a paper/project,NA,NA,NA,2023-07-31T16:18:12.000Z
recBdPi9jTrdn6xhs,,0.65,NA,0.55,,"""Global health; """"Health & well-being in low-income countries""""""",NA,NA,NA,NA,Emailed,considering,suggested - internally,NA,NA,2022-09-24T15:18:34.000Z
recC20N4daHXFNmJJ,,0.6,NA,NA,,"""Communicable diseases, bio-security and pandemic preparedness, biological risks""",NA,"""Health, Education, and Welfare"", Health","David Reinstein, Sam Abbott",Unpublished working paper,NA,considering,internal-NBER,NA,NA,2022-11-05T22:40:56.000Z
recCblJLRgWmhYBcO,,0.52,NA,NA,,Economic development & governance (LMICs),NA,NA,NA,NA,NA,NA,internal-NBER,NA,NA,2023-06-08T23:29:52.000Z
recDKf292flMuBf7b,,0.57,NA,NA,,"""Catastrophic and existential risks, the long-term future, forecasting""",Economic development & governance (LMICs),NA,NA,NA,NA,NA,internal-NBER,NA,NA,2023-06-08T23:34:05.000Z
recDx0VLZQq5nckAO,,0.6,NA,NA,,"""Catastrophic and existential risks, the long-term future, forecasting""",Empirical methods,meta-analysis,NA,Unpublished,NA,NA,suggested - externally,NA,NA,2023-09-11T19:50:14.000Z
recEZssfl3wF37J1T,,NA,NA,NA,,NA,NA,NA,NA,NA,NA,NA,suggested - externally - NGO,NA,NA,2023-07-31T16:18:12.000Z
recEiYGtyDewEDl9T,,0.55,NA,0.8,,Emerging technologies: social and economic impacts (focus: AI),NA,"Development and Growth, Innovation and R&D, Economic Systems, Industrial Organization",Kris Gulati,Unpublished working paper,NA,NA,internal-NBER,NA,NA,2022-11-05T22:16:13.000Z
recFauZqPDBMVK28J,,NA,NA,NA,,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,2023-07-31T16:18:12.000Z
Expand Down Expand Up @@ -119,7 +119,6 @@ recsiQaf3ZTSkkXLV,,0.63,NA,NA,,"""Global health; """"Health & well-being in low-
recsii9l3QRQFerkU,,1,0.8,1,,"""Catastrophic and existential risks, the long-term future, forecasting""",Emerging technologies: social and economic impacts (focus: AI),NA,NA,"Published, ? journal",Agreed,published,submitted,not needed (submitted by authors),NA,2022-05-08T03:58:56.000Z
recsxlSRIz4Y1RHd3,,NA,NA,NA,,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,2023-07-31T16:18:12.000Z
rect8c6gbgVnvz6Zt,,NA,NA,NA,,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,2023-08-14T19:32:14.000Z
rectZXnEtaDibizPe,,NA,NA,NA,,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,2023-08-22T22:03:51.000Z
rectfSMcCGKrVVtuw,,0.63,NA,NA,,"""Global health; """"Health & well-being in low-income countries""""""",Economic development & governance (LMICs),"Public Economics, ""Health, Education, and Welfare"", Poverty and Wellbeing, Labor Economics, Demography and Aging, Labor Supply and Demand, Development and Growth, Development","Hansika Kapoor, Anirudh Tagat",NA,Emailed,published,internal-NBER,NA,NA,2022-11-23T01:51:58.000Z
rectim9KLJ6yQ1Goa,,0.56,NA,0.56,,"""Catastrophic and existential risks, the long-term future, forecasting""",NA,NA,NA,"Published, ~top journal",NA,NA,internal-from-syllabus-agenda-policy-database,NA,NA,2022-04-15T15:57:47.000Z
rectraxzNjDb0cDbU,,NA,NA,NA,,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,2023-08-22T20:29:40.000Z
Expand Down
Binary file modified data/evals.Rdata
Binary file not shown.
Loading

0 comments on commit aa9866a

Please sign in to comment.