From e7e469445c1f049809f027f68d2f74c317c0cbcb Mon Sep 17 00:00:00 2001
From: Niema Moshiri <niemamoshiri@gmail.com>
Date: Sat, 13 Apr 2024 11:34:31 -0700
Subject: [PATCH] Added RYG distributions

---
 teach_online/academic_integrity.md | 72 ++++++++++++++++++++++++++----
 1 file changed, 64 insertions(+), 8 deletions(-)

diff --git a/teach_online/academic_integrity.md b/teach_online/academic_integrity.md
index 224f72e..7388acd 100644
--- a/teach_online/academic_integrity.md
+++ b/teach_online/academic_integrity.md
@@ -178,8 +178,11 @@ we would have *n*/2 cheating pairs of students out of *n*(*n*–1)/2 total pairs
 In a class of *n* = 100 students,
 we would have 100/2 = 50 cheating pairs and out of 100(99)/2 = 4,950 total pairs of students
 (just over 1% of all pairs of students).
-Thus, we can use the distribution of all pairwise MESS calculations as an approximation of the null distribution,
-and we can try to identify collaboration by looking at outliers of this distribution.
+Thus, we can use the distribution of all pairwise MESS calculations as an approximation of the null distribution
+(null hypothesis = "MESS score resulted from no collaboration"),
+and we can try to identify collaboration by looking at outliers of this distribution
+(e.g. perform one-sided tests of statistical significance,
+as well as multiple hypothesis test correction).
 
 ```{figure} ../images/mess_distribution.png
 ---
@@ -205,16 +208,69 @@ However, there are a handful of limitations of this method:
 * Thus, this method is *specific* (i.e., high MESS typically implies collaboration), but not *sensitive* (i.e., it can miss true cases of cheating)
 
 MESS gives us a way of looking at the *uniqueness* of shared incorrect responses,
-but we can actually gain interesting insights from the *number* of shared incorrect responses
+but we can also gain interesting insights from the *number* of shared incorrect responses
 in the context of all incorrect responses they submitted.
-TODO WRITE ABOUT RYG DISTRIBUTION
+Specifically,
+expanding on {cite:t}`moshiri_scalable_2022`,
+*while* performing all MESS calculations,
+we can also count the following for every pair of students
+(the colors are arbitrary and aim to align with "scarier color" = "more suspicious"):
+
+* Red Count = The number of questions both students missed with the *exact same* wrong answer
+  * If students collaborate, we expect a disproportionately large number of identical wrong answers between them
+* Yellow Count = The number of questions both students missed, but with *different* wrong answers
+  * If students collaborate, we might expect them to put the same wrong answer,
+    so Yellow questions could be evidence against collaboration
+  * However, if students collaborate and are torn between two potential answers,
+    one might guess one answer, and one might guess another,
+    so Yellow questions could be evidence supporting collaboration
+  * Thus, overall, Yellow questions are semi-neutral
+* Green Count = The number of questions only one of the two students missed
+  * In other words, the number of questions one student got right and one student got wrong
+  * If students collaborate, we expect them to miss the same questions, so a high Green Count could be evidence against collaboration
+* Black Count = Red Count + Yellow Count + Green Count
+  * In other words, this is the total number of questions *at least* one student missed
+  * Why is this helpful? We'll discuss it a bit later in this section
+
+Recall from the earlier thought experiment that
+we can safely assume that the *vast majority* of pairwise comparisons are *not* cheating pairs.
+As a result, we can look at the distributions of the Red, Yellow, and Green counts
+across all pairs of students in the class as approximations of their null distributions
+(null hypothesis = "Red, Yellow, and Green Counts resulted from no collaboration"),
+and we can try to identify collaboration by looking at outliers of these distributions.
+The range of possible Red, Yellow, and Green Counts for a given pair of students is bounded by their Black Count
+(Black = Red + Yellow + Green),
+so we can do the following:
+
+* Plot the 2D distributions of Red, Yellow, and Green Counts (vertical axis) vs. Black Count (horizontal axis)
+  * In other words, each pair of students defines 3 points: (Black, Red), (Black, Yellow), and (Black, Green)
+* Plot a given pair's (Black, Red), (Black, Yellow), and (Black, Green) points
+* Check if the pair's Red, Yellow, and Green Counts deviate significantly from what is expected at that Black Count
+  * Estimate expected values based on the null distributions at that Black Count
+  * Perform a statistical test to check for significance
+    (e.g. [Fisher's exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test)
+    or [χ2 test](https://en.wikipedia.org/wiki/Chi-squared_test) with 2 degrees of freedom)
+
+```{figure} ../images/ryg_distributions.png
+---
+height: 500px
+name: ryg_distributions
+---
+Distributions of all pairwise Red, Yellow, and Green Counts vs. Black Counts in a 500-person Advanced Data Structures course (log-scale).
+2D [Kernel Density Estimates (KDEs)](https://en.wikipedia.org/wiki/Kernel_density_estimation) are shown as colored contours,
+and best-fit lines are shown for each distribution.
+A single pair of students with suspiciously outlying Red (9), Yellow (0), and Green (0) Counts for their given Black Count (9)
+is shown as a black vertical line with colored dots.
+```
 
-We wrote a Python program to perform all pairwise MESS calculations,
+We wrote suite of Python programs to perform all pairwise
+Red Count, Yellow Count, Green Count, and MESS calculations,
 calculate a best-fit [Exponential distribution](https://en.wikipedia.org/wiki/Exponential_distribution),
-plot the distribution,
-and perform other downstream analyses on [GitHub](https://github.com/niemasd/MESS).
+plot the distributions,
+and perform other downstream analyses,
+which is available as an open source project on [GitHub](https://github.com/niemasd/MESS).
 The tools in this repository support exams with multiple choice, short answer, math, Parsons, etc. problems:
-they simply perform string equality comparisons between responses.
+they simply perform string equality comparisons between responses to determine response equality.
 
 ```{glossary}
 Detection