diff --git a/README.md b/README.md index 6a8799c..c958205 100644 --- a/README.md +++ b/README.md @@ -98,6 +98,10 @@ Once an initial release has been created for a new workshop, create subsequent t Note: I do not follow [Semantic Versioning](https://semver.org/) for this project. For the first digit (in semver, `major`), I use the year of the target workshop, and for the last (in semver `patch`), I increment when a chunk of work is done towards giving the workshop. The middle digit (in semver, `minor`) stays on 0 until I give the workshop, when it bumps to 1. Fixes to the given workshop get reflected in the patch versions `yyyy.1.`. +### 2024.1.3 + +* Add additional output plus commentary for do vs non-do probabilities to `causal-models-exercises` notebook. + ### v2024.1.2 * Add concluding slide to introductory presentation diff --git a/notebooks/causal-models-exercises.ipynb b/notebooks/causal-models-exercises.ipynb index 0d421c7..016b597 100644 --- a/notebooks/causal-models-exercises.ipynb +++ b/notebooks/causal-models-exercises.ipynb @@ -28,6 +28,77 @@ "from brent import DAG, Query" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Expanded proof of quantities unchanged between $G$ and $G' = G_{\\underline{\\textrm{days}}}$\n", + "\n", + "from slides \"Correlation and Causality\"\n", + "\n", + "Graph of \"given-days\" graph $G$:\n", + "\n", + "![Graph of hit-rate for conditional calculations](../slides/correlation-causality/graphics/given_days.png)\n", + "\n", + "Graph of \"do-days\" graph $G'$:\n", + "\n", + "![Graph of hit-rate for do calculations](../slides/correlation-causality/graphics/do_days.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Proof of $P_{G'}(\\textrm{producttype} = p, \\textrm{rating} = r)= P_G( \\textrm{producttype} = p, \\textrm{rating} = r)$ \n", + "\n", + "We swap notation for this proof a bit, writing $Pr$ for probability, and $P$ for $\\textrm{producttype}$, $R$ for $\\textrm{rating}$, etc.\n", + "\n", + "By the definition of graphical models and the decomposition of conditional probabilities\n", + "\n", + "$$\n", + "Pr_G(P, R, D, H) = Pr(P) \\cdot Pr_G(R|P) \\cdot {\\color{red}{Pr_G(D|P)}} \\cdot Pr_G(H | D, R)\n", + "$$\n", + "\n", + "Note how we have dropped the subscript from $Pr(P)$, as the marginal probabilities of each vertex variable are independent of the underlying graph. For $G'$ we have\n", + "\n", + "$$\n", + "Pr_{G'}(P, R, D, H) = Pr(P) \\cdot Pr_G(R|P) \\cdot {\\color{red}{Pr(D)}} \\cdot Pr_G(H | D, R)\n", + "$$\n", + "\n", + "where we've highlighted in $\\color{red}{\\textrm{red}}$ the terms that differ between the two expressions.\n", + "\n", + "Note further that, by definition of the graph surgery we do in going from $G$ to $G'$, all conditional probabilities are preserved that aren't affected by the \"do\"-surgery.\n", + "\n", + "Now consider $P_{G'}(\\textrm{producttype} = p, \\textrm{rating} = r)$. Using the conditioning trick\n", + "\n", + "\\begin{align*}\n", + "Pr_{G'}(P, R) & = \\sum_{d, h} Pr_{G'}(P=p, R=r, D=d, H=h) \\\\\n", + "& = Pr(P) \\cdot Pr_{G'}(R|P) \\cdot \\sum_{d, h} {\\color{red}{Pr(D)}} \\cdot Pr_G(H | D, R) \\textrm{, by the graph property} \\\\\n", + "& = Pr(P) \\cdot Pr_{G'}(R|P) \\cdot \\sum_{d} {\\color{red}{Pr(D)}} \\sum_h \\cdot Pr_G(H | D, R) \\textrm{, by lack of dependence on $h$ summands}\\\\\n", + "& = Pr(P) \\cdot Pr_{G'}(R|P) \\cdot \\sum_{d} {\\color{red}{Pr(D)}} \\cdot 1 \\textrm{, by the law of total probability} \\\\\n", + "& = Pr(P) \\cdot Pr_{G'}(R|P) \\cdot 1 \\textrm{, by the law of total probability for $D$}\n", + "\\end{align*}\n", + "\n", + "Similarly\n", + "\n", + "\\begin{align*}\n", + "Pr_{G}(P, R) & = \\sum_{d, h} Pr_{G'}(P=p, R=r, D=d, H=h) \\\\\n", + "& = Pr(P) \\cdot Pr_G(R|P) \\cdot \\sum_{d, h} {\\color{red}{Pr(D | P)}} \\cdot Pr_G(H | D, R) \\textrm{, by the graph property} \\\\\n", + "& = Pr(P) \\cdot Pr_G(R|P) \\cdot \\sum_{d} {\\color{red}{Pr(D | P)}} \\sum_h Pr_G(H | D, R) \\textrm{, by lack of dependence on $h$ summands} \\\\\n", + "& = Pr(P) \\cdot Pr_G(R|P) \\cdot \\sum_{d} {\\color{red}{Pr(D | P)}} \\cdot 1 \\textrm{, by the law of total probability} \\\\\n", + "& = Pr(P) \\cdot Pr_G(R|P) \\cdot 1 \\textrm{, by the law of total probability for each of $D|P$}\n", + "\\end{align*}\n", + "\n", + "Since conditional probabilities unaffected by the \"do\" graph surgery are unaffected, we have $Pr_{G'}(R|P) = Pr_G(R|P)$, and the result follows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "metadata": {}, @@ -156,79 +227,72 @@ "execution_count": 4, "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning: Could not load \"/usr/local/Cellar/graphviz/2.47.0/lib/graphviz/libgvplugin_pango.6.dylib\" - It was found, so perhaps one of its dependents was not. Try ldd.\n" - ] - }, { "data": { "image/svg+xml": [ "\n", "\n", - "\n", "\n", - "\n", - "\n", - "\n", + "\n", + "\n", + "\n", "\n", "\n", "producttype\n", - "\n", - "producttype\n", + "\n", + "producttype\n", "\n", "\n", "\n", "days\n", - "\n", - "days\n", + "\n", + "days\n", "\n", "\n", "\n", "producttype->days\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "rating\n", - "\n", - "rating\n", + "\n", + "rating\n", "\n", "\n", "\n", "producttype->rating\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "hit\n", - "\n", - "hit\n", + "\n", + "hit\n", "\n", "\n", "\n", "days->hit\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "rating->hit\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n" ], "text/plain": [ - "" + "" ] }, "execution_count": 4, @@ -247,80 +311,73 @@ "execution_count": 5, "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning: Could not load \"/usr/local/Cellar/graphviz/2.47.0/lib/graphviz/libgvplugin_pango.6.dylib\" - It was found, so perhaps one of its dependents was not. Try ldd.\n" - ] - }, { "data": { "image/svg+xml": [ "\n", "\n", - "\n", "\n", - "\n", - "\n", - "\n", + "\n", + "\n", + "\n", "\n", "\n", "producttype\n", - "\n", - "producttype\n", + "\n", + "producttype\n", "\n", "\n", "\n", "days\n", - "\n", - "\n", - "days\n", + "\n", + "\n", + "days\n", "\n", "\n", "\n", "producttype->days\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "rating\n", - "\n", - "rating\n", + "\n", + "rating\n", "\n", "\n", "\n", "producttype->rating\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "hit\n", - "\n", - "hit\n", + "\n", + "hit\n", "\n", "\n", "\n", "days->hit\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "rating->hit\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n" ], "text/plain": [ - "" + "" ] }, "execution_count": 5, @@ -338,91 +395,84 @@ "execution_count": 6, "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning: Could not load \"/usr/local/Cellar/graphviz/2.47.0/lib/graphviz/libgvplugin_pango.6.dylib\" - It was found, so perhaps one of its dependents was not. Try ldd.\n" - ] - }, { "data": { "image/svg+xml": [ "\n", "\n", - "\n", "\n", - "\n", - "\n", - "\n", + "\n", + "\n", + "\n", "\n", "\n", "producttype\n", - "\n", - "producttype\n", + "\n", + "producttype\n", "\n", "\n", "\n", "days\n", - "\n", - "\n", - "days\n", + "\n", + "\n", + "days\n", "\n", "\n", "\n", "producttype->days\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "rating\n", - "\n", - "rating\n", + "\n", + "rating\n", "\n", "\n", "\n", "producttype->rating\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "hit\n", - "\n", - "hit\n", + "\n", + "hit\n", "\n", "\n", "\n", "days->hit\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", " \n", - " \n", + " \n", "\n", "\n", "\n", " ->days\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "rating->hit\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n" ], "text/plain": [ - "" + "" ] }, "execution_count": 6, @@ -522,10 +572,69 @@ "p_pr_given_ds.name = 'hit'" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the red terms in the above equations were identical, then the values for each group of rows with fixed $D=d$ would be identical copies of the above variable `P_Gpr`." + ] + }, { "cell_type": "code", "execution_count": 9, "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "producttype rating days\n", + "financial 0 0 0.091294\n", + " 1 0 0.374588\n", + "liability 0 0 0.343059\n", + " 1 0 0.143529\n", + "property 0 0 0.016941\n", + " 1 0 0.030588\n", + "financial 0 1 0.017527\n", + " 1 1 0.079844\n", + "liability 0 1 0.510224\n", + " 1 1 0.252191\n", + "property 0 1 0.062317\n", + " 1 1 0.077897\n", + "financial 0 2 0.002413\n", + " 1 2 0.015682\n", + "liability 0 2 0.533172\n", + " 1 2 0.225573\n", + "property 0 2 0.095296\n", + " 1 2 0.127865\n", + "financial 0 3 0.002841\n", + " 1 3 0.016335\n", + "liability 0 3 0.094460\n", + " 1 3 0.040483\n", + "property 0 3 0.332386\n", + " 1 3 0.513494\n", + "Name: hit, dtype: float64" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "p_pr_given_ds" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The probabilities do indeed differ across different $D=d$ values, so the \"do\" probabilities from $G'$ are different from the non-do probabilities from $G$." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, "outputs": [], "source": [ "# Sanity check on P_G(P=p, R=r | D=d) and P_G(P=p, R=r) calculations\n", @@ -558,7 +667,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 11, "metadata": {}, "outputs": [ { @@ -572,7 +681,7 @@ "Name: prob, dtype: float64" ] }, - "execution_count": 10, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -611,7 +720,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ @@ -629,7 +738,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ @@ -652,7 +761,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -715,7 +824,7 @@ "2 2 3 0.473538 0.102707" ] }, - "execution_count": 13, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } @@ -746,7 +855,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -864,73 +973,66 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 16, "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning: Could not load \"/usr/local/Cellar/graphviz/2.47.0/lib/graphviz/libgvplugin_pango.6.dylib\" - It was found, so perhaps one of its dependents was not. Try ldd.\n" - ] - }, { "data": { "image/svg+xml": [ "\n", "\n", - "\n", "\n", "\n", "\n", - "\n", + "\n", "\n", "\n", "K\n", "\n", - "K\n", + "K\n", "\n", "\n", "\n", "H\n", "\n", - "H\n", + "H\n", "\n", "\n", "\n", "K->H\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "W\n", "\n", - "W\n", + "W\n", "\n", "\n", "\n", "K->W\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n", "H->W\n", - "\n", - "\n", + "\n", + "\n", "\n", "\n", "\n" ], "text/plain": [ - "" + "" ] }, - "execution_count": 15, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } @@ -985,9 +1087,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.5" + "version": "3.10.10" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 }