Learning LDA model via Gibbs sampling

tuanavu · Jul 30, 2016 · a8ea3e2 · a8ea3e2
1 parent aefe178
commit a8ea3e2
Show file tree

Hide file tree

Showing 13 changed files with 618 additions and 0 deletions.
diff --git a/...ture/week5/.ipynb_checkpoints/quiz-Learning LDA model via Gibbs sampling-checkpoint.ipynb b/...ture/week5/.ipynb_checkpoints/quiz-Learning LDA model via Gibbs sampling-checkpoint.ipynb
@@ -0,0 +1,309 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Learning LDA model via Gibbs sampling"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 1\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 3.14.04 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 2\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 3.14.28 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 3\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 5.11.56 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 4\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 4.53.54 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Answer**\n",
+    "\n",
+    "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/AK3N4kI8EeaXyw5hjmsWew\n",
+    "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/L5yUeFA4EearRRKGx4XuoQ\n",
+    "- Understand notation:\n",
+    "    - $n_{i,k}$: count of topics (1's and 2's) in the document after you decrement the target word \"manager\". If the target word is for topic 1 then you won't need to decrement since the manager = topic 2. So you count the 1's. If the target word \"manager\" is for topic 2 then you have to decrement the single count of topic 2 which then makes your $n_{i,k}$ = 0.\n",
+    "    - $N_i$: count of words in doc i\n",
+    "    - V: total count of vocabulary, which is 10\n",
+    "    - $m_{manager,k}$: total count of word, \"manager\" in the corpus assigned to topic k\n",
+    "    - $\\sum_{w} m_{w,k}$: Sum of count of all words in the corpus assigned to topic k"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 5\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 4.54.06 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Answer**\n",
+    "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n",
+    "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 6\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 4.54.09 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Answer**\n",
+    "- $n_{i, 1}$: # current assignments to topic 1 in doc i, which is how many times topic 1 appears in document i.\n",
+    "    - clearly 3 times, for baseball + ticket + owner = 1 + 1 + 1 = 3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 7\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 4.54.13 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Answer**\n",
+    "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n",
+    "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Answer**\n",
+    "- When we remove the assignment of manager to topic 2 in document i, we only have 1 assignment of topic 2 which is the word \"price\"\n",
+    "    - $n_{i, 2}$: # current assignments to topic 2 in doc i, which is 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 8\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 4.54.18 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Answer**\n",
+    "- When we remove the assignment of \"manager\" to topic 2 in document i -> manager: 0\n",
+    "    - The total counts of manager in topic 2 in the corpus: \n",
+    "        - number of assignments corpus-wide of word \"manager\" to topic 2 - number of assignment of word \"manager\" to topic 2 in document i\n",
+    "        - $\\large m_{\\text{manager,2}} - z_{\\text{i,manager}}$ = 37 - 1 = 36"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 9\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 5.12.06 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Answer**\n",
+    "- $\\sum_w m_{w, 2}$: total number of words in the corpus of topic 2, which is the sum of all words assigned to topic 2 after we decrement the associated counts\n",
+    "- 2 + 25 + 36 + 32 + 23 + 75 + 19 + 29 = 241"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Question 10\n",
+    "\n",
+    "<img src=\"images/Screen Shot 2016-07-29 at 5.17.14 PM.png\">\n",
+    "\n",
+    "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+    "\n",
+    "<!--TEASER_END-->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Answer**\n",
+    "\n",
+    "As discussed in the slides, the unnormalized probability of assigning to topic 1 is\n",
+    "\n",
+    "- $p_1 = \\frac{n_{i, 1} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 1} + \\gamma}{\\sum_w m_{w, 1} + V \\gamma}$\n",
+    "\n",
+    "where V is the total size of the vocabulary.\n",
+    "\n",
+    "Similarly the unnormalized probability of assigning to topic 2 is\n",
+    "- $p_2 = \\frac{n_{i, 2} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 2} + \\gamma}{\\sum_w m_{w, 2} + V \\gamma}$\n",
+    "\n",
+    "Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.\n",
+    "\n",
+    "- Left Formula =  (# times topic_1 appears in doc + alpha) / (# words in doc - 1 + K * alpha)\n",
+    "- Right Formula = (# of corpus-wide assignment of 'manager' to topic 1 + gamma) / (Sum of all topic 1 word counts + V * gamma)\n",
+    "- Prob 1 = (Left Formula) * (Right Formula)\n",
+    "\n",
+    "Example: \n",
+    "- calculate prob 1\n",
+    "    - Left Formula =  (3 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.5417\n",
+    "    - Right Formula = (20 + 0.1) / (123 + (10 * 0.1)) = 0.162\n",
+    "    - Prob 1 = 0.5417 * 0.162 = 0.0877554\n",
+    "- calculate prob 2\n",
+    "    - Left Formula =  (1 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.4583\n",
+    "    - Right Formula = (36 + 0.1) / (241 + (10 * 0.1)) = 0.1492\n",
+    "    - Prob 1 = 0.4583 * 0.1492 = 0.06837836\n",
+    "- normalize prob 1\n",
+    "    - normalize prob 1 = Prob 1/(Prob 1 + Prob 2) = 0.0877554/(0.0877554 + 0.06837836) = 0.562"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.0878024193548\n",
+      "0.0683712121212\n",
+      "0.562210268949\n"
+     ]
+    }
+   ],
+   "source": [
+    "def calculate_unnorm_prob(n_i, alpha, N_i, K, m_word, gamma, sum_of_m, V):\n",
+    "    \"\"\" Calculate unnormalized probability of assigning to topic\n",
+    "    \"\"\"\n",
+    "    left_formula = (n_i + alpha)/(N_i - 1 + (K * alpha))\n",
+    "    right_formula = (m_word + gamma)/(sum_of_m + (V * gamma))\n",
+    "    prob = left_formula * right_formula\n",
+    "    return prob\n",
+    "\n",
+    "\n",
+    "unnorm_prob_1 = calculate_unnorm_prob(3, 10.0, 5, 2, 20, 0.1, 123, 10)\n",
+    "unnorm_prob_2 = calculate_unnorm_prob(1, 10.0, 5, 2, 36, 0.1, 241, 10)\n",
+    "print unnorm_prob_1\n",
+    "print unnorm_prob_2\n",
+    "prob_1 = unnorm_prob_1/(unnorm_prob_1 + unnorm_prob_2)\n",
+    "print prob_1"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  },
+  "toc": {
+   "toc_cell": false,
+   "toc_number_sections": false,
+   "toc_threshold": "8",
+   "toc_window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.04 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.04 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.28 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.28 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.53.54 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.53.54 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.06 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.06 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.09 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.09 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.13 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.13 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.18 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.18 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.27 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.27 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.11.56 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.11.56 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.12.06 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.12.06 PM.png
diff --git a/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.17.14 PM.png b/...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.17.14 PM.png