diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/.ipynb_checkpoints/quiz-Learning LDA model via Gibbs sampling-checkpoint.ipynb b/machine_learning/4_clustering_and_retrieval/lecture/week5/.ipynb_checkpoints/quiz-Learning LDA model via Gibbs sampling-checkpoint.ipynb
new file mode 100644
index 0000000..50f8bae
--- /dev/null
+++ b/machine_learning/4_clustering_and_retrieval/lecture/week5/.ipynb_checkpoints/quiz-Learning LDA model via Gibbs sampling-checkpoint.ipynb
@@ -0,0 +1,309 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Learning LDA model via Gibbs sampling"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 1\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 2\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 3\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 4\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "\n",
+ "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/AK3N4kI8EeaXyw5hjmsWew\n",
+ "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/L5yUeFA4EearRRKGx4XuoQ\n",
+ "- Understand notation:\n",
+ " - $n_{i,k}$: count of topics (1's and 2's) in the document after you decrement the target word \"manager\". If the target word is for topic 1 then you won't need to decrement since the manager = topic 2. So you count the 1's. If the target word \"manager\" is for topic 2 then you have to decrement the single count of topic 2 which then makes your $n_{i,k}$ = 0.\n",
+ " - $N_i$: count of words in doc i\n",
+ " - V: total count of vocabulary, which is 10\n",
+ " - $m_{manager,k}$: total count of word, \"manager\" in the corpus assigned to topic k\n",
+ " - $\\sum_{w} m_{w,k}$: Sum of count of all words in the corpus assigned to topic k"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 5\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n",
+ "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 6\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- $n_{i, 1}$: # current assignments to topic 1 in doc i, which is how many times topic 1 appears in document i.\n",
+ " - clearly 3 times, for baseball + ticket + owner = 1 + 1 + 1 = 3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 7\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n",
+ "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- When we remove the assignment of manager to topic 2 in document i, we only have 1 assignment of topic 2 which is the word \"price\"\n",
+ " - $n_{i, 2}$: # current assignments to topic 2 in doc i, which is 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 8\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- When we remove the assignment of \"manager\" to topic 2 in document i -> manager: 0\n",
+ " - The total counts of manager in topic 2 in the corpus: \n",
+ " - number of assignments corpus-wide of word \"manager\" to topic 2 - number of assignment of word \"manager\" to topic 2 in document i\n",
+ " - $\\large m_{\\text{manager,2}} - z_{\\text{i,manager}}$ = 37 - 1 = 36"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 9\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- $\\sum_w m_{w, 2}$: total number of words in the corpus of topic 2, which is the sum of all words assigned to topic 2 after we decrement the associated counts\n",
+ "- 2 + 25 + 36 + 32 + 23 + 75 + 19 + 29 = 241"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 10\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "\n",
+ "As discussed in the slides, the unnormalized probability of assigning to topic 1 is\n",
+ "\n",
+ "- $p_1 = \\frac{n_{i, 1} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 1} + \\gamma}{\\sum_w m_{w, 1} + V \\gamma}$\n",
+ "\n",
+ "where V is the total size of the vocabulary.\n",
+ "\n",
+ "Similarly the unnormalized probability of assigning to topic 2 is\n",
+ "- $p_2 = \\frac{n_{i, 2} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 2} + \\gamma}{\\sum_w m_{w, 2} + V \\gamma}$\n",
+ "\n",
+ "Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.\n",
+ "\n",
+ "- Left Formula = (# times topic_1 appears in doc + alpha) / (# words in doc - 1 + K * alpha)\n",
+ "- Right Formula = (# of corpus-wide assignment of 'manager' to topic 1 + gamma) / (Sum of all topic 1 word counts + V * gamma)\n",
+ "- Prob 1 = (Left Formula) * (Right Formula)\n",
+ "\n",
+ "Example: \n",
+ "- calculate prob 1\n",
+ " - Left Formula = (3 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.5417\n",
+ " - Right Formula = (20 + 0.1) / (123 + (10 * 0.1)) = 0.162\n",
+ " - Prob 1 = 0.5417 * 0.162 = 0.0877554\n",
+ "- calculate prob 2\n",
+ " - Left Formula = (1 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.4583\n",
+ " - Right Formula = (36 + 0.1) / (241 + (10 * 0.1)) = 0.1492\n",
+ " - Prob 1 = 0.4583 * 0.1492 = 0.06837836\n",
+ "- normalize prob 1\n",
+ " - normalize prob 1 = Prob 1/(Prob 1 + Prob 2) = 0.0877554/(0.0877554 + 0.06837836) = 0.562"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.0878024193548\n",
+ "0.0683712121212\n",
+ "0.562210268949\n"
+ ]
+ }
+ ],
+ "source": [
+ "def calculate_unnorm_prob(n_i, alpha, N_i, K, m_word, gamma, sum_of_m, V):\n",
+ " \"\"\" Calculate unnormalized probability of assigning to topic\n",
+ " \"\"\"\n",
+ " left_formula = (n_i + alpha)/(N_i - 1 + (K * alpha))\n",
+ " right_formula = (m_word + gamma)/(sum_of_m + (V * gamma))\n",
+ " prob = left_formula * right_formula\n",
+ " return prob\n",
+ "\n",
+ "\n",
+ "unnorm_prob_1 = calculate_unnorm_prob(3, 10.0, 5, 2, 20, 0.1, 123, 10)\n",
+ "unnorm_prob_2 = calculate_unnorm_prob(1, 10.0, 5, 2, 36, 0.1, 241, 10)\n",
+ "print unnorm_prob_1\n",
+ "print unnorm_prob_2\n",
+ "prob_1 = unnorm_prob_1/(unnorm_prob_1 + unnorm_prob_2)\n",
+ "print prob_1"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 2",
+ "language": "python",
+ "name": "python2"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 2
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython2",
+ "version": "2.7.12"
+ },
+ "toc": {
+ "toc_cell": false,
+ "toc_number_sections": false,
+ "toc_threshold": "8",
+ "toc_window_display": false
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.04 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.04 PM.png
new file mode 100644
index 0000000..10bdbea
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.04 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.28 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.28 PM.png
new file mode 100644
index 0000000..37ead55
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.28 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.53.54 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.53.54 PM.png
new file mode 100644
index 0000000..df8fcc0
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.53.54 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.06 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.06 PM.png
new file mode 100644
index 0000000..b3803bb
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.06 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.09 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.09 PM.png
new file mode 100644
index 0000000..976b031
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.09 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.13 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.13 PM.png
new file mode 100644
index 0000000..2a59e58
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.13 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.18 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.18 PM.png
new file mode 100644
index 0000000..d2f66b2
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.18 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.27 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.27 PM.png
new file mode 100644
index 0000000..e7fae81
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.27 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.11.56 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.11.56 PM.png
new file mode 100644
index 0000000..e3fb281
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.11.56 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.12.06 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.12.06 PM.png
new file mode 100644
index 0000000..14e16bc
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.12.06 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.17.14 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.17.14 PM.png
new file mode 100644
index 0000000..6a70771
Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.17.14 PM.png differ
diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/quiz-Learning LDA model via Gibbs sampling.ipynb b/machine_learning/4_clustering_and_retrieval/lecture/week5/quiz-Learning LDA model via Gibbs sampling.ipynb
new file mode 100644
index 0000000..50f8bae
--- /dev/null
+++ b/machine_learning/4_clustering_and_retrieval/lecture/week5/quiz-Learning LDA model via Gibbs sampling.ipynb
@@ -0,0 +1,309 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Learning LDA model via Gibbs sampling"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 1\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 2\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 3\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 4\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "\n",
+ "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/AK3N4kI8EeaXyw5hjmsWew\n",
+ "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/L5yUeFA4EearRRKGx4XuoQ\n",
+ "- Understand notation:\n",
+ " - $n_{i,k}$: count of topics (1's and 2's) in the document after you decrement the target word \"manager\". If the target word is for topic 1 then you won't need to decrement since the manager = topic 2. So you count the 1's. If the target word \"manager\" is for topic 2 then you have to decrement the single count of topic 2 which then makes your $n_{i,k}$ = 0.\n",
+ " - $N_i$: count of words in doc i\n",
+ " - V: total count of vocabulary, which is 10\n",
+ " - $m_{manager,k}$: total count of word, \"manager\" in the corpus assigned to topic k\n",
+ " - $\\sum_{w} m_{w,k}$: Sum of count of all words in the corpus assigned to topic k"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 5\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n",
+ "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 6\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- $n_{i, 1}$: # current assignments to topic 1 in doc i, which is how many times topic 1 appears in document i.\n",
+ " - clearly 3 times, for baseball + ticket + owner = 1 + 1 + 1 = 3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 7\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n",
+ "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- When we remove the assignment of manager to topic 2 in document i, we only have 1 assignment of topic 2 which is the word \"price\"\n",
+ " - $n_{i, 2}$: # current assignments to topic 2 in doc i, which is 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 8\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- When we remove the assignment of \"manager\" to topic 2 in document i -> manager: 0\n",
+ " - The total counts of manager in topic 2 in the corpus: \n",
+ " - number of assignments corpus-wide of word \"manager\" to topic 2 - number of assignment of word \"manager\" to topic 2 in document i\n",
+ " - $\\large m_{\\text{manager,2}} - z_{\\text{i,manager}}$ = 37 - 1 = 36"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 9\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "- $\\sum_w m_{w, 2}$: total number of words in the corpus of topic 2, which is the sum of all words assigned to topic 2 after we decrement the associated counts\n",
+ "- 2 + 25 + 36 + 32 + 23 + 75 + 19 + 29 = 241"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Question 10\n",
+ "\n",
+ "\n",
+ "\n",
+ "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Answer**\n",
+ "\n",
+ "As discussed in the slides, the unnormalized probability of assigning to topic 1 is\n",
+ "\n",
+ "- $p_1 = \\frac{n_{i, 1} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 1} + \\gamma}{\\sum_w m_{w, 1} + V \\gamma}$\n",
+ "\n",
+ "where V is the total size of the vocabulary.\n",
+ "\n",
+ "Similarly the unnormalized probability of assigning to topic 2 is\n",
+ "- $p_2 = \\frac{n_{i, 2} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 2} + \\gamma}{\\sum_w m_{w, 2} + V \\gamma}$\n",
+ "\n",
+ "Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.\n",
+ "\n",
+ "- Left Formula = (# times topic_1 appears in doc + alpha) / (# words in doc - 1 + K * alpha)\n",
+ "- Right Formula = (# of corpus-wide assignment of 'manager' to topic 1 + gamma) / (Sum of all topic 1 word counts + V * gamma)\n",
+ "- Prob 1 = (Left Formula) * (Right Formula)\n",
+ "\n",
+ "Example: \n",
+ "- calculate prob 1\n",
+ " - Left Formula = (3 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.5417\n",
+ " - Right Formula = (20 + 0.1) / (123 + (10 * 0.1)) = 0.162\n",
+ " - Prob 1 = 0.5417 * 0.162 = 0.0877554\n",
+ "- calculate prob 2\n",
+ " - Left Formula = (1 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.4583\n",
+ " - Right Formula = (36 + 0.1) / (241 + (10 * 0.1)) = 0.1492\n",
+ " - Prob 1 = 0.4583 * 0.1492 = 0.06837836\n",
+ "- normalize prob 1\n",
+ " - normalize prob 1 = Prob 1/(Prob 1 + Prob 2) = 0.0877554/(0.0877554 + 0.06837836) = 0.562"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.0878024193548\n",
+ "0.0683712121212\n",
+ "0.562210268949\n"
+ ]
+ }
+ ],
+ "source": [
+ "def calculate_unnorm_prob(n_i, alpha, N_i, K, m_word, gamma, sum_of_m, V):\n",
+ " \"\"\" Calculate unnormalized probability of assigning to topic\n",
+ " \"\"\"\n",
+ " left_formula = (n_i + alpha)/(N_i - 1 + (K * alpha))\n",
+ " right_formula = (m_word + gamma)/(sum_of_m + (V * gamma))\n",
+ " prob = left_formula * right_formula\n",
+ " return prob\n",
+ "\n",
+ "\n",
+ "unnorm_prob_1 = calculate_unnorm_prob(3, 10.0, 5, 2, 20, 0.1, 123, 10)\n",
+ "unnorm_prob_2 = calculate_unnorm_prob(1, 10.0, 5, 2, 36, 0.1, 241, 10)\n",
+ "print unnorm_prob_1\n",
+ "print unnorm_prob_2\n",
+ "prob_1 = unnorm_prob_1/(unnorm_prob_1 + unnorm_prob_2)\n",
+ "print prob_1"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 2",
+ "language": "python",
+ "name": "python2"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 2
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython2",
+ "version": "2.7.12"
+ },
+ "toc": {
+ "toc_cell": false,
+ "toc_number_sections": false,
+ "toc_threshold": "8",
+ "toc_window_display": false
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}