diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/.ipynb_checkpoints/quiz-Learning LDA model via Gibbs sampling-checkpoint.ipynb b/machine_learning/4_clustering_and_retrieval/lecture/week5/.ipynb_checkpoints/quiz-Learning LDA model via Gibbs sampling-checkpoint.ipynb new file mode 100644 index 0000000..50f8bae --- /dev/null +++ b/machine_learning/4_clustering_and_retrieval/lecture/week5/.ipynb_checkpoints/quiz-Learning LDA model via Gibbs sampling-checkpoint.ipynb @@ -0,0 +1,309 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Learning LDA model via Gibbs sampling" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 1\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 2\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 3\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 4\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "\n", + "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/AK3N4kI8EeaXyw5hjmsWew\n", + "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/L5yUeFA4EearRRKGx4XuoQ\n", + "- Understand notation:\n", + " - $n_{i,k}$: count of topics (1's and 2's) in the document after you decrement the target word \"manager\". If the target word is for topic 1 then you won't need to decrement since the manager = topic 2. So you count the 1's. If the target word \"manager\" is for topic 2 then you have to decrement the single count of topic 2 which then makes your $n_{i,k}$ = 0.\n", + " - $N_i$: count of words in doc i\n", + " - V: total count of vocabulary, which is 10\n", + " - $m_{manager,k}$: total count of word, \"manager\" in the corpus assigned to topic k\n", + " - $\\sum_{w} m_{w,k}$: Sum of count of all words in the corpus assigned to topic k" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 5\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n", + "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 6\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- $n_{i, 1}$: # current assignments to topic 1 in doc i, which is how many times topic 1 appears in document i.\n", + " - clearly 3 times, for baseball + ticket + owner = 1 + 1 + 1 = 3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 7\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n", + "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- When we remove the assignment of manager to topic 2 in document i, we only have 1 assignment of topic 2 which is the word \"price\"\n", + " - $n_{i, 2}$: # current assignments to topic 2 in doc i, which is 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 8\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- When we remove the assignment of \"manager\" to topic 2 in document i -> manager: 0\n", + " - The total counts of manager in topic 2 in the corpus: \n", + " - number of assignments corpus-wide of word \"manager\" to topic 2 - number of assignment of word \"manager\" to topic 2 in document i\n", + " - $\\large m_{\\text{manager,2}} - z_{\\text{i,manager}}$ = 37 - 1 = 36" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 9\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- $\\sum_w m_{w, 2}$: total number of words in the corpus of topic 2, which is the sum of all words assigned to topic 2 after we decrement the associated counts\n", + "- 2 + 25 + 36 + 32 + 23 + 75 + 19 + 29 = 241" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 10\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "\n", + "As discussed in the slides, the unnormalized probability of assigning to topic 1 is\n", + "\n", + "- $p_1 = \\frac{n_{i, 1} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 1} + \\gamma}{\\sum_w m_{w, 1} + V \\gamma}$\n", + "\n", + "where V is the total size of the vocabulary.\n", + "\n", + "Similarly the unnormalized probability of assigning to topic 2 is\n", + "- $p_2 = \\frac{n_{i, 2} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 2} + \\gamma}{\\sum_w m_{w, 2} + V \\gamma}$\n", + "\n", + "Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.\n", + "\n", + "- Left Formula = (# times topic_1 appears in doc + alpha) / (# words in doc - 1 + K * alpha)\n", + "- Right Formula = (# of corpus-wide assignment of 'manager' to topic 1 + gamma) / (Sum of all topic 1 word counts + V * gamma)\n", + "- Prob 1 = (Left Formula) * (Right Formula)\n", + "\n", + "Example: \n", + "- calculate prob 1\n", + " - Left Formula = (3 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.5417\n", + " - Right Formula = (20 + 0.1) / (123 + (10 * 0.1)) = 0.162\n", + " - Prob 1 = 0.5417 * 0.162 = 0.0877554\n", + "- calculate prob 2\n", + " - Left Formula = (1 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.4583\n", + " - Right Formula = (36 + 0.1) / (241 + (10 * 0.1)) = 0.1492\n", + " - Prob 1 = 0.4583 * 0.1492 = 0.06837836\n", + "- normalize prob 1\n", + " - normalize prob 1 = Prob 1/(Prob 1 + Prob 2) = 0.0877554/(0.0877554 + 0.06837836) = 0.562" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.0878024193548\n", + "0.0683712121212\n", + "0.562210268949\n" + ] + } + ], + "source": [ + "def calculate_unnorm_prob(n_i, alpha, N_i, K, m_word, gamma, sum_of_m, V):\n", + " \"\"\" Calculate unnormalized probability of assigning to topic\n", + " \"\"\"\n", + " left_formula = (n_i + alpha)/(N_i - 1 + (K * alpha))\n", + " right_formula = (m_word + gamma)/(sum_of_m + (V * gamma))\n", + " prob = left_formula * right_formula\n", + " return prob\n", + "\n", + "\n", + "unnorm_prob_1 = calculate_unnorm_prob(3, 10.0, 5, 2, 20, 0.1, 123, 10)\n", + "unnorm_prob_2 = calculate_unnorm_prob(1, 10.0, 5, 2, 36, 0.1, 241, 10)\n", + "print unnorm_prob_1\n", + "print unnorm_prob_2\n", + "prob_1 = unnorm_prob_1/(unnorm_prob_1 + unnorm_prob_2)\n", + "print prob_1" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.12" + }, + "toc": { + "toc_cell": false, + "toc_number_sections": false, + "toc_threshold": "8", + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.04 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.04 PM.png new file mode 100644 index 0000000..10bdbea Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.04 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.28 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.28 PM.png new file mode 100644 index 0000000..37ead55 Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.28 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.53.54 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.53.54 PM.png new file mode 100644 index 0000000..df8fcc0 Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.53.54 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.06 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.06 PM.png new file mode 100644 index 0000000..b3803bb Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.06 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.09 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.09 PM.png new file mode 100644 index 0000000..976b031 Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.09 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.13 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.13 PM.png new file mode 100644 index 0000000..2a59e58 Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.13 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.18 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.18 PM.png new file mode 100644 index 0000000..d2f66b2 Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.18 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.27 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.27 PM.png new file mode 100644 index 0000000..e7fae81 Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.27 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.11.56 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.11.56 PM.png new file mode 100644 index 0000000..e3fb281 Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.11.56 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.12.06 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.12.06 PM.png new file mode 100644 index 0000000..14e16bc Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.12.06 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.17.14 PM.png b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.17.14 PM.png new file mode 100644 index 0000000..6a70771 Binary files /dev/null and b/machine_learning/4_clustering_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.17.14 PM.png differ diff --git a/machine_learning/4_clustering_and_retrieval/lecture/week5/quiz-Learning LDA model via Gibbs sampling.ipynb b/machine_learning/4_clustering_and_retrieval/lecture/week5/quiz-Learning LDA model via Gibbs sampling.ipynb new file mode 100644 index 0000000..50f8bae --- /dev/null +++ b/machine_learning/4_clustering_and_retrieval/lecture/week5/quiz-Learning LDA model via Gibbs sampling.ipynb @@ -0,0 +1,309 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Learning LDA model via Gibbs sampling" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 1\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 2\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 3\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 4\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "\n", + "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/AK3N4kI8EeaXyw5hjmsWew\n", + "- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/L5yUeFA4EearRRKGx4XuoQ\n", + "- Understand notation:\n", + " - $n_{i,k}$: count of topics (1's and 2's) in the document after you decrement the target word \"manager\". If the target word is for topic 1 then you won't need to decrement since the manager = topic 2. So you count the 1's. If the target word \"manager\" is for topic 2 then you have to decrement the single count of topic 2 which then makes your $n_{i,k}$ = 0.\n", + " - $N_i$: count of words in doc i\n", + " - V: total count of vocabulary, which is 10\n", + " - $m_{manager,k}$: total count of word, \"manager\" in the corpus assigned to topic k\n", + " - $\\sum_{w} m_{w,k}$: Sum of count of all words in the corpus assigned to topic k" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 5\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n", + "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 6\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- $n_{i, 1}$: # current assignments to topic 1 in doc i, which is how many times topic 1 appears in document i.\n", + " - clearly 3 times, for baseball + ticket + owner = 1 + 1 + 1 = 3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 7\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n", + "- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- When we remove the assignment of manager to topic 2 in document i, we only have 1 assignment of topic 2 which is the word \"price\"\n", + " - $n_{i, 2}$: # current assignments to topic 2 in doc i, which is 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 8\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- When we remove the assignment of \"manager\" to topic 2 in document i -> manager: 0\n", + " - The total counts of manager in topic 2 in the corpus: \n", + " - number of assignments corpus-wide of word \"manager\" to topic 2 - number of assignment of word \"manager\" to topic 2 in document i\n", + " - $\\large m_{\\text{manager,2}} - z_{\\text{i,manager}}$ = 37 - 1 = 36" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 9\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "- $\\sum_w m_{w, 2}$: total number of words in the corpus of topic 2, which is the sum of all words assigned to topic 2 after we decrement the associated counts\n", + "- 2 + 25 + 36 + 32 + 23 + 75 + 19 + 29 = 241" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Question 10\n", + "\n", + "\n", + "\n", + "*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Answer**\n", + "\n", + "As discussed in the slides, the unnormalized probability of assigning to topic 1 is\n", + "\n", + "- $p_1 = \\frac{n_{i, 1} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 1} + \\gamma}{\\sum_w m_{w, 1} + V \\gamma}$\n", + "\n", + "where V is the total size of the vocabulary.\n", + "\n", + "Similarly the unnormalized probability of assigning to topic 2 is\n", + "- $p_2 = \\frac{n_{i, 2} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 2} + \\gamma}{\\sum_w m_{w, 2} + V \\gamma}$\n", + "\n", + "Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.\n", + "\n", + "- Left Formula = (# times topic_1 appears in doc + alpha) / (# words in doc - 1 + K * alpha)\n", + "- Right Formula = (# of corpus-wide assignment of 'manager' to topic 1 + gamma) / (Sum of all topic 1 word counts + V * gamma)\n", + "- Prob 1 = (Left Formula) * (Right Formula)\n", + "\n", + "Example: \n", + "- calculate prob 1\n", + " - Left Formula = (3 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.5417\n", + " - Right Formula = (20 + 0.1) / (123 + (10 * 0.1)) = 0.162\n", + " - Prob 1 = 0.5417 * 0.162 = 0.0877554\n", + "- calculate prob 2\n", + " - Left Formula = (1 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.4583\n", + " - Right Formula = (36 + 0.1) / (241 + (10 * 0.1)) = 0.1492\n", + " - Prob 1 = 0.4583 * 0.1492 = 0.06837836\n", + "- normalize prob 1\n", + " - normalize prob 1 = Prob 1/(Prob 1 + Prob 2) = 0.0877554/(0.0877554 + 0.06837836) = 0.562" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.0878024193548\n", + "0.0683712121212\n", + "0.562210268949\n" + ] + } + ], + "source": [ + "def calculate_unnorm_prob(n_i, alpha, N_i, K, m_word, gamma, sum_of_m, V):\n", + " \"\"\" Calculate unnormalized probability of assigning to topic\n", + " \"\"\"\n", + " left_formula = (n_i + alpha)/(N_i - 1 + (K * alpha))\n", + " right_formula = (m_word + gamma)/(sum_of_m + (V * gamma))\n", + " prob = left_formula * right_formula\n", + " return prob\n", + "\n", + "\n", + "unnorm_prob_1 = calculate_unnorm_prob(3, 10.0, 5, 2, 20, 0.1, 123, 10)\n", + "unnorm_prob_2 = calculate_unnorm_prob(1, 10.0, 5, 2, 36, 0.1, 241, 10)\n", + "print unnorm_prob_1\n", + "print unnorm_prob_2\n", + "prob_1 = unnorm_prob_1/(unnorm_prob_1 + unnorm_prob_2)\n", + "print prob_1" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.12" + }, + "toc": { + "toc_cell": false, + "toc_number_sections": false, + "toc_threshold": "8", + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}