-
Notifications
You must be signed in to change notification settings - Fork 351
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Learning LDA model via Gibbs sampling
- Loading branch information
Showing
13 changed files
with
618 additions
and
0 deletions.
There are no files selected for viewing
309 changes: 309 additions & 0 deletions
309
...ture/week5/.ipynb_checkpoints/quiz-Learning LDA model via Gibbs sampling-checkpoint.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,309 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Learning LDA model via Gibbs sampling" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 1\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 3.14.04 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 2\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 3.14.28 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 3\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 5.11.56 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 4\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 4.53.54 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Answer**\n", | ||
"\n", | ||
"- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/AK3N4kI8EeaXyw5hjmsWew\n", | ||
"- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/L5yUeFA4EearRRKGx4XuoQ\n", | ||
"- Understand notation:\n", | ||
" - $n_{i,k}$: count of topics (1's and 2's) in the document after you decrement the target word \"manager\". If the target word is for topic 1 then you won't need to decrement since the manager = topic 2. So you count the 1's. If the target word \"manager\" is for topic 2 then you have to decrement the single count of topic 2 which then makes your $n_{i,k}$ = 0.\n", | ||
" - $N_i$: count of words in doc i\n", | ||
" - V: total count of vocabulary, which is 10\n", | ||
" - $m_{manager,k}$: total count of word, \"manager\" in the corpus assigned to topic k\n", | ||
" - $\\sum_{w} m_{w,k}$: Sum of count of all words in the corpus assigned to topic k" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 5\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 4.54.06 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Answer**\n", | ||
"- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n", | ||
"- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 6\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 4.54.09 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Answer**\n", | ||
"- $n_{i, 1}$: # current assignments to topic 1 in doc i, which is how many times topic 1 appears in document i.\n", | ||
" - clearly 3 times, for baseball + ticket + owner = 1 + 1 + 1 = 3" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 7\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 4.54.13 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Answer**\n", | ||
"- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n", | ||
"- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Answer**\n", | ||
"- When we remove the assignment of manager to topic 2 in document i, we only have 1 assignment of topic 2 which is the word \"price\"\n", | ||
" - $n_{i, 2}$: # current assignments to topic 2 in doc i, which is 1" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 8\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 4.54.18 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Answer**\n", | ||
"- When we remove the assignment of \"manager\" to topic 2 in document i -> manager: 0\n", | ||
" - The total counts of manager in topic 2 in the corpus: \n", | ||
" - number of assignments corpus-wide of word \"manager\" to topic 2 - number of assignment of word \"manager\" to topic 2 in document i\n", | ||
" - $\\large m_{\\text{manager,2}} - z_{\\text{i,manager}}$ = 37 - 1 = 36" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 9\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 5.12.06 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Answer**\n", | ||
"- $\\sum_w m_{w, 2}$: total number of words in the corpus of topic 2, which is the sum of all words assigned to topic 2 after we decrement the associated counts\n", | ||
"- 2 + 25 + 36 + 32 + 23 + 75 + 19 + 29 = 241" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Question 10\n", | ||
"\n", | ||
"<img src=\"images/Screen Shot 2016-07-29 at 5.17.14 PM.png\">\n", | ||
"\n", | ||
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n", | ||
"\n", | ||
"<!--TEASER_END-->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Answer**\n", | ||
"\n", | ||
"As discussed in the slides, the unnormalized probability of assigning to topic 1 is\n", | ||
"\n", | ||
"- $p_1 = \\frac{n_{i, 1} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 1} + \\gamma}{\\sum_w m_{w, 1} + V \\gamma}$\n", | ||
"\n", | ||
"where V is the total size of the vocabulary.\n", | ||
"\n", | ||
"Similarly the unnormalized probability of assigning to topic 2 is\n", | ||
"- $p_2 = \\frac{n_{i, 2} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 2} + \\gamma}{\\sum_w m_{w, 2} + V \\gamma}$\n", | ||
"\n", | ||
"Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.\n", | ||
"\n", | ||
"- Left Formula = (# times topic_1 appears in doc + alpha) / (# words in doc - 1 + K * alpha)\n", | ||
"- Right Formula = (# of corpus-wide assignment of 'manager' to topic 1 + gamma) / (Sum of all topic 1 word counts + V * gamma)\n", | ||
"- Prob 1 = (Left Formula) * (Right Formula)\n", | ||
"\n", | ||
"Example: \n", | ||
"- calculate prob 1\n", | ||
" - Left Formula = (3 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.5417\n", | ||
" - Right Formula = (20 + 0.1) / (123 + (10 * 0.1)) = 0.162\n", | ||
" - Prob 1 = 0.5417 * 0.162 = 0.0877554\n", | ||
"- calculate prob 2\n", | ||
" - Left Formula = (1 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.4583\n", | ||
" - Right Formula = (36 + 0.1) / (241 + (10 * 0.1)) = 0.1492\n", | ||
" - Prob 1 = 0.4583 * 0.1492 = 0.06837836\n", | ||
"- normalize prob 1\n", | ||
" - normalize prob 1 = Prob 1/(Prob 1 + Prob 2) = 0.0877554/(0.0877554 + 0.06837836) = 0.562" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 21, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"0.0878024193548\n", | ||
"0.0683712121212\n", | ||
"0.562210268949\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"def calculate_unnorm_prob(n_i, alpha, N_i, K, m_word, gamma, sum_of_m, V):\n", | ||
" \"\"\" Calculate unnormalized probability of assigning to topic\n", | ||
" \"\"\"\n", | ||
" left_formula = (n_i + alpha)/(N_i - 1 + (K * alpha))\n", | ||
" right_formula = (m_word + gamma)/(sum_of_m + (V * gamma))\n", | ||
" prob = left_formula * right_formula\n", | ||
" return prob\n", | ||
"\n", | ||
"\n", | ||
"unnorm_prob_1 = calculate_unnorm_prob(3, 10.0, 5, 2, 20, 0.1, 123, 10)\n", | ||
"unnorm_prob_2 = calculate_unnorm_prob(1, 10.0, 5, 2, 36, 0.1, 241, 10)\n", | ||
"print unnorm_prob_1\n", | ||
"print unnorm_prob_2\n", | ||
"prob_1 = unnorm_prob_1/(unnorm_prob_1 + unnorm_prob_2)\n", | ||
"print prob_1" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 2", | ||
"language": "python", | ||
"name": "python2" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 2 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython2", | ||
"version": "2.7.12" | ||
}, | ||
"toc": { | ||
"toc_cell": false, | ||
"toc_number_sections": false, | ||
"toc_threshold": "8", | ||
"toc_window_display": false | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
Binary file added
BIN
+22.1 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.04 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+36.3 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 3.14.28 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+125 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.53.54 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+21.2 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.06 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+27 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.09 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+27.7 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.13 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+29.9 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.18 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+76.8 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 4.54.27 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+39.8 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.11.56 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+29.8 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.12.06 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+77.1 KB
...ing_and_retrieval/lecture/week5/images/Screen Shot 2016-07-29 at 5.17.14 PM.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.