Skip to content

Commit

Permalink
Learning LDA model via Gibbs sampling
Browse files Browse the repository at this point in the history
  • Loading branch information
tuanavu committed Jul 30, 2016
1 parent aefe178 commit a8ea3e2
Show file tree
Hide file tree
Showing 13 changed files with 618 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Learning LDA model via Gibbs sampling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 1\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 3.14.04 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 2\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 3.14.28 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 3\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 5.11.56 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 4\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 4.53.54 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer**\n",
"\n",
"- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/AK3N4kI8EeaXyw5hjmsWew\n",
"- https://www.coursera.org/learn/ml-clustering-and-retrieval/discussions/weeks/5/threads/L5yUeFA4EearRRKGx4XuoQ\n",
"- Understand notation:\n",
" - $n_{i,k}$: count of topics (1's and 2's) in the document after you decrement the target word \"manager\". If the target word is for topic 1 then you won't need to decrement since the manager = topic 2. So you count the 1's. If the target word \"manager\" is for topic 2 then you have to decrement the single count of topic 2 which then makes your $n_{i,k}$ = 0.\n",
" - $N_i$: count of words in doc i\n",
" - V: total count of vocabulary, which is 10\n",
" - $m_{manager,k}$: total count of word, \"manager\" in the corpus assigned to topic k\n",
" - $\\sum_{w} m_{w,k}$: Sum of count of all words in the corpus assigned to topic k"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 5\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 4.54.06 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer**\n",
"- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n",
"- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 6\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 4.54.09 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer**\n",
"- $n_{i, 1}$: # current assignments to topic 1 in doc i, which is how many times topic 1 appears in document i.\n",
" - clearly 3 times, for baseball + ticket + owner = 1 + 1 + 1 = 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 7\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 4.54.13 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer**\n",
"- $\\sum_w m_{w, 1}$: total number of words in the corpus of topic 1, which is the sum of all words assigned to topic 1\n",
"- 52 + 15 + 9 + 9 + 20 + 17 + 1 = 123"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer**\n",
"- When we remove the assignment of manager to topic 2 in document i, we only have 1 assignment of topic 2 which is the word \"price\"\n",
" - $n_{i, 2}$: # current assignments to topic 2 in doc i, which is 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 8\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 4.54.18 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer**\n",
"- When we remove the assignment of \"manager\" to topic 2 in document i -> manager: 0\n",
" - The total counts of manager in topic 2 in the corpus: \n",
" - number of assignments corpus-wide of word \"manager\" to topic 2 - number of assignment of word \"manager\" to topic 2 in document i\n",
" - $\\large m_{\\text{manager,2}} - z_{\\text{i,manager}}$ = 37 - 1 = 36"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 9\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 5.12.06 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer**\n",
"- $\\sum_w m_{w, 2}$: total number of words in the corpus of topic 2, which is the sum of all words assigned to topic 2 after we decrement the associated counts\n",
"- 2 + 25 + 36 + 32 + 23 + 75 + 19 + 29 = 241"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 10\n",
"\n",
"<img src=\"images/Screen Shot 2016-07-29 at 5.17.14 PM.png\">\n",
"\n",
"*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-clustering-and-retrieval/exam/6ieZu/learning-lda-model-via-gibbs-sampling)*\n",
"\n",
"<!--TEASER_END-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Answer**\n",
"\n",
"As discussed in the slides, the unnormalized probability of assigning to topic 1 is\n",
"\n",
"- $p_1 = \\frac{n_{i, 1} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 1} + \\gamma}{\\sum_w m_{w, 1} + V \\gamma}$\n",
"\n",
"where V is the total size of the vocabulary.\n",
"\n",
"Similarly the unnormalized probability of assigning to topic 2 is\n",
"- $p_2 = \\frac{n_{i, 2} + \\alpha}{N_i - 1 + K \\alpha}\\frac{m_{\\text{manager}, 2} + \\gamma}{\\sum_w m_{w, 2} + V \\gamma}$\n",
"\n",
"Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.\n",
"\n",
"- Left Formula = (# times topic_1 appears in doc + alpha) / (# words in doc - 1 + K * alpha)\n",
"- Right Formula = (# of corpus-wide assignment of 'manager' to topic 1 + gamma) / (Sum of all topic 1 word counts + V * gamma)\n",
"- Prob 1 = (Left Formula) * (Right Formula)\n",
"\n",
"Example: \n",
"- calculate prob 1\n",
" - Left Formula = (3 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.5417\n",
" - Right Formula = (20 + 0.1) / (123 + (10 * 0.1)) = 0.162\n",
" - Prob 1 = 0.5417 * 0.162 = 0.0877554\n",
"- calculate prob 2\n",
" - Left Formula = (1 + 10.0) / (5 - 1 + (2 * 10.0)) = 0.4583\n",
" - Right Formula = (36 + 0.1) / (241 + (10 * 0.1)) = 0.1492\n",
" - Prob 1 = 0.4583 * 0.1492 = 0.06837836\n",
"- normalize prob 1\n",
" - normalize prob 1 = Prob 1/(Prob 1 + Prob 2) = 0.0877554/(0.0877554 + 0.06837836) = 0.562"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.0878024193548\n",
"0.0683712121212\n",
"0.562210268949\n"
]
}
],
"source": [
"def calculate_unnorm_prob(n_i, alpha, N_i, K, m_word, gamma, sum_of_m, V):\n",
" \"\"\" Calculate unnormalized probability of assigning to topic\n",
" \"\"\"\n",
" left_formula = (n_i + alpha)/(N_i - 1 + (K * alpha))\n",
" right_formula = (m_word + gamma)/(sum_of_m + (V * gamma))\n",
" prob = left_formula * right_formula\n",
" return prob\n",
"\n",
"\n",
"unnorm_prob_1 = calculate_unnorm_prob(3, 10.0, 5, 2, 20, 0.1, 123, 10)\n",
"unnorm_prob_2 = calculate_unnorm_prob(1, 10.0, 5, 2, 36, 0.1, 241, 10)\n",
"print unnorm_prob_1\n",
"print unnorm_prob_2\n",
"prob_1 = unnorm_prob_1/(unnorm_prob_1 + unnorm_prob_2)\n",
"print prob_1"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
},
"toc": {
"toc_cell": false,
"toc_number_sections": false,
"toc_threshold": "8",
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit a8ea3e2

Please sign in to comment.