Gutenberg - Nick Damiano #68

NickDamiano · 2014-09-04T05:50:37Z

Training 4.45 seconds,
Prediction: 4.72 seconds
Accuracy: 96%

My algorithm to identify the category for an unknown book is:

Count the occurrences of the words - similar to our win-loss prework exercise - and store it in a hash with word as key and frequency as value.
Sort the Hash by frequency and grab the top 2500 most frequently occurring words - storing them into an 2-dimensional array. The array looks something like this [["nick", 10], ["is", 2], ["rad", 9]].
Run .map on that array - grabbing only the first of the inner array and store that in a variable so now we have an array of just the top 100 words.
4)Repeat the above steps for each of the tokens passed in during the predictions.
5)Subtract the token word list array from the known subject array and what we get is an array containing all of the words that didn't match between the two.
6)Find the number of elements, the smaller the number the more successful matches were made and the more likely the subject matched against is the correct subject.
Test each differences array count against the existing smallest one. If the differences array element count is smaller, replace the variables storing the up-to-then most likely subject with the new subject, and the previous size of differences array with the new number of elements in differences array.
Return the most-likely subject.

…ly used words and compare that to the unknown books to see category. Right now I have two arrays of words and I'm going to subtract them and then see which count is the smallest. the smallest count is the one with the most matching words and the right category.

…over 6 characters with a 73% accuracy, prediction time of 5.08 seconds, and a training time of 4.75 seconds

… 73%

… Changed the comparison arrays so that the known subject array top words is 2500 instead of 100. 96%

NickDamiano added 4 commits September 3, 2014 23:58

finished my code to test comparisons between top 100 occurring words …

02ab14a

…over 6 characters with a 73% accuracy, prediction time of 5.08 seconds, and a training time of 4.75 seconds

added code to check for subject name in first 5000 words and still at…

d270382

… 73%

removed the section that searches through the words for subject word.…

d0b3d97

… Changed the comparison arrays so that the known subject array top words is 2500 instead of 100. 96%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gutenberg - Nick Damiano #68

Gutenberg - Nick Damiano #68

NickDamiano commented Sep 4, 2014

Gutenberg - Nick Damiano #68

Are you sure you want to change the base?

Gutenberg - Nick Damiano #68

Conversation

NickDamiano commented Sep 4, 2014