Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gutenberg - Nick Damiano #68

Open
wants to merge 4 commits into
base: gutenberg
Choose a base branch
from
Open

Gutenberg - Nick Damiano #68

wants to merge 4 commits into from

Conversation

NickDamiano
Copy link

Training 4.45 seconds,
Prediction: 4.72 seconds
Accuracy: 96%

My algorithm to identify the category for an unknown book is:

  1. Count the occurrences of the words - similar to our win-loss prework exercise - and store it in a hash with word as key and frequency as value.
  2. Sort the Hash by frequency and grab the top 2500 most frequently occurring words - storing them into an 2-dimensional array. The array looks something like this [["nick", 10], ["is", 2], ["rad", 9]].
  3. Run .map on that array - grabbing only the first of the inner array and store that in a variable so now we have an array of just the top 100 words.
    4)Repeat the above steps for each of the tokens passed in during the predictions.
    5)Subtract the token word list array from the known subject array and what we get is an array containing all of the words that didn't match between the two.
    6)Find the number of elements, the smaller the number the more successful matches were made and the more likely the subject matched against is the correct subject.
  4. Test each differences array count against the existing smallest one. If the differences array element count is smaller, replace the variables storing the up-to-then most likely subject with the new subject, and the previous size of differences array with the new number of elements in differences array.
  5. Return the most-likely subject.

…ly used words and compare that to the unknown books to see category. Right now I have two arrays of words and I'm going to subtract them and then see which count is the smallest. the smallest count is the one with the most matching words and the right category.
…over 6 characters with a 73% accuracy, prediction time of 5.08 seconds, and a training time of 4.75 seconds
… Changed the comparison arrays so that the known subject array top words is 2500 instead of 100. 96%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant