Groupie is a simple way to group texts and classify new texts as being a likely member of one of the defined groups. Think of bayesian spam filters.
The eventual goal is to have Groupie work as a sort of bayesian spam filter, where you feed it spam and ham (non-spam) and ask it to classify new texts as spam or ham. Applications for this are e-mail spam filtering and blog spam filtering. Other sorts of categorizing might be interesting as well, such as finding suitable tags for a blog post or bookmark.
Started and forgotten in 2009 as a short-lived experiment, in 2010 Groupie got new features when I started using it on a RSS reader project that classified news items into "Interesting" and "Not interesting" categories.
Current funcionality includes:
- Tokenize an input text to prepare it for grouping.
- Strip XML and HTML tag.
- Keep certain infix characters, such as period and comma.
- Add texts (as an Array of Strings) to any number of groups.
- Classify a single word to check the likelihood it belongs to each group.
- Do classification for complete (tokenized) texts.
- Pick classification strategy to weigh repeat words differently (weigh by sum, square root or log10 of words in group)
Add this line to your application's Gemfile:
gem 'groupie'
You can also perform this to do this for you:
bundle add groupie
And then execute:
bundle install
Or install it system-wide via:
gem install groupie
Here is an annotated console session that shows off the features available in Groupie.
# Instantiate a new Groupie instance
groupie = Groupie.new
# Groups are defined as you use them, so let's get started by adding some pre-tokenized words
groupie[:spam].add(%w[this is obvious spam please buy our product])
groupie[:spam].add(%w[hello friend this is rich prince i have awesome bitcoin for you])
groupie[:ham].add(%w[you are invited to my awesome party just click the link to rsvp])
# Is your data less than clean? We've got a tokenizer for that!
tokens = Groupie.tokenize('Please give me your password so I can haxx0r you!')
# => ["please", "give", "me", "your", "password", "so", "i", "can", "haxx0r", "you"]
groupie[:spam].add(tokens)
# So, now let's attempt to classify a text and see if it's spam or ham:
test_tokens = %w[please click the link to reset your password for our awesome product]
groupie.classify_text(test_tokens)
# => {:spam=>0.5909090909090909, :ham=>0.4090909090909091}
# As you can see, this password reset email looks a little dodgy...
# We have multiple strategies for drawing conclusions about what group it belongs to.
# The default you saw above is :sum, it weighs each word by the total sum of occurrences.
# Let's see if it looks less bad by using a different classification strategies.
# Log reduces the weight of each word to the log10 of its occurrence count:
# - Count 1 is weight 0
# - Count 10 is weight 1
# - Count 100 is weight 2
groupie.classify_text(test_tokens, :log)
# => {:spam=>0.5, :ham=>0.5}
# This is even more even, most likely because it ignores all single-count words...
# Square root algorithm is less harsh, it reduces the weight of each word to the square root of the count:
# - Count 1 is weight 1
# - Count 4 is weight 2
# - Count 9 is weight 3
groupie.classify_text(test_tokens, :sqrt)
# => {:spam=>0.5909090909090909, :ham=>0.4090909090909091}
# This seems to result in the same value as :sum
# Unique uses the same weighting algorithm as the square root, but it modifies the word dictionary:
# it discards the 25% most common words, so less common words gain higher predictive power.
groupie.classify_text(test_tokens, :unique)
# => {:spam=>0.625, :ham=>0.375}
# This looks even worse for our poor password reset email.
# In case you're curious, the ignored words in this case are:
test_tokens - (test_tokens & groupie.unique_words)
# => ["please", "to", "reset", "awesome"]
# If you'd be classifying email, you can assume that common email headers will get ignored this way.
# If you're just starting out, your incomplete data could lead to dramatic misrepresentations of the data.
# To balance against this, you can enable smart weight:
groupie.smart_weight = true
# You could also set it during initialization via Groupie.new(smart_weight: true)
# What's so useful about it? It adds a default weight to _all_ words, even the ones you haven't
# seen yet, which counter-acts the data you have. This shines in low data situations,
# reducing the impact of the few words you have seen before.
groupie.default_weight
# => 1.2285714285714286
# Classifying the same text as before should consider all words, and add this default weight to all words
# It basically gives all groups the likelihood of "claiming" a word,
# unless there is strong data to suggest otherwise.
groupie.classify_text(test_tokens)
# => {:spam=>0.5241046831955923, :ham=>0.4758953168044077}
Persistence can be naively done by using YAML:
# Instantiate a new Groupie instance
groupie = Groupie.new
groupie[:spam].add(%w[assume you have a lot of data you care about])
require 'yaml'
yaml = YAML.dump(groupie)
loaded = YAML.safe_load(yaml, permitted_classes: [Groupie, Groupie::Group, Symbol])
For I'm still experimenting with Groupie in Infinity Feed, so persistence is a Future Problem for me there. In development, I'm building (low data count) classifiers in memory and discarding them after use.
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment. Rubocop is available via bin/rubocop
with some friendly default settings.
To install this gem onto your local machine, run bundle exec rake install
.
To release a new version (obviously only for the maintainer):
- Decide on the new version number. We follow Sematic Versioning, so breaking changes bump the major version, new features bump the minor, bugfixes bump the patch. At version 0.x, treat the minor version as if it were a major version (i.e. breaking changes bump the minor).
- Update the Changelog to add a version header with the current date. Please preserve "Unreleased changes" and give it a newline to breathe a bit.
- Update the version number in
version.rb
to reference the same version as the Changelog. - Commit the changes with a clear name (such as "Release version X.Y.Z")
- Run
bundle exec rake release
, which will build the gem, create a git tag for the version, push git commits and the created tag, and push the.gem
file to rubygems.org. It requires 2FA. For obvious reasons, only the project maintainer can do this. - Create a new Release on Github. Copy/paste the Changelog body as minimum release notes. Release name is simply "Version X.Y.Z". Add
:rotating_light:
and:loudspeaker:
before Breaking and Feature changes. Double check relevant changes reference their PR.
Bug reports and pull requests are welcome on GitHub at https://github.com/Narnach/groupie.
The gem is available as open source under the terms of the MIT License.