Skip to content
Jesse Farmer edited this page Jun 1, 2014 · 10 revisions

Analyzing Text in Ruby

We're going to write a set of simple command-line tools to display basic statistics about a text file or set of text files. Some basic statistics include

  1. Character count, word count, and sentence count
  2. Letter frequency
  3. Word frequency, e.g., most common and least common words

We'll also work towards adding the ability to...

  1. Download data from an arbitrary URL
  2. Extract the text from a web page for analysis
  3. Display the results in different formats, e.g., charts, histograms, etc.
  4. Export the results to a spreadsheet

Here's a screenshot of a program that downloads the entire text of Moby Dick from Project Gutenberg and prints out a histogram of the letter frequencies.

Letter frequencies in Moby Dick

It turns out that the letter "t" makes up 9.25% of all the letters in Moby Dick. The more you know!

Getting Started

Look at the Iterations page to see how to get started.

Why Are We Doing This?

If you can believe it, the program above that downloads text from a URL, computes the letter frequencies, and displays a histogram is less than 50 lines of Ruby. Think about the questions you'd need to be able to answer in order to make it work, though:

  1. How do I open and read data contained in file on my computer?

  2. How do I download and read data on the web?

  3. How to I pass in an arbitrary URL to my Ruby program? That is, how do I make something like this work?

    # This downloads and analyzes the text file at the supplied URL
    ruby textalyze.rb http://some-website.com/books/moby-dick.txt
  4. Once I have the data (from a file or a URL), how do I go about calculating the relevant statistics?

  5. Once I've calculated the relevant statistics, how do I display them in a user-friendly way?

These questions run the gamut from nitty-gritty Ruby to user experience, while also starting us down the path of becoming comfortable with how the web works.

Some Practical Scenarios

To start, think of any time you've wanted to get data off some website for your own purposes. Or, perhaps a time when you could get the data from someone, but not in a format that was useful for you and you sat around twiddling your thumbs until they prioritized your problem. This project can serve as a springboard for you to build software to do all those things for yourself.

On the user experience side, we're forced to think about how to display the results of our analysis. If we were building a standalone webpage, we might want to produce beautiful charts and graphs. What should those charts be and why? What questions are we trying to answer?

If we were working with a marketing or operations team, we might want to produce a spreadsheet for further analysis. What should that spreadsheet contain and why? How much raw data and how much of our analysis should it contain?

If we were building this to be used by other people in real-time, the overall time it took to compute the statistics would suddenly become part of the user experience. Who wants to wait around while the computer crunches numbers?