Skip to content
Jesse Farmer edited this page May 31, 2014 · 10 revisions

Analyzing Text in Ruby

We're going to write a set of simple command-line tools to display basic statistics about a text file or set of text files. Some basic statistics include

  1. Character count, word count, and sentence count
  2. Letter frequency
  3. Word frequency, e.g., most common and least common words

We'll also work towards adding the ability to...

  1. Download data from an arbitrary URL
  2. Extract the text from a web page for analysis
  3. Display the results in different formats, e.g., charts, histograms, etc.
  4. Export the results to a spreadsheet

Here's a screenshot of a program that downloads the entire text of Moby Dick from Project Gutenberg and prints out a histogram of the letter frequencies.

Letter frequencies in Moby Dick

Why are we doing this?

If you can believe it, the program above that downloads text from a URL, computes the letter frequencies, and displays a histogram is less than 50 lines of Ruby. Think about the questions you'd need to be able to answer in order to make it work, though:

  1. How do I open and read data contained in file on my computer?
  2. How do I download and read data on the web?
  3. How to I pass in an arbitrary URL to my Ruby program? That is, how do I make something like this work?
    # This downloads and analyzes the text file at the supplied URL
    ruby textalyze.rb http://some-website.com/books/moby-dick.txt
  4. Once I have the data (from a file or a URL), how do I go about calculating the relevant statistics?
  5. Once I've calculated the relevant statistics, how do I display them in a user-friendly way?

Some Practical Scenarios

These questions run the gamut from nitty-gritty Ruby to user experience, while also starting us down the path of becoming comfortable with how the web works. From a purely practical perspective, think of any time you've wanted to get data off some website for your own purposes. This project can serve as a springboard for you to answer your own questions.

On the user experience side, we're forced to think about how to display the results of our analysis. If we were building a standalone webpage, we might want to produce beautiful charts and graphs. What should those charts be and why? What questions are we trying to answer?

If we were working with a marketing or operations team, we might want to produce a spreadsheet for further analysis. What should that spreadsheet contain and why? How much raw data and how much of our analysis should it contain?

Clone this wiki locally