-
Notifications
You must be signed in to change notification settings - Fork 9
Home
We're going to write a set of simple command-line tools to display basic statistics about a text file or set of text files. Some basic statistics include
- Character count, word count, and sentence count
- Letter frequency
- Word frequency, e.g., most common and least common words
We'll also work towards adding the ability to...
- Download data from an arbitrary URL
- Extract the text from a web page for analysis
- Display the results in different formats, e.g., charts, histograms, etc.
- Export the results to a spreadsheet
Here's a screenshot of a program that downloads the entire text of Moby Dick from Project Gutenberg and prints out a histogram of the letter frequencies.
It turns out that the letter "t" makes up 9.25% of all the letters in Moby Dick. The more you know!
If you can believe it, the program above that downloads text from a URL, computes the letter frequencies, and displays a histogram is less than 50 lines of Ruby. Think about the questions you'd need to be able to answer in order to make it work, though:
- How do I open and read data contained in file on my computer?
- How do I download and read data on the web?
- How to I pass in an arbitrary URL to my Ruby program? That is, how do I make something like this work?
# This downloads and analyzes the text file at the supplied URL ruby textalyze.rb http://some-website.com/books/moby-dick.txt
- Once I have the data (from a file or a URL), how do I go about calculating the relevant statistics?
- Once I've calculated the relevant statistics, how do I display them in a user-friendly way?
These questions run the gamut from nitty-gritty Ruby to user experience, while also starting us down the path of becoming comfortable with how the web works. From a purely practical perspective, think of any time you've wanted to get data off some website for your own purposes. This project can serve as a springboard for you to answer your own questions.
On the user experience side, we're forced to think about how to display the results of our analysis. If we were building a standalone webpage, we might want to produce beautiful charts and graphs. What should those charts be and why? What questions are we trying to answer?
If we were working with a marketing or operations team, we might want to produce a spreadsheet for further analysis. What should that spreadsheet contain and why? How much raw data and how much of our analysis should it contain?