Iterations

This project is structured as a sequence of iterations, each of which builds on previous iterations. We do not expect students to complete each and every iteration. Rather, they serve three important roles:

Models for good engineering and product management, i.e., what do we build, in what order, and why?
Natural checkpoints to ask for a code review or other feedback
The ability to accomodate students with different interests, skill levels, and time constraints.

Code Reviews & Feedback

Remember, the absolute, tip-top, #1 priority is asking for and receiving feedback on your code. It's better to "fall short" of an iteration and ask for feedback on an incomplete version than it is to get stuck. It's better to ask for feedback on a hacked-together-but-working version than worry about whether it's "polished enough."

Indeed, even if you know your code is unpolished or incomplete, you may as well ask for feedback so that we can be working on that feedback in parallel while you're polishing or completing your code. The worst that could possibly happen is that we give you feedback you are already aware of.

[v0.1] Basic Count Statistics

Using hard-coded examples, write a method that takes an Array containing arbitrary and possibly duplicated items as input and returns a Hash containing item/count pairs. Print out those pairs in a sensible way.

That is, if the input has 100 entries and 20 of the are letter "a" then then resultant Hash should have

{'a' => 20}

"Sensible" is up to you to define, but here's a suggested format, pretending we hard-coded the input as ["a", "a", "a", "b", "b", "c"].

user@host text-analysis $ ruby textalyzer.rb
The counts for ["a", "a", "a", "b", "b", "c"] are...
a   3
b   2
c   1
user@host text-analysis $

[v0.2] String to Characters

Using hard-coded examples, write a method that takes an arbitrary String as input and returns an Array of all the characters in the string, including spaces and punctuation.

Feed this into the array-counting method from the previous iteration to get a Hash containing letter/count pairs. Print out those pairs in a sensible way.

[v0.3] Basic String Sanitizing

Write a method called sanitize that takes an arbitrary String — perhaps containing spaces, punctuation, line breaks, etc. — and returns a "sanitized" string that replaces all upper-case letters with their lower-case equivalent. This will ensure that the letters 'A' and 'a' are not treated as two distinct letters when we analyze our text. We'll handle punctuation and other bits in a later iteration.

It should work like this

sanitize("This is a sentence.")        # => "this is a sentence."
sanitize("WHY AM I YELLING?")          # => "why am i yelling?"
sanitize("HEY: ThIs Is hArD tO rEaD!") # => "hey: this is hard to read!"

Lucky for us, Ruby comes with a built-in method to help us: String#downcase.

Integrate this method into current program so that the Hash of results contains, e.g.,

{'a' => 25}

instead of

{'a' = 19, 'A' => 6}

Some Notes on String Sanitizing

Oftentimes the data we want isn't in a format that makes it easy to analyze. The process of taking poorly-formatted data and transforming it into something we can make use of is called sanitizing our data.

What counts as "sanitizing" varies depending on the underlying data and our needs. For example, if we wanted to look at all the text in an HTML document, we wouldn't want to be counting all the HTML tags. Conversely, if we wanted a report on the most commonly-used tags in an HTML document, we'd want to keep the tags but remove the text.

In our case, we've designed our program such that it treats upper-case letters and lower-case letters as distinct letters, i.e., our results Hash might contain

{'a' => 20, 'A' => 5}

but we'd probably rather it just contain

{'a' => 25}

Likewise, we probably don't care about punctuation, although this is harder to deal with than differences between upper-case and lower-case letters.

[v0.4] Reading From a Hard-Coded File

The base repository contains a directory called sample_data that contains a handful of text files. Hard-code the name of one of these files into your program and read the contents of that file into a string. Pass that string into your current program, so that it now prints out the letter-count statistics for that specific file instead of the hard-coded strings you had in the previous iteration.

To read the contents of a file into a string, use File.read.

Note: Look at the examples/file_read.rb file in the ruby-examples repository to see an example of File.read in action.

If you're feeling adventurous, look at the Ruby documentation for File and IO.read. The IO class is a more abstract class encompassing all kinds of input/output, including reading/writing from network connections. This is where the general read method is defined, however. Don't worry if the documentation is overwhelming or doesn't make sense — reading technical documentation is a skill in its own right and like any skill, one starts improving by becoming comfortable with what it looks like.

[v1.0] Reading From a User-Supplied File

We don't want to edit our Ruby code every time we need to change the file from which we're reading data. Let's change it so that the user running the program can pass in the name of the file from which to read. We'll do this using command-line arguments.

This iteration marks v1.0 of our program. As it stands, our program — although limited — is self-contained enough that you could give it to another person and they could use it as you intended without having to know how to edit Ruby code.

Congrats!

Command-Line Arguments

Consider the following command run from the Terminal:

We're running

ruby some-program.rb first_argument second_argument banana

The command-line arguments are first_argument, second_argument, and banana, with a space denoting the separation between each argument. first_argument is the first command-line argument and banana is the third command-line argument.

Look at the examples/args.rb file in the ruby-examples repository for a demonstration of how command-line arguments work.

[v1.1] Basic Frequency Statistics

Using hard-coded examples, write a method that takes an Array containing arbitrary and possibly duplicated entries as input and returns a Hash containing item/frequency pairs. Print out those pairs in a sensible way.

That is, if the input has 100 entries and 20 of the are letter "a" then then resultant Hash should have

{'a' => 0.20}

Stretch Approach

You've already written a method that takes an Array and returns a Hash containing entry/count pairs and you'll need these counts (one way or another) in order to calculate the overall frequency. If you want to stretch yourself, try writing your "frequency statistics" method in a way that makes use of your "counting statistics" method, so that you don't have to duplicate as much code or work in your program.

This is a "stretch approach," which means it's absolutely not necessary for you to write your program this way. Like we've been saying, it's much better to write something and get feedback on it than get stuck while trying to puzzle out a "better", "faster", "more elegant", etc. approach.

[v1.2] Pretty Histograms

Print out a histogram of letter frequencies, a la the original screenshot. Hint: You can use the frequency for each item as a way to scale the length of the histogram.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterations

Iterations

Code Reviews & Feedback

[v0.1] Basic Count Statistics

[v0.2] String to Characters

[v0.3] Basic String Sanitizing

Some Notes on String Sanitizing

[v0.4] Reading From a Hard-Coded File

[v1.0] Reading From a User-Supplied File

Command-Line Arguments

[v1.1] Basic Frequency Statistics

Stretch Approach

[v1.2] Pretty Histograms

Clone this wiki locally