- Code review implementations of higher order Markov chains
- Review structures used to build Markov chain and discuss scalability
- Lecture and discussion following regular expressions slides
- Build and test regular expressions with RegExr and visualize them with RegExper
After completing this class session and the associated tutorial challenges, students will be able to ...
- Use regular expressions to clean up and remove junk text from corpus
- Use regular expressions to create a more intelligent word tokenizer
- Watch Make School's regular expressions lecture
- Review Make School's regular expressions slides
- Use Cheatography's regular expressions cheat sheet as a reference guide
- Solve interactive challenges in UBC's regular expressions lab webpage
- Use RegExr or RegEx Pal to build and test regular expression patterns on text samples
- Use RegExper to visualize railroad diagrams of regular expression patterns
- Read StackOverflow answers to questions about using regular expressions to parse HTML: first some comedic relief and then an explanation of why you shouldn't
These challenges are the baseline required to complete the project and course. Be sure to complete these before next class session and before starting on the stretch challenges below.
- Page 13: Parsing Text and Clean Up
- Remove unwanted junk text (e.g., chapter titles in books, character names in scripts)
- Remove unwanted punctuation (e.g.,
_
or*
characters around words) - Convert HTML character codes to ASCII equivalents (e.g.,
—
to—
) - Normalize punctuation characters (e.g., convert both types of quotes –
‘’
and“”
– to regular quotes –''
and""
)
- Page 14: Tokenization
- Handle special characters (e.g., underscores, dashes, brackets,
$
,%
,•
, etc.) - Handle punctuation and hyphens (e.g.,
Dr.
,U.S.
,can't
,on-demand
, etc.) - Handle letter casing and capitalization (e.g.,
turkey
andTurkey
)
- Handle special characters (e.g., underscores, dashes, brackets,
These challenges are more difficult and help you push your skills and understanding to the next level.
- Page 13: Parsing Text and Clean Up
- Make your parser code readable, then improve its organization and modularity so that it's easy to modify in the future
- Modify your parser so that it can be used as both a module (imported by another script) and as a stand-alone, executable script that, when invoked from the command line with a file argument, will print out the cleaned-up version, which can be redirected into a file
- Page 14: Tokenization
- Make your tokenizer code readable, then improve its organization and modularity so that it's easy to modify in the future
- Write tests to ensure that you're getting the results you've designed for, then run your tests with controlled input data
- Come up with at least one other tokenization strategy and compare performance against your original strategy, then find ways to make your tokenizer more efficient