This is the public posting of the assignment. See Blackboard for the invite link to make your submission in your own repository in the class organization
Due: Tuesday, September 22, 2020 by 11:59pm
The goal of this week's assignment is to gain experience using OpenRefine for data cleaning.
This assignment assumes that you have already downloaded and installed OpenRefine and worked through the tutorial from Week 2 of CS 625.
Create a new project in OpenRefine and load the PetNames.tsv dataset available from https://github.com/jgolbeck/petnames (read the README.txt in that repo for more information on the dataset). If you view the raw version of the data file in GitHub, you can copy that URL directly into OpenRefine to load the data without downloading it separately.
Note: In class you will likely not have learned everything you need to know to complete the assignment. I expect that you will watch the tutorials and read documentation, including documentation on the GREL regex language.
Use OpenRefine to clean the dataset of pet names so that you can answer the questions in Part 2. Make sure to keep track of all operations you perform. As much as you can, use OpenRefine facets and GREL transforms to clean the data rather than manual editing (though, some cleaning will need to be done manually).
There are a couple entries where multiple pets are in the same entry. Make a decision on how to handle these cases and document it in your report.
When you are done cleaning the file:
- Export the file as a new CSV and save it in your repo as
HW2-petnames.csv
. - Extract JSON scripts containing all of the operations you performed on the file and save it in your repo as
HW2-petnames.json
. (Select Extract at the top of the Undo/Redo tab. Then copy and paste the JSON script into a new file.)
In your report, answer the following questions using the cleaned data:
- How many types (kinds) of pets are there?
- How many dogs?
- How many breeds of dogs?
- What's the most popular dog breed?
- What's the age range of the dogs?
- What's the age range of the guinea pigs?
- What is the oldest pet?
- Which are more popular, betta fish or goldfish? How many of each?
- What's the most popular everyday name for a cat?
- What's the most popular full name for a dog?
I do not expect everyone to have the exact same answers. Some of these will depend upon decisions you make while cleaning the data. Make sure to note any decisions you make that could have an impact on your answers.
For this project, since you'll just be describing the actions you took using OpenRefine without including R code, you can either directly write in Markdown in report.md
or your can use R Markdown in report.Rmd
and the Knit process to generate your report.md
. (In any case, report.md
is the file that I will use for grading.)
In your report, explain the steps you took to clean the data. Make sure to include and explain all GREL functions that you used. If you did any manual cleaning, note that and explain why you did this manually.
In answering the questions, also explain how you arrived at the answer using OpenRefine.
Important: Your report is the most important part of this assignment. You need to include enough detail so that I am convinced that you understand how to use OpenRefine. I have not provided a template, but I expect your report to include your name, CS625-HW2, date, and appropriate headings and Markdown markup for clarity and neatness. In addition, you will lose points if there are many spelling or grammatical errors.
It is not sufficient just to provide the answers to the questions. You must first describe in detail how you cleaned the data. Include screenshots, GREL statements, etc. as needed to clearly document what you did. Make sure that your report is clear and easily readable.
Include links to any examples that you used in completing this assignment, including the tutorial examples have been provided.
Your GitHub repository should contain the following files (in addition to any assignment files that were provided):
report.md
- your reportHW2-petnames.csv
- cleaned CSVHW2-petnames.json
- operations used to clean the data in JSON format
Submit the URL of your report (not the URL of your repo) in Blackboard. Make sure that you have committed and pushed your local repo to GitHub. Include "Ready to grade @weiglemc" in your final commit message.
- Click on HW2 under Week 2 in Blackboard
- Under "Assignment Submission", click the "Write Submission" button.
- Copy/paste the URL of your
report.md
file into the edit box (should be something like https://github.com/cs625-datavis-fall20/hw2-cleaning-username/blob/master/report.md) - Make sure to "Submit" your assignment.