Skip to content

Commit

Permalink
Exercise updates
Browse files Browse the repository at this point in the history
  • Loading branch information
Skipper Seabold committed Nov 1, 2017
1 parent 360630c commit 1f77921
Show file tree
Hide file tree
Showing 19 changed files with 1,766 additions and 203 deletions.
167 changes: 120 additions & 47 deletions 1 - Reading Data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -34,14 +34,14 @@
"metadata": {},
"outputs": [],
"source": [
"csv_file = open(\"data/health_inspection_sample.csv\")"
"csv_file = open(\"data/health_inspection_chi_sample.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"File objects are lazy **iterators**. Lazy means that they only do things, in this case read data, when you ask them to. You can call **next** on iterator objects to explicitly get the next item."
"File objects are lazy **iterators** (here, *stream objects*). Lazy means that they only do things, in this case read data, when you ask them to. You can call **next** on iterator objects to explicitly get the next item."
]
},
{
Expand Down Expand Up @@ -119,7 +119,7 @@
"metadata": {},
"outputs": [],
"source": [
"with open(\"data/health_inspection_sample.csv\") as csv_file:\n",
"with open(\"data/health_inspection_chi_sample.csv\") as csv_file:\n",
" for line in csv_file:\n",
" pass"
]
Expand All @@ -131,6 +131,33 @@
"By using the `open` function as a context manager, we get an automatic call to close the open file when we exit the context (determined by the indentation level). When working with files non-interactively, you'll almost always want to use open as a context manager."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise\n",
"\n",
"Write some code that iterates through the file `data/health_inspection_chi_sample.json` twice. Only call `open` once, however, then close the file. Can you find out, programatically, how many characters are in the file?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Type your solution here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load solutions/read_json_twice.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -168,7 +195,7 @@
"metadata": {},
"outputs": [],
"source": [
"csv_file = open(\"data/health_inspection_sample.csv\")\n",
"csv_file = open(\"data/health_inspection_chi_sample.csv\")\n",
"\n",
"reader = csv.reader(csv_file)"
]
Expand Down Expand Up @@ -222,7 +249,7 @@
"source": [
"The biggest difference in using `csv.reader` vs. iterating through the file is that it automatically splits the csv on commas and returns the line of the file split into a list.\n",
"\n",
"You can control this behavior through a `Dialect` object. By default, `csv.reader` uses a Dialect object called \"excel.\" "
"You can control this behavior through a `Dialect` object. By default, `csv.reader` uses a Dialect object called \"excel.\" Here let's look at the attributes of the excel dialect. Don't worry too much about the code used to achieve this. We'll look more at this later."
]
},
{
Expand Down Expand Up @@ -292,7 +319,7 @@
"metadata": {},
"outputs": [],
"source": [
"file_name = \"data/health_inspection_sample.csv\"\n",
"file_name = \"data/health_inspection_chi_sample.csv\"\n",
"\n",
"with open(file_name) as csv_file:\n",
" \n",
Expand Down Expand Up @@ -365,7 +392,7 @@
"source": [
"The final thing to note in the block above is the use of `print` to provide some information about what went wrong. Logging is another really good habit to get into, and print statements are the dead simplest way to log the behavior of your code.\n",
"\n",
"In practice, you probably don't want to use `print`. You want to use the logging module (TODO: link)."
"In practice, you probably don't want to use `print`. You want to use the [logging](https://docs.python.org/3/library/logging.html) module, but we're not going to talk about best practices in logging anymore today."
]
},
{
Expand All @@ -386,7 +413,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Each line in the file `data/health_inspection_sample.json` is a single json object that represents the same data above. "
"Each line in the file `data/health_inspection_chi_sample.json` is a single json object that represents the same data above. "
]
},
{
Expand Down Expand Up @@ -418,7 +445,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since each line is a json object here, we need to iterate over the file and parse each line. We use the `json.loads` function here for \"load string.\" The function `json.load` will take a file-like object."
"Since each line is a json object here, we need to iterate over the file and parse each line. We use the `json.loads` function here for \"load string.\" The similar function `json.load` takes a file-like object."
]
},
{
Expand All @@ -443,48 +470,30 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`json.loads` places each json object into a Python dictionary, helpfully filling in `None` for `null` for missing values and otherwise preserving types. It also, works recursively as we see in the `location` field.\n",
"\n",
"We can take further control over how the data is read in by using the `object_hook` argument. Say we wanted to remove the `location` field above. We don't need the `geoJSON` formatted information. We could do so with the `object_hook`."
"`json.loads` places each json object into a Python dictionary, helpfully filling in `None` for `null` for missing values and otherwise preserving types. It also, works recursively as we see in the `location` field."
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"def remove_entry(record):\n",
" try:\n",
" del record['location']\n",
" # this is called recursively on objects so not all have it\n",
" except KeyError:\n",
" pass\n",
" \n",
" return record\n",
"\n",
"\n",
"def parse_json(record):\n",
" return json.loads(record, object_hook=remove_entry)"
"## Aside: List Comprehensions"
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"with open(\"data/health_inspection_chi_sample.json\") as json_file:\n",
" dta = [parse_json(line) for line in json_file]\n",
" \n",
"pprint(dta[0])"
"Let's take a look at another Pythonic concept, introduced a bit above, called a **list comprehension**. This is what's called *syntactic sugar*. It's a concise way to create a list."
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"You'll notice two things in the above code. First is the line within the context manager. This is another Pythonic concept called a **list comprehension**. This is what's called *syntactic sugar*. It's a concise way to createa list."
"[i for i in range(1, 6)]"
]
},
{
Expand Down Expand Up @@ -570,6 +579,33 @@
"{key: value for key, value in pairs}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise\n",
"\n",
"Returning to the code that we introduced above, we can take further control over how a file with json objects is read in by using the `object_hook` argument. Say we wanted to remove the `location` field above. We don't need the `geoJSON` formatted information. We could do so with the `object_hook`. Write a function called `remove_entry` that removes the `'location'` field from each record in the `'data/health_inspection_chi_sample.json'` file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Type your solution here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load solutions/object_hook_json.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -590,7 +626,7 @@
"source": [
"#### Introducing Pandas\n",
"\n",
"First, a few words of introduction for **pandas**. Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a high-level tool for doing practical, real world data analysis in Python.\n",
"First, a few words of introduction for **pandas**. Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data. It is a high-level tool for doing practical, real world data analysis in Python.\n",
"\n",
"You reach for pandas when you have:\n",
"\n",
Expand Down Expand Up @@ -677,17 +713,31 @@
"The JSON counterpart to `read_csv` is `read_json`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise\n",
"\n",
"Use `pd.read_json` to read in the Chicago health inspections json sample in the `data` folder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.read_json(\n",
" \"data/health_inspection_chi_sample.json\", \n",
" orient=\"records\",\n",
" lines=True,\n",
")"
"# Type your solution Here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load solutions/read_json.py"
]
},
{
Expand All @@ -703,7 +753,7 @@
"source": [
"So far, we've seen some ways that we can read data from disk. As Data Scientists, we often need to go out and grab data from the Internet.\n",
"\n",
"Generally Python is \"batteries included\" and reading data from the Internet is no exception, but there are some *great* packages out there. [requests]() is one of them for making HTTP requests. Use it. (TODO: link)\n",
"Generally Python is \"batteries included\" and reading data from the Internet is no exception, but there are some *great* packages out there. [requests](http://docs.python-requests.org/en/master/) is one of them for making HTTP requests.\n",
"\n",
"Let's look at how we can use the [Chicago Data Portal](https://data.cityofchicago.org/) API to get this data in the first place. (I originally used San Francisco for this, but the data was just too clean to be terribly interesting.)"
]
Expand Down Expand Up @@ -736,6 +786,13 @@
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Requests returns a [Reponse](http://docs.python-requests.org/en/master/api/#requests.Response) object with many helpful methods and attributes."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -788,15 +845,31 @@
"Of course, pandas can also load data directly from a URL, but I encourage you to reach for `requests` as often as you need it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise\n",
"\n",
"Try passing the URL above to `pd.read_json`. What happens?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Type your solution here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = ('https://data.cityofchicago.org/'\n",
" 'resource/cwig-ma7x.json?$limit=5')\n",
"pd.read_json(url, orient='records')"
"%load solutions/read_url_json.py"
]
},
{
Expand Down Expand Up @@ -858,7 +931,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes we need to be resourceful in order to get data. Knowing how to scrape the web can really come in handy. We're not going to go into details here, but you'll likely find libraries like [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), [lxml](http://lxml.de/), and [mechanize](https://mechanize.readthedocs.io/en/latest/) to be helpful."
"Sometimes we need to be resourceful in order to get data. Knowing how to scrape the web can really come in handy. We're not going to go into details today, but you'll likely find libraries like [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), [lxml](http://lxml.de/), and [mechanize](https://mechanize.readthedocs.io/en/latest/) to be helpful. There's also a `read_html` function in pandas that will quickly scrape HTML tables for you and put them into a DataFrame. "
]
}
],
Expand Down
Loading

0 comments on commit 1f77921

Please sign in to comment.