Exercise updates

jseabold · Nov 1, 2017 · 1f77921 · 1f77921
1 parent 360630c
commit 1f77921
Show file tree

Hide file tree

Showing 19 changed files with 1,766 additions and 203 deletions.
diff --git a/1 - Reading Data.ipynb b/1 - Reading Data.ipynb
@@ -34,14 +34,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "csv_file = open(\"data/health_inspection_sample.csv\")"
+    "csv_file = open(\"data/health_inspection_chi_sample.csv\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "File objects are lazy **iterators**. Lazy means that they only do things, in this case read data, when you ask them to. You can call **next** on iterator objects to explicitly get the next item."
+    "File objects are lazy **iterators** (here, *stream objects*). Lazy means that they only do things, in this case read data, when you ask them to. You can call **next** on iterator objects to explicitly get the next item."
    ]
   },
   {
@@ -119,7 +119,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "with open(\"data/health_inspection_sample.csv\") as csv_file:\n",
+    "with open(\"data/health_inspection_chi_sample.csv\") as csv_file:\n",
     "    for line in csv_file:\n",
     "        pass"
    ]
@@ -131,6 +131,33 @@
     "By using the `open` function as a context manager, we get an automatic call to close the open file when we exit the context (determined by the indentation level). When working with files non-interactively, you'll almost always want to use open as a context manager."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise\n",
+    "\n",
+    "Write some code that iterates through the file `data/health_inspection_chi_sample.json` twice. Only call `open` once, however, then close the file. Can you find out, programatically, how many characters are in the file?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Type your solution here"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load solutions/read_json_twice.py"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -168,7 +195,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "csv_file = open(\"data/health_inspection_sample.csv\")\n",
+    "csv_file = open(\"data/health_inspection_chi_sample.csv\")\n",
     "\n",
     "reader = csv.reader(csv_file)"
    ]
@@ -222,7 +249,7 @@
    "source": [
     "The biggest difference in using `csv.reader` vs. iterating through the file is that it automatically splits the csv on commas and returns the line of the file split into a list.\n",
     "\n",
-    "You can control this behavior through a `Dialect` object. By default, `csv.reader` uses a Dialect object called \"excel.\" "
+    "You can control this behavior through a `Dialect` object. By default, `csv.reader` uses a Dialect object called \"excel.\" Here let's look at the attributes of the excel dialect. Don't worry too much about the code used to achieve this. We'll look more at this later."
    ]
   },
   {
@@ -292,7 +319,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "file_name = \"data/health_inspection_sample.csv\"\n",
+    "file_name = \"data/health_inspection_chi_sample.csv\"\n",
     "\n",
     "with open(file_name) as csv_file:\n",
     "    \n",
@@ -365,7 +392,7 @@
    "source": [
     "The final thing to note in the block above is the use of `print` to provide some information about what went wrong. Logging is another really good habit to get into, and print statements are the dead simplest way to log the behavior of your code.\n",
     "\n",
-    "In practice, you probably don't want to use `print`. You want to use the logging module (TODO: link)."
+    "In practice, you probably don't want to use `print`. You want to use the [logging](https://docs.python.org/3/library/logging.html) module, but we're not going to talk about best practices in logging anymore today."
    ]
   },
   {
@@ -386,7 +413,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Each line in the file `data/health_inspection_sample.json` is a single json object that represents the same data above. "
+    "Each line in the file `data/health_inspection_chi_sample.json` is a single json object that represents the same data above. "
    ]
   },
   {
@@ -418,7 +445,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Since each line is a json object here, we need to iterate over the file and parse each line. We use the `json.loads` function here for \"load string.\" The function `json.load` will take a file-like object."
+    "Since each line is a json object here, we need to iterate over the file and parse each line. We use the `json.loads` function here for \"load string.\" The similar function `json.load` takes a file-like object."
    ]
   },
   {
@@ -443,48 +470,30 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "`json.loads` places each json object into a Python dictionary, helpfully filling in `None` for `null` for missing values and otherwise preserving types. It also, works recursively as we see in the `location` field.\n",
-    "\n",
-    "We can take further control over how the data is read in by using the `object_hook` argument. Say we wanted to remove the `location` field above. We don't need the `geoJSON` formatted information. We could do so with the `object_hook`."
+    "`json.loads` places each json object into a Python dictionary, helpfully filling in `None` for `null` for missing values and otherwise preserving types. It also, works recursively as we see in the `location` field."
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "def remove_entry(record):\n",
-    "    try:\n",
-    "        del record['location']\n",
-    "    # this is called recursively on objects so not all have it\n",
-    "    except KeyError:\n",
-    "        pass\n",
-    "    \n",
-    "    return record\n",
-    "\n",
-    "\n",
-    "def parse_json(record):\n",
-    "    return json.loads(record, object_hook=remove_entry)"
+    "## Aside: List Comprehensions"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "with open(\"data/health_inspection_chi_sample.json\") as json_file:\n",
-    "    dta = [parse_json(line) for line in json_file]\n",
-    "    \n",
-    "pprint(dta[0])"
+    "Let's take a look at another Pythonic concept, introduced a bit above, called a **list comprehension**. This is what's called *syntactic sugar*. It's a concise way to create a list."
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "You'll notice two things in the above code. First is the line within the context manager. This is another Pythonic concept called a **list comprehension**. This is what's called *syntactic sugar*. It's a concise way to createa list."
+    "[i for i in range(1, 6)]"
    ]
   },
   {
@@ -570,6 +579,33 @@
     "{key: value for key, value in pairs}"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise\n",
+    "\n",
+    "Returning to the code that we introduced above, we can take further control over how a file with json objects is read in by using the `object_hook` argument. Say we wanted to remove the `location` field above. We don't need the `geoJSON` formatted information. We could do so with the `object_hook`. Write a function called `remove_entry` that removes the `'location'` field from each record in the `'data/health_inspection_chi_sample.json'` file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Type your solution here"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load solutions/object_hook_json.py"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -590,7 +626,7 @@
    "source": [
     "#### Introducing Pandas\n",
     "\n",
-    "First, a few words of introduction for **pandas**. Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a high-level tool for doing practical, real world data analysis in Python.\n",
+    "First, a few words of introduction for **pandas**. Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data. It is a high-level tool for doing practical, real world data analysis in Python.\n",
     "\n",
     "You reach for pandas when you have:\n",
     "\n",
@@ -677,17 +713,31 @@
     "The JSON counterpart to `read_csv` is `read_json`."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise\n",
+    "\n",
+    "Use `pd.read_json` to read in the Chicago health inspections json sample in the `data` folder."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "pd.read_json(\n",
-    "    \"data/health_inspection_chi_sample.json\", \n",
-    "    orient=\"records\",\n",
-    "    lines=True,\n",
-    ")"
+    "# Type your solution Here"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load solutions/read_json.py"
    ]
   },
   {
@@ -703,7 +753,7 @@
    "source": [
     "So far, we've seen some ways that we can read data from disk. As Data Scientists, we often need to go out and grab data from the Internet.\n",
     "\n",
-    "Generally Python is \"batteries included\" and reading data from the Internet is no exception, but there are some *great* packages out there. [requests]() is one of them for making HTTP requests. Use it. (TODO: link)\n",
+    "Generally Python is \"batteries included\" and reading data from the Internet is no exception, but there are some *great* packages out there. [requests](http://docs.python-requests.org/en/master/) is one of them for making HTTP requests.\n",
     "\n",
     "Let's look at how we can use the [Chicago Data Portal](https://data.cityofchicago.org/) API to get this data in the first place. (I originally used San Francisco for this, but the data was just too clean to be terribly interesting.)"
    ]
@@ -736,6 +786,13 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Requests returns a [Reponse](http://docs.python-requests.org/en/master/api/#requests.Response) object with many helpful methods and attributes."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -788,15 +845,31 @@
     "Of course, pandas can also load data directly from a URL, but I encourage you to reach for `requests` as often as you need it."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise\n",
+    "\n",
+    "Try passing the URL above to `pd.read_json`. What happens?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Type your solution here"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "url = ('https://data.cityofchicago.org/'\n",
-    "       'resource/cwig-ma7x.json?$limit=5')\n",
-    "pd.read_json(url, orient='records')"
+    "%load solutions/read_url_json.py"
    ]
   },
   {
@@ -858,7 +931,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Sometimes we need to be resourceful in order to get data. Knowing how to scrape the web can really come in handy. We're not going to go into details here, but you'll likely find libraries like [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), [lxml](http://lxml.de/), and [mechanize](https://mechanize.readthedocs.io/en/latest/) to be helpful."
+    "Sometimes we need to be resourceful in order to get data. Knowing how to scrape the web can really come in handy. We're not going to go into details today, but you'll likely find libraries like [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), [lxml](http://lxml.de/), and [mechanize](https://mechanize.readthedocs.io/en/latest/) to be helpful. There's also a `read_html` function in pandas that will quickly scrape HTML tables for you and put them into a DataFrame. "
    ]
   }
  ],