- Instructor: Meredith Franklin
- Email: [email protected], please put "JSC370" in the subject line.
- Teaching Assistant: Evelyn Pan [email protected]; Sejal Bhalla [email protected]
- Time: Mondays (Lecture) and Wednesdays (Lab), 1-3pm
- Location: SF 2202 (Mondays), GB 304 (Wednesdays)
- Office hours: By Appointment
- Course Forum: Quercus
- Lab materials
Topics/Weekly Activities | Labs Due Wednesdays 11:59pm HW Due Fridays 11:59pm |
|
---|---|---|
Week 1 January 6 lecture January 8 lab |
Introduction to Data Science tools: R, markdown | Lab 1 |
Week 2 January 13 lecture January 15 lab |
Version Control & Reproducible Research, Git |
Lab 2 |
Week 3 January 20 lecture January 22 lab |
Exploratory Data Analysis | Lab 3 |
Week 4 January 27 lecture January 29 lab |
Data visualization | HW1, Lab 4 |
Week 5 February 3 lecture February 5 lab |
Data cleaning and wrangling ML 1 (gam) |
Lab 5 |
Week 6 February 10 lecture February 12 lab |
Regular Expressions, Data scraping, using APIs | HW2, Lab 6 |
Week 7 February 17 |
Reading Week | |
Week 8 February 24 lecture February 26 lab |
Text mining | HW3, Lab 8 |
Week 9 March 3 lecture March 5 lab |
High performance computing, cloud computing | Midterm, Lab 9 |
Week 10 March 10 lecture March 12 lab |
ML 2 (trees, rf, xgboost) | Lab 10 |
Week 11 March 17 lecture March 19 lab11 |
Interactive visualization and effective data communication I |
HW4, Lab 11 |
Week 12 March 23 lecture March 26 lab12 |
Interactive visualization and effective data communication II | Lab 12 |
Week 13 March 31 April 2 |
Final Project Workshop | HW5 |
Week 15 April 28 |
Final Project |
Task | % of Grade |
---|---|
Labs (including attendance) | 10 |
Homework (5) | 25 |
Midterm report | 30 |
Final project | 35 |
- R Programming for Data Science, 2022. Roger Peng.
- R for Data Science (2e), 2023 Garrett Grolemund and Hadley Wickham.
- Exploratory Data Analysis with R, 2020 Roger Peng.
- Mastering Software Development in R, 2020 Roger Peng, Sean Kross, Brooke Anderson.
- The Plain Person’s Guide to Plain Text Social Science: Why you should write data-based reports using plain-text tools.
- Markdown tutorial: An interactive tutorial to practice using Markdown.
- Markdown cheatsheet: Useful one-page Markdown available on this page.
- RStudio Cheatsheets Other quick guides, including a more comprehensive RMarkdown reference and a information about using RStudio's IDE, and some of the main tools in R.
- R Style Guide. Write readable code.
- Jenny Bryan's Stat 545. Notes and tutorials for a Data Analysis course taught by Jennifer Bryan at the University of British Columbia. Lots of useful material.
- knitr demos Documentation and examples for
knitr
by its author, Yihui Xie. There is also a knitr book covering the same ground in more detail. - Rmarkdown documentation from the makers of RStudio. Lots of good examples.
- Plain Person's Guide The git repository for this project.
- Karl Broman's Tutorials and Guides Accurate and concise guides to many of the tools and topics described here, including getting started with reproducible research, using git and GitHub, and working with knitr.
- Makefiles for OCR and converting Shapefiles. Some further examples of
Makefiles
in the data-analysis pipeline, by Lincoln Mullen
- Apple's Developer Tools Unix toolchain. Install directly with
xcode-select --install
, or just try to use e.g.git
from the terminal and have OS X prompt you to install the tools. - Homebrew package manager. A convenient way to install several of the tools here, including Emacs and Pandoc.
- R. A platform for statistical computing.
- knitr. Reproducible plain-text documents from within R.
- Python and SciPy. Python is a general-purpose programming language increasingly used in data manipulation and analysis.
- RStudio. An IDE for R. The most straightforward way to get into using R and RMarkdown.
- TeX and LaTeX. A typesetting and document preparation system. You can write files in
.tex
format directly, but it is more useful to just have it available in the background for other tools to use. The MacTeX Distribution is the one to install for macOS. - Pandoc. Converts plain-text documents to and from a wide variety of formats. Can be installed with Homebrew. Be sure to also install
pandoc-citeproc
for processing citations and bibliographies, andpandoc-crossref
for producing cross-references and labels. - Git. Version control system. Installs with Apple's Developer Tools, or get the latest version via Homebrew.
- GNU Make. You tell
make
what the steps are to create the pieces of a document or program. As you edit and change the various pieces, it automatically figures out which pieces need to be updated and recompiled, and issues the commands to do that. See Karl Broman's Minimal Make for a short introduction. Make will be installed automatically with Apple's developer tools. - lintr and flycheck. Tools that nudge you to write neater code.
- Backblaze. Secure off-site backup.
- GitHub. Host public Git repositories for free. Pay to host private ones. Also a source for publicly available code (e.g. R packages and utilities) written by other people.
- Marked 2. Live HTML previewing of Markdown documents. Mac OS X only.
- Sublime Text. Python-based text editor.
- Zotero, Mendeley, and Papers are citation managers that incorporate PDF storage, annotation and other features. Zotero is free to use. Mendeley has a premium tier. Papers is a paid application after a trial period. I don't use these tools much, but that's not for any strong principled reason---mostly just intertia. If you use one and want to integrate with the material here, just make sure it can export to BibTeX/BibLaTeX files. Papers, which I've used most recently, can handily output citation keys in pandoc's format amongst several others.
Many of these websites have API to download the data. We recommend you using APIs to get data.
- Canada GIS Data
- Canada Census Data
- University of Toronto Library Geospatial Data
- Toronto Open Data
- Toronto Police Department
- British Columbia Open Data
- Ontario Data Catalogue
- Public Health Ontario Open Data
- US Environmental Protection Agency EPA
- Weather and Climate Data NOAA
- North American Climate Model Data NCAR
- Natural Resources Canada Geospatial Data
- US Coastal Data Oceans
- Great Lakes Bathymetry
- Energy Data EIA
- UN Food and Agriculture Organization FAO
- UN Geospatial Hub
- NASA SEDAC
- NASA EarthData
- USA Data.gov Geospatial Data
- US Census Data National Historical Geographic Information System (NHGIS)
- NYU Geospatial Data Repository
- Google Earth Engine
- Google Dataset Search
- FiveThirtyEight Open Data
- World Bank Open Data
- Los Angeles City Data
- Los Angeles Crime Data
- NIH Cancer Surveillance
- World Health Organization WHO data
- US Center for Disease Control and Prevention Data
- California Health and Human Services Open Data Portal
- Canada Covid Data CovidTracker
- UniProt Protein Data
- The Gene Ontology Project
- Twitter Developers API
- GitHub Developers API
- Instagram Developers API
- LinkedIn Developers API
- Zillow Developers API
- Spotify Developers API
- Figshare data Repository
- Zenodo data Repository
- Harvard Dataverse
- Elsevier Developers API