-
Notifications
You must be signed in to change notification settings - Fork 52
Tips and Tricks
Based on previous students experiences - please add yours!
How to get help from Udacity Staff regarding technical matters (tuition, grading, submissions): [email protected]
If you feel like you are being thrown in the deep end with data wrangling and exploratory analysis, we recommend you seriously consider taking a step back and pursuing the Data Analyst Nanodegree (DAND). If you aren’t drowning, but struggling, then consider the following individual courses:
-
Intro to Data Analysis
- Rated beginner level, this course gives intuition about how to ask questions of your data and explore it, as well as Numpy and Pandas experience
-
Intro to Data Science
- Rated Intermediate level, this course takes you through the entire data science process: statistical analysis, data wrangling, exploratory and explanatory visualization, machine learning (basic), and even some MapReduce
- A note on these two courses: Intro to Data Analysis was created to replace Intro to Data Science in the DAND, and was intended to be easier material and a more open ended project in terms of the dataset. One suggestion is to select one of these courses, based on your comfortability with data science (Intro to Data Analysis being the easier of the two). Of course, you can always take both if you have the time
-
Data Wrangling with MongoDB
- This incredibly practical course will get you comfortable writing data parsing scripts to handle data in CSV, JSON, XML, and HTML formats, as well as give you experience with MongoDB syntax for storing and querying data. Highly recommended as this is one of the most practical skills you can have -- as they say, 80% of data science is cleaning the data. For reference, it took Nash(@nash) 4 months to clean his capstone dataset
We believe these courses will get you relatively up-to-speed on the data science process.
Because the video material for this Nanodegree (ND) is taken from multiple existing Udacity courses, it can seem jarring at times, or missing material. One great option is to just simply watch the entire video series, in their original format, as Udacity Classes.
- Intro to Machine Learning - Katie and Sebastian give practical advice in sklearn, this is very useful for Projects 1-3.
- GA Tech Machine Learning - Charles and Michael talk theory theory and more theory, it’s OK if you don’t understand everything they say (really!), useful for Projects 2-4.
- GA Tech Reinforcement Learning - deeper theoretical material, e.g. convergence, multi-agent game theory, useful for Project 4.
Remember in math class when you would be first introduced to something like finding a derivative and you would need to do every single derivative by hand? Then you would be introduced to the general faster method that everyone uses. Having done it by hand helped you to understand the faster method. This project is essentially implementing a decision tree by hand. Not in a machine learning sense - you are not coding the CART algorithm by hand - but rather in the sense that you create a binary decision tree by programming a series of rules on a data set. If you’ve never seen Jupyter before it’s a great introduction.
This project is meant as a sort of warm-up. It’s meant to teach you best practices for model evaluation. Instead of first teaching you about algorithms, the course instructors decided to first teach the correct workflow for a machine learning problem. So if you don’t understand how the algorithm works, don’t worry you aren’t supposed to yet. You are supposed to learn about correct model validation and setup. That means splitting your data into training and testing, looking at learning graphs (to check for overfitting / underfitting) and at the end doing cross-validation. The project may be easier than you think it should be. So if you are finding it too plain and simple and think you are missing some thing -- you aren’t.
In this project you need to implement different algorithms and test them against each other. In its current form (8/16/2016) the dataset is quite small, and this makes any meaningful comparison a bit trivial. Still, you can get value out of the process and learn to compare and contrast algorithms.
- People struggle to get an “intuition” as to what algorithms to choose and why
- This is totally normal and expected. Don’t sweat it, just pick something.
- The dataset is so small that re-running randomized train-test-splits will cause big variations in your results. It’s highly recommended to set “random-seed = 0” (or any integer - just keep it the same) so you can test with the same data split for comparisons.
- Alternately, make multiple runs (>5) with each classifier and average the results
- People struggle to answer this question: What makes this model a good candidate for the problem, given what you know about the data?
- Most people find it very difficult to know anything about the data. This question makes it sounds like there is an easy way to know something. It’s really not that simple, but do your best.
- Udacity forum thread: selecting models based on data
If you are looking to extract the core information from the lessons, without getting bogged down in high level mathematical theory, we recommend the following:
Watch the supervised learning section of “Intro to Machine Learning” (units 1-4) first, to get an overview, then the supervised learning section of “GA Tech Machine Learning” for in-depth theory of what you’re actually doing. You’ll have an added advantage if you watch the videos from week 1,2,3 and 7 of Coursera’s Machine Learning course taught by Andrew Ng.
Finally, the PDFs found under the resources section, are all really good!
Unsupervised learning, including clustering, PCA, and ICA. The material in the unsupervised learning section of “GA Tech Machine Learning” should be sufficient. The randomized optimization material won’t come up, so it’s not necessary for the project. Neither is information theory. Week 8 in Andrew Ng’s course will help you understand the math better.
- The renders.py library is a custom file, you can read the code if you like, it’s in the zip. However there is no documentation
This project is quite different from the others in that it’s not in an IPython Notebook and it’s got a lot of code, which is quite confusing at first. It’s a challenge for a lot of students to figure out “what to do”, so we highly suggest reading up in the forums. A lot of questions have been asked and answered there.
Here is a basic overview:
You are going to implement a basic Q learning algorithm using tables. The goal of the Q-learning implementation is basically to create an agent that follows the rules of the road. The project should be called “Training a smart cab to drive safely”, because you don’t need to give the cab instructions on where to go! That’s handled by the project’s “planner”.
The most important unit in “GA Tech Machine Learning” for this section by far is “Reinforcement Learning, Lesson 2: Reinforcement Learning”. Game theory is not involved (but it’s still excellent material). Use the papers in the reading material section to get an idea on how you could implement Q-Learning.
See if anything seems like you’d want to try it out, not everyone learns the same way.
- If you are having trouble following along, watch all the videos twice, the first time watch at 1.5-2x speed, and just watch to get key points and highlights, then watch all the way through again at normal speed, and take thorough notes
- If a question seems easy, it is! Don’t stress about it.
- You might have some trouble with getting pygame working, if that’s the case search the forums and ping the slack group if you are stuck.
- Get Anaconda, it’s just way easier, (pygame might be an issue)
- You have unlimited resubmissions, so don’t sweat some feedback,
- The forums are a great resource, use them!
- If you can't find an answer on the forums or you want to talk to someone in real-time, don’t be shy to ask on Slack. We’ll have a real-time conversation, and chances are you’ll get a bunch of people involved (which is good, cause it’s fun).
Excellent Material on how to organize your data-science project
-
Introduction to Statistical Learning, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, available for free here
- Intended as companion piece to Elements of Statistical Learning, Hastie, Tibshirani, Friedman
- Read Intro first. Even if you have a PhD in statistics.
- Machine Learning, Tom Mitchell website here
- Artificial Intelligence, Stuart Russell and Peter Norvig, website here
- Linear Algebra Review and Reference, Zico Kolter here
- Stanford CS 229 course, Machine Learning, here
- Machine Learning for Audio, Image, and Video Analysis, Camastra, Vinciarelli
- Pattern Recognition and Machine Learning, Bishop
- Python 3 Text Processing with NLTK 3 Cookbook, Perkins
- Thoughtful Machine Learning, Kirk
- A free source for math textbooks
- An Introduction to the Conjugate Gradient Method Without the Agonizing Pain, Shewchuck
- Computational science and engineering, Gilbert Strang
- Concrete Mathematics, Graham, Knuth, Patashnik
-
Introduction to Linear Algebra, Gilbert Strang
- My favorite book of all time, @joshuacook (seconded, @nash)
- Ask @joshuacook about why linear algebra is the bee’s knees and how to study it
- Introduction to Mathematical Statistics, Hogg
- Linear Algebra Done Right, Axler
- Linear Algebra Done Wrong, Treil
- Linearity, Symmetry, and Prediction in the Hydrogen Atom, Singer
- Pearls in Graph Theory, Hartsfield, Ringel
-
A Guide to Numpy, Travis Oliphant free here
- Timothy Oliphant wrote numpy and now runs Continuum Analytics
- All the Mathematics You Missed But Need to Know for Graduate School, Garrity
- Becoming a Better Programmer, Pete Goodliffe
- Data Structures and Algorithms in Python, Goodrich, Tamassia, Goldwasser
- Effective Computation in Physics, Scopatz, Huff
- Expert Python Programming, Ziade
- Flask Web Development, Grinberg
- Fluent Python, Ramalho
- Functional Python Programming, Lott
- Introduction to Computing Using Python, Perkovic
-
IPython Interactive Computing and Visualization Cookbook, Rossant
- IPython Notebook is now called Jupyter. Still a great reference.
- Learn Python the Hard Way, Zed Shaw
- Numerical Python, Johansson
- NumPy Cookbook, Idris
- Python Scripting for Computational Science, Langtangen
- Python 3 Object Oriented Programming, Phillips
- SciPy and NumPy, Bressert
- scikit-learn Cookbook, Hauck
- Structure and Interpretation of Computer Programs, Abelson, Sussman
- The Pragmatic Programmer, Hunt, Thomas
- The TexBook, Knuth
A series of books by Allen Downey. Excellent introduction and exploration of Python topics both related to the Python Data Model and Mathematical Python.
- Think Python (free)
- Think Bayes
- Think Stats
- Think Complexity
- Think DSP
Jupyter is extremely well documented. Their installation instructions are here.
If you are feeling adventurous, running Jupyter via Docker is not only very well supported, it might even in the long run be easier. Here is how to run Jupyter in Docker:
- Install Docker - instructions for Mac, Windows, and Linux.
- From the command line type:
- $
docker run jupyter/scipy-notebook
- $
- When the above process completes (downloading the jupyter/scipy-notebook image from Docker Hub), it should start and now be running (usually at localhost:8888). Enter that URL in your browser.
Please add your name below if you added material to this document.
- Nash Taylor
- Devon Muraoka
- Bharat Ramanathan
- Joshua Cook
- Gilad Gressel