Trees and forests, robust estimators for the 99%.
This is an interactive tutorial that will take about 60minutes. By the end you will know:
- the basics of
scikit-learn
- how to use Decision Trees and Random Forests
- how to use cross-validation to measure performance
- that there are many metrics by which to measure performance
Shown at PyZurich July 2016.
You can either install python on your computer and run these notebooks or you can run them in the cloud by clicking the "binder" button below:
(the service is free so sometimes they do maintenance etc and it isn't available)
Anaconda is a python distribution that is easy to install and contains a large number of commonly used libraries. Download anaconda, clone this repository, and then from this directory run:
conda create -n forests-intro --file=environment.yml
This will create an environment with all the dependencies for these examples.
After setting up the dependencies, activate your conda
environment with
source activate forests-intro
. To run the examples simply run
jupyter notebook
from a terminal in this directory.
Two very nice (and pretty) explanations of how decision trees and neural networks work:
How to get Unbiased performance estimates, read this to find out why you need to keep some of your data secret and use it only once
Gilles Louppe's well written PhD thesis on Understanding Random Forests. Much more precise and formal than my descriptions.
Geneva's Humanitarian Big Data by Tim Head is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.