class: middle, center, title-slide
Lecture 1: Introduction and measuring performance
class: middle, center
(Chapter 1)
- Intro and Model Performance
- Tree based models
- Neural Networks
- Feature engineering
- Embeddings and pretrained networks
- Interpretability
- Six lectures.
- 15 March, 22 March, 12 April, 19 April, 26 April, 3 May
- Always in this room.
- Six exercises, 1hr after each lecture.
- We will discuss the questions from the previous week’s lecture.
- One final project.
- Show that you learned something new.
- 10% attendance
- if you can't make it, excuse yourself by email by 8.50am
- 10% per homework
- contribution to the discussion
- does it run?
- did you make an effort?
- 30% final project
We will use Python.
- We will use today’s exercise to get you setup with what you need.
We will use GitHub.
- Create an account and repository for the course if you don’t already have one, we can help you during the exercise today.
- Don’t end up with:
final_project_v3_reviewed_final_iteration3.pdf
You will be able to use modern machine-learning methods in your work.
You will understand the main ideas behind each technique and be ready to dive into the mathematical details.
.footnote[Josh Tenenbaum]
class: middle, center
class: middle, center
.larger[
- Supervised
- Unsupervised
- Reinforcement ]
.larger[
$$ (x_i, y_i) \propto p(x, y) \text{ i.i.d.}$$
$$ x_i \in \mathbb{R}^p$$
$$ y_i \in \mathbb{R}$$
.larger[
$$ x_i \propto p(x) \text{ i.i.d.}$$
]
Learn about
We won't talk about this.
I will use
Temperature | Humidity | Percipitation | Comfortable | |
---|---|---|---|---|
0 | 12C | 74% | 3mm | yes |
1 | 24C | 67% | 0mm | yes |
2 | 4C | 91% | 13mm | no |
(...) | (...) | (...) | (...) |
Each row is a sample and each column a feature.
I will use
Comfortable? | |
---|---|
0 | yes |
1 | yes |
2 | no |
(...) |
For each sample we have one label.
For regression we will use
Sunshine minutes | |
---|---|
0 | 14 |
1 | 28 |
2 | 2 |
(...) |
class: middle, center
(Chapter 2)
What is the accuracy of this classifier?
We are interested in how well the classifier performs on data from the future.
Split off a fraction of the dataset at the start, use it to simulate the future.
Training data:
Feature 1 | Feature 2 | |
---|---|---|
0 | 2.14 | 5.234 |
1 | 1.124 | 0.32 |
2 | -2.24 | 2.32 |
3 | -1.24 | 3.23 |
Testing data:
Feature 1 | Feature 2 | |
---|---|---|
4 | 5.34 | 6.34 |
5 | 2.24 | -5.23 |
What is the accuracy of this classifier?
Train | Test | |
---|---|---|
1 | 1.0 | 0.85 |
class: middle, center
Pro: easy, fast
Con: high variance, "wastes" data
Pro: stable estimate, better use of data
Con: slower
class: middle, center
There are many strategies for splitting your data. Here a few.
You can find several more in the scikit-learn documentation: http://scikit-learn.org/stable/modules/classes.html#splitter-classes
Makes sure class fractions in the full dataset are correctly represented in each split.
All samples from the same group are always in the same split.
- Medical patients
- Satellite images
What fraction of classes did the model predict correctly.
How many selected items are relevant?
How many relevant items are selected?
Special offer: Tim's credit card fraud detector. 99.99% accurate!
Special offer: Tim's credit card fraud detector. 99.99% accurate!
If your classes are (very) imbalanced then accuracy is not a good measure.
Tim's favourite baseline: the DummyClassifier
.
Works well for unbalanced classes, shows you what trade offs you can make.
By default scikit-learn classifiers assign classes based on the midpoint of the possible classifier output (0.5). You can change this threshold to make different trade-offs.
X_train, X_test, y_train, y_test = train_test_split(X, y)
knn = KNeighborsClassifier().fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
False 0.86 0.86 0.86 51
True 0.86 0.86 0.86 49
avg / total 0.86 0.86 0.86 100
By default scikit-learn classifiers assign classes based on the midpoint of the possible classifier output (0.5). You can change this threshold to make different trade-offs.
X_train, X_test, y_train, y_test = train_test_split(X, y)
knn = KNeighborsClassifier().fit(X_train, y_train)
*y_pred = knn.predict_proba(X_test)[:, 1] > 0.8
print(classification_report(y_test, y_pred))
precision recall f1-score support
False 0.70 0.94 0.80 51
True 0.90 0.57 0.70 49
avg / total 0.80 0.76 0.75 100
class: middle, center