MACHINE LEARNING FOUNDATIONS

https://www.coursera.org/specializations/machine-learning

WEEK 1 INTRODUCTION

THE MACHINE LEARNING PIPELINE

Data -> ML Method -> Intelligence

CASE STUDY APPROACH

Use various ML Methods in different case studies

Regression Case Study 1: Predicting house prices
Classification Case Study 2: Sentiment Analysis
Clustering Case Study 3: Document retrieval
Matrix Factorization Case Study 4: Product recommendation
Deep Learning Cases Study 5: Visual product recommender

Overview

Wk 2 Regression

Case Study: Prediction house prices
Models
- linear regression
- Regularization: Ridge (L2), Lasso (L1)
Algorithms
- Gradient descent
- Coordinate descent
Concepts
- Loss functions
- bias-variance tradeoff
- cross-validation
- sparsity
- overfitting
- model selection

Wk 3 Classification

Case study: Analyzing sentiment
Models:
- Linear classifiers (logistic regression, SVMs, perceptron)
- Kernels
- Decision trees
Algorithms
- Stochastic gradient descent
- Boosting
Concepts
- Decision boundaries
- MLE ensemble methods
- random forests
- CART
- online learning

Wk 4 Clustering & Retrieval

Case study: Finding documents
Models
- Nearest neighbors
- Clustering, mixtures of Gaussians
- Latent Dirichlet allocation (LDA)
Algorithms
- KD-trees, locality-sensitive hashing (LSH)
- K-means
- Expectation-maximization (EM)
Concepts
- Distance metrics
- approximation algorithms
- hashing
- sampling algorithms
- scaling up with map-reduce

Wk 5 Matrix Factorization & Dimensionality Reduction

Case study: Recommending Products
Models:
- Collaborative filtering
- Matrix factorization
- PCA
Algorithms
- Coordinate descent
- Eigen decomposition
- SVD Algorithms
Concepts
- Matrix completion
- eigenvalues
- random projections
- cold-start problem
- diversity
- scaling up

Wk 6 Capstone: An intelligent application using deep learning

GETTING STARTED WITH PYTHON

Install anaconda, GraphLab, ipython notebook

To run notebooks once above is installed

source activate gl-env
conda update pip
jupyter-notebook --no-browser

[First Notebook](week_1/Getting started with iPython Notebook.html)

[source](week_1/Getting started with iPython Notebook.ipynb)

Basic types

i = 4 # int
f = 4.1
b = True

type(i)

Advanced types

l = [3, 1, 2] # list

d = {'foo':1, 'bar':2.3, 's':'string'} # dictionary

print d['foo']

n = None # null
type(n) # NoneType

Advanced printing

print "Our float value is %s. Our int value is %s." % (f,i)

Conditional statements

if i == 1 and f > 4:
  print "i == 1 and f > 4"
elif i > 4 or f > 4:
  print "(i or f) > 4"
else:
  print "(i and f) <= 4"

Loops

print l

for e in l:
  print e

counter = 6
while counter < 10:
  print counter
  counter += 1

functions

def add2(x):
  return x + 2

lambdas

square = lambda x: x*x
square(3)

GETTING STARTED WITH SFRAME AND GRAPHLAB CREATE

iPython Notebook

Machine learning library scikit-learn

Data manipulation tool Pandas

Tools above require a learning curve. This course uses GraphLab Create that includes SFrame

Load a tabular data set sf = graphlab.SFrame('people-example.csv')

view end of the table sf.tail()

visualizes any data structure in GraphLab Create sf.show()

Categorical view sf['age'].show(view='Categorical')

Some simple columnar operations

sf['age'].mean()
sf['age'].max()

Create new columns in our SFrame sf['Full Name'] = sf['First Name'] + ' ' + sf['Last Name']

Use the apply function to do a advance transformation of our data

def transform_country(country):

  if country == 'USA':
    return 'United States'
  else:
  return country

transform_country('USA')

sf['Country'].apply(transform_country)

WEEK 2 Regression: Predicting House Prices

Introduction

Linear regression modeling

This week you will build your first intelligent application that makes predictions of house prices from data.

Create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,...).

Predicting house prices: A case study in regression

Want to list my house

Compare houses in neighborhood
Focus on similar houses: sq ft, bedrooms, etc

Plot a graph

X = sq. ft.
Y = Price

Terminology:

x = feature covariate or predictor
y = observation or response

Note no house on graph will have same sq ft as yours. Also, if you only include similar houses you're discarding the rest of the data on the graph

Fit a line through the data = f(x) = w0 + w1 x

w0 = intercept
w1 = slope

f(x) is parameterized by (w0, w1)

RSS := Residaul sum of squares

draw a line and sum the distance of plots from line

RSS(w0,w1) = ($house1 - [w0 + w1(sq ft house1])^2 + ($house2 - [w0 + w1(sq ft house2])^2 + ($house3 - [w0 + w1(sq ft house3])^2 + ...

best line is the one that minimizes the cost over all possible w0,w1

Ŵ = (ŵ0, ŵ1) W hat

Fŵ(x) = ŵ0 + ŵ1 x

Best guess of your house price ŷ = ŵ0 + ŵ1 (sq ft your house)

Adding higher order effects

But what if it's not a linear relationship. It could be quadratic.

Fw(x) = w0 + w1 x + w2 x^2

note we square x, but not w

We can apply even higher order polynomials to reduce RSS further

An example of an even higher order polynomial that may not be what you want :)

Evaluating regression models

Evaluating overfitting via training/test split

Based on the last example, we can overfit to the point that it's not generalizable to new data

Want good predictions but can't observe future. We can simulate prediction

test set: remove some houses
training set: fit model on remaining houses
predict test set

Training error (w) = ($train1 - fw(sq.ft. train1))^2 + ($train2 - fw(sq.ft. train2))^2 + ($train3 - fw(sq.ft. train3))^2 + ...

Test error (ŵ) = ($test1 - fw(sq.ft. test1))^2
               + ($test2 - fw(sq.ft. test2))^2
               + ($test3 - fw(sq.ft. test3))^2
               + ...

Training/test curves

Training error ŵ decresases with increasing model order

Test error decreases up to a point but then starts increasing

Adding other features

What if we need to add additional variables e.g. # bathrooms

Each new variable is a new dimension. so adding bathroom is a 3d graph

calculate hyperplane on the cube

Other regression examples

Salary after ML specialization ŷ = ŵ0 + ŵ1 performance + ŵ capstone + ŵ forum
Stock prediction depends on recent prices, news event, related commodities
tweet poplarity: # followers, # followers of followers, popularity of hashtag

Summary of regression

Describe the input features and output real-valued predictions of a regression model
Calculate a goodness-of-fit metric (e.g., RSS)
Estimate model parameters by minimizing RSS (algorithms to come...)
Exploit the estimated model to form predictions
Perform a training/test split of the data
Analyze performance of various regression models in terms of test error
Use test error to avoid overfitting when selecting amongst candidate models
Describe a regression model using multiple features

Quiz: Regression

1
2
3
3
x 2,3 : 4
4
x 2,3 : 4
3
2

See Explore the Quadratic Equation to see the effect of the coefficients

Predicting house prices: IPython Notebook

iPython Notebook

Generate a scatter plot sales.show(view="Scatter Plot", x="sqft_living", y="price")

can hover over individual points to explore further

Splitting the data into training and test sets

train_data,test_data = sales.random_split(.8,seed=0)

Use random_split to split training and test data
0.8 => 80% training and 20% test
set seed to 0 in this case. we should use a random seed or let GL pick it for you

Learning a simple regression model to predict house prices from house size

sqft_model = graphlab.linear_regression.create(train_data,
                                               target='price', 
                                               features=['sqft_living'], 
                                               validation_set=None)

note the default algorithm used is Newton's Method

Evaluating error (RMSE) of the simple model

print sqft_model.evaluate(test_data)

max_error is the outlier
Also shows RMSE

Visualizing predictions of simple model with Matplotlib

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(test_data['sqft_living'], test_data['price'], '.',
         test_data['sqft_living'], sqft_model.predict(test_data), '-')

blue . is the actual data scatter plot
green - as a line is the predicted value based on the test_data

Inspecting the model coefficients learned

sqft_model.get('coefficients')

(intercept) = where the line crosses the y axis
sqft_living ~= the average cost of a house per sq ft according to this regression model

View other features of a house we might be interested in

my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

sales[my_features].show()

sales.show(view='BoxWhisker Plot', x='zipcode', y='price')

BoxWhisker Plot to view the set split by feature (zipcode in this case)

Based on the data we see other features such as zip code, and # bedrooms makes a difference in the estimated price of a home

my_features_model = graphlab.linear_regression.create(train_data,
                                                      target='price',
                                                      features=my_features,
                                                      validation_set=None)

print my_features to view what features are includedd

Compare the original model to the expanded features model

print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)

Note that the rmse has lowered by adding additional features

Applying learned models to predict price of an average house

house1 = sales[sales['id']=='5309101200'] # find a particular house by id

house1 # to view the data for this house

print house1['price'] to view the actual price based on the data

print sqft_model.predict(house1) to check the price our model predicted

print my_features_model.predict(house1) to predict price based on expanded features

:caution: the prediction model based on sqft was more accurate than the expanded feature model in this case

Applying learned models to predict price of two fancy houses

house2 = sales[sales['id']=='1925069082']

this is an example of a house where due to an uncaptured feature, "on the waterfront", was not predicted very well by our model
expanded feature prediction is closer than the original sqft model

Final example is Bill Gates' house. We don't have data on what the actual price is but the prediction gives us a price, that is probably on the low end.

Quiz: Programming assignment

Selection and summary statistics: In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price. Now, take the sales data, select only the houses with this zip code, and compute the average price.

Ans:

houses = sales[sales['zipcode']=='98039']
houses['price'].mean()

2160606.5999999996

Filtering data: One of the key features we used in our model was the number of square feet of living space (sqft_living) in the house. For this part, we are going to use the idea of filtering (selecting) data.

In particular, we are going to use logical filters to select rows of an SFrame. You can find more info in the Logical Filter section of this documentation. Using such filters, first select the houses that have sqft_living higher than 2000 sqft but no larger than 4000 sqft. What fraction of the all houses have sqft_living in this range? Save this result to answer the quiz at the end.

Ans:

100.0 * sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] <= 4000)].num_rows() / sales.num_rows()

42.187572294452416

Building a regression model with several more features: In the sample notebook, we built two regression models to predict house prices, one using just sqft_living and the other one using a few more features, we called this set

my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

Now, going back to the original dataset, you will build a model using the following features:

advanced_features = 
[
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house       
'grade', # measure of quality of construction       
'waterfront', # waterfront property       
'view', # type of view        
'sqft_above', # square feet above ground        
'sqft_basement', # square feet in basement        
'yr_built', # the year built        
'yr_renovated', # the year renovated        
'lat', 'long', # the lat-long of the parcel       
'sqft_living15', # average sq.ft. of 15 nearest neighbors         
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

Compute the RMSE (root mean squared error) on the test_data for the model using just my_features, and for the one using advanced_features.

Ans:

my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

advanced_features = 
[
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house       
'grade', # measure of quality of construction       
'waterfront', # waterfront property       
'view', # type of view        
'sqft_above', # square feet above ground        
'sqft_basement', # square feet in basement        
'yr_built', # the year built        
'yr_renovated', # the year renovated        
'lat', 'long', # the lat-long of the parcel       
'sqft_living15', # average sq.ft. of 15 nearest neighbors         
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

train_data,test_data = sales.random_split(.8,seed=0)

f_model = graphlab.linear_regression.create(train_data, target='price', features=my_features, validation_set=None)
af_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features, validation_set=None)

print "RMSE diff = " + str(f_model.evaluate(test_data)['rmse'] - af_model.evaluate(test_data)['rmse'])

RMSE diff = 22711.3165108

WEEK 3 Classification: Analyzing Sentiment

CLASSIFICATION MODELING

Analyzing the sentiment of reviews

Rating with stars is too simple

Understand aspects of restaurant review

build a restaurant review app
categories for review e.g. experience, ramen, sushi
Break all reviews into sentences in a Sentence Sentiment Classifier
Average the predictions
Display the most positive or negative reviews

Classification Aplications

Input (sentence) -> Classifier -> Output (predicted rating)

Examples

Webpage classification by category e.g. education, finance, technology
Spam filtering: checks sender, text, ipaddress, etc
Image classification
Personalized medical diagnosis
Reading your mind by FMRI

simple threshold classifier

list of +ve words: great, awesome, etc
list of -ve words: bad, terrible, etc
if # +ve words > # -ve words => ŷ = +ve else ŷ = -ve
e.g. "Great sushi, awesome food, but terrible service" => +2, -1

Problems with threshold classifier

populating initial +ve and -ve word lists
words have degrees of sentiment: e.g. "great" > "good"
single words are not enough: "good" vs "not good"

The first two can be address by learning a classifier

The 3rd issue can be address by more elaborate features

Word	Weight
great	1.5
awesome	1.2
bad	-1.0
terrible	-2.1
awful	-3.3
restaurant, the, we, where, ...	0.0

e.g. "Sushi was great, the food was awesome, but the service was terrible"

score = 1.2 (great) + 1.7 (awesome) - 2.1 (terrible) = 0.8

Decision Boundaries

Can graph scores. e.g. awesome = 1.0 and awful = -1.5

awful
 5|
 4|
 3|
 2| * (this ambiance is awesome but service is awful and food is awful)
 1|
 0|-------- awesome
  0 1 2 3 4

For linear classifiers

when 2 weights are non-zero: line
when 3 weights are non-zero: plane
when 2 weights are non-zero: hyperplane

Evaluating classification models

Training and Evaluating a Classifier

Training a classifier = Learning the weights

split data into training and test sets
training set is passed to learned classifier to learn weights of words
test set is evaluated by error & accuracy

Test example "Sushi was great"

Feed the sentence ^ to the learned classifier
predict ŷ is +ve
pass multiple test cases and compare correct vs mistakes

Classification error & accuracy

Error = (# mistakes) / (total # sentences)
- best possible value = 0.0
Accuracy = (# correct) / (total # sentences)
- best possible value = 1.0

error = 1 - accuracy

What's a good accuracy?

Purely random guessing on a binary classification = 0.5 accuracy

For k classes, accuracy = 1/k

we should at least beat random guessing

This can be counterintuitive e.g.

"90% email is spam"
if we predict 100% email is spam we get 90% accuracy

Questions to ask:

Is there class imbalance?
How does it compare to a simple, baseline approach?
- Random guessing
- Majority class
- ...
Most importantly: what accuracy does my application need?
- What is good enough for my users experience?
- What is the impact of the mistakes we make?

False positives, false negatives, and confusion matrices

A confusion matrix

               Predicted Label
             |   +ve     |  -ve      |
             |-----------|-----------|
 True    +ve | True +ve  | False -ve |
 Label   -ve | False +ve | True -ve  |

True +ve & True -ve is good, we got it right
FN (False -ve) & FP have different impacts

Example of domains

           | Spam filtering | Medical Diagnosis  |
           | -------------- | ------------------ |
 False -ve | Annyoing       | Disease untreated  |
 False +ve | Email lost     | Wasteful treatment |

Given 100 test examples, a possible confusion matrix for spam filtering

            Predicted Label
             | +ve  | -ve  |
 True    +ve | (50) |  10  |
 Label   -ve |   5  | (35) |

Accuracy = 85/100 = 0.85 ie higher false +ve than false -ve

Multiclass classification example with 100 test examples

          Predicted Label
      |            | Healthy    | Cold | Flu |
      | ---------- | ---------- | ---- | --- |
True  | Healthy 70 | (60)       | 8    |  2  |
Label |    Cold 20 |   4        | (12) |  4  |
      |     Flu 10 |   2        |      | (8) |

Accuracy = (60 + 12 + 8)/100 = 80/100 = 0.8

Learning Curves: How much data do I need?

More data is good but data quality is more important

Theoretical techniques sometimes bound how much data is needed

provide guidance but not as practical

In practice

more complex models require more data
empirical analysis can provide guidance

Learning curves

Generally, the more data we have the fewer test errors we find

Test error

    ^
    |*
    | *
    |   *
    |      *
    |            *
    |                        *
    |                                                 *
    |----------------------------------> Amount of training data

Bias of model even with infinite data, the curve never reaches 0.

More complex models tend to have less bias

sentiment classifier on single words does ok
But some are just too hard e.g. "The sushi was not good"
we can then score on pairs of words bigram model

Even bigram models have bias. The graph looks better (approaches 0 faster) but never hits 0

Class Probabilities

Classifier provide a confidence level P(y|x)

e.g. "the sushi & everything else were awesome" P(y=+|x) = 0.99
"The sushi was good, the service was OK" P(y=+|x) = 0.55
- less confident that this is a +ve review

Summary of classification

Classification ML block diagram

y = Training data
x = word counts
ŷ = predicted sentiment
ŵ = weights for each word
y = sentiment label
compare y with ŷ to get a quality metric that is fed into ML Algorithm
ML algorithm updates ŵ

Summary

Identify a classification problem and some common applications
Describe decision boundaries and linear classifiers
Train a classifier
Measure its error
- Some rules of thumb for good accuracy
Interpret the types of error associated with classification
Describe the tradeoffs between model bias and data set size
Use class probability to express degree of confidence in prediction

Quiz: Classification

1 The simple threshold classifier for sentiment analysis described in the video (check all that apply):

Must have pre-defined positive and negative attributes
Must either count attributes equally or pre-define weights on attributes
Defines a possibly non-linear decision boundary

1

2 For a linear classifier classifying between positive and negative sentiment in a review x, Score(x) = 0 implies (check all that apply):

The review is very clearly negative
We are uncertain whether the review is positive or negative
We need to retrain our classifier because an error has occurred

2

3 For which of the following datasets would a linear classifier perform perfectly?

x 1,2,3

4 True or false: High classification accuracy always indicates a good classifier.

false

5 True or false: For a classifier classifying between 5 classes, there always exists a classifier with accuracy greater than 0.18.

true

6 True or false: A false negative is always worse than a false positive.

false

7 Which of the following statements are true? (Check all that apply)

Test error tends to decrease with more training data until a point, and then does not change (i.e., curve flattens out)
Test error always goes to 0 with an unboundedly large training dataset
Test error is never a function of the amount of training data

1

Analyzing sentiment: IPython Notebook

iPython Notebook

Analyze text products['word_count'] = graphlab.text_analytics.count_words(products['review'])

Train classifier using a logistic classifier

train_data,test_data = products.random_split(.8, seed=0)


sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set=test_data)

Evaluate the sentiment model using ROC Curve

sentiment_model.evaluate(test_data, metric='roc_curve')

sentiment_model.show(view='Evaluation')

Programming assignment

Analyzing product sentiment assignment

Quiz: Analyzing product sentiment

WEEK 4 Clustering and Similarity: Retrieving Documents

Algorithms for retrieval and measuring similarity of documents

Document retrieval: A case study in clustering and measuring similarity

Groups of related articles: clusters

What is the document retrieval task?

Given an article, how to find similar articles?

measure similarity
search over similar articles

Word count representation for measuring similarity

Most popular model: bag of words model

ignore order of words
count # of instances of each word in vocabulary

Create vector e.g. "Carlos calls the sport futbol" vs "Emily calls the sport soccer"

word	count
carlos	1
the	2
tree	0
calls	2
sport	2
cat	0
futbol	1
dog	0
soccer	1
emily	1

Measuring similarity: compare the following 2

article on messi: 1 0 0 0 5 3 0 0 1 0 0 0 0
article on pele: 3 0 0 0 2 0 0 1 0 1 0 0 0
similarity = (1 * 3) + (5 * 2) = 13

Another example

article on messi: 1 0 0 0 5 3 0 0 1 0 0 0 0
article on africa: 0 0 1 0 0 0 9 0 0 6 0 4 0
similarity = 0

Issue Doc Length: The longer the documents, the higher similarity score

messi: 1 0 0 0 5 3 0 0 1 0 0 0 0 doubled 2 0 0 0 10 6 0 0 2 0 0 0 0
pele: 3 0 0 0 2 0 0 1 0 1 0 0 0 doubled 6 0 0 0 4 0 0 2 0 2 0 0 0
original similarity = 13 vs doubled = 52

Solution = normalize Sqrt(Sum(i^2))

messi: 1 0 0 0 5 3 0 0 1 0 0 0 0
normalized: Sqrt(1^2 + 5^2 + 3^2 + 1^2) = Sqrt(36) = 6
messi normalized = 6 0 0 0 6 6 0 0 6 0 0 0 0

Prioritizing important words with tf-idf

Common words in doc: the, player, field, goal
^ dominate rare words like: futbol, Messi

discount word w based on # of docs containing w in corpus

But we don't want only rare words to dominate

What characterizes an important word ?

Appears frequently in document common locally
Appears rarely in corpus rare globally

Trade off between local frequency and global rarity

TD-IDF := Term frequency - Inverse document frequency

tf:
idf: log( # docs / 1 + # docs using word)
- word in many docs: log(large / 1 + large) ~= log 1 = 0
- rare word: log(large # / 1 + small #) => large #

Calculating tf-idf vectors

e.g. doc with "the" showing up 100 times and "messi" showing up 5

assuming 64 docs and "the" shows up in 63 of them
assuming "messi" shows up in 3 docs
TF: {the: 1000, Messi: 5}
IDF: { the: log 64/1+63 = 0, Messi: log 64/1+3 = log 16 = 4 }
TF-IDF (TF * IDF) "the" = 1000 * 0 = 0
TF-IDF (TF * IDF) "messi" = 5 * 4 = 20

Retrieving similar documents using nearest neighbor search

query article is the current article
corpus = entire library
specify: distance metric
output: set of most similar articles

Algorithm

search over each article in corpus
compute s = similarity
if s > Best_s, set new Best_s as this article

k-nearest neighbor

return list of k similar articles

Clustering models and algorithms

Clustering documents task overview

Discover groups clusters of related articles

Structure documents by topic: e.g. sports, world news, etc
assumuning we are provided with labels

Multiclass classification Problem

how do i classify a new article based on clusters?
example of a supervised learning problem

Clustering documents: An unsupervised learning task

Clustering: no labels provided & want to uncover cluster structure

Input: docs as vectors [w1, w2]
Output: cluster labels
once articles are clustered, you can post-facto provide a meaningful label e.g. "sport"

Defining a cluster

center
shape or spread

To assign a new article check the shape of the cluster and center

Another approach is to only look at distance from center

K-means: A clustering algorithm

Assuming similarity metric is distance from center

0 Initialize cluster centers

1 Assign observations to closest cluster centers Voronoi Tessalation

Assign regions

2 Revise cluster centers as mean of assigned observations

Iterate on the process

3 Repeat 1,2 until convergence

Other examples of clustering

Clustering images: by ocean, dog, clouds, etc

Clustering patients by medical condition: by subpopulations and diseases

e.g. patients with seizures
record seizure activity over time
cluster the different types of seizures

Clustering products on amazon:

discover product categories from purchase histories
discover groups of users
e.g. person buying crib also purchased baby seat => crib is "baby product", not "furniture"

Structuring web search results

e.g. "cardinal" can have multiple meanings: bird, baseball team, catholic cardinal
Use clusturing to structure output

Discovering similar neighborhoods: estimate price at a small region

cluster regions with similar sales trends

Discovering similar neighborhoods: forecast violent crimes

cluster regions

Summary of clustering and similarity

Training data: doc id, document text
x: tf-idf
ML model: clustering
ŷ: estimated cluster label
y: does not exist b'se unsupervised learning
quality metric: x & ŵ
- i.e. distances of observations to assigned centers
ŵ: cluster centers
ML algorithm: k-means

Clustering and similarity ML block diagram

Summary

Describe ways to represent a document (e.g., raw word counts, tf-idf ,...)
Measure the similarity between two documents
Discuss issues related to using raw word counts
- Normalize counts to adjust for document length
- Emphasize important words using tf-idf
Implement a nearest neighbor search for document retrieval
Describe the input (unlabeled observations) and output (labels) of a clustering algorithm
Determine whether a task is supervised or unsupervised
Cluster documents using k-means (algorithmic details to come...)
Describe other applications of clustering

Quiz: Clustering and Similarity

1 A country, called Simpleland, has a language with a small vocabulary of just the, on, and, go, round, bus, and wheels. For a word count vector with indices ordered as the words appear above, what is the word count vector for a document that simply says "the wheels on the bus go round and round".

Please enter the vector of counts as follows: If the counts were ["the"=1, on=3, "and"=2, "go"=1, "round"=2, "bus"=1, "wheels"=1], enter 1321211.

the the
on
and
go
round round
bus
wheels

2111211

2 In Simpleland, a reader is enjoying a document with a representation: [1 3 2 1 2 1 1]. Which of the following articles would you recommend to this reader next?

[7 0 2 1 0 0 1] * [1 3 2 1 2 1 1] = [7 0 4 1 0 0 1] = 13
[1 7 0 0 2 0 1] * [1 3 2 1 2 1 1] = [1 21 0 0 4 0 1] = 27
[1 0 0 0 7 1 2] * [1 3 2 1 2 1 1] = [1 0 0 0 14 1 2] = 18
[0 2 0 0 7 1 1] * [1 3 2 1 2 1 1] = [0 6 0 0 14 1 1] = 22

3 A corpus in Simpleland has 99 articles. If you pick one article and perform 1-nearest neighbor search to find the closest article to this query article, how many times must you compute the similarity between two articles?

Y 98
98 * 2 = 196
98/2 = 49
(98)^2
99

4 For the TF-IDF representation, does the relative importance of words in a document depend on the base of the logarithm used? For example, take the words "bus" and "wheels" in a particular document. Is the ratio between the TF-IDF values for "bus" and "wheels" different when computed using log base 2 versus log base 10?

No

5 Which of the following statements are true? (Check all that apply):

Y Deciding whether an email is spam or not spam using the text of the email and some spam / not spam labels is a supervised learning problem.
Dividing emails into two groups based on the text of each email is a supervised learning problem.
Y If we are performing clustering, we typically assume we either do not have or do not use class labels in training the model.

6 Which of the following pictures represents the best k-means solution? (Squares represent observations, plus signs are cluster centers, and colors indicate assignments of observations to cluster centers.)

2

Document retrieval: IPython Notebook

Loading & exploring Wikipedia data

Using text_analytics

obama = people[people['name'] == 'Barack Obama']

obama = graphlab.text_analytics.count_words(obama['text'])

print obama['word_count']

Exploring word counts

Extract words and counts using stack

obama_word_count_table = obama[['word_count']] \
                        .stack('word_count',
                               new_column_name = ['word','count']) \
                        .sort('count', ascending=False)

Computing & exploring TF-IDFs

people['word_count'] = graphlab.text_analytics.count_words(people['text'])

people['tfidf'] = graphlab.text_analytics.tf_idf(people['word_count'])

obama = people[people['name'] == 'Barack Obama']

obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

Computing distances between Wikipedia articles

Manually computing distance

clinton = people[people['name'] == 'Bill Clinton']

graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])

Building & exploring a nearest neighbors model for Wikipedia articles

Using nearest_neighbors

knn_model = graphlab.nearest_neighbors \
            .create(people, features=['tfidf'], label='name')

knn_model.query(obama, radius=0.84, k=10)

Programming assignment: Retrieving Wikipedia articles assignment1h

Quiz: Retrieving Wikipedia articles

WEEK 5 Recommender Systems

Recommender systems

ML to use past history of your and other's purchases

Where we see recommender systems in action

youtube, netflix movies
amazon products: include global and session interests e.g. last yr was buying newborn products
facebook friend recommendations
drug target interactions: drug d has been studied for target t. Can we discover new applications of d?

Building a recommender system via classification

Solution 0: Popularity

rank by global popularity

con: no personalization

Solution 1: Classification Model

User Info        |
                 |                  Yes
Purchase History |                  /
                 |----> Classifier <
Product Info     |                  \
                 |                  No
Other Info       |

Features may not be available e.g. age, gender

doesn't perform as well as collaborative filtering

Solution 2: Co-occurrence matrices for collaborative filtering

e.g. person buying diapers might also be interested in baby wipes b'se others did so

Matrix C: (#i * #j)

symmetric matrix: reflective on diagonal
when buying diapers, look at the row for diapers and find other products that people bought frequently when buying diapers

row diapers [ 0 .... 4 ......... 100 ... ] dvd pacifier baby wipes

Effect of popular items

Sophie giraffe is a popular baby product

Can weight my purchased items to refine e.g.

Score(me, baby wipes) = 1/2[S(wipes,diapers) + S(wipes,milk)]

Cons: does not utilize

context: e.g. time of day
user features: e.g. age
product features: e.g. baby vs electronics
Cold start problem: what if this is a new product

row Sophie [ 0 .... 4 ......... 100 ......... 1M ..... ] dvd pacifier baby wipes diapers

Diapers is really popular so it drowns out everything else in the domain

Normalizing co-occurrence matrices and leveraging purchase histories

Normalize with Jaccard Similarity

similar to TF-IDF
(#i + #j purchased) / (#i OR #j purchased)
^ = ∪(i,j) / i' + j' (venn notation)

An alternative similarity metric is cosine similarity

Cons

does not utilize context (e.g. time of day), user features (e.g. age), product features (e.g. baby vs electronics)
Cold start problem: no history for a new product or user

Matrix factorization

Matrix completion task

User	Movie	Rating
U1	M1	3
U1	M2	5
U1	M3	2
U2	M4	2
U2	M5	4
U3	M2	4
U3	M4	4
U3	M5	5
U3	M6	4

Matrix

         -------------
Rating = |XX  X   XXX|
         |X  X       |
         |    X X X X| Users
         | X XX X    |
         |X  X      X|
         |  X  XX    |
         -------------
            Movies

Rating(u,v) known for X cells
Rating(u,v) unknown for blank cells

Goal is to fill missing data

Recommendations from known user/item features

Describe Movie v with topics Rv
- action, romance, drama ... = [0,3, 0.01, 1.5, ...]
Describe User u with topics Lu
- action, romance, drama ... = [2.5, 0, 0.8, ...]

Estimate ^Rating(u,v) for an unkown movie = ^Rv * ^Lu

[0,3, 0.01, 1.5, ...] * [2.5, 0, 0.8, ...] = 0.75 + 0 + 1.2
                                           = 1.95

^ is not representative of what the user would score the movie, but the higher the score the more likely they are to like it

Predictions in matrix form

Rating(u,v) = <Lu, Rv>

Can create the same matrix like above by using by getting all Lu and all Rv

Discovering hidden structure by matrix factorization

Use residual sum of squars RSS (introduced earlier)

RSS(L,R) = SUM[(Rating(u,v) - <Lu,Rv>)^2]

Matrix Factorization Model is taking this matrix and approximating it with factorization

Limitations:

cold start problem: new movie or user

Bringing it all together: Featurized matrix factorization

Features capture context
matrix factorization captures group of users

We can combine the 2 above to mitigate cold-start problem

rating for a new user from features only
As we gather user data we shift to matrix factorization topics

Netflix Challenge 2006-2009

winning team blended over 100 models

Performance metrics for recommender systems

A performance metric for recommender systems

Don't want to use classification accuracy (liked vs not liked)

not interested in what a person does not like
want to quickly discover relatively few liked items
^ imbalanced class problem

Users have short attention span, so we want to recommend fewer items

higher cost to missing liked item

Recall = (# liked & shown) / (# liked)

e.g 3/5

Precision = (# liked & shown) / (# shown)

e.g. 3/11

Optimal recommenders

How to maximize recall? recommend all my liked items

in this case precision is really small. e.g. 1000 total products, of which i liked 6 items => precision = 6/1000

Optimal recommender is to only recommend only items I like

precision & recall = 1

Precision-recall curves

Input: specific recommender system
Oupture: algoritm-specific precision-recall curve

A possible metric is AUC (Area under Curve). e.g. compare the blue curve to green curve below

Another metric is to set desired recall and maximize precision at k e.g.

only want to display x items to fit on a single page
assuming users will ignore anything more than x items

Summary of recommender systems

Recommender systems ML block diagram

Training data = user, products, ratings
x: user_id, product_id
ML Model: matrix factorization
ŵ: { ^Lu, ^Ru } + other weighted parameters
ŷ: predicted rating
y = actual ratings (training data set)
quality metric: RSS as well as others

Quiz: Recommender Systems

1 Recommending items based on global popularity can (check all that apply):

provide personalization
capture context (e.g., time of day)
none of the above

2 Recommending items using a classification approach can (check all that apply):

provide personalization
capture context (e.g., time of day)
none of the above

3 Recommending items using a simple count based co-occurrence matrix can (check all that apply):

provide personalization
capture context (e.g., time of day)
none of the above

4 Recommending items using featurized matrix factorization can (check all that apply):

provide personalization
capture context (e.g., time of day)
none of the above

5 Normalizing co-occurrence matrices is used primarily to account for:

people who purchased many items
items purchased by many people
eliminating rare products
none of the above

6 A store has 3 customers and 3 products. Below are the learned feature vectors for each user and product. Based on this estimated model, which product would you recommend most highly to User #2?

| User ID | Feature vector | | 1 | (1.73, 0.01, 5.22) | | 2 | (0.03, 4.41, 2.05) | | 3 | (1.13, 0.89, 3.76) |

| Product ID | Feature vector | | 1 | (3.29, 3.44, 3.67) | | 2 | (0.82, 9.71, 3.88) | | 3 | (8.34, 1.72, 0.02) |

Product #1
Product #2
Product #3

7 For the liked and recommended items displayed below, calculate the recall and round to 2 decimal points. (As in the lesson, green squares indicate recommended items, magenta squares are liked items. Items not recommended are grayed out for clarity.) Note: enter your answer in American decimal format (e.g. enter 0.98, not 0,98)

8 For the liked and recommended items displayed below, calculate the precision and round to 2 decimal points. (As in the lesson, green squares indicate recommended items, magenta squares are liked items. Items not recommended are grayed out for clarity.) Note: enter your answer in American decimal format (e.g. enter 0.98, not 0,98)

9 Based on the precision-recall curves in the figure below, which recommender would you use?

RecSys #1
RecSys #2
RecSys #3
3
1,2
1
1,2
2
2
1 / 3 = 0.33
1/4 = 0.25
1

Song recommender: IPython Notebook

iPython Notebook

Loading and exploring song data

Song data = [user_id, song_id, listen_count, title, artist, song]

song_data['song'].show()

Show unique users users = song_data['user_id'].unique()

Creating & evaluating a popularity-based song recommender

Set up train/test data train_data,test_data = song_data.random_split(.8,seed=0)

Popularity based recommender popularity_recommender

popularity_model = graphlab.popularity_recommender \
                              .create(train_data,
                                      user_id='user_id',
                                      item_id='song')

Note: we're using song as the item_id for the recommender above

The recommendations for all users should be the same

popularity_model.recommend(users=[users[0]])
popularity_model.recommend(users=[users[1]])

Creating & evaluating a personalized song recommender

Personalized recommender item_similarity_recommender

personalized_model = graphlab.item_similarity_recommender \
                             .create(train_data,
                                     user_id='user_id',
                                     item_id='song')

Can recommend by user personalized_model.recommend(users=[users[0]])

Or by Song personalized_model.get_similar_items(['With Or Without You - U2'])

Using precision-recall to compare recommender models

Can compare() the 2 models on test_data and generate a graph

model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)

graphlab.show_comparison(model_performance,[popularity_model, personalized_model])

Quiz: Recommending songs

Deep Learning: Searching for Images

Neural networks: Learning very non-linear features

Slides presented in this module

Annotated PDF

Some useful papers on computer vision:

What is a visual product recommender?

Want to search for a product but not sure what keywords to use e.g. dress

"dress floral" shows varied options
Can click on "similar items" once we find something we like

Learning very non-linear features with neural networks

Features are key to machine learning

Deep learning based on neural networks

learning very non-linear features

Linear classifiers create a boundary between +ve and -ve scores

With neural networks, we use graphs to score

Score(x) = w0 + Sum(wi * xi)

QN: If the perceptron takes an input of exacly 0, what should it output?

ANS: Implementation defined. An input of zero is an edge case.

What can a linear classifier represent?

x1 OR x2
x1 AND x2

What can't a linear classifier represent? XOR

^ this is a representation of a simple neural network

Neural Networks

require lots of data
require high performance systems

Deep learning & deep features

Application of deep learning to computer vision

Image features are combined to make a prediction.

e.g. nose, eye, eye, mouth => face
^ detectors are much lower level. don't find eye, but various features
^ find "unique" features

Example algorithm is SIFT

Implicitly learn features

Deep learning performance

Examples of image recognition using deep neural networks

99.5% German traffic sign
97.8% house number recognition
ImageNet 2012 competition
- SuperVision team got huge gain over competitors with new algorithms and GPU implementation

Demo of deep learning model on ImageNet data

ImageNet

Other examples of deep learning in computer vision

Scene parsing: label parts of an image

Challenges of deep learning

Pros

Enables learning of features rather than hand tuning
Impressive performance gains
- Computer vision
- Speech recognition
- Some text analysis
Potential for more impact

Cons

Requires a lot of data for high accuracy
Computationally really expensive
Extremely hard to tune
- Choice of architecture
- Parameter types
- Hyperparameters
- Learning algorithm
- etc

Computationally expensive and hard to tune

Deep Features

Deep Features allow us to build neural networks with smaller seed data

In a neural net

early tasks are more generic and used as feature extractor
later tasks are specific

By doing:

keep weights fixed for early tasks
use simple classifier in later tasks

Workflow looks like

Real world use: Compology adds camera to trash cans to detect how full the trashcan is

Summary of deep learning

training data: image, label
x: deep features
ML model: logistic regression
ŷ: predicted labels
y: true labels
quality metric: classification accuracy
ŵ: weights of features

Summary

Describe multi-layer neural network models
Interpret the role of features as local detectors in computer vision
Relate neural networks to hand-crafted image features
Describe some settings where deep learning achieves significant performance boosts
State the pros & cons of deep learning model
Apply the notion of transfer learning
Use neural network models trained in one domain as features for building a model in another domain
Build an image retrieval tool using deep features

Quiz: Deep Learning 6 questions

1 Which of the following statements are true? (Check all that apply)

Linear classifiers are never useful, because they cannot represent XOR.
Linear classifiers are useful, because, with enough data, they can represent anything.
Having good non-linear features can allow us to learn very accurate linear classifiers.
none of the above

2 A simple linear classifier can represent which of the following functions? (Check all that apply) Hint: If you are stuck, see https://www.coursera.org/learn/ml-foundations/module/nqC1t/discussions/AAIUurrtEeWGphLhfbPAyQ

x1 OR x2 OR NOT x3
x1 AND x2 AND NOT x3
x1 OR (x2 AND NOT x3)
none of the above

Generated tables

| a | b | c | a OR b OR NOT c | | T | T | T | T | | T | T | F | T | | T | F | T | T | | T | F | F | T | | F | T | T | T | | F | T | F | T | | F | F | T | F | | F | F | F | T |

| a | b | c | a AND b AND NOT c | | T | T | T | F | | T | T | F | T | | T | F | T | F | | T | F | F | F | | F | T | T | F | | F | T | F | F | | F | F | T | F | | F | F | F | F |

| a | b | c | a OR (b AND NOT c) | | T | T | T | T | | T | T | F | T | | T | F | T | T | | T | F | F | T | | F | T | T | F | | F | T | F | T | | F | F | T | F | | F | F | F | F |

3 Which of the the following neural networks can represent the following function? Select all that apply. (x1 AND x2) OR (NOT x1 AND NOT x2)

Hint: If you are stuck, see https://www.coursera.org/learn/ml-foundations/module/nqC1t/discussions/AAIUurrtEeWGphLhfbPAyQ

4 Which of the following statements is true? (Check all that apply)

Features in computer vision act like local detectors.
Deep learning has had impact in computer vision, because it’s used to combine all the different hand-created features that already exist.
By learning non-linear features, neural networks have allowed us to automatically learn detectors for computer vision.
none of the above

5 If you have lots of images of different types of plankton labeled with their species name, and lots of computational resources, what would you expect to perform better predictions:

a deep neural network trained on this data.
a simple classifier trained on this data, using deep features as input, which were trained using ImageNet data.

6 If you have a few images of different types of plankton labeled with their species name, what would you expect to perform better predictions:

a deep neural network trained on this data.
a simple classifier trained on this data, using deep features as input, which were trained using ImageNet data.
3
x 1,2,3
3 x 3,5 x 3,4 x 5 x 1,2,5
1,2,3
1
2

Deep features for image classification: iPython Notebook

iPython Notebook

Deep features for image retrieval: iPython Notebook

iPython Notebook

Programming assignment

bird
6
4
35 - 37
37-39
cat
60-70

Deploying machine learning as a service

What is Production?

Deployment: Server live predictions
Evaluation: Measuring quality of deployed models
Management: Choosing between deployed models
Monitoring: Tracking model quality & operations

Deployment ---> Evaluation
```
  ^               |
  |               |
  |               v
```
Management <--- Monitoring

Deployment System

  Batch Training            Real-time Predictions
|-------------------| |----------------------------------------|

Hist. Data -> Model -> System -> Predictions
    ^    ^             ^     \
    |     \           /       v
    |       Live Data          Feedback
    |_____________________________|

After Deployment: Evaluation, Management, Monitoring

Use feedback to learn a Model 2

run both and compare to Model 1
continuous evaluation and testing

Evaluation = predictions + metrics

track data collecting from users
what metrics being used to evaluate

Offline evaluation: when do we update the model?

e.g. SSE (sum squared error)

Online evaluation: choosing between models

e.g. User engagement

Can use A/B Testing to choose between ML Models

Other production considerations

A/B testing caveats
versioning
provenance
dashboards
reports
etc

Machine learning challenges and future directions

Open challenges in ML

Choice of Model to use

Lots of models are available
e.g. classifier vs matrix factorization

Data representation

e.g. how do i represent bag of words count or tf-idf

Scaling: Data is getting big

e.g. too many social networks to get data from
e.g. wearable devices are gathering a lot of data
e.g. IoT
Need methods that scale up to the amount of data available

Scaling: models are getting big

e.g. clustering models applied on brain activity gets complicated

CPUs stopped getting faster

marginal increases in last decade

Parallel architectures

can solve for CPU limitations e.g. using GPUs or clusters
concurrent system design and implementation is hard

Where is ML going?

Self-driving cars
specialized medicine
combining real-time data gathering: e.g. localization, cameras, etc
scaling up with amount of data

What's ahead in the specialization

See last section in PDF

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
code		code
images		images
README.md		README.md

veruzi/edu_coursera_machine_learning_1_foundations

Folders and files

Latest commit

History

Repository files navigation

MACHINE LEARNING FOUNDATIONS

WEEK 1 INTRODUCTION

THE MACHINE LEARNING PIPELINE

CASE STUDY APPROACH

Overview

GETTING STARTED WITH PYTHON

GETTING STARTED WITH SFRAME AND GRAPHLAB CREATE

WEEK 2 Regression: Predicting House Prices

Introduction

Linear regression modeling

Predicting house prices: A case study in regression

Adding higher order effects

Evaluating regression models

Evaluating overfitting via training/test split

Training/test curves

Adding other features

Other regression examples

Summary of regression

Quiz: Regression

Predicting house prices: IPython Notebook

Quiz: Programming assignment

WEEK 3 Classification: Analyzing Sentiment

CLASSIFICATION MODELING

Analyzing the sentiment of reviews

Classification Aplications

Decision Boundaries

Evaluating classification models

Training and Evaluating a Classifier

What's a good accuracy?

False positives, false negatives, and confusion matrices

Learning Curves: How much data do I need?

Class Probabilities

Summary of classification

Classification ML block diagram

Summary

Quiz: Classification

Analyzing sentiment: IPython Notebook

Programming assignment

Analyzing product sentiment assignment

Quiz: Analyzing product sentiment

WEEK 4 Clustering and Similarity: Retrieving Documents

Algorithms for retrieval and measuring similarity of documents

Document retrieval: A case study in clustering and measuring similarity

What is the document retrieval task?

Word count representation for measuring similarity

Prioritizing important words with tf-idf

Calculating tf-idf vectors

Retrieving similar documents using nearest neighbor search

Clustering models and algorithms

Clustering documents task overview

Clustering documents: An unsupervised learning task

K-means: A clustering algorithm

Other examples of clustering

Summary of clustering and similarity

Clustering and similarity ML block diagram

Summary

Quiz: Clustering and Similarity

Document retrieval: IPython Notebook

Loading & exploring Wikipedia data

Exploring word counts

Computing & exploring TF-IDFs

Computing distances between Wikipedia articles

Building & exploring a nearest neighbors model for Wikipedia articles

Programming assignment: Retrieving Wikipedia articles assignment1h

Quiz: Retrieving Wikipedia articles

WEEK 5 Recommender Systems

Recommender systems

Solution 0: Popularity

Solution 1: Classification Model

Solution 2: Co-occurrence matrices for collaborative filtering

Effect of popular items

Normalizing co-occurrence matrices and leveraging purchase histories

Matrix factorization

Matrix completion task

Recommendations from known user/item features

Packages