Logistic Regression

One of the most popular parametric linear models is logistic regression. Although designed for regression problems with the output of continuous values in [0, 1], a little trick (interpreting continuous values as posterior class probabilities) makes it one of the most useful tools available to CV/ML engineers for building a strong baseline before delving into deep learning.

Problem definition

Let’s assume we are interested in using logistic regression to classify a set of observations into two classes (binary classification), f.ex - if an email is spam or not. For this exercise we use the Breast Cancer Dataset. You can easily load and use this dataset from the scikit-learn python package as follows.

For our coding challenge, we are interested in learning parameters of a logistic regression model on the Breast Cancer Dataset. Alongside this doc you’d find our bare-bone implementation of logistic regression. In our implementation, we intend to train our model with Stochastic Gradient Descent + log-loss and get comparable performance to its better known public implementation.

To test the correctness of your implementation we use publicly available SGDClassifier as a strong baseline to provide guidelines on the expected accuracy. NOTE: We don’t expect your implementation to outperform SGDClassifier (we won’t complain if it does 😄).

Deliverable

Part 1

Modify the arguments for SGDClassifier to fully support a linear model.
These modifications would depend on your implementation of __compute_loss

Part 2

Derive gradient updates for weights and bias value for the model from scratch
You can do it on a sheet of paper and send us a photo

Part 3

As a coding task we’d like you to implement the following function(s)

clean_data

Remove any noise in the training data with heuristics

fit

Given features and ground truth labels
Loop of data for epocs / iterations
Build random mini-batch
Compute log-loss using __compute_loss below
Compute gradients given log-loss, mini-batch
Updated the weight (self.w) and bias (self.b) of the model

predict

Given samples predict the labels with trained weights and bias
Currently we have set predict to assign random labels

__compute_loss

Compute loss over a batch
Currently we set it to hard coded value (0.0)

__compute_gradient

Compute gradient give loss/batch
Currently set as zero (No updates)

How will we evaluate your code

Correctness : Does your code do the right thing
Objective: Comment on your choice of loss function
Convergence: Does the Loss decrease with the number of iterations
Blind baseline: Is your classifier better than a random classifier

Report

Is minimizing the loss the best criteria to perform early stopping ?
Does the model guarantee performance on an unseen dataset ?
How does lr and batch_size affect convergence ?
Bonus question :

a. Is it possible to modify the training data and learn just the weight vector ?

b. Add a function __dropout, which randomly sets some of feature values to zero during training. How will you incorporate it during fit / predict ?

c. Does __dropout help in convergence / overfitting ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assignment.md

assignment.md

Logistic Regression

Problem definition

Deliverable

Part 1

Part 2

Part 3

How will we evaluate your code

Report

Files

assignment.md

Latest commit

History

assignment.md

File metadata and controls

Logistic Regression

Problem definition

Deliverable

Part 1

Part 2

Part 3

How will we evaluate your code

Report