Skip to content

Latest commit

 

History

History
111 lines (92 loc) · 5.73 KB

README.md

File metadata and controls

111 lines (92 loc) · 5.73 KB

Binder

Introduction:

The workshop series is designed with a focus on the practical aspects of machine learning. We will be working in Python and using real-world datasets from Kaggle, the machine learning platform most suited for the “learn-by-doing” philosophy. The series is targeted towards complete beginners familiar with Python, but it is also designed adaptively so that you will be challenged even if you have some familiarity with machine learning tools.

The four-session workshop is going to be very hands-on and will focus on how to work with datasets. Instead of comprehensively covering every tool and concept, you will learn the minimal but most useful tools and concepts quickly and learn how to find resources to explore further.

Timeline:
Session 1: 5:30-7:30 pm on Thursday March 28, 2019 at Aviation Room, HMC
Session 2: 5:30-7:30 pm on Thursday April 4, 2019 at Shan 2454, HMC
Session 3: 5:30-7:30 pm on Thursday April 11, 2019 at Aviation Room, HMC
Session 4: 5:30-7:30 pm on Thursday April 18, 2019 at Shan 2454, HMC

This series is a precursor to a future Deep Learning workshop series.

General structure of each two-hour session in the workshop series:

  • Guided session
  • Hands-on exercise
  • Project work

Four sessions are planned in the series with the following time allocations:

Sessions Guided session (min) Hands-on exercise (min) Total time (min)
1 50 70 120
2 30 90 120
3 40 80 120
4 90 30 120

Topics covered in the guided sessions and hands-on exercises:

Session 1: Exploratory Data Analysis and Feature Engineering using Pandas - 1

  • Pandas dataframes as the data structure for datasets
  • Converting csv files to dataframes
  • Slicing and indexing dataframes using conditionals as well as iloc and loc methods.
  • Statistical summary and exploration of dataframes
  • Detecting and filling missing values in the dataframes
  • Regular expressions for data extraction
  • Feature engineering such as creating new features
  • Basic statistical plots using matplotlib and seaborn
  • Correlation among features
  • Basic operations such as dropping rows/columns, setting index, replacing values of a column using a dictionary, etc.

Session 2: Exploratory Data Analysis and Feature Engineering using Pandas - 2

  • Split-apply-combine operations by grouping rows of a dataframe
  • Encoding categorical variables
  • Concatentating and merging dataframes
  • More operations such as sorting the rows, creating a dataframe from the scratch, etc.

Session 3: Model Building, Tuning and Validation using Scikit-learn - 1

  • Overfitting and underfitting of models
  • Regression algorithms
    • Linear Regression
    • Polynomial Regression
    • Rigde Regression
    • Lasso Regression
  • Model Validation
  • Tuning regularization paramter
  • Evaluation metrics for regression - R-squared and Root Mean-Squared Error (RMSE)
  • Normalization and scaling of features

Session 4: Model Building, Tuning and Validation using Scikit-learn - 2

  • Classification algorithms
    • Logistic Regression
    • Decision Trees
    • k-Nearest Neighbors
    • Support Vector Machines
    • Random Forests
  • Evaluation metrics for classification
    • Classification accuracy
    • Confusion matrix
    • Decision Threshold
    • Precision and Recall
    • F1 score
    • Area Under ROC curve
  • Dimensionality reduction (Optional)
    • Principal Component Analysis (PCA)
  • k-fold Cross-validation
  • Maximum Voting Classifiers

Pre-requisites:

  • Python programming basics (HMC CS-5 or equivalent should suffice)
  • Some familiarity with common statistical concepts (HMC MATH-35 or equivalent should suffice)

Learning materials:

The learning material is shared in the Github repository. You can download the entire repository and run the notebooks in your system by installing Jupyter notebooks using Anaconda distribution with python 3 version. Another option would be to fork the notebooks from the following links and run it using Kaggle Kernels - a cloud computing environment that does not require any installation.

The solutions for the guided sessions and exercise notebooks are available in the Github repository but not on Kaggle. The material is designed to be self-sufficient and useful in case you miss a session.

Team:

Instructor: Aashita Kesarwani
TAs: Rex Asabor, Ben Langton and Qualan Woodard

Seats are limited, please register using this link. It is important that you attend all four sessions of the series for it to be useful.