The broad curriculum elements for the workshop are listed below. We would showcase some of them in this workshop.
- Introduction - “I think, therefore I am”
- What is data analysis?
- What type of questions can be answered?
- Frame/Acquire/Refine/Explore/Model/Insight framework
- Acquire - "Data is the new oil"
- Sources of Data - Download from an internal system, Obtained from client, or other 3rd party, Extracted from a web-based API, Scraped from a website / pdfs, or Gathered manually and recorded
- Acquire data from a csv file or a database
- Acquire data from a 3rd part client (e.g. twitter)
- Refine - "Data is messy"
- Concept of Tidy Data - Why is it important?
- Missing e.g. Check for missing or incomplete data
- Quality e.g. Check for duplicates, accuracy, unusual data
- Parse e.g. extract year from date
- Merge e.g. first and surname for full name
- Convert e.g. free text to coded value
- Derive e.g. gender from title
- Calculate e.g. percentages, proportion
- Remove e.g. remove redundant data
- Aggregate e.g. rollup by year, cluster by area
- Filter e.g. exclude based on location
- Sample e.g. extract a representative data
- Summary e.g. show summary stats like mean
- Basic statistics: variance, standard deviation, co-variance, correlation
- Explore - "I don't know, what I don't know"
- Why do visual exploration?
- Understand Data Structure & Types
- Explore single variable graphs - Quantitative, Categorical
- Explore dual variable graphs - Q & Q, Q & C, C & C
- Explore multi-dimensional variable graphs
- Model - "All models are wrong, Some of them are useful"
- Introduction to Machine Learning
- The power and limits of models
- Tradeoff between Prediction Accuracy and Model Interpretability
- Assessing Model Accuracy
- For Regression problems - RMSE
- For classification problems- Precision, Recall, AUC/ROC, F-Score, Mis-classification rate
- Bias-Variance tradeoff
- Overfitting
- Linear Regression
- Logistic Regression
- L1, L2 Linear & Logistic Regression
- Regularization
- Classification model
- Decision Trees
- Visualizing decision trees
- Insight - “The goal is to turn data into insight”
- Why do we need to communicate insight?
- Types of communication - Exploration vs. Explanation
- Explanation: Telling a story with data
- Exploration: Building an interface for people to find stories