Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
kvarada committed Dec 7, 2023
1 parent b7d3e47 commit 2ec9cca
Show file tree
Hide file tree
Showing 24 changed files with 2,060 additions and 72 deletions.
2 changes: 2 additions & 0 deletions README.html
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,8 @@
<li class="toctree-l1"><a class="reference internal" href="lectures/20_survival-analysis.html">Lecture 20: Survival analysis</a></li>
<li class="toctree-l1"><a class="reference internal" href="lectures/21_communication.html">Lecture 21: Communication</a></li>
<li class="toctree-l1"><a class="reference internal" href="lectures/23_deployment-conclusion.html">Lecture 23: Deployment and conclusion</a></li>
<li class="toctree-l1"><a class="reference internal" href="lectures/final-review.html">Final review guiding questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="lectures/A-Quick-intro-to-LLMs.html">Bonus: A high-level quick introduction to LLMs</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Demos</span></p>
<ul class="nav bd-sidenav">
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/Markov-bigram-probs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/baby-chatGPT-ex.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/model-sizes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/smart-compose.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/voice-assistant-ex.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
751 changes: 751 additions & 0 deletions _sources/lectures/A-Quick-intro-to-LLMs.ipynb

Large diffs are not rendered by default.

173 changes: 140 additions & 33 deletions _sources/lectures/final-review.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@
"metadata": {},
"source": [
"### Clustering \n",
"- Why clustering?\n",
"- Clustering methods\n",
"- Clustering evaluation "
"- Why clustering and what is the problem of clustering?\n",
"- Compare and contrast different clustering methods.\n",
"- What’s the difficulty in evaluation of clustering? How do we evaluate clusters?"
]
},
{
Expand All @@ -55,73 +55,180 @@
"| Scenario | Which clustering method? | \n",
"|------------------------|--------------------------|\n",
"| Well-separated spherical clusters | |\n",
"| Large Datasets | | \n",
"| Large datasets | | \n",
"| Flexibility with cluster shapes | | \n",
"| Small to medium datasets | | \n",
"| Prior knowlege on how many clusters | | \n",
"| Clusters are roughly of equal size | | \n",
"| Irregularly shaped clusters | | \n",
"| Noise and outliers | | \n",
"| Unknown number of clusters | | \n",
"| Clusters with different densities | | \n",
"| Datasets with hierarchical relationships | | \n",
"| No prior knowledge on number of clusters | | \n",
"| Flexibility with cluster shapes | | \n",
"| Small to medium datasets | | \n"
"| Noise and outliers | | "
]
},
{
"cell_type": "markdown",
"id": "f618f63b-cfff-4b98-ab5a-03a4b5d088a9",
"metadata": {},
"source": []
"source": [
"- Which clustering method would you use in each of the scenarios below? Why?\n",
"- How would you represent the data in each case? \n",
" - Scenario 1: Customer segmentation in retail\n",
" - Scenario 2: An environmental study aiming to identify clusters of a rare plant species\n",
" - Scenario 3: Clustering furniture items for inventory management and customer recommendations"
]
},
{
"cell_type": "markdown",
"id": "268ce357-8124-404d-9b14-3896b3d0bc63",
"metadata": {},
"source": [
"- How to decide the number of clusters? \n",
"- What’s the difficulty in evaluation of clustering? How do we evaluate clusters?"
]
},
{
"cell_type": "markdown",
"id": "fe978d2b-fced-40b2-ad35-b9a358bc03f9",
"metadata": {},
"source": [
"### Recommender systems \n",
"- Problem of recommender systems\n",
"- Baselines\n",
"- "
"- What’s the utility matrix?\n",
"- How do we evaluate recommender systems?\n",
"- What are the baseline models we talked about?\n",
" - Global average\n",
" - Per user average\n",
" - Per item average\n",
"- Evaluation of recommender systems\n",
"- Compare and contrast KNN Imputer, collaborative filtering, and content-based filtering \n",
"- Ethical issues associated with recommender systems "
]
},
{
"cell_type": "markdown",
"id": "e6c94f08-f636-4cff-ad83-7f385a15a322",
"id": "ec9fa25f-946f-4f20-b2b6-95736fa48445",
"metadata": {},
"source": [
"- Which clustering method in what scenario?"
"### Introduction to NLP \n",
"\n",
"- Embeddings\n",
" - What are different document and word representations we talked about?\n",
" - Why do we care about creating different representations?\n",
" - What are pre-trained models? Why are the benefits of using them?\n",
"- Topic modeling \n",
" - What is topic modeling? What are the inputs and outputs of topic modeling?\n",
" - How it's different from clustering documents using a clustering model, say KMeans?\n",
"- Text Preprocessing\n"
]
},
{
"cell_type": "markdown",
"id": "3514acb8-3fde-4ef5-8a6c-81c1ec2b6237",
"id": "5f40b693-aa61-4f69-ab41-1418fccc99cc",
"metadata": {},
"source": [
"\n",
"\n",
"- Why clustering and what is the problem of clustering?\n",
"- What are the three methods of clustering we talked about?\n",
"- What’s the difficulty in evaluation of clustering? How do we evaluate clusters?\n",
"- What’s the problem of recommender systems?\n",
"- What’s the utility matrix?\n",
"- How do we evaluate recommender systems?\n",
"- What are the baseline models we talked about?\n",
"- What are the two recommender systems methods we talked about?\n",
"- What are different document and word representations we talked about?\n",
"- What do we care about creating different representations?\n",
"- What are pre-trained models? Why are the benefits of using them?\n",
"- What is topic modeling? What are the inputs and outputs of topic modeling?\n",
"### Multiclass classification and computer vision \n",
"- What’s the difference between OVR and OVO?\n",
"- What are the methods we saw to use pre-trained image classification models for our image classification tasks?\n",
"- What is time series?\n",
"- What’s wrong with using our usual `train_train_split` on time-series data?\n",
"- What are lag features?\n",
"- How can we forecast into the future?\n",
"- What’s wrong with using binary classification models on right censored data?\n",
" - Out of the box\n",
" - Using pre-trained models as feature extractors\n",
" - Fine-tuning pre-trained models for our task (only mentioned) "
]
},
{
"cell_type": "markdown",
"id": "ef27986e-521a-487b-84c3-e62f029e3b30",
"metadata": {},
"source": [
"How would you use pre-trained model in each case below? \n",
"- Imagine you want to quickly develop a prototype for an app that can identify different cat breeds from photos. \n",
"- Suppose you're working on a project to predict the city in Canada based on the photos of landmarks in the city, a task for which there's limited training data available.\n",
"- Suppose you're developing a system to diagnose specific types of tumors from MRI scans. "
]
},
{
"cell_type": "markdown",
"id": "7c3c8d4b-76f2-4278-9232-0f6a28501df0",
"metadata": {},
"source": [
"### Time series\n",
"\n",
"- When is time series analysis appropriate? \n",
" - Time series analysis is used when there is a temporal aspect in the data.\n",
"- **Data splitting**: Data should be split based on time to avoid future data leaking into the training set.\n",
"- **Essential questions for Exploratory Data Analysis (EDA)**:\n",
" - What is the frequency of data collection (e.g., hourly, daily)?\n",
" - How many time series are present within the dataset?\n",
" - Are there any gaps or missing values in the data?\n",
"- **Feature engineering**\n",
" - Derived new features from the date/time column.\n",
" - Appropriately encoded features based on the chosen model.\n",
" - Created lag features to incorporate past values for prediction.\n",
"- **Baseline model approach**: Employ a simple model, such as using today's target value to predict tomorrow's, as a starting point for comparison.\n",
"- **Cross-Validation Method for Time Series**: In `sklearn`, use `TimeSeriesSplit` as the `cv` parameter in functions like `cross_validate` or `cross_val_score` for time-appropriate validation.\n",
"- **Strategies for long-term forecasting**:\n",
" - Generate forecasts for sequential time steps by assuming the predictions for the previous steps are accurate. \n",
"- **Trends** \n",
" - A 'days since' feature to capture the trend over time"
]
},
{
"cell_type": "markdown",
"id": "ba774c9b-7865-4ecc-b2ab-2e1276edf6ea",
"metadata": {},
"source": [
"### Survival analysis \n",
"- What is right-censored data?\n",
"- What happens when we treat right-censored data the same as \"regular\" data?\n",
" - Predicting churn vs. no churn\n",
" - Predicting tenure\n",
" - Throw away people who haven't churned\n",
" - Assume everyone churns today\n",
"- Survival analysis encompasses predicting both churn and tenure and deals with censoring and can make rich and useful predictions!\n",
" - We can get survival curves which show the probability of survival over time.\n",
" - KM model $\\rightarrow$ doesn't look at features\n",
" - CPH model $\\rightarrow$ like linear regression, does look at the features and provides coefficients associated with each feature"
]
},
{
"cell_type": "markdown",
"id": "3514acb8-3fde-4ef5-8a6c-81c1ec2b6237",
"metadata": {},
"source": [
"### Communication \n",
"- Why is communication important in ML and Data Science? \n",
"- What are different principles of good explanation?\n",
"- What to watch out for when producing or consuming visualizations?"
]
},
{
"cell_type": "markdown",
"id": "95151429-493e-4891-8e1a-a0174a260cee",
"metadata": {},
"source": [
"### Ethics\n",
"- Bias and fairness"
]
},
{
"cell_type": "markdown",
"id": "53f61364-0bf2-4e74-a85a-1c222b123bf7",
"metadata": {},
"source": [
"### Deployment\n",
"\n",
"- Deploying a model as a web app\n",
"- Deploying a model as a REST API"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70a077f4-2362-482b-b65d-5ed815fb6dba",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
2 changes: 2 additions & 0 deletions genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,8 @@
<li class="toctree-l1"><a class="reference internal" href="lectures/20_survival-analysis.html">Lecture 20: Survival analysis</a></li>
<li class="toctree-l1"><a class="reference internal" href="lectures/21_communication.html">Lecture 21: Communication</a></li>
<li class="toctree-l1"><a class="reference internal" href="lectures/23_deployment-conclusion.html">Lecture 23: Deployment and conclusion</a></li>
<li class="toctree-l1"><a class="reference internal" href="lectures/final-review.html">Final review guiding questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="lectures/A-Quick-intro-to-LLMs.html">Bonus: A high-level quick introduction to LLMs</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Demos</span></p>
<ul class="nav bd-sidenav">
Expand Down
4 changes: 3 additions & 1 deletion lectures/17_natural-language-processing.html
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,8 @@
<li class="toctree-l1"><a class="reference internal" href="20_survival-analysis.html">Lecture 20: Survival analysis</a></li>
<li class="toctree-l1"><a class="reference internal" href="21_communication.html">Lecture 21: Communication</a></li>
<li class="toctree-l1"><a class="reference internal" href="23_deployment-conclusion.html">Lecture 23: Deployment and conclusion</a></li>
<li class="toctree-l1"><a class="reference internal" href="final-review.html">Final review guiding questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="A-Quick-intro-to-LLMs.html">Bonus: A high-level quick introduction to LLMs</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Demos</span></p>
<ul class="nav bd-sidenav">
Expand Down Expand Up @@ -503,7 +505,7 @@ <h2>Imports<a class="headerlink" href="#imports" title="Permalink to this headin
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
</pre></div>
</div>
<img alt="../_images/13f1dcc7bbd51e751f1492477095ac4c5be28e5882c481cbbeec55b7e4404ade.png" src="../_images/13f1dcc7bbd51e751f1492477095ac4c5be28e5882c481cbbeec55b7e4404ade.png" />
<img alt="../_images/a705a88012e43ad177c2422afb7dc054441241fbb59c080ab2a8b1e736a59e7e.png" src="../_images/a705a88012e43ad177c2422afb7dc054441241fbb59c080ab2a8b1e736a59e7e.png" />
</div>
</div>
<div class="cell docutils container">
Expand Down
Loading

0 comments on commit 2ec9cca

Please sign in to comment.