Update documentation

UBC-CS · Dec 7, 2023 · 2ec9cca · 2ec9cca
1 parent b7d3e47
commit 2ec9cca
Show file tree

Hide file tree

Showing 24 changed files with 2,060 additions and 72 deletions.
diff --git a/README.html b/README.html
@@ -179,6 +179,8 @@
 <li class="toctree-l1"><a class="reference internal" href="lectures/20_survival-analysis.html">Lecture 20: Survival analysis</a></li>
 <li class="toctree-l1"><a class="reference internal" href="lectures/21_communication.html">Lecture 21: Communication</a></li>
 <li class="toctree-l1"><a class="reference internal" href="lectures/23_deployment-conclusion.html">Lecture 23: Deployment and conclusion</a></li>
+<li class="toctree-l1"><a class="reference internal" href="lectures/final-review.html">Final review guiding questions</a></li>
+<li class="toctree-l1"><a class="reference internal" href="lectures/A-Quick-intro-to-LLMs.html">Bonus: A high-level quick introduction to LLMs</a></li>
 </ul>
 <p aria-level="2" class="caption" role="heading"><span class="caption-text">Demos</span></p>
 <ul class="nav bd-sidenav">

diff --git a/_images/746252d950f7f2ec7c8d53896b6c0eefbfb9c76ed6a5bd2e7a18c09b58d8b993.png b/_images/746252d950f7f2ec7c8d53896b6c0eefbfb9c76ed6a5bd2e7a18c09b58d8b993.png
diff --git a/_images/Markov-bigram-probs.png b/_images/Markov-bigram-probs.png
diff --git a/_images/a705a88012e43ad177c2422afb7dc054441241fbb59c080ab2a8b1e736a59e7e.png b/_images/a705a88012e43ad177c2422afb7dc054441241fbb59c080ab2a8b1e736a59e7e.png
diff --git a/_images/baby-chatGPT-ex.png b/_images/baby-chatGPT-ex.png
diff --git a/_images/e6b68d1a072ea4cc94b3c08f9b6fd9fada94253e717feaba1a1ed7fb8684d9b0.png b/_images/e6b68d1a072ea4cc94b3c08f9b6fd9fada94253e717feaba1a1ed7fb8684d9b0.png
diff --git a/_images/f1b36c60738d71124c03e96b6131d27689387250f6b81d27a4eaf4fa963b8667.png b/_images/f1b36c60738d71124c03e96b6131d27689387250f6b81d27a4eaf4fa963b8667.png
diff --git a/_images/model-sizes.png b/_images/model-sizes.png
diff --git a/_images/smart-compose.gif b/_images/smart-compose.gif
diff --git a/_images/voice-assistant-ex.png b/_images/voice-assistant-ex.png
diff --git a/_sources/lectures/A-Quick-intro-to-LLMs.ipynb b/_sources/lectures/A-Quick-intro-to-LLMs.ipynb
diff --git a/_sources/lectures/final-review.ipynb b/_sources/lectures/final-review.ipynb
@@ -42,9 +42,9 @@
    "metadata": {},
    "source": [
     "### Clustering \n",
-    "- Why clustering?\n",
-    "- Clustering methods\n",
-    "- Clustering evaluation "
+    "- Why clustering and what is the problem of clustering?\n",
+    "- Compare and contrast different clustering methods.\n",
+    "- What’s the difficulty in evaluation of clustering? How do we evaluate clusters?"
    ]
   },
   {
@@ -55,73 +55,180 @@
     "|     Scenario           | Which clustering method? | \n",
     "|------------------------|--------------------------|\n",
     "| Well-separated spherical clusters  |  |\n",
-    "| Large Datasets    |  | \n",
+    "| Large datasets    |  | \n",
+    "| Flexibility with cluster shapes  |  | \n",
+    "| Small to medium datasets  |  | \n",
     "| Prior knowlege on how many clusters   |  | \n",
     "| Clusters are roughly of equal size   |  | \n",
     "| Irregularly shaped clusters   |  | \n",
-    "| Noise and outliers   |  | \n",
     "| Unknown number of clusters   |  | \n",
     "| Clusters with different densities  |  | \n",
     "| Datasets with hierarchical relationships  |  | \n",
     "| No prior knowledge on number of clusters  |  | \n",
-    "| Flexibility with cluster shapes  |  | \n",
-    "| Small to medium datasets  |  | \n"
+    "| Noise and outliers   |  | "
    ]
   },
   {
    "cell_type": "markdown",
    "id": "f618f63b-cfff-4b98-ab5a-03a4b5d088a9",
    "metadata": {},
-   "source": []
+   "source": [
+    "- Which clustering method would you use in each of the scenarios below? Why?\n",
+    "- How would you represent the data in each case? \n",
+    "    - Scenario 1: Customer segmentation in retail\n",
+    "    - Scenario 2: An environmental study aiming to identify clusters of a rare plant species\n",
+    "    - Scenario 3: Clustering furniture items for inventory management and customer recommendations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "268ce357-8124-404d-9b14-3896b3d0bc63",
+   "metadata": {},
+   "source": [
+    "- How to decide the number of clusters? \n",
+    "- What’s the difficulty in evaluation of clustering? How do we evaluate clusters?"
+   ]
   },
   {
    "cell_type": "markdown",
    "id": "fe978d2b-fced-40b2-ad35-b9a358bc03f9",
    "metadata": {},
    "source": [
     "### Recommender systems \n",
-    "- Problem of recommender systems\n",
-    "- Baselines\n",
-    "- "
+    "- What’s the utility matrix?\n",
+    "- How do we evaluate recommender systems?\n",
+    "- What are the baseline models we talked about?\n",
+    "    - Global average\n",
+    "    - Per user average\n",
+    "    - Per item average\n",
+    "- Evaluation of recommender systems\n",
+    "- Compare and contrast KNN Imputer, collaborative filtering, and content-based filtering \n",
+    "- Ethical issues associated with recommender systems "
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "e6c94f08-f636-4cff-ad83-7f385a15a322",
+   "id": "ec9fa25f-946f-4f20-b2b6-95736fa48445",
    "metadata": {},
    "source": [
-    "- Which clustering method in what scenario?"
+    "### Introduction to NLP \n",
+    "\n",
+    "- Embeddings\n",
+    "    - What are different document and word representations we talked about?\n",
+    "    - Why do we care about creating different representations?\n",
+    "    - What are pre-trained models? Why are the benefits of using them?\n",
+    "- Topic modeling \n",
+    "    - What is topic modeling? What are the inputs and outputs of topic modeling?\n",
+    "    - How it's different from clustering documents using a clustering model, say KMeans?\n",
+    "- Text Preprocessing\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "3514acb8-3fde-4ef5-8a6c-81c1ec2b6237",
+   "id": "5f40b693-aa61-4f69-ab41-1418fccc99cc",
    "metadata": {},
    "source": [
-    "\n",
-    "\n",
-    "- Why clustering and what is the problem of clustering?\n",
-    "- What are the three methods of clustering we talked about?\n",
-    "- What’s the difficulty in evaluation of clustering? How do we evaluate clusters?\n",
-    "- What’s the problem of recommender systems?\n",
-    "- What’s the utility matrix?\n",
-    "- How do we evaluate recommender systems?\n",
-    "- What are the baseline models we talked about?\n",
-    "- What are the two recommender systems methods we talked about?\n",
-    "- What are different document and word representations we talked about?\n",
-    "- What do we care about creating different representations?\n",
-    "- What are pre-trained models? Why are the benefits of using them?\n",
-    "- What is topic modeling? What are the inputs and outputs of topic modeling?\n",
+    "### Multiclass classification and computer vision \n",
     "- What’s the difference between OVR and OVO?\n",
     "- What are the methods we saw to use pre-trained image classification models for our image classification tasks?\n",
-    "- What is time series?\n",
-    "- What’s wrong with using our usual `train_train_split` on time-series data?\n",
-    "- What are lag features?\n",
-    "- How can we forecast into the future?\n",
-    "- What’s wrong with using binary classification models on right censored data?\n",
+    "    - Out of the box\n",
+    "    - Using pre-trained models as feature extractors\n",
+    "    - Fine-tuning pre-trained models for our task (only mentioned) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef27986e-521a-487b-84c3-e62f029e3b30",
+   "metadata": {},
+   "source": [
+    "How would you use pre-trained model in each case below? \n",
+    "- Imagine you want to quickly develop a prototype for an app that can identify different cat breeds from photos. \n",
+    "- Suppose you're working on a project to predict the city in Canada based on the photos of landmarks in the city, a task for which there's limited training data available.\n",
+    "- Suppose you're developing a system to diagnose specific types of tumors from MRI scans. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7c3c8d4b-76f2-4278-9232-0f6a28501df0",
+   "metadata": {},
+   "source": [
+    "### Time series\n",
+    "\n",
+    "- When is time series analysis appropriate? \n",
+    "    - Time series analysis is used when there is a temporal aspect in the data.\n",
+    "- **Data splitting**: Data should be split based on time to avoid future data leaking into the training set.\n",
+    "- **Essential questions for Exploratory Data Analysis (EDA)**:\n",
+    "    - What is the frequency of data collection (e.g., hourly, daily)?\n",
+    "    - How many time series are present within the dataset?\n",
+    "    - Are there any gaps or missing values in the data?\n",
+    "- **Feature engineering**\n",
+    "    - Derived new features from the date/time column.\n",
+    "    - Appropriately encoded features based on the chosen model.\n",
+    "    - Created lag features to incorporate past values for prediction.\n",
+    "- **Baseline model approach**: Employ a simple model, such as using today's target value to predict tomorrow's, as a starting point for comparison.\n",
+    "- **Cross-Validation Method for Time Series**: In `sklearn`, use `TimeSeriesSplit` as the `cv` parameter in functions like `cross_validate` or `cross_val_score` for time-appropriate validation.\n",
+    "- **Strategies for long-term forecasting**:\n",
+    "    - Generate forecasts for sequential time steps by assuming the predictions for the previous steps are accurate. \n",
+    "- **Trends** \n",
+    "    - A 'days since' feature to capture the trend over time"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba774c9b-7865-4ecc-b2ab-2e1276edf6ea",
+   "metadata": {},
+   "source": [
+    "### Survival analysis \n",
+    "- What is right-censored data?\n",
+    "- What happens when we treat right-censored data the same as \"regular\" data?\n",
+    "    - Predicting churn vs. no churn\n",
+    "    - Predicting tenure\n",
+    "        - Throw away people who haven't churned\n",
+    "        - Assume everyone churns today\n",
+    "- Survival analysis encompasses predicting both churn and tenure and deals with censoring and can make rich and useful predictions!\n",
+    "    - We can get survival curves which show the probability of survival over time.\n",
+    "    - KM model $\\rightarrow$ doesn't look at features\n",
+    "    - CPH model $\\rightarrow$ like linear regression, does look at the features and provides coefficients associated with each feature"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3514acb8-3fde-4ef5-8a6c-81c1ec2b6237",
+   "metadata": {},
+   "source": [
+    "### Communication \n",
+    "- Why is communication important in ML and Data Science? \n",
     "- What are different principles of good explanation?\n",
     "- What to watch out for when producing or consuming visualizations?"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95151429-493e-4891-8e1a-a0174a260cee",
+   "metadata": {},
+   "source": [
+    "### Ethics\n",
+    "- Bias and fairness"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53f61364-0bf2-4e74-a85a-1c222b123bf7",
+   "metadata": {},
+   "source": [
+    "### Deployment\n",
+    "\n",
+    "- Deploying a model as a web app\n",
+    "- Deploying a model as a REST API"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70a077f4-2362-482b-b65d-5ed815fb6dba",
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {

diff --git a/genindex.html b/genindex.html
@@ -175,6 +175,8 @@
 <li class="toctree-l1"><a class="reference internal" href="lectures/20_survival-analysis.html">Lecture 20: Survival analysis</a></li>
 <li class="toctree-l1"><a class="reference internal" href="lectures/21_communication.html">Lecture 21: Communication</a></li>
 <li class="toctree-l1"><a class="reference internal" href="lectures/23_deployment-conclusion.html">Lecture 23: Deployment and conclusion</a></li>
+<li class="toctree-l1"><a class="reference internal" href="lectures/final-review.html">Final review guiding questions</a></li>
+<li class="toctree-l1"><a class="reference internal" href="lectures/A-Quick-intro-to-LLMs.html">Bonus: A high-level quick introduction to LLMs</a></li>
 </ul>
 <p aria-level="2" class="caption" role="heading"><span class="caption-text">Demos</span></p>
 <ul class="nav bd-sidenav">

diff --git a/lectures/17_natural-language-processing.html b/lectures/17_natural-language-processing.html
@@ -180,6 +180,8 @@
 <li class="toctree-l1"><a class="reference internal" href="20_survival-analysis.html">Lecture 20: Survival analysis</a></li>
 <li class="toctree-l1"><a class="reference internal" href="21_communication.html">Lecture 21: Communication</a></li>
 <li class="toctree-l1"><a class="reference internal" href="23_deployment-conclusion.html">Lecture 23: Deployment and conclusion</a></li>
+<li class="toctree-l1"><a class="reference internal" href="final-review.html">Final review guiding questions</a></li>
+<li class="toctree-l1"><a class="reference internal" href="A-Quick-intro-to-LLMs.html">Bonus: A high-level quick introduction to LLMs</a></li>
 </ul>
 <p aria-level="2" class="caption" role="heading"><span class="caption-text">Demos</span></p>
 <ul class="nav bd-sidenav">
@@ -503,7 +505,7 @@ <h2>Imports<a class="headerlink" href="#imports" title="Permalink to this headin
 Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
 </pre></div>
 </div>
-<img alt="../_images/13f1dcc7bbd51e751f1492477095ac4c5be28e5882c481cbbeec55b7e4404ade.png" src="../_images/13f1dcc7bbd51e751f1492477095ac4c5be28e5882c481cbbeec55b7e4404ade.png" />
+<img alt="../_images/a705a88012e43ad177c2422afb7dc054441241fbb59c080ab2a8b1e736a59e7e.png" src="../_images/a705a88012e43ad177c2422afb7dc054441241fbb59c080ab2a8b1e736a59e7e.png" />
 </div>
 </div>
 <div class="cell docutils container">