diff --git a/source/acknowledgements-python.md b/source/acknowledgements-python.md
index dc687718..b3518de4 100644
--- a/source/acknowledgements-python.md
+++ b/source/acknowledgements-python.md
@@ -15,11 +15,11 @@ kernelspec:
# Acknowledgments for the Python Edition
-We'd like to thank everyone that has contributed to the development of
+We'd like to thank everyone that has contributed to the development of
[*Data Science: A First Introduction (Python Edition)*](https://ubc-dsci.github.io/introduction-to-datascience-python/).
This is an open source Python translation of the original [*Data Science: A First Introduction*](https://datasciencebook.ca);
-the original focused on the R programming language. Both of these books are
-used to teach DSCI 100, a new introductory data science course
+the original focused on the R programming language. Both of these books are
+used to teach DSCI 100, a new introductory data science course
at the University of British Columbia (UBC).
We will finalize this acknowledgements section after the book is complete!
diff --git a/source/acknowledgements.md b/source/acknowledgements.md
index 82ecc5c7..873a8813 100644
--- a/source/acknowledgements.md
+++ b/source/acknowledgements.md
@@ -15,20 +15,20 @@ kernelspec:
# Acknowledgments
-We'd like to thank everyone that has contributed to the development of
+We'd like to thank everyone that has contributed to the development of
[*Data Science: A First Introduction*](https://datasciencebook.ca).
This is an open source textbook that began as a collection of course readings
-for DSCI 100, a new introductory data science course
+for DSCI 100, a new introductory data science course
at the University of British Columbia (UBC).
-Several faculty members in the UBC Department of Statistics
-were pivotal in shaping the direction of that course,
-and as such, contributed greatly to the broad structure and
+Several faculty members in the UBC Department of Statistics
+were pivotal in shaping the direction of that course,
+and as such, contributed greatly to the broad structure and
list of topics in this book. We would especially like to thank Matías
Salibían-Barrera for his mentorship during the initial development and roll-out
of both DSCI 100 and this book. His door was always open when
we needed to chat about how to best introduce and teach data science to our first-year students.
-We would also like to thank all those who contributed to the process of
+We would also like to thank all those who contributed to the process of
publishing this book. In particular, we would like to thank all of our reviewers for their feedback and suggestions:
Rohan Alexander, Isabella Ghement, Virgilio Gómez Rubio, Albert Kim, Adam Loy, Maria Prokofieva, Emily Riederer, and Greg Wilson.
The book was improved substantially by their insights.
@@ -37,8 +37,8 @@ for his support and encouragement throughout the process, and to
Roger Peng for graciously offering to write the Foreword.
Finally, we owe a debt of gratitude to all of the students of DSCI 100 over the past
-few years. They provided invaluable feedback on the book and worksheets;
-they found bugs for us (and stood by very patiently in class while
+few years. They provided invaluable feedback on the book and worksheets;
+they found bugs for us (and stood by very patiently in class while
we frantically fixed those bugs); and they brought a level of enthusiasm to the class
that sustained us during the hard work of creating a new course and writing a textbook.
Our interactions with them taught us how to teach data science, and that learning
diff --git a/source/appendixA.md b/source/appendixA.md
index 7e57bf72..3909d161 100644
--- a/source/appendixA.md
+++ b/source/appendixA.md
@@ -13,14 +13,14 @@ kernelspec:
name: python3
---
-# Downloading files from JupyterHub
+# Downloading files from JupyterHub
This section will help you
-save your work from a JupyterHub web-based platform to your own computer.
+save your work from a JupyterHub web-based platform to your own computer.
Let's say you want to download everything inside a folder called `your_folder`
in your home directory.
-First open a terminal \index{JupyterHub!file download} by clicking "terminal" in the Launcher tab.
-Next, type the following in the terminal to create a
+First open a terminal \index{JupyterHub!file download} by clicking "terminal" in the Launcher tab.
+Next, type the following in the terminal to create a
compressed `.zip` archive for the work you are interested in downloading:
```
@@ -29,6 +29,6 @@ zip -r hub_folder.zip your_folder
After the compressing process is complete, right-click on `hub_folder.zip`
in the JupyterHub file browser
-and click "Download". After the download is complete, you should be
+and click "Download". After the download is complete, you should be
able to find the `hub_folder.zip` file on your own computer,
and unzip the file (typically by double-clicking on it).
diff --git a/source/clustering.md b/source/clustering.md
index ee64e5de..4d843f4e 100644
--- a/source/clustering.md
+++ b/source/clustering.md
@@ -36,19 +36,19 @@ warnings.filterwarnings("ignore")
```
-## Overview
+## Overview
As part of exploratory data analysis, it is often helpful to see if there are
-meaningful subgroups (or *clusters*) in the data.
-This grouping can be used for many purposes,
-such as generating new questions or improving predictive analyses.
-This chapter provides an introduction to clustering
+meaningful subgroups (or *clusters*) in the data.
+This grouping can be used for many purposes,
+such as generating new questions or improving predictive analyses.
+This chapter provides an introduction to clustering
using the K-means algorithm,
including techniques to choose the number of clusters.
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
-* Describe a case where clustering is appropriate,
+* Describe a case where clustering is appropriate,
and what insight it might extract from the data.
* Explain the K-means clustering algorithm.
* Interpret the output of a K-means analysis.
@@ -64,8 +64,8 @@ and what insight it might extract from the data.
```{index} clustering
```
-Clustering is a data analysis task
-involving separating a data set into subgroups of related data.
+Clustering is a data analysis task
+involving separating a data set into subgroups of related data.
For example, we might use clustering to separate a
data set of documents into groups that correspond to topics, a data set of
human genetic information into groups that correspond to ancestral
@@ -73,72 +73,72 @@ subpopulations, or a data set of online customers into groups that correspond
to purchasing behaviors. Once the data are separated, we can, for example,
use the subgroups to generate new questions about the data and follow up with a
predictive modeling exercise. In this course, clustering will be used only for
-exploratory analysis, i.e., uncovering patterns in the data.
+exploratory analysis, i.e., uncovering patterns in the data.
```{index} classification, regression, supervised, unsupervised
```
-Note that clustering is a fundamentally different kind of task
-than classification or regression.
-In particular, both classification and regression are *supervised tasks*
-where there is a *response variable* (a category label or value),
-and we have examples of past data with labels/values
-that help us predict those of future data.
-By contrast, clustering is an *unsupervised task*,
-as we are trying to understand
-and examine the structure of data without any response variable labels
-or values to help us.
-This approach has both advantages and disadvantages.
-Clustering requires no additional annotation or input on the data.
-For example, it would be nearly impossible to annotate
-all the articles on Wikipedia with human-made topic labels.
-However, we can still cluster the articles without this information
-to find groupings corresponding to topics automatically.
+Note that clustering is a fundamentally different kind of task
+than classification or regression.
+In particular, both classification and regression are *supervised tasks*
+where there is a *response variable* (a category label or value),
+and we have examples of past data with labels/values
+that help us predict those of future data.
+By contrast, clustering is an *unsupervised task*,
+as we are trying to understand
+and examine the structure of data without any response variable labels
+or values to help us.
+This approach has both advantages and disadvantages.
+Clustering requires no additional annotation or input on the data.
+For example, it would be nearly impossible to annotate
+all the articles on Wikipedia with human-made topic labels.
+However, we can still cluster the articles without this information
+to find groupings corresponding to topics automatically.
Given that there is no response variable, it is not as easy to evaluate
the "quality" of a clustering. With classification, we can use a test data set
to assess prediction performance. In clustering, there is not a single good
choice for evaluation. In this book, we will use visualization to ascertain the
quality of a clustering, and leave rigorous evaluation for more advanced
-courses.
+courses.
```{index} K-means
```
-As in the case of classification,
-there are many possible methods that we could use to cluster our observations
-to look for subgroups.
-In this book, we will focus on the widely used K-means algorithm {cite:p}`kmeans`.
+As in the case of classification,
+there are many possible methods that we could use to cluster our observations
+to look for subgroups.
+In this book, we will focus on the widely used K-means algorithm {cite:p}`kmeans`.
In your future studies, you might encounter hierarchical clustering,
-principal component analysis, multidimensional scaling, and more;
-see the additional resources section at the end of this chapter
+principal component analysis, multidimensional scaling, and more;
+see the additional resources section at the end of this chapter
for where to begin learning more about these other methods.
```{index} semisupervised
```
-> **Note:** There are also so-called *semisupervised* tasks,
-> where only some of the data come with response variable labels/values,
-> but the vast majority don't.
-> The goal is to try to uncover underlying structure in the data
-> that allows one to guess the missing labels.
-> This sort of task is beneficial, for example,
-> when one has an unlabeled data set that is too large to manually label,
-> but one is willing to provide a few informative example labels as a "seed"
+> **Note:** There are also so-called *semisupervised* tasks,
+> where only some of the data come with response variable labels/values,
+> but the vast majority don't.
+> The goal is to try to uncover underlying structure in the data
+> that allows one to guess the missing labels.
+> This sort of task is beneficial, for example,
+> when one has an unlabeled data set that is too large to manually label,
+> but one is willing to provide a few informative example labels as a "seed"
> to guess the labels for all the data.
-**An illustrative example**
+**An illustrative example**
```{index} Palmer penguins
```
Here we will present an illustrative example using a data set from
-[the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) {cite:p}`palmerpenguins`. This
+[the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) {cite:p}`palmerpenguins`. This
data set was collected by Dr. Kristen Gorman and
the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
measurements for adult penguins found near there {cite:p}`penguinpaper`. We have
modified the data set for use in this chapter. Here we will focus on using two
-variables—penguin bill and flipper length, both in millimeters—to determine whether
+variables—penguin bill and flipper length, both in millimeters—to determine whether
there are distinct types of penguins in our data.
Understanding this might help us with species discovery and classification in a data-driven
way.
@@ -151,18 +151,18 @@ name: 09-penguins
Gentoo penguin.
```
-To learn about K-means clustering
+To learn about K-means clustering
we will work with `penguin_data` in this chapter.
-`penguin_data` is a subset of 18 observations of the original data,
-which has already been standardized
-(remember from Chapter {ref}`classification`
-that scaling is part of the standardization process).
-We will discuss scaling for K-means in more detail later in this chapter.
+`penguin_data` is a subset of 18 observations of the original data,
+which has already been standardized
+(remember from Chapter {ref}`classification`
+that scaling is part of the standardization process).
+We will discuss scaling for K-means in more detail later in this chapter.
Before we get started, we will set a random seed.
This will ensure that our analysis will be reproducible.
-As we will learn in more detail later in the chapter,
-setting the seed here is important
+As we will learn in more detail later in the chapter,
+setting the seed here is important
because the K-means clustering algorithm uses random numbers.
```{index} seed; numpy.random.seed
@@ -191,7 +191,7 @@ penguin_data
```
-Next, we can create a scatter plot using this data set
+Next, we can create a scatter plot using this data set
to see if we can detect subtypes or groups in our data set.
```{code-cell} ipython3
@@ -213,7 +213,7 @@ glue('scatter_plot', scatter_plot, display=True)
```
:::{glue:figure} scatter_plot
-:figwidth: 700px
+:figwidth: 700px
:name: scatter_plot
Scatter plot of standardized bill length versus standardized flipper length.
@@ -222,8 +222,8 @@ Scatter plot of standardized bill length versus standardized flipper length.
```{index} altair, altair; mark_circle
```
-Based on the visualization
-in {numref}`scatter_plot`,
+Based on the visualization
+in {numref}`scatter_plot`,
we might suspect there are a few subtypes of penguins within our data set.
We can see roughly 3 groups of observations in {numref}`scatter`,
including:
@@ -236,17 +236,17 @@ including:
```
Data visualization is a great tool to give us a rough sense of such patterns
-when we have a small number of variables.
-But if we are to group data—and select the number of groups—as part of
+when we have a small number of variables.
+But if we are to group data—and select the number of groups—as part of
a reproducible analysis, we need something a bit more automated.
-Additionally, finding groups via visualization becomes more difficult
+Additionally, finding groups via visualization becomes more difficult
as we increase the number of variables we consider when clustering.
-The way to rigorously separate the data into groups
+The way to rigorously separate the data into groups
is to use a clustering algorithm.
-In this chapter, we will focus on the *K-means* algorithm,
-a widely used and often very effective clustering method,
-combined with the *elbow method*
-for selecting the number of clusters.
+In this chapter, we will focus on the *K-means* algorithm,
+a widely used and often very effective clustering method,
+combined with the *elbow method*
+for selecting the number of clusters.
This procedure will separate the data into groups;
{numref}`colored_scatter_plot` shows these groups
denoted by colored scatter points.
@@ -276,7 +276,7 @@ glue('colored_scatter_plot', colored_scatter_plot, display=True)
```
:::{glue:figure} colored_scatter_plot
-:figwidth: 700px
+:figwidth: 700px
:name: colored_scatter_plot
Scatter plot of standardized bill length versus standardized flipper length with colored groups.
@@ -290,7 +290,7 @@ where we can easily visualize the clusters on a scatter plot, we can give
human-made labels to the groups using their positions on
the plot:
-- small flipper length and small bill length (orange cluster),
+- small flipper length and small bill length (orange cluster),
- small flipper length and large bill length (blue cluster).
- and large flipper length and large bill length (yellow cluster).
@@ -298,9 +298,9 @@ Once we have made these determinations, we can use them to inform our species
classifications or ask further questions about our data. For example, we might
be interested in understanding the relationship between flipper length and bill
length, and that relationship may differ depending on the type of penguin we
-have.
+have.
-## K-means
+## K-means
### Measuring cluster quality
@@ -319,11 +319,11 @@ The K-means algorithm is a procedure that groups data into K clusters.
It starts with an initial clustering of the data, and then iteratively
improves it by making adjustments to the assignment of data
to clusters until it cannot improve any further. But how do we measure
-the "quality" of a clustering, and what does it mean to improve it?
+the "quality" of a clustering, and what does it mean to improve it?
In K-means clustering, we measure the quality of a cluster by its
*within-cluster sum-of-squared-distances* (WSSD), also called *intertia*. Computing this involves two steps.
-First, we find the cluster centers by computing the mean of each variable
-over data points in the cluster. For example, suppose we have a
+First, we find the cluster centers by computing the mean of each variable
+over data points in the cluster. For example, suppose we have a
cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
Then we would compute the coordinates, $\mu_x$ and $\mu_y$, of the cluster center via
@@ -345,8 +345,8 @@ glue("mean_bill_len_std_glue", mean_bill_len_std)
```
-In the first cluster from the example, there are {glue:}`clus_rows_glue` data points. These are shown with their cluster center
-(flipper_length_standardized = {glue:}`mean_flipper_len_std_glue` and bill_length_standardized = {glue:}`mean_bill_len_std_glue`) highlighted
+In the first cluster from the example, there are {glue:}`clus_rows_glue` data points. These are shown with their cluster center
+(flipper_length_standardized = {glue:}`mean_flipper_len_std_glue` and bill_length_standardized = {glue:}`mean_bill_len_std_glue`) highlighted
in {numref}`toy-example-clus1-center-1`
@@ -361,18 +361,18 @@ Cluster 1 from the penguin_data data set example. Observations are in blue, with
```{index} distance; K-means
```
-The second step in computing the WSSD is to add up the squared distance
-between each point in the cluster
+The second step in computing the WSSD is to add up the squared distance
+between each point in the cluster
and the cluster center.
-We use the straight-line / Euclidean distance formula
+We use the straight-line / Euclidean distance formula
that we learned about in Chapter {ref}`classification`.
-In the {glue:}`clus_rows_glue`-observation cluster example above,
+In the {glue:}`clus_rows_glue`-observation cluster example above,
we would compute the WSSD $S^2$ via
$S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right) + \left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right) + \left((x_4 - \mu_x)^2 + (y_4 - \mu_y)^2\right)$
-These distances are denoted by lines in {numref}`toy-example-clus1-dists-1` for the first cluster of the penguin data example.
+These distances are denoted by lines in {numref}`toy-example-clus1-dists-1` for the first cluster of the penguin data example.
```{figure} img/toy-example-clus1-dists-1.png
---
@@ -385,9 +385,9 @@ Cluster 1 from the penguin_data data set example. Observations are in blue, with
The larger the value of $S^2$, the more spread out the cluster is, since large $S^2$ means that points are far from the cluster center.
Note, however, that "large" is relative to *both* the scale of the variables for clustering *and* the number of points in the cluster. A cluster where points are very close to the center might still have a large $S^2$ if there are many data points in the cluster.
-After we have calculated the WSSD for all the clusters,
+After we have calculated the WSSD for all the clusters,
we sum them together to get the *total WSSD*.
-For our example,
+For our example,
this means adding up all the squared distances for the 18 observations.
These distances are denoted by black lines in
{numref}`toy-example-all-clus-dists-1`
@@ -407,8 +407,8 @@ All clusters from the penguin_data data set example. Observations are in orange,
```{index} K-means; algorithm
```
-We begin the K-means algorithm by picking K,
-and randomly assigning a roughly equal number of observations
+We begin the K-means algorithm by picking K,
+and randomly assigning a roughly equal number of observations
to each of the K clusters.
An example random initialization is shown in {numref}`toy-kmeans-init-1`
@@ -433,10 +433,10 @@ sum of WSSDs over all the clusters, i.e., the *total WSSD*:
2. **Label update:** Reassign each data point to the cluster with the nearest center.
These two steps are repeated until the cluster assignments no longer change.
-We show what the first four iterations of K-means would look like in
-{numref}`toy-kmeans-iter-1`
+We show what the first four iterations of K-means would look like in
+{numref}`toy-kmeans-iter-1`
There each row corresponds to an iteration,
-where the left column depicts the center update,
+where the left column depicts the center update,
and the right column depicts the reassignment of data to clusters.
```{figure} img/toy-kmeans-iter-1.png
@@ -461,15 +461,15 @@ in the fourth iteration; both the centers and labels will remain the same from t
> ways to assign the data to clusters. So at some point, the total WSSD must stop decreasing, which means none of the assignments
> are changing, and the algorithm terminates.
-What kind of data is suitable for K-means clustering?
+What kind of data is suitable for K-means clustering?
In the simplest version of K-means clustering that we have presented here,
-the straight-line distance is used to measure the
-distance between observations and cluster centers.
+the straight-line distance is used to measure the
+distance between observations and cluster centers.
This means that only quantitative data should be used with this algorithm.
-There are variants on the K-means algorithm,
-as well as other clustering algorithms entirely,
-that use other distance metrics
-to allow for non-quantitative data to be clustered.
+There are variants on the K-means algorithm,
+as well as other clustering algorithms entirely,
+that use other distance metrics
+to allow for non-quantitative data to be clustered.
These, however, are beyond the scope of this book.
### Random restarts
@@ -508,15 +508,15 @@ and pick the clustering that has the lowest final total WSSD.
### Choosing K
-In order to cluster data using K-means,
+In order to cluster data using K-means,
we also have to pick the number of clusters, K.
-But unlike in classification, we have no response variable
+But unlike in classification, we have no response variable
and cannot perform cross-validation with some measure of model prediction error.
Further, if K is chosen too small, then multiple clusters get grouped together;
-if K is too large, then clusters get subdivided.
-In both cases, we will potentially miss interesting structure in the data.
-{numref}`toy-kmeans-vary-k-1` illustrates the impact of K
-on K-means clustering of our penguin flipper and bill length data
+if K is too large, then clusters get subdivided.
+In both cases, we will potentially miss interesting structure in the data.
+{numref}`toy-kmeans-vary-k-1` illustrates the impact of K
+on K-means clustering of our penguin flipper and bill length data
by showing the different clusterings for K's ranging from 1 to 9.
```{figure} img/toy-kmeans-vary-k-1.png
@@ -530,11 +530,11 @@ Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster cente
```{index} elbow method
```
-If we set K less than 3, then the clustering merges separate groups of data; this causes a large
-total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On
-the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
-decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
-clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") when we reach roughly
+If we set K less than 3, then the clustering merges separate groups of data; this causes a large
+total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On
+the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
+decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
+clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") when we reach roughly
the right number of clusters ({numref}`toy-kmeans-elbow-1`)).
```{figure} img/toy-kmeans-elbow-1.png
@@ -550,22 +550,22 @@ Total WSSD for K clusters ranging from 1 to 9.
```{index} pair: standardization; K-means
```
-Similar to K-nearest neighbors classification and regression, K-means
-clustering uses straight-line distance to decide which points are similar to
+Similar to K-nearest neighbors classification and regression, K-means
+clustering uses straight-line distance to decide which points are similar to
each other. Therefore, the *scale* of each of the variables in the data
will influence which cluster data points end up being assigned.
-Variables with a large scale will have a much larger
-effect on deciding cluster assignment than variables with a small scale.
+Variables with a large scale will have a much larger
+effect on deciding cluster assignment than variables with a small scale.
To address this problem, we typically standardize our data before clustering,
which ensures that each variable has a mean of 0 and standard deviation of 1.
-The `StandardScaler()` function in Python can be used to do this.
-We show an example of how to use this function
+The `StandardScaler()` function in Python can be used to do this.
+We show an example of how to use this function
below using an unscaled and unstandardized version of the data set in this chapter.
```{code-cell} ipython3
:tags: ["remove-cell"]
-unstandardized_data = pd.read_csv("data/toy_penguins.csv", usecols=["bill_length_mm", "flipper_length_mm"])
+unstandardized_data = pd.read_csv("data/toy_penguins.csv", usecols=["bill_length_mm", "flipper_length_mm"])
unstandardized_data.to_csv("data/penguins_not_standardized.csv", index=False)
unstandardized_data
```
@@ -579,7 +579,7 @@ not_standardized_data = pd.read_csv("data/penguins_not_standardized.csv")
not_standardized_data
```
-And then we apply the `StandardScaler()` function to both the columns in the data frame
+And then we apply the `StandardScaler()` function to both the columns in the data frame
using `fit_transform()`
@@ -588,7 +588,7 @@ using `fit_transform()`
scaler = StandardScaler()
standardized_data = pd.DataFrame(
scaler.fit_transform(not_standardized_data), columns = ['bill_length_mm', 'flipper_length_mm'])
-
+
standardized_data
```
@@ -622,9 +622,9 @@ print(f"Cluster labels : {penguin_clust.labels_}")
```{index} K-means; inertia_, K-means; cluster_centers_, K-means; labels_, K-means; predict
```
-As you can see above, the clustering object is returned by `KMeans`
+As you can see above, the clustering object is returned by `KMeans`
has a lot of information that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
-To obtain the information in the clustering object, we will call the `predict` function. (We can also call the `labels_` attribute)
+To obtain the information in the clustering object, we will call the `predict` function. (We can also call the `labels_` attribute)
```{code-cell} ipython3
predictions = penguin_clust.predict(standardized_data)
@@ -634,7 +634,7 @@ predictions
Let's start by visualizing the clustering
as a colored scatter plot. To do that,
-we will add a new column and store assign the above predictions to that. The final
+we will add a new column and store assign the above predictions to that. The final
data frame will contain the data and the cluster assignments for
each point:
@@ -665,8 +665,8 @@ cluster_plot = (
glue('cluster_plot', cluster_plot, display=True)
```
-:::{glue:figure} cluster_plot
-:figwidth: 700px
+:::{glue:figure} cluster_plot
+:figwidth: 700px
:name: cluster_plot
The data colored by the cluster assignments returned by K-means.
@@ -679,7 +679,7 @@ The data colored by the cluster assignments returned by K-means.
```
As mentioned above, we also need to select K by finding
-where the "elbow" occurs in the plot of total WSSD versus the number of clusters.
+where the "elbow" occurs in the plot of total WSSD versus the number of clusters.
We can obtain the total WSSD (inertia) from our
clustering using `.inertia_` function. For example:
@@ -689,7 +689,7 @@ penguin_clust.inertia_
To calculate the total WSSD for a variety of Ks, we will
create a data frame with a column named `k` with rows containing
-each value of K we want to run K-means with (here, 1 to 9).
+each value of K we want to run K-means with (here, 1 to 9).
```{code-cell} ipython3
import numpy as np
@@ -699,8 +699,8 @@ penguin_clust_ks = pd.DataFrame({"k": np.array(range(1, 10)).transpose()})
```{index} pandas.DataFrame; assign
```
-Then we use `assign()` to create a new column and `lambda` operator to apply the `KMeans` function
-within each row to each K.
+Then we use `assign()` to create a new column and `lambda` operator to apply the `KMeans` function
+within each row to each K.
```{code-cell} ipython3
np.random.seed(12)
@@ -711,8 +711,8 @@ penguin_clust_ks = penguin_clust_ks.assign(
)
```
-If we take a look at our data frame `penguin_clust_ks` now,
-we see that it has two columns: one with the value for K,
+If we take a look at our data frame `penguin_clust_ks` now,
+we see that it has two columns: one with the value for K,
and the other holding the clustering model objects.
```{code-cell} ipython3
@@ -733,10 +733,10 @@ penguin_clust_ks.iloc[1]['penguin_clusts']
penguin_clust_ks.iloc[1]['penguin_clusts'].inertia_
```
-Next, we use `assign` again to add 2 new columns `inertia` and `n_iter`
-to each of the K-means clustering objects to get the clustering statistics
+Next, we use `assign` again to add 2 new columns `inertia` and `n_iter`
+to each of the K-means clustering objects to get the clustering statistics
-This results in a data frame with 4 columns, one for K, one for the
+This results in a data frame with 4 columns, one for K, one for the
K-means clustering objects, and 2 for the clustering statistics:
```{code-cell} ipython3
@@ -745,11 +745,11 @@ penguin_clust_ks = penguin_clust_ks.assign(
n_iter=penguin_clust_ks["penguin_clusts"].apply(lambda x: x.n_iter_)
)
-
+
penguin_clust_ks
```
-Now that we have `inertia` and `k` as columns in a data frame, we can make a line plot
+Now that we have `inertia` and `k` as columns in a data frame, we can make a line plot
({numref}`elbow_plot`) and search for the "elbow" to find which value of K to use. We will drop the column `penguin_clusts` to make the plotting in altair feasible
```{code-cell} ipython3
@@ -776,7 +776,7 @@ glue('elbow_plot', elbow_plot, display=True)
```
:::{glue:figure} elbow_plot
-:figwidth: 700px
+:figwidth: 700px
:name: elbow_plot
A plot showing the total WSSD versus the number of clusters.
@@ -786,14 +786,14 @@ A plot showing the total WSSD versus the number of clusters.
```
It looks like 3 clusters is the right choice for this data.
-But why is there a "bump" in the total WSSD plot here?
-Shouldn't total WSSD always decrease as we add more clusters?
-Technically yes, but remember: K-means can get "stuck" in a bad solution.
+But why is there a "bump" in the total WSSD plot here?
+Shouldn't total WSSD always decrease as we add more clusters?
+Technically yes, but remember: K-means can get "stuck" in a bad solution.
Unfortunately, for K = 7 we had an unlucky initialization
-and found a bad clustering!
-We can help prevent finding a bad clustering
+and found a bad clustering!
+We can help prevent finding a bad clustering
by removing the `init='random'` as the argument in `KMeans`.
-The default value for `init` argument is `k-means++`, which selects
+The default value for `init` argument is `k-means++`, which selects
initial cluster centers for k-mean clustering in a smart way to speed up convergence
The more times we perform K-means clustering,
@@ -834,8 +834,8 @@ elbow_plot=(
glue('elbow_plot2', elbow_plot, display=True)
```
-:::{glue:figure} elbow_plot2
-:figwidth: 700px
+:::{glue:figure} elbow_plot2
+:figwidth: 700px
:name: elbow_plot2
A plot showing the total WSSD versus the number of clusters when K-means is run without `init` argument
@@ -843,8 +843,8 @@ A plot showing the total WSSD versus the number of clusters when K-means is run
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme)
in the "Clustering" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
diff --git a/source/index.md b/source/index.md
index be402176..3d2c1f1e 100644
--- a/source/index.md
+++ b/source/index.md
@@ -23,7 +23,7 @@ the top left of the page.
For the R version of the textbook, please visit https://datasciencebook.ca.
You can purchase a PDF or print copy of the R version of the book
-on the [CRC Press website](https://www.routledge.com/Data-Science-A-First-Introduction/Timbers-Campbell-Lee/p/book/9780367524685) or
+on the [CRC Press website](https://www.routledge.com/Data-Science-A-First-Introduction/Timbers-Campbell-Lee/p/book/9780367524685) or
on [Amazon](https://www.amazon.com/Data-Science-First-Introduction-Chapman/dp/0367532174/ref=sr_[…]qid=1644637450&sprefix=data+science+timber%2Caps%2C166&sr=8-1).
diff --git a/source/inference.md b/source/inference.md
index 19103e26..bb1d78d4 100644
--- a/source/inference.md
+++ b/source/inference.md
@@ -38,9 +38,9 @@ analysis questions regarding how summaries, patterns, trends, or relationships
in a data set extend to the wider population are called *inferential
questions*. This chapter will start with the fundamental ideas of sampling from
populations and then introduce two common techniques in statistical inference:
-*point estimation* and *interval estimation*.
+*point estimation* and *interval estimation*.
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
* Describe real-world examples of questions that can be answered with statistical inference.
@@ -56,7 +56,7 @@ By the end of the chapter, readers will be able to do the following:
+++
-## Why do we need sampling?
+## Why do we need sampling?
We often need to understand how quantities we observe in a subset
of data relate to the same quantities in the broader population. For example, suppose a
retailer is considering selling iPhone accessories, and they want to estimate
@@ -79,7 +79,7 @@ general, a population parameter is a numerical characteristic of the entire
population. To compute this number in the example above, we would need to ask
every single undergraduate in North America whether they own an iPhone. In
practice, directly computing population parameters is often time-consuming and
-costly, and sometimes impossible.
+costly, and sometimes impossible.
```{index} sample, sample; estimate, inference
```
@@ -87,17 +87,17 @@ costly, and sometimes impossible.
```{index} see: statistical inference; inference
```
-A more practical approach would be to make measurements for a **sample**, i.e., a
+A more practical approach would be to make measurements for a **sample**, i.e., a
subset of individuals collected from the population. We can then compute a
-**sample estimate**—a numerical characteristic of the sample—that
+**sample estimate**—a numerical characteristic of the sample—that
estimates the population parameter. For example, suppose we randomly selected
ten undergraduate students across North America (the sample) and computed the
proportion of those students who own an iPhone (the sample estimate). In that
case, we might suspect that proportion is a reasonable estimate of the
-proportion of students who own an iPhone in the entire population.
+proportion of students who own an iPhone in the entire population.
{numref}`fig:11-population-vs-sample` illustrates this process.
In general, the process of using a sample to make a conclusion about the
-broader population from which it is taken is referred to as **statistical inference**.
+broader population from which it is taken is referred to as **statistical inference**.
+++
@@ -113,7 +113,7 @@ Note that proportions are not the *only* kind of population parameter we might
be interested in. For example, suppose an undergraduate student studying at the University
of British Columbia in Canada is looking for an apartment
to rent. They need to create a budget, so they want to know something about
-studio apartment rental prices in Vancouver, BC. This student might
+studio apartment rental prices in Vancouver, BC. This student might
formulate the following question:
*What is the average price-per-month of studio apartment rentals in Vancouver, Canada?*
@@ -147,13 +147,13 @@ focus on two settings:
```{index} Airbnb
```
-We will look at an example using data from
+We will look at an example using data from
[Inside Airbnb](http://insideairbnb.com/) {cite:p}`insideairbnb`. Airbnb is an online
marketplace for arranging vacation rentals and places to stay. The data set
contains listings for Vancouver, Canada, in September 2020. Our data
includes an ID number, neighborhood, type of room, the number of people the
rental accommodates, number of bathrooms, bedrooms, beds, and the price per
-night.
+night.
```{code-cell} ipython3
@@ -952,18 +952,18 @@ distribution is centered at the population mean. Second, increasing the size of
the sample decreases the spread (i.e., the variability) of the sampling
distribution. Therefore, a larger sample size results in a more reliable point
estimate of the population parameter. And third, the distribution of the sample
-mean is roughly bell-shaped.
+mean is roughly bell-shaped.
> **Note:** You might notice that in the `n = 20` case in {numref}`fig:11-example-means7`,
> the distribution is not *quite* bell-shaped. There is a bit of skew towards the right!
> You might also notice that in the `n = 50` case and larger, that skew seems to disappear.
-> In general, the sampling distribution—for both means and proportions—only
+> In general, the sampling distribution—for both means and proportions—only
> becomes bell-shaped *once the sample size is large enough*.
-> How large is "large enough?" Unfortunately, it depends entirely on the problem at hand. But
+> How large is "large enough?" Unfortunately, it depends entirely on the problem at hand. But
> as a rule of thumb, often a sample size of at least 20 will suffice.
-
+++
@@ -980,7 +980,7 @@ mean is roughly bell-shaped.
+++
-### Overview
+### Overview
*Why all this emphasis on sampling distributions?*
@@ -1004,15 +1004,15 @@ estimate.
```
Unfortunately, we cannot construct the exact sampling distribution without
-full access to the population. However, if we could somehow *approximate* what
-the sampling distribution would look like for a sample, we could
+full access to the population. However, if we could somehow *approximate* what
+the sampling distribution would look like for a sample, we could
use that approximation to then report how uncertain our sample
point estimate is (as we did above with the *exact* sampling
-distribution). There are several methods to accomplish this; in this book, we
-will use the *bootstrap*. We will discuss **interval estimation** and
+distribution). There are several methods to accomplish this; in this book, we
+will use the *bootstrap*. We will discuss **interval estimation** and
construct
-**confidence intervals** using just a single sample from a population. A
-confidence interval is a range of plausible values for our population parameter.
+**confidence intervals** using just a single sample from a population. A
+confidence interval is a range of plausible values for our population parameter.
Here is the key idea. First, if you take a big enough sample, it *looks like*
the population. Notice the histograms' shapes for samples of different sizes
@@ -1078,17 +1078,17 @@ In the previous section, we took many samples of the same size *from our
population* to get a sense of the variability of a sample estimate. But if our
sample is big enough that it looks like our population, we can pretend that our
sample *is* the population, and take more samples (with replacement) of the
-same size from it instead! This very clever technique is
+same size from it instead! This very clever technique is
called **the bootstrap**. Note that by taking many samples from our single, observed
sample, we do not obtain the true sampling distribution, but rather an
-approximation that we call **the bootstrap distribution**.
+approximation that we call **the bootstrap distribution**.
> **Note:** We must sample *with* replacement when using the bootstrap.
> Otherwise, if we had a sample of size $n$, and obtained a sample from it of
> size $n$ *without* replacement, it would just return our original sample!
This section will explore how to create a bootstrap distribution from a single
-sample using Python. The process is visualized in {numref}`fig:11-intro-bootstrap-image`.
+sample using Python. The process is visualized in {numref}`fig:11-intro-bootstrap-image`.
For a sample of size $n$, you would do the following:
+++
@@ -1111,10 +1111,10 @@ Overview of the bootstrap process.
+++
-### Bootstrapping in Python
+### Bootstrapping in Python
Let’s continue working with our Airbnb example to illustrate how we might create
-and use a bootstrap distribution using just a single sample from the population.
+and use a bootstrap distribution using just a single sample from the population.
Once again, suppose we are
interested in estimating the population mean price per night of all Airbnb
listings in Vancouver, Canada, using a single sample size of 40.
@@ -1164,7 +1164,7 @@ this sample and estimate are the only data we can work with.
```
We now perform steps 1–5 listed above to generate a single bootstrap
-sample in Python and calculate a point estimate from that bootstrap sample. We will
+sample in Python and calculate a point estimate from that bootstrap sample. We will
use the `resample` function from the `scikit-learn` package. Critically, note that we now
pass `one_sample`—our single sample of size 40—as the first argument.
And since we need to sample with replacement,
@@ -1207,7 +1207,7 @@ that our single sample is close to the population, and we are trying to
mimic drawing another sample from the population by drawing one from our original
sample.
-Let's now take 20,000 bootstrap samples from the original sample (`one_sample`)
+Let's now take 20,000 bootstrap samples from the original sample (`one_sample`)
using `resample`, and calculate the means for
each of those replicates. Recall that this assumes that `one_sample` *looks like*
our original population; but since we do not have access to the population itself,
@@ -1306,7 +1306,7 @@ Distribution of the bootstrap sample means.
+++
-Let's compare the bootstrap distribution—which we construct by taking many samples from our original sample of size 40—with
+Let's compare the bootstrap distribution—which we construct by taking many samples from our original sample of size 40—with
the true sampling distribution—which corresponds to taking many samples from the population.
```{code-cell} ipython3
@@ -1396,18 +1396,18 @@ glue("one_sample_mean", round(one_sample["price"].mean(), 2))
```{index} sampling distribution; compared to bootstrap distribution
```
-There are two essential points that we can take away from
+There are two essential points that we can take away from
{numref}`fig:11-bootstrapping6`. First, the shape and spread of the true sampling
distribution and the bootstrap distribution are similar; the bootstrap
distribution lets us get a sense of the point estimate's variability. The
second important point is that the means of these two distributions are
-different. The sampling distribution is centered at
+different. The sampling distribution is centered at
\${glue:}`population_mean`, the population mean value. However, the bootstrap
-distribution is centered at the original sample's mean price per night,
+distribution is centered at the original sample's mean price per night,
\${glue:}`one_sample_mean`. Because we are resampling from the
original sample repeatedly, we see that the bootstrap distribution is centered
at the original sample's mean value (unlike the sampling distribution of the
-sample mean, which is centered at the population parameter value).
+sample mean, which is centered at the population parameter value).
{numref}`fig:11-bootstrapping7` summarizes the bootstrapping process.
The idea here is that we can use this distribution of bootstrap sample means to
@@ -1430,13 +1430,13 @@ Summary of bootstrapping process.
+++
-### Using the bootstrap to calculate a plausible range
+### Using the bootstrap to calculate a plausible range
```{index} confidence interval
```
Now that we have constructed our bootstrap distribution, let's use it to create
-an approximate 95\% percentile bootstrap confidence interval.
+an approximate 95\% percentile bootstrap confidence interval.
A **confidence interval** is a range of plausible values for the population parameter. We will
find the range of values covering the middle 95\% of the bootstrap
distribution, giving us a 95\% confidence interval. You may be wondering, what
@@ -1458,7 +1458,7 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol
+++
-1. Arrange the observations in the bootstrap distribution in ascending order.
+1. Arrange the observations in the bootstrap distribution in ascending order.
2. Find the value such that 2.5\% of observations fall below it (the 2.5\% percentile). Use that value as the lower bound of the interval.
3. Find the value such that 97.5\% of observations fall below it (the 97.5\% percentile). Use that value as the upper bound of the interval.
@@ -1540,7 +1540,7 @@ Distribution of the bootstrap sample means with percentile lower and upper bound
To finish our estimation of the population parameter, we would report the point
estimate and our confidence interval's lower and upper bounds. Here the sample
-mean price-per-night of 40 Airbnb listings was
+mean price-per-night of 40 Airbnb listings was
\${glue:}`one_sample_mean`, and we are 95\% "confident" that the true
population mean price-per-night for all Airbnb listings in Vancouver is between
\$({glue:}`ci_lower`, {glue:}`ci_upper`).
@@ -1562,8 +1562,8 @@ statistical techniques you may learn about in the future!
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme)
in the two "Statistical inference" rows.
You can launch an interactive version of each worksheet in your browser by clicking the "launch binder" button.
diff --git a/source/jupyter.md b/source/jupyter.md
index bff3eaed..ab413ce0 100644
--- a/source/jupyter.md
+++ b/source/jupyter.md
@@ -14,28 +14,28 @@ kernelspec:
---
(getting-started-with-jupyter)=
-# Combining code and text with Jupyter
+# Combining code and text with Jupyter
## Overview
A typical data analysis involves not only writing and executing code, but also writing text and displaying images
that help tell the story of the analysis. In fact, ideally, we would like to *interleave* these three media,
with the text and images serving as narration for the code and its output.
-In this chapter we will show you how to accomplish this using Jupyter notebooks, a common coding platform in
+In this chapter we will show you how to accomplish this using Jupyter notebooks, a common coding platform in
data science. Jupyter notebooks do precisely what we need: they let you combine text, images, and (executable!) code in a single
document. In this chapter, we will focus on the *use* of Jupyter notebooks to program in Python and write
-text via a web interface.
+text via a web interface.
These skills are essential to getting your analysis running; think of it like getting dressed in the morning!
Note that we assume that you already have Jupyter set up and ready to use. If that is not the case, please first read
the {ref}`move-to-your-own-machine` chapter to learn how to install and configure Jupyter on your own
-computer.
+computer.
```{note}
This book was originally written for the R programming language, and
has been edited to focus instead on Python. This chapter on Jupyter notebooks
-has not yet been fully updated to focus on Python; it has images and examples from
+has not yet been fully updated to focus on Python; it has images and examples from
the R version of the book. But the concepts related to Jupyter notebooks are generally
-the same. We are currently working on producing new Python-based images and examples
+the same. We are currently working on producing new Python-based images and examples
for this chapter.
```
@@ -54,17 +54,17 @@ By the end of the chapter, readers will be able to do the following:
```{index} Jupyter notebook, reproducible
```
-Jupyter is a web-based interactive development environment for creating, editing,
-and executing documents called Jupyter notebooks. Jupyter notebooks are
-documents that contain a mix of computer code (and its output) and formattable
-text. Given that they combine these two analysis artifacts in a single
-document—code is not separate from the output or written report—notebooks are
-one of the leading tools to create reproducible data analyses. Reproducible data
-analysis is one where you can reliably and easily re-create the same results when
-analyzing the same data. Although this sounds like something that should always
-be true of any data analysis, in reality, this is not often the case; one needs
+Jupyter is a web-based interactive development environment for creating, editing,
+and executing documents called Jupyter notebooks. Jupyter notebooks are
+documents that contain a mix of computer code (and its output) and formattable
+text. Given that they combine these two analysis artifacts in a single
+document—code is not separate from the output or written report—notebooks are
+one of the leading tools to create reproducible data analyses. Reproducible data
+analysis is one where you can reliably and easily re-create the same results when
+analyzing the same data. Although this sounds like something that should always
+be true of any data analysis, in reality, this is not often the case; one needs
to make a conscious effort to perform data analysis in a reproducible manner.
-An example of what a Jupyter notebook looks like is shown in
+An example of what a Jupyter notebook looks like is shown in
{numref}`img-jupyter`.
@@ -80,14 +80,14 @@ A screenshot of a Jupyter Notebook.
```{index} JupyterHub
```
-One of the easiest ways to start working with Jupyter is to use a
-web-based platform called JupyterHub. JupyterHubs often have Jupyter, Python, a number of Python
-packages, and collaboration tools installed, configured and ready to use.
+One of the easiest ways to start working with Jupyter is to use a
+web-based platform called JupyterHub. JupyterHubs often have Jupyter, Python, a number of Python
+packages, and collaboration tools installed, configured and ready to use.
JupyterHubs are usually created and provisioned by organizations,
and require authentication to gain access. For example, if you are reading
this book as part of a course, your instructor may have a JupyterHub
-already set up for you to use! Jupyter can also be installed on your
own computer; see the {ref}`move-to-your-own-machine` chapter for instructions.
@@ -96,15 +96,15 @@ own computer; see the {ref}`move-to-your-own-machine` chapter for instructions.
```{index} Jupyter notebook; code cell
```
-The sections of a Jupyter notebook that contain code are referred to as code cells.
-A code cell that has not yet been
-executed has no number inside the square brackets to the left of the cell
-({numref}`code-cell-not-run`). Running a code cell will execute all of
+The sections of a Jupyter notebook that contain code are referred to as code cells.
+A code cell that has not yet been
+executed has no number inside the square brackets to the left of the cell
+({numref}`code-cell-not-run`). Running a code cell will execute all of
the code it contains, and the output (if any exists) will be displayed directly
-underneath the code that generated it. Outputs may include printed text or
-numbers, data frames and data visualizations. Cells that have been executed
-also have a number inside the square brackets to the left of the cell.
-This number indicates the order in which the cells were run
+underneath the code that generated it. Outputs may include printed text or
+numbers, data frames and data visualizations. Cells that have been executed
+also have a number inside the square brackets to the left of the cell.
+This number indicates the order in which the cells were run
({numref}`code-cell-run`).
```{figure} img/code-cell-not-run.png
@@ -130,19 +130,19 @@ A code cell in Jupyter that has been executed.
```{index} Jupyter notebook; cell execution
```
-Code cells can be run independently or as part of executing the entire notebook
-using one of the "**Run all**" commands found in the **Run** or **Kernel** menus
-in Jupyter. Running a single code cell independently is a workflow typically
-used when editing or writing your own Python code. Executing an entire notebook is a
-workflow typically used to ensure that your analysis runs in its entirety before
-sharing it with others, and when using a notebook as part of an automated
+Code cells can be run independently or as part of executing the entire notebook
+using one of the "**Run all**" commands found in the **Run** or **Kernel** menus
+in Jupyter. Running a single code cell independently is a workflow typically
+used when editing or writing your own Python code. Executing an entire notebook is a
+workflow typically used to ensure that your analysis runs in its entirety before
+sharing it with others, and when using a notebook as part of an automated
process.
To run a code cell independently, the cell needs to first be activated. This
is done by clicking on it with the cursor. Jupyter will indicate a cell has been
activated by highlighting it with a blue rectangle to its left. After the cell
-has been activated ({numref}`activate-and-run-button`), the cell can be run by either pressing
-the **Run** (▸) button in the toolbar, or by using a keyboard shortcut of
+has been activated ({numref}`activate-and-run-button`), the cell can be run by either pressing
+the **Run** (▸) button in the toolbar, or by using a keyboard shortcut of
`Shift + Enter`.
```{figure} img/activate-and-run-button-annotated.png
@@ -184,19 +184,19 @@ Restarting the Python session can be accomplished by clicking Restart Kernel and
```{index} kernel, Jupyter notebook; kernel
```
-The kernel is a program that executes the code inside your notebook and
-outputs the results. Kernels for many different programming languages have
-been created for Jupyter, which means that Jupyter can interpret and execute
-the code of many different programming languages. To run Python code, your notebook
-will need an Python kernel. In the top right of your window, you can see a circle
-that indicates the status of your kernel. If the circle is empty
-(◯), the kernel is idle and ready to execute code. If the circle is filled in
+The kernel is a program that executes the code inside your notebook and
+outputs the results. Kernels for many different programming languages have
+been created for Jupyter, which means that Jupyter can interpret and execute
+the code of many different programming languages. To run Python code, your notebook
+will need an Python kernel. In the top right of your window, you can see a circle
+that indicates the status of your kernel. If the circle is empty
+(◯), the kernel is idle and ready to execute code. If the circle is filled in
(⬤), the kernel is busy running some code.
```{index} kernel; interrupt, kernel; restart
```
-You may run into problems where your kernel is stuck for an excessive amount
+You may run into problems where your kernel is stuck for an excessive amount
of time, your notebook is very slow and unresponsive, or your kernel loses its
connection. If this happens, try the following steps:
@@ -206,9 +206,9 @@ connection. If this happens, try the following steps:
### Creating new code cells
-To create a new code cell in Jupyter ({numref}`create-new-code-cell`), click the `+` button in the
-toolbar. By default, all new cells in Jupyter start out as code cells,
-so after this, all you have to do is write Python code within the new cell you just
+To create a new code cell in Jupyter ({numref}`create-new-code-cell`), click the `+` button in the
+toolbar. By default, all new cells in Jupyter start out as code cells,
+so after this, all you have to do is write Python code within the new cell you just
created!
```{figure} img/create-new-code-cell.png
@@ -223,13 +223,13 @@ New cells can be created by clicking the + button, and are by default code cells
```{index} markdown, Jupyter notebook; markdown cell
```
-Text cells inside a Jupyter notebook are called Markdown cells. Markdown cells
-are rich formatted text cells, which means you can **bold** and *italicize*
-text, create subject headers, create bullet and numbered lists, and more. These cells are
+Text cells inside a Jupyter notebook are called Markdown cells. Markdown cells
+are rich formatted text cells, which means you can **bold** and *italicize*
+text, create subject headers, create bullet and numbered lists, and more. These cells are
given the name "Markdown" because they use *Markdown language* to specify the rich text formatting.
-You do not need to learn Markdown to write text in the Markdown cells in
-Jupyter; plain text will work just fine. However, you might want to learn a bit
-about Markdown eventually to enable you to create nicely formatted analyses.
+You do not need to learn Markdown to write text in the Markdown cells in
+Jupyter; plain text will work just fine. However, you might want to learn a bit
+about Markdown eventually to enable you to create nicely formatted analyses.
See the additional resources at the end of this chapter to find out
where you can start learning Markdown.
@@ -237,9 +237,9 @@ where you can start learning Markdown.
To edit a Markdown cell in Jupyter, you need to double click on the cell. Once
you do this, the unformatted (or *unrendered*) version of the text will be
-shown ({numref}`markdown-cell-not-run`). You
+shown ({numref}`markdown-cell-not-run`). You
can then use your keyboard to edit the text. To view the formatted
-(or *rendered*) text ({numref}`markdown-cell-run`), click the **Run** (▸) button in the toolbar,
+(or *rendered*) text ({numref}`markdown-cell-run`), click the **Run** (▸) button in the toolbar,
or use the `Shift + Enter` keyboard shortcut.
```{figure} img/markdown-cell-not-run.png
@@ -258,10 +258,10 @@ A Markdown cell in Jupyter that has been rendered and exhibits rich text formatt
### Creating new Markdown cells
-To create a new Markdown cell in Jupyter, click the `+` button in the toolbar.
-By default, all new cells in Jupyter start as code cells, so
-the cell format needs to be changed to be recognized and rendered as a Markdown
-cell. To do this, click on the cell with your cursor to
+To create a new Markdown cell in Jupyter, click the `+` button in the toolbar.
+By default, all new cells in Jupyter start as code cells, so
+the cell format needs to be changed to be recognized and rendered as a Markdown
+cell. To do this, click on the cell with your cursor to
ensure it is activated. Then click on the drop-down box on the toolbar that says "Code" (it
is next to the ⏭ button), and change it from "**Code**" to "**Markdown**" ({numref}`convert-to-markdown-cell`).
@@ -274,13 +274,13 @@ New cells are by default code cells. To create Markdown cells, the cell format m
## Saving your work
-As with any file you work on, it is critical to save your work often so you
-don't lose your progress! Jupyter has an autosave feature, where open files are
-saved periodically. The default for this is every two minutes. You can also
-manually save a Jupyter notebook by selecting **Save Notebook** from the
+As with any file you work on, it is critical to save your work often so you
+don't lose your progress! Jupyter has an autosave feature, where open files are
+saved periodically. The default for this is every two minutes. You can also
+manually save a Jupyter notebook by selecting **Save Notebook** from the
**File** menu, by clicking the disk icon on the toolbar,
or by using a keyboard shortcut (`Control + S` for Windows, or `Command + S` for
-Mac OS).
+Mac OS).
## Best practices for running a notebook
@@ -290,32 +290,32 @@ Mac OS).
```
As you might know (or at least imagine) by now, Jupyter notebooks are great for
-interactively editing, writing and running Python code; this is what they were
-designed for! Consequently, Jupyter notebooks are flexible in regards to code
-cell execution order. This flexibility means that code cells can be run in any
-arbitrary order using the **Run** (▸) button. But this flexibility has a downside:
-it can lead to Jupyter notebooks whose code cannot be executed in a linear
-order (from top to bottom of the notebook). A nonlinear notebook is problematic
-because a linear order is the conventional way code documents are run, and
-others will have this expectation when running your notebook. Finally, if the
-code is used in some automated process, it will need to run in a linear order,
-from top to bottom of the notebook.
-
-The most common way to inadvertently create a nonlinear notebook is to rely solely
-on using the (▸) button to execute cells. For example,
-suppose you write some Python code that creates an Python object, say a variable named
-`y`. When you execute that cell and create `y`, it will continue
-to exist until it is deliberately deleted with Python code, or when the Jupyter
-notebook Python session (*i.e.*, kernel) is stopped or restarted. It can also be
-referenced in another distinct code cell ({numref}`out-of-order-1`).
+interactively editing, writing and running Python code; this is what they were
+designed for! Consequently, Jupyter notebooks are flexible in regards to code
+cell execution order. This flexibility means that code cells can be run in any
+arbitrary order using the **Run** (▸) button. But this flexibility has a downside:
+it can lead to Jupyter notebooks whose code cannot be executed in a linear
+order (from top to bottom of the notebook). A nonlinear notebook is problematic
+because a linear order is the conventional way code documents are run, and
+others will have this expectation when running your notebook. Finally, if the
+code is used in some automated process, it will need to run in a linear order,
+from top to bottom of the notebook.
+
+The most common way to inadvertently create a nonlinear notebook is to rely solely
+on using the (▸) button to execute cells. For example,
+suppose you write some Python code that creates an Python object, say a variable named
+`y`. When you execute that cell and create `y`, it will continue
+to exist until it is deliberately deleted with Python code, or when the Jupyter
+notebook Python session (*i.e.*, kernel) is stopped or restarted. It can also be
+referenced in another distinct code cell ({numref}`out-of-order-1`).
Together, this means that you could then write a code cell further above in the
-notebook that references `y` and execute it without error in the current session
-({numref}`out-of-order-2`). This could also be done successfully in
-future sessions if, and only if, you run the cells in the same unconventional
-order. However, it is difficult to remember this unconventional order, and it
-is not the order that others would expect your code to be executed in. Thus, in
-the future, this would lead
-to errors when the notebook is run in the conventional
+notebook that references `y` and execute it without error in the current session
+({numref}`out-of-order-2`). This could also be done successfully in
+future sessions if, and only if, you run the cells in the same unconventional
+order. However, it is difficult to remember this unconventional order, and it
+is not the order that others would expect your code to be executed in. Thus, in
+the future, this would lead
+to errors when the notebook is run in the conventional
linear order ({numref}`out-of-order-3`).
```{figure} img/out-of-order-1.png
@@ -351,83 +351,83 @@ notebook.
You can also accidentally create a nonfunctioning notebook by
-creating an object in a cell that later gets deleted. In such a
-scenario, that object only exists for that one particular Python session and will
-not exist once the notebook is restarted and run again. If that
-object was referenced in another cell in that notebook, an error
+creating an object in a cell that later gets deleted. In such a
+scenario, that object only exists for that one particular Python session and will
+not exist once the notebook is restarted and run again. If that
+object was referenced in another cell in that notebook, an error
would occur when the notebook was run again in a new session.
-These events may not negatively affect the current Python session when
-the code is being written; but as you might now see, they will likely lead to
-errors when that notebook is run in a future session. Regularly executing
-the entire notebook in a fresh Python session will help guard
+These events may not negatively affect the current Python session when
+the code is being written; but as you might now see, they will likely lead to
+errors when that notebook is run in a future session. Regularly executing
+the entire notebook in a fresh Python session will help guard
against this. If you restart your session and new errors seem to pop up when
you run all of your cells in linear order, you can at least be aware that there
-is an issue. Knowing this sooner rather than later will allow you to
+is an issue. Knowing this sooner rather than later will allow you to
fix the issue and ensure your notebook can be run linearly from start to finish.
We recommend as a best practice to run the entire notebook in a fresh Python session
at least 2–3 times within any period of work. Note that,
critically, you *must do this in a fresh Python session* by restarting your kernel.
-We recommend using either the **Kernel** >>
-**Restart Kernel and Run All Cells...** command from the menu or the ⏭
-button in the toolbar. Note that the **Run** >> **Run All Cells**
-menu item will not restart the kernel, and so it is not sufficient
+We recommend using either the **Kernel** >>
+**Restart Kernel and Run All Cells...** command from the menu or the ⏭
+button in the toolbar. Note that the **Run** >> **Run All Cells**
+menu item will not restart the kernel, and so it is not sufficient
to guard against these errors.
### Best practices for including Python packages in notebooks
-Most data analyses these days depend on functions from external Python packages that
-are not built into Python. One example is the `pandas` package that we
-heavily rely on in this book. This package provides us access to functions like
+Most data analyses these days depend on functions from external Python packages that
+are not built into Python. One example is the `pandas` package that we
+heavily rely on in this book. This package provides us access to functions like
`read_csv` for reading data, and `loc[]` for subsetting rows and columns.
-We also use the `altair` package for creating high-quality graphics.
+We also use the `altair` package for creating high-quality graphics.
-As mentioned earlier in the book, external Python packages need to be loaded before
-the functions they contain can be used. Our recommended way to do this is via
+As mentioned earlier in the book, external Python packages need to be loaded before
+the functions they contain can be used. Our recommended way to do this is via
`import package_name`, and perhaps also to give it a shorter alias like
-`import package_name as pn`. But where should this line of code be written in a
-Jupyter notebook? One idea could be to load the library right before the
-function is used in the notebook. However, although this technically works, this
-causes hidden, or at least non-obvious, Python package dependencies when others view
-or try to run the notebook. These hidden dependencies can lead to errors when
-the notebook is executed on another computer if the needed Python packages are not
-installed. Additionally, if the data analysis code takes a long time to run,
-uncovering the hidden dependencies that need to be installed so that the
+`import package_name as pn`. But where should this line of code be written in a
+Jupyter notebook? One idea could be to load the library right before the
+function is used in the notebook. However, although this technically works, this
+causes hidden, or at least non-obvious, Python package dependencies when others view
+or try to run the notebook. These hidden dependencies can lead to errors when
+the notebook is executed on another computer if the needed Python packages are not
+installed. Additionally, if the data analysis code takes a long time to run,
+uncovering the hidden dependencies that need to be installed so that the
analysis can run without error can take a great deal of time to uncover.
-Therefore, we recommend you load all Python packages in a code cell near the top of
-the Jupyter notebook. Loading all your packages at the start ensures that all
-packages are loaded before their functions are called, assuming the notebook is
-run in a linear order from top to bottom as recommended above. It also makes it
-easy for others viewing or running the notebook to see what external Python packages
-are used in the analysis, and hence, what packages they should install on
+Therefore, we recommend you load all Python packages in a code cell near the top of
+the Jupyter notebook. Loading all your packages at the start ensures that all
+packages are loaded before their functions are called, assuming the notebook is
+run in a linear order from top to bottom as recommended above. It also makes it
+easy for others viewing or running the notebook to see what external Python packages
+are used in the analysis, and hence, what packages they should install on
their computer to run the analysis successfully.
### Summary of best practices for running a notebook
1. Write code so that it can be executed in a linear order.
-2. As you write code in a Jupyter notebook, run the notebook in a linear order
-and in its entirety often (2–3 times every work session) via the **Kernel** >>
+2. As you write code in a Jupyter notebook, run the notebook in a linear order
+and in its entirety often (2–3 times every work session) via the **Kernel** >>
**Restart Kernel and Run All Cells...** command from the Jupyter menu or the ⏭
button in the toolbar.
-3. Write the code that loads external Python packages near the top of the Jupyter
+3. Write the code that loads external Python packages near the top of the Jupyter
notebook.
## Exploring data files
It is essential to preview data files before you try to read them into Python to see
-whether or not there are column names, what the delimiters are, and if there are
-lines you need to skip. In Jupyter, you preview data files stored as plain text
-files (e.g., comma- and tab-separated files) in their plain text format ({numref}`open-data-w-editor-2`) by
-right-clicking on the file's name in the Jupyter file explorer, selecting
-**Open with**, and then selecting **Editor** ({numref}`open-data-w-editor-1`).
-Suppose you do not specify to open
-the data file with an editor. In that case, Jupyter will render a nice table
-for you, and you will not be able to see the column delimiters, and therefore
-you will not know which function to use, nor which arguments to use and values
+whether or not there are column names, what the delimiters are, and if there are
+lines you need to skip. In Jupyter, you preview data files stored as plain text
+files (e.g., comma- and tab-separated files) in their plain text format ({numref}`open-data-w-editor-2`) by
+right-clicking on the file's name in the Jupyter file explorer, selecting
+**Open with**, and then selecting **Editor** ({numref}`open-data-w-editor-1`).
+Suppose you do not specify to open
+the data file with an editor. In that case, Jupyter will render a nice table
+for you, and you will not be able to see the column delimiters, and therefore
+you will not know which function to use, nor which arguments to use and values
to specify for them.
```{figure} img/open_data_w_editor_01.png
@@ -446,46 +446,46 @@ A data file as viewed in an editor in Jupyter.
-## Exporting to a different file format
+## Exporting to a different file format
```{index} Jupyter notebook; export
```
-In Jupyter, viewing, editing and running Python code is done in the Jupyter notebook
-file format with file extension `.ipynb`. This file format is not easy to open and
-view outside of Jupyter. Thus, to share your analysis with people who do not
-commonly use Jupyter, it is recommended that you export your executed analysis
-as a more common file type, such as an `.html` file, or a `.pdf`. We recommend
-exporting the Jupyter notebook after executing the analysis so that you can
+In Jupyter, viewing, editing and running Python code is done in the Jupyter notebook
+file format with file extension `.ipynb`. This file format is not easy to open and
+view outside of Jupyter. Thus, to share your analysis with people who do not
+commonly use Jupyter, it is recommended that you export your executed analysis
+as a more common file type, such as an `.html` file, or a `.pdf`. We recommend
+exporting the Jupyter notebook after executing the analysis so that you can
also share the outputs of your code. Note, however, that your audience will not be
able to *run* your analysis using a `.html` or `.pdf` file. If you want your audience
to be able to reproduce the analysis, you must provide them with the `.ipynb` Jupyter notebook file.
### Exporting to HTML
-Exporting to `.html` will result in a shareable file that anyone can open
+Exporting to `.html` will result in a shareable file that anyone can open
using a web browser (e.g., Firefox, Safari, Chrome, or Edge). The `.html`
-output will produce a document that is visually similar to what the Jupyter notebook
-looked like inside Jupyter. One point of caution here is that if there are
-images in your Jupyter notebook, you will need to share the image files and the
+output will produce a document that is visually similar to what the Jupyter notebook
+looked like inside Jupyter. One point of caution here is that if there are
+images in your Jupyter notebook, you will need to share the image files and the
`.html` file to see them.
### Exporting to PDF
-Exporting to `.pdf` will result in a shareable file that anyone can open
-using many programs, including Adobe Acrobat, Preview, web browsers and many
-more. The benefit of exporting to PDF is that it is a standalone document,
-even if the Jupyter notebook included references to image files.
-Unfortunately, the default settings will result in a document
-that visually looks quite different from what the Jupyter notebook looked
-like. The font, page margins, and other details will appear different in the `.pdf` output.
+Exporting to `.pdf` will result in a shareable file that anyone can open
+using many programs, including Adobe Acrobat, Preview, web browsers and many
+more. The benefit of exporting to PDF is that it is a standalone document,
+even if the Jupyter notebook included references to image files.
+Unfortunately, the default settings will result in a document
+that visually looks quite different from what the Jupyter notebook looked
+like. The font, page margins, and other details will appear different in the `.pdf` output.
## Creating a new Jupyter notebook
-At some point, you will want to create a new, fresh Jupyter notebook for your
-own project instead of viewing, running or editing a notebook that was started
-by someone else. To do this, navigate to the **Launcher** tab, and click on
-the Python icon under the **Notebook** heading. If no **Launcher** tab is visible,
-you can get a new one via clicking the **+** button at the top of the Jupyter
-file explorer ({numref}`launcher`).
+At some point, you will want to create a new, fresh Jupyter notebook for your
+own project instead of viewing, running or editing a notebook that was started
+by someone else. To do this, navigate to the **Launcher** tab, and click on
+the Python icon under the **Notebook** heading. If no **Launcher** tab is visible,
+you can get a new one via clicking the **+** button at the top of the Jupyter
+file explorer ({numref}`launcher`).
```{figure} img/launcher-annotated.png
---
@@ -496,19 +496,19 @@ Clicking on the Python icon under the Notebook heading will create a new Jupyter
+++
-Once you have created a new Jupyter notebook, be sure to give it a descriptive
-name, as the default file name is `Untitled.ipynb`. You can rename files by
-first right-clicking on the file name of the notebook you just created, and
-then clicking **Rename**. This will make
-the file name editable. Use your keyboard to
-change the name. Pressing `Enter` or clicking anywhere else in the Jupyter
+Once you have created a new Jupyter notebook, be sure to give it a descriptive
+name, as the default file name is `Untitled.ipynb`. You can rename files by
+first right-clicking on the file name of the notebook you just created, and
+then clicking **Rename**. This will make
+the file name editable. Use your keyboard to
+change the name. Pressing `Enter` or clicking anywhere else in the Jupyter
interface will save the changed file name.
-We recommend not using white space or non-standard characters in file names.
-Doing so will not prevent you from using that file in Jupyter. However, these
-sorts of things become troublesome as you start to do more advanced data
-science projects that involve repetition and automation. We recommend naming
-files using lower case characters and separating words by a dash (`-`) or an
+We recommend not using white space or non-standard characters in file names.
+Doing so will not prevent you from using that file in Jupyter. However, these
+sorts of things become troublesome as you start to do more advanced data
+science projects that involve repetition and automation. We recommend naming
+files using lower case characters and separating words by a dash (`-`) or an
underscore (`_`).
## Additional resources
diff --git a/source/preface-text.md b/source/preface-text.md
index 139fe55c..96ba1b1a 100644
--- a/source/preface-text.md
+++ b/source/preface-text.md
@@ -20,18 +20,18 @@ kernelspec:
-This textbook aims to be an approachable introduction to the world of data science.
+This textbook aims to be an approachable introduction to the world of data science.
In this book, we define **data science** as the process of generating
-insight from data through **reproducible** and **auditable** processes.
+insight from data through **reproducible** and **auditable** processes.
If you analyze some data and give your analysis to a friend or colleague, they should
be able to re-run the analysis from start to finish and get the same result you did (*reproducibility*).
They should also be able to see and understand all the steps in the analysis, as well as the history of how
-the analysis developed (*auditability*). Creating reproducible and auditable
+the analysis developed (*auditability*). Creating reproducible and auditable
analyses allows both you and others to easily double-check and validate your work.
-At a high level, in this book, you will learn how to
+At a high level, in this book, you will learn how to
-(1) identify common problems in data science, and
+(1) identify common problems in data science, and
(2) solve those problems with reproducible and auditable workflows.
{numref}`preface-overview-fig` summarizes what you will learn in each chapter
@@ -43,16 +43,16 @@ while answering descriptive and exploratory data analysis questions. In the next
six chapters, you will learn how to answer predictive, exploratory, and inferential
data analysis questions with common methods in data science, including
classification, regression, clustering, and estimation.
-In the final chapters
+In the final chapters
you will learn how to combine Python code, formatted text, and images
in a single coherent document with Jupyter, use version control for
collaboration, and install and configure the software needed for data science
on your own computer. If you are reading this book as part of a course that you are
-taking, the instructor may have set up all of these tools already for you; in this
+taking, the instructor may have set up all of these tools already for you; in this
case, you can continue on through the book reading the chapters in order.
-But if you are reading this independently, you may want to jump to these last three chapters
+But if you are reading this independently, you may want to jump to these last three chapters
early before going on to make sure your computer is set up in such a way that you can
-try out the example code that we include throughout the book.
+try out the example code that we include throughout the book.
```{figure} img/chapter_overview.jpeg
---
@@ -66,9 +66,9 @@ Where are we going?
Each chapter in the book has an accompanying worksheet that provides exercises
to help you practice the concepts you will learn. We strongly recommend that you
-work through the worksheet when you finish reading each chapter
+work through the worksheet when you finish reading each chapter
before moving on to the next chapter. All of the worksheets
-are available at
+are available at
[https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme);
the "Exercises" section at the end of each chapter points you to the right worksheet for that chapter.
For each worksheet, you can either launch an interactive version of the worksheet in your browser by clicking the "launch binder" button,
diff --git a/source/references.md b/source/references.md
index 942a1a56..a3a864de 100644
--- a/source/references.md
+++ b/source/references.md
@@ -13,6 +13,6 @@ kernelspec:
name: python3
---
-# References
+# References
diff --git a/source/regression1.md b/source/regression1.md
index 78af1628..3dca3312 100644
--- a/source/regression1.md
+++ b/source/regression1.md
@@ -59,7 +59,7 @@ however that is beyond the scope of this book.
+++
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
* Recognize situations where a simple regression analysis would be appropriate for making predictions.
@@ -80,27 +80,27 @@ By the end of the chapter, readers will be able to do the following:
Regression, like classification, is a predictive problem setting where we want
to use past information to predict future observations. But in the case of
-regression, the goal is to predict *numerical* values instead of *categorical* values.
-The variable that you want to predict is often called the *response variable*.
+regression, the goal is to predict *numerical* values instead of *categorical* values.
+The variable that you want to predict is often called the *response variable*.
For example, we could try to use the number of hours a person spends on
-exercise each week to predict their race time in the annual Boston marathon. As
+exercise each week to predict their race time in the annual Boston marathon. As
another example, we could try to use the size of a house to
-predict its sale price. Both of these response variables—race time and sale price—are
+predict its sale price. Both of these response variables—race time and sale price—are
numerical, and so predicting them given past data is considered a regression problem.
```{index} classification; comparison to regression
```
-Just like in the classification setting, there are many possible methods that we can use
+Just like in the classification setting, there are many possible methods that we can use
to predict numerical response variables. In this chapter we will
focus on the **K-nearest neighbors** algorithm {cite:p}`knnfix,knncover`, and in the next chapter
we will study **linear regression**.
In your future studies, you might encounter regression trees, splines,
and general local regression methods; see the additional resources
section at the end of the next chapter for where to begin learning more about
-these other methods.
+these other methods.
-Many of the concepts from classification map over to the setting of regression. For example,
+Many of the concepts from classification map over to the setting of regression. For example,
a regression model predicts a new observation's response variable based on the response variables
for similar observations in the data set of past observations. When building a regression model,
we first split the data into training and test sets, in order to ensure that we assess the performance
@@ -131,30 +131,30 @@ is that we are now predicting numerical variables instead of categorical variabl
```{index} Sacramento real estate, question; regression
```
-In this chapter and the next, we will study
-a data set of
-[932 real estate transactions in Sacramento, California](https://support.spatialkey.com/spatialkey-sample-csv-data/)
+In this chapter and the next, we will study
+a data set of
+[932 real estate transactions in Sacramento, California](https://support.spatialkey.com/spatialkey-sample-csv-data/)
originally reported in the *Sacramento Bee* newspaper.
We first need to formulate a precise question that
we want to answer. In this example, our question is again predictive:
Can we use the size of a house in the Sacramento, CA area to predict
its sale price? A rigorous, quantitative answer to this question might help
-a realtor advise a client as to whether the price of a particular listing
+a realtor advise a client as to whether the price of a particular listing
is fair, or perhaps how to set the price of a new listing.
We begin the analysis by loading and examining the data.
```{code-cell} ipython3
:tags: [remove-cell]
-# In this chapter and the next, we will study
-# a data set \index{Sacramento real estate} of
-# [932 real estate transactions in Sacramento, California](https://support.spatialkey.com/spatialkey-sample-csv-data/)
+# In this chapter and the next, we will study
+# a data set \index{Sacramento real estate} of
+# [932 real estate transactions in Sacramento, California](https://support.spatialkey.com/spatialkey-sample-csv-data/)
# originally reported in the *Sacramento Bee* newspaper.
# We first need to formulate a precise question that
# we want to answer. In this example, our question is again predictive:
# \index{question!regression} Can we use the size of a house in the Sacramento, CA area to predict
# its sale price? A rigorous, quantitative answer to this question might help
-# a realtor advise a client as to whether the price of a particular listing
+# a realtor advise a client as to whether the price of a particular listing
# is fair, or perhaps how to set the price of a new listing.
# We begin the analysis by loading and examining the data, and setting the seed value.
@@ -179,10 +179,10 @@ the data as a scatter plot where we place the predictor variable
(house size) on the x-axis, and we place the target/response variable that we
want to predict (sale price) on the y-axis.
-> **Note:** Given that the y-axis unit is dollars in {numref}`fig:07-edaRegr`,
-> we format the axis labels to put dollar signs in front of the house prices,
+> **Note:** Given that the y-axis unit is dollars in {numref}`fig:07-edaRegr`,
+> we format the axis labels to put dollar signs in front of the house prices,
> as well as commas to increase the readability of the larger numbers.
-> We can do this in `altair` by passing the `axis=alt.Axis(format='$,.0f')` argument
+> We can do this in `altair` by passing the `axis=alt.Axis(format='$,.0f')` argument
> to the `y` encoding channel in an `altair` specification.
```{code-cell} ipython3
@@ -220,7 +220,7 @@ size of a house increases, so does its sale price. Thus, we can reason that we
may be able to use the size of a not-yet-sold house (for which we don't know
the sale price) to predict its final sale price. Note that we do not suggest here
that a larger house size *causes* a higher sale price; just that house price
-tends to increase with house size, and that we may be able to use the latter to
+tends to increase with house size, and that we may be able to use the latter to
predict the former.
+++
@@ -230,10 +230,10 @@ predict the former.
```{index} K-nearest neighbors; regression
```
-Much like in the case of classification,
-we can use a K-nearest neighbors-based
-approach in regression to make predictions.
-Let's take a small sample of the data in {numref}`fig:07-edaRegr`
+Much like in the case of classification,
+we can use a K-nearest neighbors-based
+approach in regression to make predictions.
+Let's take a small sample of the data in {numref}`fig:07-edaRegr`
and walk through how K-nearest neighbors (KNN) works
in a regression context before we dive in to creating our model and assessing
how well it predicts house sale price. This subsample is taken to allow us to
@@ -243,9 +243,9 @@ this chapter we will use all the data.
```{index} pandas.DataFrame; sample
```
-To take a small random sample of size 30, we'll use the
+To take a small random sample of size 30, we'll use the
`sample` method of a `pandas.DataFrame` object, and input the number of rows
-to randomly select (`n`) and the random seed (`random_state`).
+to randomly select (`n`) and the random seed (`random_state`).
```{code-cell} ipython3
small_sacramento = sacramento.sample(n=30, random_state=10)
@@ -299,9 +299,9 @@ Scatter plot of price (USD) versus house size (square feet) with vertical line i
We will employ the same intuition from the classification chapter, and use the
neighboring points to the new point of interest to suggest/predict what its
-sale price might be.
-For the example shown in {numref}`fig:07-small-eda-regr`,
-we find and label the 5 nearest neighbors to our observation
+sale price might be.
+For the example shown in {numref}`fig:07-small-eda-regr`,
+we find and label the 5 nearest neighbors to our observation
of a house that is 2,000 square feet.
```{code-cell} ipython3
@@ -350,7 +350,7 @@ Scatter plot of price (USD) versus house size (square feet) with lines to 5 near
{numref}`fig:07-knn5-example` illustrates the difference between the house sizes
of the 5 nearest neighbors (in terms of house size) to our new
-2,000 square-foot house of interest. Now that we have obtained these nearest neighbors,
+2,000 square-foot house of interest. Now that we have obtained these nearest neighbors,
we can use their values to predict the
sale price for the new home. Specifically, we can take the mean (or
average) of these 5 values as our predicted value, as illustrated by
@@ -393,12 +393,12 @@ classification: which $K$ do we choose, and is our model any good at making
predictions? In the next few sections, we will address these questions in the
context of KNN regression.
-One strength of the KNN regression algorithm
+One strength of the KNN regression algorithm
that we would like to draw attention to at this point
is its ability to work well with non-linear relationships
(i.e., if the relationship is not a straight line).
This stems from the use of nearest neighbors to predict values.
-The algorithm really has very few assumptions
+The algorithm really has very few assumptions
about what the data must look like for it to work.
+++ {"tags": []}
@@ -408,13 +408,13 @@ about what the data must look like for it to work.
```{index} training data, test data
```
-As usual,
-we must start by putting some test data away in a lock box
-that we will come back to only after we choose our final model.
-Let's take care of that now.
-Note that for the remainder of the chapter
-we'll be working with the entire Sacramento data set,
-as opposed to the smaller sample of 30 points
+As usual,
+we must start by putting some test data away in a lock box
+that we will come back to only after we choose our final model.
+Let's take care of that now.
+Note that for the remainder of the chapter
+we'll be working with the entire Sacramento data set,
+as opposed to the smaller sample of 30 points
that we used earlier in the chapter ({numref}`fig:07-small-eda-regr`).
+++
@@ -438,8 +438,8 @@ Next, we'll use cross-validation to choose $K$. In KNN classification, we used
accuracy to see how well our predictions matched the true labels. We cannot use
the same metric in the regression setting, since our predictions will almost never
*exactly* match the true response variable values. Therefore in the
-context of KNN regression we will use root mean square prediction error (RMSPE) instead.
-The mathematical formula for calculating RMSPE is:
+context of KNN regression we will use root mean square prediction error (RMSPE) instead.
+The mathematical formula for calculating RMSPE is:
$$\text{RMSPE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
@@ -449,13 +449,13 @@ where:
- $y_i$ is the observed value for the $i^\text{th}$ observation, and
- $\hat{y}_i$ is the forecasted/predicted value for the $i^\text{th}$ observation.
-In other words, we compute the *squared* difference between the predicted and true response
+In other words, we compute the *squared* difference between the predicted and true response
value for each observation in our test (or validation) set, compute the average, and then finally
take the square root. The reason we use the *squared* difference (and not just the difference)
is that the differences can be positive or negative, i.e., we can overshoot or undershoot the true
response value. {numref}`fig:07-verticalerrors` illustrates both positive and negative differences
between predicted and true response values.
-So if we want to measure error—a notion of distance between our predicted and true response values—we
+So if we want to measure error—a notion of distance between our predicted and true response values—we
want to make sure that we are only adding up positive values, with larger positive values representing larger
mistakes.
If the predictions are very close to the true values, then
@@ -524,18 +524,18 @@ Scatter plot of price (USD) versus house size (square feet) with example predict
```{index} RMSPE; comparison with RMSE
```
-> **Note:** When using many code packages, the evaluation output
+> **Note:** When using many code packages, the evaluation output
> we will get to assess the prediction quality of
> our KNN regression models is labeled "RMSE", or "root mean squared
-> error". Why is this so, and why not RMSPE?
+> error". Why is this so, and why not RMSPE?
> In statistics, we try to be very precise with our
> language to indicate whether we are calculating the prediction error on the
-> training data (*in-sample* prediction) versus on the testing data
-> (*out-of-sample* prediction). When predicting and evaluating prediction quality on the training data, we
+> training data (*in-sample* prediction) versus on the testing data
+> (*out-of-sample* prediction). When predicting and evaluating prediction quality on the training data, we
> say RMSE. By contrast, when predicting and evaluating prediction quality
-> on the testing or validation data, we say RMSPE.
+> on the testing or validation data, we say RMSPE.
> The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the $y$s are
-> training or testing data. But many people just use RMSE for both,
+> training or testing data. But many people just use RMSE for both,
> and rely on context to denote which data the root mean squared error is being calculated on.
```{index} scikit-learn, scikit-learn; pipeline, scikit-learn; make_pipeline, scikit-learn; make_column_transformer
@@ -545,29 +545,29 @@ Now that we know how we can assess how well our model predicts a numerical
value, let's use Python to perform cross-validation and to choose the optimal $K$.
First, we will create a recipe for preprocessing our data.
Note that we include standardization
-in our preprocessing to build good habits, but since we only have one
+in our preprocessing to build good habits, but since we only have one
predictor, it is technically not necessary; there is no risk of comparing two predictors
of different scales.
-Next we create a model pipeline for K-nearest neighbors regression. Note
+Next we create a model pipeline for K-nearest neighbors regression. Note
that we use `KNeighborsRegressor`
now in the model specification to denote a regression problem, as opposed to the classification
-problems from the previous chapters.
+problems from the previous chapters.
The use of `KNeighborsRegressor` essentially
tells `scikit-learn` that we need to use different metrics (instead of accuracy)
-for tuning and evaluation.
-Next we specify a dictionary of parameter grid containing the numbers of neighbors ranging from 1 to 200.
+for tuning and evaluation.
+Next we specify a dictionary of parameter grid containing the numbers of neighbors ranging from 1 to 200.
Then we create a 5-fold `GridSearchCV` object, and pass in the pipeline and parameter grid. We also
need to specify that `scoring="neg_root_mean_squared_error"` to get *negative* RMSPE from `scikit-learn`.
-The reason that `scikit-learn` negates the regular RMSPE is that the function always tries to maximize
-the scores, while RMSPE should be minimized. Hence, in order to see the actual RMSPE, we need to negate
+The reason that `scikit-learn` negates the regular RMSPE is that the function always tries to maximize
+the scores, while RMSPE should be minimized. Hence, in order to see the actual RMSPE, we need to negate
back the `mean_test_score`.
+++
In the output of the `sacr_results`
results data frame, we see that the `param_kneighborsregressor__n_neighbors` variable contains the values of $K$,
-the `RMSPE` variable contains the value of the RMSPE estimated via cross-validation,
-which was obtained through negating the `mean_test_score` variable,
+the `RMSPE` variable contains the value of the RMSPE estimated via cross-validation,
+which was obtained through negating the `mean_test_score` variable,
and the standard error (`std_test_score`) contains a value corresponding to a measure of how uncertain we are in the mean value. A detailed treatment of this
is beyond the scope of this chapter; but roughly, if your estimated mean is 100,000 and standard
error is 1,000, you can expect the *true* RMSPE to be somewhere roughly between 99,000 and 101,000 (although it may
@@ -581,13 +581,13 @@ as they do not provide any additional insight.
# value, let's use R to perform cross-validation and to choose the optimal $K$.
# First, we will create a recipe for preprocessing our data.
# Note that we include standardization
-# in our preprocessing to build good habits, but since we only have one
+# in our preprocessing to build good habits, but since we only have one
# predictor, it is technically not necessary; there is no risk of comparing two predictors
# of different scales.
-# Next we create a model specification for K-nearest neighbors regression. Note
+# Next we create a model specification for K-nearest neighbors regression. Note
# that we use `set_mode("regression")`
# now in the model specification to denote a regression problem, as opposed to the classification
-# problems from the previous chapters.
+# problems from the previous chapters.
# The use of `set_mode("regression")` essentially
# tells `tidymodels` that we need to use different metrics (RMSPE, not accuracy)
# for tuning and evaluation.
@@ -641,7 +641,7 @@ sacr_results[
```{code-cell} ipython3
:tags: [remove-cell]
-# Next we run cross-validation for a grid of numbers of neighbors ranging from 1 to 200.
+# Next we run cross-validation for a grid of numbers of neighbors ranging from 1 to 200.
# The following code tunes
# the model and returns the RMSPE for each number of neighbors. In the output of the `sacr_results`
# results data frame, we see that the `neighbors` variable contains the value of $K$,
@@ -708,7 +708,7 @@ The smallest RMSPE occurs when $K =$ {glue:}`kmin`.
## Underfitting and overfitting
Similar to the setting of classification, by setting the number of neighbors
to be too small or too large, we cause the RMSPE to increase, as shown in
-{numref}`fig:07-choose-k-knn-plot`. What is happening here?
+{numref}`fig:07-choose-k-knn-plot`. What is happening here?
{numref}`fig:07-howK` visualizes the effect of different settings of $K$ on the
regression model. Each plot shows the predicted values for house sale price from
@@ -792,9 +792,9 @@ Predicted values for house price (represented as a blue line) from KNN regressio
```
{numref}`fig:07-howK` shows that when $K$ = 1, the blue line runs perfectly
-through (almost) all of our training observations.
+through (almost) all of our training observations.
This happens because our
-predicted values for a given region (typically) depend on just a single observation.
+predicted values for a given region (typically) depend on just a single observation.
In general, when $K$ is too small, the line follows the training data quite
closely, even if it does not match it perfectly.
If we used a different training data set of house prices and sizes
@@ -805,32 +805,32 @@ predictions on new observations which, generally, will not have the same fluctua
as the original training data.
Recall from the classification
chapters that this behavior—where the model is influenced too much
-by the noisy data—is called *overfitting*; we use this same term
+by the noisy data—is called *overfitting*; we use this same term
in the context of regression.
```{index} underfitting; regression
```
-What about the plots in {numref}`fig:07-howK` where $K$ is quite large,
-say, $K$ = 250 or 699?
+What about the plots in {numref}`fig:07-howK` where $K$ is quite large,
+say, $K$ = 250 or 699?
In this case the blue line becomes extremely smooth, and actually becomes flat
-once $K$ is equal to the number of datapoints in the entire data set.
+once $K$ is equal to the number of datapoints in the entire data set.
This happens because our predicted values for a given x value (here, home
-size), depend on many neighboring observations; in the case where $K$ is equal
+size), depend on many neighboring observations; in the case where $K$ is equal
to the size of the dataset, the prediction is just the mean of the house prices
-in the dataset (completely ignoring the house size).
-In contrast to the $K=1$ example,
+in the dataset (completely ignoring the house size).
+In contrast to the $K=1$ example,
the smooth, inflexible blue line does not follow the training observations very closely.
In other words, the model is *not influenced enough* by the training data.
Recall from the classification
chapters that this behavior is called *underfitting*; we again use this same
-term in the context of regression.
+term in the context of regression.
Ideally, what we want is neither of the two situations discussed above. Instead,
we would like a model that (1) follows the overall "trend" in the training data, so the model
actually uses the training data to learn something useful, and (2) does not follow
the noisy fluctuations, so that we can be confident that our model will transfer/generalize
-well to other new data. If we explore
+well to other new data. If we explore
the other values for $K$, in particular $K$ = {glue:}`kmin` (as suggested by cross-validation),
we can see it achieves this goal: it follows the increasing trend of house price
versus house size, but is not influenced too much by the idiosyncratic variations
@@ -843,20 +843,20 @@ chapter.
# Changed from ...
-# What about the plots in Figure \@ref(fig:07-howK) where $K$ is quite large,
-# say, $K$ = 250 or 932?
+# What about the plots in Figure \@ref(fig:07-howK) where $K$ is quite large,
+# say, $K$ = 250 or 932?
```
## Evaluating on the test set
To assess how well our model might do at predicting on unseen data, we will
assess its RMSPE on the test data. To do this, we will first
-re-train our KNN regression model on the entire training data set,
+re-train our KNN regression model on the entire training data set,
using $K =$ {glue:}`kmin` neighbors. Then we will
use `predict` to make predictions on the test data, and use the `mean_squared_error`
-function to compute the mean squared prediction error (MSPE). Finally take the
+function to compute the mean squared prediction error (MSPE). Finally take the
square root of MSPE to get RMSPE. The reason that we do not use `score` method (as we
-did in Classification chapter) is that the scoring metric `score` uses for `KNeighborsRegressor`
+did in Classification chapter) is that the scoring metric `score` uses for `KNeighborsRegressor`
is $R^2$ instead of RMSPE.
```{code-cell} ipython3
@@ -864,7 +864,7 @@ is $R^2$ instead of RMSPE.
# To assess how well our model might do at predicting on unseen data, we will
# assess its RMSPE on the test data. To do this, we will first
-# re-train our KNN regression model on the entire training data set,
+# re-train our KNN regression model on the entire training data set,
# using $K =$ `r sacr_min |> pull(neighbors)` neighbors. Then we will
# use `predict` to make predictions on the test data, and use the `metrics`
# function again to compute the summary of regression quality. Because
@@ -906,27 +906,27 @@ glue("test_RMSPE", "{0:,.0f}".format(int(RMSPE)))
glue("cv_RMSPE", "{0:,.0f}".format(int(sacr_min['RMSPE'])))
```
-Our final model's test error as assessed by RMSPE
-is \$ {glue:text}`test_RMSPE`.
+Our final model's test error as assessed by RMSPE
+is \$ {glue:text}`test_RMSPE`.
Note that RMSPE is measured in the same units as the response variable.
-In other words, on new observations, we expect the error in our prediction to be
-*roughly* \$ {glue:text}`test_RMSPE`.
+In other words, on new observations, we expect the error in our prediction to be
+*roughly* \$ {glue:text}`test_RMSPE`.
From one perspective, this is good news: this is about the same as the cross-validation
-RMSPE estimate of our tuned model
-(which was \$ {glue:text}`cv_RMSPE`,
+RMSPE estimate of our tuned model
+(which was \$ {glue:text}`cv_RMSPE`,
so we can say that the model appears to generalize well
to new data that it has never seen before.
However, much like in the case of KNN classification, whether this value for RMSPE is *good*—i.e.,
whether an error of around \$ {glue:text}`test_RMSPE`
-is acceptable—depends entirely on the application.
+is acceptable—depends entirely on the application.
In this application, this error
-is not prohibitively large, but it is not negligible either;
+is not prohibitively large, but it is not negligible either;
\$ {glue:text}`test_RMSPE`
might represent a substantial fraction of a home buyer's budget, and
-could make or break whether or not they could afford put an offer on a house.
+could make or break whether or not they could afford put an offer on a house.
Finally, {numref}`fig:07-predict-all` shows the predictions that our final model makes across
-the range of house sizes we might encounter in the Sacramento area—from 500 to 5000 square feet.
+the range of house sizes we might encounter in the Sacramento area—from 500 to 5000 square feet.
You have already seen a few plots like this in this chapter, but here we also provide the code that generated it
as a learning challenge.
@@ -974,18 +974,18 @@ Predicted values of house price (blue line) for the final KNN regression model.
## Multivariable KNN regression
As in KNN classification, we can use multiple predictors in KNN regression.
-In this setting, we have the same concerns regarding the scale of the predictors. Once again,
+In this setting, we have the same concerns regarding the scale of the predictors. Once again,
predictions are made by identifying the $K$
observations that are nearest to the new point we want to predict; any
variables that are on a large scale will have a much larger effect than
-variables on a small scale. Hence, we should re-define the preprocessor in the
+variables on a small scale. Hence, we should re-define the preprocessor in the
pipeline to incorporate all predictor variables.
```{code-cell} ipython3
:tags: [remove-cell]
# As in KNN classification, we can use multiple predictors in KNN regression.
-# In this setting, we have the same concerns regarding the scale of the predictors. Once again,
+# In this setting, we have the same concerns regarding the scale of the predictors. Once again,
# predictions are made by identifying the $K$
# observations that are nearest to the new point we want to predict; any
# variables that are on a large scale will have a much larger effect than
@@ -996,14 +996,14 @@ pipeline to incorporate all predictor variables.
Note that we also have the same concern regarding the selection of predictors
in KNN regression as in KNN classification: having more predictors is **not** always
better, and the choice of which predictors to use has a potentially large influence
-on the quality of predictions. Fortunately, we can use the predictor selection
+on the quality of predictions. Fortunately, we can use the predictor selection
algorithm from the classification chapter in KNN regression as well.
As the algorithm is the same, we will not cover it again in this chapter.
```{index} K-nearest neighbors; multivariable regression, Sacramento real estate
```
-We will now demonstrate a multivariable KNN regression analysis of the
+We will now demonstrate a multivariable KNN regression analysis of the
Sacramento real estate data using `scikit-learn`. This time we will use
house size (measured in square feet) as well as number of bedrooms as our
predictors, and continue to use house sale price as our outcome/target variable
@@ -1115,11 +1115,11 @@ If we want to compare this multivariable KNN regression model to the model with
predictor *as part of the model tuning process* (e.g., if we are running forward selection as described
in the chapter on evaluating and tuning classification models),
then we must compare the accuracy estimated using only the training data via cross-validation.
-Looking back, the estimated cross-validation accuracy for the single-predictor
+Looking back, the estimated cross-validation accuracy for the single-predictor
model was {glue:}`cv_RMSPE`.
The estimated cross-validation accuracy for the multivariable model is
{glue:text}`cv_RMSPE_2pred`.
-Thus in this case, we did not improve the model
+Thus in this case, we did not improve the model
by a large amount by adding this additional predictor.
Regardless, let's continue the analysis to see how we can make predictions with a multivariable KNN regression model
@@ -1157,9 +1157,9 @@ glue("RMSPE_mult", "{0:,.0f}".format(RMSPE_mult))
```
This time, when we performed KNN regression on the same data set, but also
-included number of bedrooms as a predictor, we obtained a RMSPE test error
+included number of bedrooms as a predictor, we obtained a RMSPE test error
of {glue:text}`RMSPE_mult`.
-{numref}`fig:07-knn-mult-viz` visualizes the model's predictions overlaid on top of the data. This
+{numref}`fig:07-knn-mult-viz` visualizes the model's predictions overlaid on top of the data. This
time the predictions are a surface in 3D space, instead of a line in 2D space, as we have 2
predictors instead of 1.
@@ -1225,7 +1225,7 @@ KNN regression model’s predictions represented as a surface in 3D space overla
+++
We can see that the predictions in this case, where we have 2 predictors, form
-a surface instead of a line. Because the newly added predictor (number of bedrooms) is
+a surface instead of a line. Because the newly added predictor (number of bedrooms) is
related to price (as price changes, so does number of bedrooms)
and is not totally determined by house size (our other predictor),
we get additional and useful information for making our
@@ -1238,13 +1238,13 @@ bedrooms, we would predict the same price for these two houses.
## Strengths and limitations of KNN regression
-As with KNN classification (or any prediction algorithm for that matter), KNN
+As with KNN classification (or any prediction algorithm for that matter), KNN
regression has both strengths and weaknesses. Some are listed here:
**Strengths:** K-nearest neighbors regression
1. is a simple, intuitive algorithm,
-2. requires few assumptions about what the data must look like, and
+2. requires few assumptions about what the data must look like, and
3. works well with non-linear relationships (i.e., if the relationship is not a straight line).
**Weaknesses:** K-nearest neighbors regression
@@ -1257,8 +1257,8 @@ regression has both strengths and weaknesses. Some are listed here:
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme)
in the "Regression I: K-nearest neighbors" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
diff --git a/source/regression2.md b/source/regression2.md
index 2e7c76be..9b38045d 100644
--- a/source/regression2.md
+++ b/source/regression2.md
@@ -35,9 +35,9 @@ from IPython.display import HTML
from myst_nb import glue
```
-## Overview
+## Overview
Up to this point, we have solved all of our predictive problems—both classification
-and regression—using K-nearest neighbors (KNN)-based approaches. In the context of regression,
+and regression—using K-nearest neighbors (KNN)-based approaches. In the context of regression,
there is another commonly used method known as *linear regression*. This chapter provides an introduction
to the basic concept of linear regression, shows how to use `scikit-learn` to perform linear regression in Python,
and characterizes its strengths and weaknesses compared to KNN regression. The focus is, as usual,
@@ -45,7 +45,7 @@ on the case where there is a single predictor and single response variable of in
concludes with an example using *multivariable linear regression* when there is more than one
predictor.
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
* Use Python and `scikit-learn` to fit a linear regression model on training data.
@@ -63,9 +63,9 @@ By the end of the chapter, readers will be able to do the following:
At the end of the previous chapter, we noted some limitations of KNN regression.
While the method is simple and easy to understand, KNN regression does not
predict well beyond the range of the predictors in the training data, and
-the method gets significantly slower as the training data set grows.
+the method gets significantly slower as the training data set grows.
Fortunately, there is an alternative to KNN regression—*linear regression*—that addresses
-both of these limitations. Linear regression is also very commonly
+both of these limitations. Linear regression is also very commonly
used in practice because it provides an interpretable mathematical equation that describes
the relationship between the predictor and response variables. In this first part of the chapter, we will focus on *simple* linear regression,
which involves only one predictor variable and one response variable; later on, we will consider
@@ -74,7 +74,7 @@ which involves only one predictor variable and one response variable; later on,
predicting a numerical response variable (like race time, house price, or height);
but *how* it makes those predictions for a new observation is quite different from KNN regression.
Instead of looking at the K nearest neighbors and averaging
-over their values for a prediction, in simple linear regression, we create a
+over their values for a prediction, in simple linear regression, we create a
straight line of best fit through the training data and then
"look up" the prediction using the line.
@@ -83,16 +83,16 @@ straight line of best fit through the training data and then
```{index} regression; logistic
```
-> **Note:** Although we did not cover it in earlier chapters, there
+> **Note:** Although we did not cover it in earlier chapters, there
> is another popular method for classification called *logistic
> regression* (it is used for classification even though the name, somewhat confusingly,
> has the word "regression" in it). In logistic regression—similar to linear regression—you
> "fit" the model to the training data and then "look up" the prediction for each new observation.
-> Logistic regression and KNN classification have an advantage/disadvantage comparison
+> Logistic regression and KNN classification have an advantage/disadvantage comparison
> similar to that of linear regression and KNN
> regression. It is useful to have a good understanding of linear regression before learning about
> logistic regression. After reading this chapter, see the "Additional Resources" section at the end of the
-> classification chapters to learn more about logistic regression.
+> classification chapters to learn more about logistic regression.
+++
@@ -101,7 +101,7 @@ straight line of best fit through the training data and then
Let's return to the Sacramento housing data from Chapter {ref}`regression1` to learn
how to apply linear regression and compare it to KNN regression. For now, we
-will consider
+will consider
a smaller version of the housing data to help make our visualizations clear.
Recall our predictive question: can we use the size of a house in the Sacramento, CA area to predict
its sale price? In particular, recall that we have come across a new 2,000 square-foot house we are interested
@@ -155,7 +155,7 @@ Scatter plot of sale price versus size with line of best fit for subset of the S
```{index} straight line; equation
```
-The equation for the straight line is:
+The equation for the straight line is:
$$\text{house sale price} = \beta_0 + \beta_1 \cdot (\text{house size}),$$
where
@@ -163,18 +163,18 @@ where
- $\beta_0$ is the *vertical intercept* of the line (the price when house size is 0)
- $\beta_1$ is the *slope* of the line (how quickly the price increases as you increase house size)
-Therefore using the data to find the line of best fit is equivalent to finding coefficients
+Therefore using the data to find the line of best fit is equivalent to finding coefficients
$\beta_0$ and $\beta_1$ that *parametrize* (correspond to) the line of best fit.
Now of course, in this particular problem, the idea of a 0 square-foot house is a bit silly;
-but you can think of $\beta_0$ here as the "base price," and
+but you can think of $\beta_0$ here as the "base price," and
$\beta_1$ as the increase in price for each square foot of space.
-Let's push this thought even further: what would happen in the equation for the line if you
+Let's push this thought even further: what would happen in the equation for the line if you
tried to evaluate the price of a house with size 6 *million* square feet?
Or what about *negative* 2,000 square feet? As it turns out, nothing in the formula breaks; linear
regression will happily make predictions for crazy predictor values if you ask it to. But even though
you *can* make these wild predictions, you shouldn't. You should only make predictions roughly within
the range of your original data, and perhaps a bit beyond it only if it makes sense. For example,
-the data in {numref}`fig:08-lin-reg1` only reaches around 600 square feet on the low end, but
+the data in {numref}`fig:08-lin-reg1` only reaches around 600 square feet on the low end, but
it would probably be reasonable to use the linear regression model to make a prediction at 500 square feet, say.
Back to the example! Once we have the coefficients $\beta_0$ and $\beta_1$, we can use the equation
@@ -234,10 +234,10 @@ Scatter plot of sale price versus size with line of best fit and a red dot at th
+++
By using simple linear regression on this small data set to predict the sale price
-for a 2,000 square-foot house, we get a predicted value of
+for a 2,000 square-foot house, we get a predicted value of
\${glue:text}`pred_2000`. But wait a minute...how
exactly does simple linear regression choose the line of best fit? Many
-different lines could be drawn through the data points.
+different lines could be drawn through the data points.
Some plausible examples are shown in {numref}`fig:08-several-lines`.
```{code-cell} ipython3
@@ -292,8 +292,8 @@ Scatter plot of sale price versus size with many possible lines that could be dr
Simple linear regression chooses the straight line of best fit by choosing
the line that minimizes the **average squared vertical distance** between itself and
-each of the observed data points in the training data. {numref}`fig:08-verticalDistToMin` illustrates
-these vertical distances as red lines. Finally, to assess the predictive
+each of the observed data points in the training data. {numref}`fig:08-verticalDistToMin` illustrates
+these vertical distances as red lines. Finally, to assess the predictive
accuracy of a simple linear regression model,
we use RMSPE—the same measure of predictive performance we used with KNN regression.
@@ -342,8 +342,8 @@ Scatter plot of sale price versus size with red lines denoting the vertical dist
```
We can perform simple linear regression in Python using `scikit-learn` in a
-very similar manner to how we performed KNN regression.
-To do this, instead of creating a `KNeighborsRegressor` model specification,
+very similar manner to how we performed KNN regression.
+To do this, instead of creating a `KNeighborsRegressor` model specification,
we use a `LinearRegression` model specification.
Another difference is that we do not need to choose $K$ in the
context of linear regression, and so we do not need to perform cross-validation.
@@ -361,7 +361,7 @@ can come back to after we choose our final model. Let's take care of that now.
:tags: [remove-cell]
# We can perform simple linear regression in R using `tidymodels` \index{tidymodels} in a
-# very similar manner to how we performed KNN regression.
+# very similar manner to how we performed KNN regression.
# To do this, instead of creating a `nearest_neighbor` model specification with
# the `kknn` engine, we use a `linear_reg` model specification
# with the `lm` engine. Another difference is that we do not need to choose $K$ in the
@@ -431,17 +431,17 @@ glue("train_lm_intercept_f", "{0:,.0f}".format(lm.intercept_[0]))
+++
-Our coefficients are
+Our coefficients are
(intercept) $\beta_0=$ {glue:}`train_lm_intercept`
and (slope) $\beta_1=$ {glue:}`train_lm_slope`.
This means that the equation of the line of best fit is
$\text{house sale price} =$ {glue:}`train_lm_intercept` $+$ {glue:}`train_lm_slope` $\cdot (\text{house size}).$
-In other words, the model predicts that houses
+In other words, the model predicts that houses
start at \${glue:text}`train_lm_intercept_f` for 0 square feet, and that
-every extra square foot increases the cost of
-the house by \${glue:text}`train_lm_slope_f`. Finally,
+every extra square foot increases the cost of
+the house by \${glue:text}`train_lm_slope_f`. Finally,
we predict on the test data set to assess how well our model does:
```{code-cell} ipython3
@@ -472,7 +472,7 @@ glue("sacr_RMSPE", "{0:,.0f}".format(RMSPE))
```
Our final model's test error as assessed by RMSPE
-is {glue:text}`sacr_RMSPE`.
+is {glue:text}`sacr_RMSPE`.
Remember that this is in units of the target/response variable, and here that
is US Dollars (USD). Does this mean our model is "good" at predicting house
sale price based off of the predictor of home size? Again, answering this is
@@ -480,10 +480,10 @@ tricky and requires knowledge of how you intend to use the prediction.
To visualize the simple linear regression model, we can plot the predicted house
sale price across all possible house sizes we might encounter superimposed on a scatter
-plot of the original housing price data. There is a function in
+plot of the original housing price data. There is a function in
the `altair`, `transform_regression`, that
allows us to add a layer on our plot with the simple
-linear regression predicted line of best fit.
+linear regression predicted line of best fit.
{numref}`fig:08-lm-predict-all` displays the result.
```{code-cell} ipython3
@@ -491,7 +491,7 @@ linear regression predicted line of best fit.
# To visualize the simple linear regression model, we can plot the predicted house
# sale price across all possible house sizes we might encounter superimposed on a scatter
-# plot of the original housing price data. There is a plotting function in
+# plot of the original housing price data. There is a plotting function in
# the `tidyverse`, `geom_smooth`, that
# allows us to add a layer on our plot with the simple
# linear regression predicted line of best fit. By default `geom_smooth` adds some other information
@@ -680,20 +680,20 @@ What differences do we observe in {numref}`fig:08-compareRegression`? One obviou
difference is the shape of the blue lines. In simple linear regression we are
restricted to a straight line, whereas in KNN regression our line is much more
flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the
-model to a straight line. A
+model to a straight line. A
straight line can be defined by two numbers, the
vertical intercept and the slope. The intercept tells us what the prediction is when
all of the predictors are equal to 0; and the slope tells us what unit increase in the target/response
variable we predict given a unit increase in the predictor
variable. KNN regression, as simple as it is to implement and understand, has no such
-interpretability from its wiggly line.
+interpretability from its wiggly line.
```{index} underfitting; regression
```
There can, however, also be a disadvantage to using a simple linear regression
model in some cases, particularly when the relationship between the target and
-the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In
+the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In
these cases the prediction model from a simple linear regression
will underfit (have high bias), meaning that model/predicted values do not
match the actual observed values very well. Such a model would probably have a
@@ -704,7 +704,7 @@ are other types of regression you can learn about in future books that may do
even better at predicting with such data.
How do these two models compare on the Sacramento house prices data set? In
-{numref}`fig:08-compareRegression`, we also printed the RMSPE as calculated from
+{numref}`fig:08-compareRegression`, we also printed the RMSPE as calculated from
predicting on the test data set that was not used to train/fit the models. The RMSPE for the simple linear
regression model is slightly lower than the RMSPE for the KNN regression model.
Considering that the simple linear regression model is also more interpretable,
@@ -720,10 +720,10 @@ predicts a constant slope. Predicting outside the range of the observed
data is known as *extrapolation*; KNN and linear models behave quite differently
when extrapolating. Depending on the application, the flat
or constant slope trend may make more sense. For example, if our housing
-data were slightly different, the linear model may have actually predicted
+data were slightly different, the linear model may have actually predicted
a *negative* price for a small house (if the intercept $\beta_0$ was negative),
which obviously does not match reality. On the other hand, the trend of increasing
-house size corresponding to increasing house price probably continues for large houses,
+house size corresponding to increasing house price probably continues for large houses,
so the "flat" extrapolation of KNN likely does not match reality.
+++
@@ -739,49 +739,49 @@ so the "flat" extrapolation of KNN likely does not match reality.
```
As in KNN classification and KNN regression, we can move beyond the simple
-case of only one predictor to the case with multiple predictors,
+case of only one predictor to the case with multiple predictors,
known as *multivariable linear regression*.
To do this, we follow a very similar approach to what we did for
-KNN regression: we just specify the training data by adding more predictors.
+KNN regression: we just specify the training data by adding more predictors.
But recall that we do not need to use cross-validation to choose any parameters,
-nor do we need to standardize (i.e., center and scale) the data for linear regression.
+nor do we need to standardize (i.e., center and scale) the data for linear regression.
Note once again that we have the same concerns regarding multiple predictors
as in the settings of multivariable KNN regression and classification: having more predictors is **not** always
-better. But because the same predictor selection
+better. But because the same predictor selection
algorithm from the classification chapter extends to the setting of linear regression,
it will not be covered again in this chapter.
```{index} Sacramento real estate
```
-We will demonstrate multivariable linear regression using the Sacramento real estate
+We will demonstrate multivariable linear regression using the Sacramento real estate
data with both house size
(measured in square feet) as well as number of bedrooms as our predictors, and
-continue to use house sale price as our response variable. We will start by
-specifying the training data to
+continue to use house sale price as our response variable. We will start by
+specifying the training data to
include both the `sqft` and `beds` variables as predictors:
```{code-cell} ipython3
:tags: [remove-cell]
# As in KNN classification and KNN regression, we can move beyond the simple
-# case of only one predictor to the case with multiple predictors,
+# case of only one predictor to the case with multiple predictors,
# known as *multivariable linear regression*. \index{regression!multivariable linear}\index{regression!multivariable linear equation|see{plane equation}}
# To do this, we follow a very similar approach to what we did for
-# KNN regression: we just add more predictors to the model formula in the
+# KNN regression: we just add more predictors to the model formula in the
# recipe. But recall that we do not need to use cross-validation to choose any parameters,
-# nor do we need to standardize (i.e., center and scale) the data for linear regression.
+# nor do we need to standardize (i.e., center and scale) the data for linear regression.
# Note once again that we have the same concerns regarding multiple predictors
# as in the settings of multivariable KNN regression and classification: having more predictors is **not** always
-# better. But because the same predictor selection
+# better. But because the same predictor selection
# algorithm from the classification chapter extends to the setting of linear regression,
# it will not be covered again in this chapter.
# We will demonstrate multivariable linear regression using the Sacramento real estate \index{Sacramento real estate}
# data with both house size
# (measured in square feet) as well as number of bedrooms as our predictors, and
-# continue to use house sale price as our response variable. We will start by
-# changing the formula in the recipe to
+# continue to use house sale price as our response variable. We will start by
+# changing the formula in the recipe to
# include both the `sqft` and `beds` variables as predictors:
```
@@ -899,8 +899,8 @@ Linear regression plane of best fit overlaid on top of the data (using price, ho
+++
We see that the predictions from linear regression with two predictors form a
-flat plane. This is the hallmark of linear regression, and differs from the
-wiggly, flexible surface we get from other methods such as KNN regression.
+flat plane. This is the hallmark of linear regression, and differs from the
+wiggly, flexible surface we get from other methods such as KNN regression.
As discussed, this can be advantageous in one aspect, which is that for each
predictor, we can get slopes/intercept from linear regression, and thus describe the
plane mathematically. We can extract those slope values from our model object
@@ -917,7 +917,7 @@ mlm.intercept_
```{index} plane equation
```
-And then use those slopes to write a mathematical equation to describe the prediction plane:
+And then use those slopes to write a mathematical equation to describe the prediction plane:
$$\text{house sale price} = \beta_0 + \beta_1\cdot(\text{house size}) + \beta_2\cdot(\text{number of bedrooms}),$$
where:
@@ -944,9 +944,9 @@ $\text{house sale price} =$ {glue:text}`icept` $+$ {glue:text}`sqftc` $\cdot (\t
This model is more interpretable than the multivariable KNN
regression model; we can write a mathematical equation that explains how
-each predictor is affecting the predictions. But as always, we should
+each predictor is affecting the predictions. But as always, we should
question how well multivariable linear regression is doing compared to
-the other tools we have, such as simple linear regression
+the other tools we have, such as simple linear regression
and multivariable KNN regression. If this comparison is part of
the model tuning process—for example, if we are trying
out many different sets of predictors for multivariable linear
@@ -963,25 +963,25 @@ lm_mult_test_RMSPE
```{index} RMSPE
```
-We obtain an RMSPE for the multivariable linear regression model
+We obtain an RMSPE for the multivariable linear regression model
of {glue:text}`sacr_mult_RMSPE`. This prediction error
is less than the prediction error for the multivariable KNN regression model,
indicating that we should likely choose linear regression for predictions of
house sale price on this data set. Revisiting the simple linear regression model
-with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was
-{glue:text}`sacr_RMSPE`,
+with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was
+{glue:text}`sacr_RMSPE`,
which is slightly higher than that of our more complex model. Our model with two predictors
-provided a slightly better fit on test data than our model with just one.
+provided a slightly better fit on test data than our model with just one.
As mentioned earlier, this is not always the case: sometimes including more
-predictors can negatively impact the prediction performance on unseen
+predictors can negatively impact the prediction performance on unseen
test data.
+++
## Multicollinearity and outliers
-What can go wrong when performing (possibly multivariable) linear regression?
-This section will introduce two common issues—*outliers* and *collinear predictors*—and
+What can go wrong when performing (possibly multivariable) linear regression?
+This section will introduce two common issues—*outliers* and *collinear predictors*—and
illustrate their impact on predictions.
+++
@@ -993,7 +993,7 @@ illustrate their impact on predictions.
Outliers are data points that do not follow the usual pattern of the rest of the data.
In the setting of linear regression, these are points that
- have a vertical distance to the line of best fit that is either much higher or much lower
+ have a vertical distance to the line of best fit that is either much higher or much lower
than you might expect based on the rest of the data. The problem with outliers is that
they can have *too much influence* on the line of best fit. In general, it is very difficult
to judge accurately which data are outliers without advanced techniques that are beyond
@@ -1003,7 +1003,7 @@ But to illustrate what can happen when you have outliers, {numref}`fig:08-lm-out
shows a small subset of the Sacramento housing data again, except we have added a *single* data point (highlighted
in red). This house is 5,000 square feet in size, and sold for only \$50,000. Unbeknownst to the
data analyst, this house was sold by a parent to their child for an absurdly low price. Of course,
-this is not representative of the real housing market values that the other data points follow;
+this is not representative of the real housing market values that the other data points follow;
the data point is an *outlier*. In blue we plot the original line of best fit, and in red
we plot the new line of best fit including the outlier. You can see how different the red line
is from the blue line, which is entirely caused by that one extra outlier data point.
@@ -1144,7 +1144,7 @@ Scatter plot of the full data, with outlier highlighted in red.
```
The second, and much more subtle, issue can occur when performing multivariable
-linear regression. In particular, if you include multiple predictors that are
+linear regression. In particular, if you include multiple predictors that are
strongly linearly related to one another, the coefficients that describe the
plane of best fit can be very unreliable—small changes to the data can
result in large changes in the coefficients. Consider an extreme example using
@@ -1263,15 +1263,15 @@ book; see the list of additional resources at the end of this chapter to find ou
We were quite fortunate in our initial exploration to find a predictor variable (house size)
that seems to have a meaningful and nearly linear relationship with our response variable (sale price).
But what should we do if we cannot immediately find such a nice variable?
-Well, sometimes it is just a fact that the variables in the data do not have enough of
+Well, sometimes it is just a fact that the variables in the data do not have enough of
a relationship with the response variable to provide useful predictions. For example,
if the only available predictor was "the current house owner's favorite ice cream flavor",
we likely would have little hope of using that variable to predict the house's sale price
-(barring any future remarkable scientific discoveries about the relationship between
-the housing market and homeowner ice cream preferences). In cases like these,
+(barring any future remarkable scientific discoveries about the relationship between
+the housing market and homeowner ice cream preferences). In cases like these,
the only option is to obtain measurements of more useful variables.
-There are, however, a wide variety of cases where the predictor variables do have a
+There are, however, a wide variety of cases where the predictor variables do have a
meaningful relationship with the response variable, but that relationship does not fit
the assumptions of the regression method you have chosen. For example, a data frame `df`
with two variables—`x` and `y`—with a nonlinear relationship between the two variables
@@ -1328,7 +1328,7 @@ Example of a data set with a nonlinear relationship between the predictor and th
Instead of trying to predict the response `y` using a linear regression on `x`,
we might have some scientific background about our problem to suggest that `y`
should be a cubic function of `x`. So before performing regression,
-we might *create a new predictor variable* `z`:
+we might *create a new predictor variable* `z`:
```{code-cell} ipython3
df["z"] = df["x"] ** 3
@@ -1336,8 +1336,8 @@ df["z"] = df["x"] ** 3
Then we can perform linear regression for `y` using the predictor variable `z`,
as shown in {numref}`fig:08-predictor-design-2`.
-Here you can see that the transformed predictor `z` helps the
-linear regression model make more accurate predictions.
+Here you can see that the transformed predictor `z` helps the
+linear regression model make more accurate predictions.
Note that none of the `y` response values have changed between {numref}`fig:08-predictor-design`
and {numref}`fig:08-predictor-design-2`; the only change is that the `x` values
have been replaced by `z` values.
@@ -1378,7 +1378,7 @@ The process of
transforming predictors (and potentially combining multiple predictors in the process)
is known as *feature engineering*. In real data analysis
problems, you will need to rely on
-a deep understanding of the problem—as well as the wrangling tools
+a deep understanding of the problem—as well as the wrangling tools
from previous chapters—to engineer useful new features that improve
predictive performance.
@@ -1395,11 +1395,11 @@ So far in this textbook we have used regression only in the context of
prediction. However, regression can also be seen as a method to understand and
quantify the effects of individual variables on a response / outcome of interest.
In the housing example from this chapter, beyond just using past data
-to predict future sale prices,
+to predict future sale prices,
we might also be interested in describing the
individual relationships of house size and the number of bedrooms with house price,
quantifying how strong each of these relationships are, and assessing how accurately we
-can estimate their magnitudes. And even beyond that, we may be interested in
+can estimate their magnitudes. And even beyond that, we may be interested in
understanding whether the predictors *cause* changes in the price.
These sides of regression are well beyond the scope of this book; but
the material you have learned here should give you a foundation of knowledge
@@ -1409,8 +1409,8 @@ that will serve you well when moving to more advanced books on the topic.
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme)
in the "Regression II: linear regression" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -1426,7 +1426,7 @@ and guidance that the worksheets provide will function as intended.
- The [`scikit-learn` website](https://scikit-learn.org/stable/) is an excellent
reference for more details on, and advanced usage of, the functions and
- packages in the past two chapters. Aside from that, it also offers many
+ packages in the past two chapters. Aside from that, it also offers many
useful [tutorials](https://scikit-learn.org/stable/tutorial/index.html) and [an extensive list
of more advanced examples](https://scikit-learn.org/stable/auto_examples/index.html#general-examples)
that you can use to continue learning beyond the scope of this book.
@@ -1438,7 +1438,7 @@ and guidance that the worksheets provide will function as intended.
"explanatory" / "inferential" approach to regression in general (in Chapters 5,
6, and 10), which provides a nice complement to the predictive tack we take in
the present book.
-- *An Introduction to Statistical Learning* {cite:p}`james2013introduction` provides
+- *An Introduction to Statistical Learning* {cite:p}`james2013introduction` provides
a great next stop in the process of
learning about regression. Chapter 3 covers linear regression at a slightly
more mathematical level than we do here, but it is not too large a leap and so
@@ -1459,7 +1459,7 @@ and guidance that the worksheets provide will function as intended.
# packages in the past two chapters. Aside from that, it also has a [nice
# beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list
# of more advanced examples](https://www.tidymodels.org/learn/) that you can use
-# to continue learning beyond the scope of this book.
+# to continue learning beyond the scope of this book.
# - *Modern Dive* [@moderndive] is another textbook that uses the
# `tidyverse` / `tidymodels` framework. Chapter 6 complements the material in
# the current chapter well; it covers some slightly more advanced concepts than
@@ -1468,7 +1468,7 @@ and guidance that the worksheets provide will function as intended.
# "explanatory" / "inferential" approach to regression in general (in Chapters 5,
# 6, and 10), which provides a nice complement to the predictive tack we take in
# the present book.
-# - *An Introduction to Statistical Learning* [@james2013introduction] provides
+# - *An Introduction to Statistical Learning* [@james2013introduction] provides
# a great next stop in the process of
# learning about regression. Chapter 3 covers linear regression at a slightly
# more mathematical level than we do here, but it is not too large a leap and so
diff --git a/source/setup.md b/source/setup.md
index ee8fbe95..5efe5697 100644
--- a/source/setup.md
+++ b/source/setup.md
@@ -18,7 +18,7 @@ kernelspec:
## Overview
-In this chapter, you'll learn how to install all of the software
+In this chapter, you'll learn how to install all of the software
needed to do the data science covered in this book on your own computer.
## Chapter learning objectives
@@ -31,8 +31,8 @@ By the end of the chapter, readers will be able to do the following:
## Installing software on your own computer
-This section will provide instructions for installing the software required by
-this book on your own computer.
+This section will provide instructions for installing the software required by
+this book on your own computer.
Given that installation instructions can vary widely based on the computer setup,
we have created instructions for multiple operating systems.
In particular, the installation instructions below have been verified to work
@@ -45,25 +45,25 @@ on a computer that:
### Git
-As shown in Chapter \@ref(Getting-started-with-version-control),
-Git \index{git!installation} is a very useful tool for version controlling your projects,
-as well as sharing your work with others. Here's how to install Git on
-the following operating systems:
+As shown in Chapter \@ref(Getting-started-with-version-control),
+Git \index{git!installation} is a very useful tool for version controlling your projects,
+as well as sharing your work with others. Here's how to install Git on
+the following operating systems:
**Windows:** To install
-Git on Windows, go to and download the Windows
-version of Git. Once the download has finished, run the installer and accept
+Git on Windows, go to and download the Windows
+version of Git. Once the download has finished, run the installer and accept
the default configuration for all pages.
-**MacOS:** To install Git on Mac OS,
-open the terminal ([how-to video](https://youtu.be/5AJbWEWwnbY))
+**MacOS:** To install Git on Mac OS,
+open the terminal ([how-to video](https://youtu.be/5AJbWEWwnbY))
and type the following command:
```
xcode-select --install
```
-**Ubuntu:** To install Git on Ubuntu, open the terminal
+**Ubuntu:** To install Git on Ubuntu, open the terminal
and type the following commands:
```
@@ -75,31 +75,31 @@ sudo apt install git
### Miniconda
-To run Jupyter notebooks on your computer,
-you will need to install the web-based platform JupyterLab.
+To run Jupyter notebooks on your computer,
+you will need to install the web-based platform JupyterLab.
But JupyterLab relies on Python, so we need to install Python first.
We can install Python via
the \index{miniconda} [miniconda Python package distribution](https://docs.conda.io/en/latest/miniconda.html).
-**Windows:** To install miniconda on Windows, download
-the [latest Python 64-bit version from here](https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe).
-Once the download has finished, run the installer
-and accept the default configuration for all pages.
+**Windows:** To install miniconda on Windows, download
+the [latest Python 64-bit version from here](https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe).
+Once the download has finished, run the installer
+and accept the default configuration for all pages.
After installation, you can open the Anaconda Prompt
-by opening the Start Menu and searching for the program called
-"Anaconda Prompt (miniconda3)".
-When this opens, you will see a prompt similar to
-`(base) C:\Users\your_name`.
+by opening the Start Menu and searching for the program called
+"Anaconda Prompt (miniconda3)".
+When this opens, you will see a prompt similar to
+`(base) C:\Users\your_name`.
-**MacOS:** To install miniconda on MacOS, you will need to use a different
+**MacOS:** To install miniconda on MacOS, you will need to use a different
installation method depending on the type of processor chip your computer has.
-If your Mac computer has an Intel x86 processor chip you can download
-the [latest Python 64-bit version from here](https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.pkg).
-After the download has finished, run the installer and accept the default
+If your Mac computer has an Intel x86 processor chip you can download
+the [latest Python 64-bit version from here](https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.pkg).
+After the download has finished, run the installer and accept the default
configuration for all pages.
-If your Mac computer has an Apple M1 processor chip you can download
+If your Mac computer has an Apple M1 processor chip you can download
the [latest Python 64-bit version from here](https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh).
After the download has finished, you need to run the downloaded script in the terminal using a command
like:
@@ -109,21 +109,21 @@ bash path/to/Miniconda3-latest-MacOSX-arm64.sh
```
Make sure to replace `path/to/` with the path of the folder
-containing the downloaded script. Most computers will save downloaded files to the `Downloads` folder.
+containing the downloaded script. Most computers will save downloaded files to the `Downloads` folder.
If this is the case for your computer, you can run the script in the terminal by typing:
```
bash Downloads/Miniconda3-latest-MacOSX-arm64.sh
```
-The instructions for the installation will then appear.
+The instructions for the installation will then appear.
Follow the prompts and agree to accepting the license,
the default installation location,
and to running `conda init`, which makes `conda` available from the terminal.
**Ubuntu:** To install miniconda on Ubuntu, first download
-the [latest Python 64-bit version from here](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh).
-After the download has finished, open the terminal and execute the following
+the [latest Python 64-bit version from here](https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh).
+After the download has finished, open the terminal and execute the following
command:
```
@@ -138,14 +138,14 @@ If this is the case for your computer, you can run the script in the terminal by
bash Downloads/Miniconda3-latest-Linux-x86_64.sh
```
-The instructions for the installation will then appear.
+The instructions for the installation will then appear.
Follow the prompts and agree to accepting the license,
the default installation location,
and to running `conda init`, which makes `conda` available from the terminal.
### JupyterLab
-With miniconda set up, we can now install JupyterLab \index{JupyterLab installation} and the Jupyter Git \index{git!Jupyter extension} extension.
+With miniconda set up, we can now install JupyterLab \index{JupyterLab installation} and the Jupyter Git \index{git!Jupyter extension} extension.
Type the following into the Anaconda Prompt (Windows) or the terminal (MacOS and Ubuntu) and press enter:
```
@@ -154,17 +154,17 @@ conda install -y nodejs
pip install --upgrade jupyterlab-git
```
-To test that your JupyterLab installation is functional, you can type
-`jupyter lab` into the Anaconda Prompt (Windows)
-or terminal (MacOS and Ubuntu) and press enter. This should open a new
-tab in your default browser with the JupyterLab interface. To exit out of
-JupyterLab you can click `File -> Shutdown`, or go to the terminal from which
+To test that your JupyterLab installation is functional, you can type
+`jupyter lab` into the Anaconda Prompt (Windows)
+or terminal (MacOS and Ubuntu) and press enter. This should open a new
+tab in your default browser with the JupyterLab interface. To exit out of
+JupyterLab you can click `File -> Shutdown`, or go to the terminal from which
you launched JupyterLab, hold `Ctrl`, and press `C` twice.
-To improve the experience of using R in JupyterLab, you should also add an extension
+To improve the experience of using R in JupyterLab, you should also add an extension
that allows you to set up keyboard shortcuts for inserting text.
-By default,
-this extension creates shortcuts for inserting two of the most common R
+By default,
+this extension creates shortcuts for inserting two of the most common R
operators: `<-` and `|>`. Type the following in the Anaconda Prompt (Windows)
or terminal (MacOS and Ubuntu) and press enter:
@@ -172,35 +172,35 @@ or terminal (MacOS and Ubuntu) and press enter:
jupyter labextension install @techrah/text-shortcuts
```
-### R, R packages, and the IRkernel
+### R, R packages, and the IRkernel
-To have the software \index{R installation} used in this book available to you in JupyterLab,
+To have the software \index{R installation} used in this book available to you in JupyterLab,
you will need to install the R programming language,
several R packages,
and the \index{kernel!installation} IRkernel.
-To install versions of these that are compatible with the accompanying worksheets,
-type the command shown below into the Anaconda Prompt (Windows)
-or terminal (MacOS and Ubuntu).
+To install versions of these that are compatible with the accompanying worksheets,
+type the command shown below into the Anaconda Prompt (Windows)
+or terminal (MacOS and Ubuntu).
```
conda env update --file https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-worksheets/main/environment.yml
```
This command installs the specific R and package versions specified in
-the `environment.yml` file found in
+the `environment.yml` file found in
[the worksheets repository](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets).
We will always keep the versions in the `environment.yml` file updated
so that they are compatible with the exercise worksheets that accompany the book.
-> You can also install the *latest* version of R
-> and the R packages used in this book by typing the commands shown below
-> in the Anaconda Prompt (Windows)
+> You can also install the *latest* version of R
+> and the R packages used in this book by typing the commands shown below
+> in the Anaconda Prompt (Windows)
> or terminal (MacOS and Ubuntu) and pressing enter.
> **Be careful though:** this may install package versions that are
> incompatible with the worksheets that accompany the book; the automated
> exercise feedback might tell you your answers are not correct even though
> they are!
->
+>
> ```
> conda install -c conda-forge -y \
> r-base \
@@ -221,13 +221,13 @@ so that they are compatible with the exercise worksheets that accompany the book
### LaTeX
-To be able to render `.ipynb` files to `.pdf` you need to install a LaTeX
-distribution. These can be quite large, so we will opt to use `tinytex`, a
-light-weight cross-platform, portable, and easy-to-maintain LaTeX distribution
-based on TeX Live.
+To be able to render `.ipynb` files to `.pdf` you need to install a LaTeX
+distribution. These can be quite large, so we will opt to use `tinytex`, a
+light-weight cross-platform, portable, and easy-to-maintain LaTeX distribution
+based on TeX Live.
-**MacOS:** To install `tinytex`
-we need to make sure that `/usr/local/bin` is writable.
+**MacOS:** To install `tinytex`
+we need to make sure that `/usr/local/bin` is writable.
To do this, type the following in the terminal:
```
@@ -236,15 +236,15 @@ sudo chown -R $(whoami):admin /usr/local/bin
> **Note:** You might be asked to enter your password during installation.
-**All operating systems:**
-To install LaTeX, open JupyterLab by typing `jupyter lab`
+**All operating systems:**
+To install LaTeX, open JupyterLab by typing `jupyter lab`
in the Anaconda Prompt (Windows) or terminal (MacOS and Ubuntu) and press Enter.
-Then from JupyterLab, open an R console, type the commands listed below, and
+Then from JupyterLab, open an R console, type the commands listed below, and
press Shift + Enter to install `tinytex`:
```
tinytex::install_tinytex()
-tinytex::tlmgr_install(c("eurosym",
+tinytex::tlmgr_install(c("eurosym",
"adjustbox",
"caption",
"collectbox",
@@ -265,10 +265,10 @@ tinytex::tlmgr_install(c("eurosym",
"upquote"))
```
-**Ubuntu:**
-To append the TinyTex executables to our `PATH` we need to edit our `.bashrc file`.
-The TinyTex executables are usually installed in `~/bin`.
-Thus, add the lines below to the bottom of your `.bashrc` file
+**Ubuntu:**
+To append the TinyTex executables to our `PATH` we need to edit our `.bashrc file`.
+The TinyTex executables are usually installed in `~/bin`.
+Thus, add the lines below to the bottom of your `.bashrc` file
(which you can open by `nano ~/.bashrc` and save the file:
```
@@ -276,23 +276,23 @@ Thus, add the lines below to the bottom of your `.bashrc` file
export PATH="$PATH:~/bin"
```
-> **Note:** If you used `nano` to open your `.bashrc` file,
-follow the keyboard shortcuts at the bottom of the nano text editor
+> **Note:** If you used `nano` to open your `.bashrc` file,
+follow the keyboard shortcuts at the bottom of the nano text editor
to save and close the file.
## Finishing up installation
It is good practice to restart all the programs you used when installing this
software stack before you proceed to doing your data analysis.
-This includes restarting JupyterLab as well as the terminal (MacOS and Ubuntu)
+This includes restarting JupyterLab as well as the terminal (MacOS and Ubuntu)
or the Anaconda Prompt (Windows).
-This will ensure all the software and settings you put in place are
-correctly sourced.
+This will ensure all the software and settings you put in place are
+correctly sourced.
## Downloading the worksheets for this book
-The worksheets containing practice exercises for this book
-can be downloaded by visiting
+The worksheets containing practice exercises for this book
+can be downloaded by visiting
[https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets),
clicking the green "Code" button, and then selecting "Download ZIP".
The worksheets are contained within the compressed zip folder that will be downloaded.
diff --git a/source/version-control.md b/source/version-control.md
index 962df7b3..0fc70616 100644
--- a/source/version-control.md
+++ b/source/version-control.md
@@ -16,9 +16,9 @@ kernelspec:
(getting-started-with-version-control)=
# Collaboration with version control
-> *You mostly collaborate with yourself,
+> *You mostly collaborate with yourself,
> and me-from-two-months-ago never responds to email.*
->
+>
> --Mark T. Holder
+++
@@ -28,24 +28,24 @@ kernelspec:
```{index} git, GitHub
```
-This chapter will introduce the concept of using version control systems
-to track changes to a project over its lifespan, to share
-and edit code in a collaborative team,
+This chapter will introduce the concept of using version control systems
+to track changes to a project over its lifespan, to share
+and edit code in a collaborative team,
and to distribute the finished project to its intended audience.
-This chapter will also introduce how to use
-the two most common version control tools: Git for local version control,
-and GitHub for remote version control.
-We will focus on the most common version control operations
-used day-to-day in a standard data science project.
-There are many user interfaces for Git; in this chapter
-we will cover the Jupyter Git interface.
+This chapter will also introduce how to use
+the two most common version control tools: Git for local version control,
+and GitHub for remote version control.
+We will focus on the most common version control operations
+used day-to-day in a standard data science project.
+There are many user interfaces for Git; in this chapter
+we will cover the Jupyter Git interface.
```{note}
This book was originally written for the R programming language, and
has been edited to focus instead on Python. This chapter on version control
-has not yet been fully updated to focus on Python; it has images and examples from
+has not yet been fully updated to focus on Python; it has images and examples from
the R version of the book. But the concepts related to version control are generally
-the same. We are currently working on producing new Python-based images and examples
+the same. We are currently working on producing new Python-based images and examples
for this chapter.
```
@@ -55,7 +55,7 @@ By the end of the chapter, readers will be able to do the following:
- Describe what version control is and why data analysis projects can benefit from it.
- Create a remote version control repository on GitHub.
-- Use Jupyter's Git version control tools for project versioning and collaboration:
+- Use Jupyter's Git version control tools for project versioning and collaboration:
- Clone a remote version control repository to create a local repository.
- Commit changes to a local version control repository.
- Push local changes to a remote version control repository.
@@ -67,35 +67,35 @@ By the end of the chapter, readers will be able to do the following:
## What is version control, and why should I use it?
-Data analysis projects often require iteration
+Data analysis projects often require iteration
and revision to move from an initial idea to a finished product
-ready for the intended audience.
-Without deliberate and conscious effort towards tracking changes
-made to the analysis, projects tend to become messy.
-This mess can have serious, negative repercussions on an analysis project,
+ready for the intended audience.
+Without deliberate and conscious effort towards tracking changes
+made to the analysis, projects tend to become messy.
+This mess can have serious, negative repercussions on an analysis project,
including interesting results files that your code cannot reproduce,
-temporary files with snippets of ideas that are forgotten or
+temporary files with snippets of ideas that are forgotten or
not easy to find, mind-boggling file names that make it unclear which is
-the current working version of the file (e.g., `document_final_draft_final.txt`,
-`to_hand_in_final_v2.txt`, etc.), and more.
+the current working version of the file (e.g., `document_final_draft_final.txt`,
+`to_hand_in_final_v2.txt`, etc.), and more.
-Additionally, the iterative nature of data analysis projects
+Additionally, the iterative nature of data analysis projects
means that most of the time, the final version of the analysis that is
-shared with the audience is only a fraction of what was explored during
-the development of that analysis.
-Changes in data visualizations and modeling approaches,
-as well as some negative results, are often not observable from
+shared with the audience is only a fraction of what was explored during
+the development of that analysis.
+Changes in data visualizations and modeling approaches,
+as well as some negative results, are often not observable from
reviewing only the final, polished analysis.
The lack of observability of these parts of the analysis development
-can lead to others repeating things that did not work well,
-instead of seeing what did not work well,
+can lead to others repeating things that did not work well,
+instead of seeing what did not work well,
and using that as a springboard to new, more fruitful approaches.
-Finally, data analyses are typically completed by a team of people
-rather than a single person.
-This means that files need to be shared across multiple computers,
-and multiple people often end up editing the project simultaneously.
-In such a situation, determining who has the latest version of the
+Finally, data analyses are typically completed by a team of people
+rather than a single person.
+This means that files need to be shared across multiple computers,
+and multiple people often end up editing the project simultaneously.
+In such a situation, determining who has the latest version of the
project—and how to resolve conflicting edits—can be a real challenge.
```{index} version control
@@ -120,50 +120,50 @@ and what you're planning to do next!
```{index} version control;system, version control;repository hosting
```
-To version control a project, you generally need two things:
-a *version control system* and a *repository hosting service*.
-The version control system is the software responsible
-for tracking changes, sharing changes you make with others,
+To version control a project, you generally need two things:
+a *version control system* and a *repository hosting service*.
+The version control system is the software responsible
+for tracking changes, sharing changes you make with others,
obtaining changes from others, and resolving conflicting edits.
-The repository hosting service is responsible for storing a copy
-of the version-controlled project online (a *repository*),
-where you and your collaborators can access it remotely,
-discuss issues and bugs, and distribute your final product.
+The repository hosting service is responsible for storing a copy
+of the version-controlled project online (a *repository*),
+where you and your collaborators can access it remotely,
+discuss issues and bugs, and distribute your final product.
For both of these items, there is a wide variety of choices.
-In this textbook we'll use Git for version control,
-and GitHub for repository hosting,
+In this textbook we'll use Git for version control,
+and GitHub for repository hosting,
because both are currently the most widely used platforms.
-In the
+In the
additional resources section at the end of the chapter,
-we list many of the common version control systems
+we list many of the common version control systems
and repository hosting services in use today.
-> **Note:** Technically you don't *have to* use a repository hosting service.
+> **Note:** Technically you don't *have to* use a repository hosting service.
> You can, for example, version control a project
-> that is stored only in a folder on your computer—never
-> sharing it on a repository hosting service.
-> But using a repository hosting service provides a few big benefits,
+> that is stored only in a folder on your computer—never
+> sharing it on a repository hosting service.
+> But using a repository hosting service provides a few big benefits,
> including managing collaborator access permissions,
-> tools to discuss and track bugs,
-> and the ability to have external collaborators contribute work,
-> not to mention the safety of having your work backed up in the cloud.
-> Since most repository hosting services now offer free accounts,
-> there are not many situations in which you wouldn't
-> want to use one for your project.
+> tools to discuss and track bugs,
+> and the ability to have external collaborators contribute work,
+> not to mention the safety of having your work backed up in the cloud.
+> Since most repository hosting services now offer free accounts,
+> there are not many situations in which you wouldn't
+> want to use one for your project.
## Version control repositories
-```{index} repository, repository;local, repository;remote
+```{index} repository, repository;local, repository;remote
```
-Typically, when we put a data analysis project under version control,
-we create two copies of the repository ({numref}`vc1-no-changes`).
+Typically, when we put a data analysis project under version control,
+we create two copies of the repository ({numref}`vc1-no-changes`).
One copy we use as our primary workspace where we create, edit, and delete files.
This copy is commonly referred to as the **local repository**. The local
repository most commonly exists on our computer or laptop, but can also exist within
a workspace on a server (e.g., JupyterHub).
The other copy is typically stored in a repository hosting service (e.g., GitHub), where
-we can easily share it with our collaborators.
+we can easily share it with our collaborators.
This copy is commonly referred to as the **remote repository**.
```{figure} img/vc1-no-changes.png
@@ -177,28 +177,28 @@ Schematic of local and remote version control repositories.
```
Both copies of the repository have a **working directory**
-where you can create, store, edit, and delete
+where you can create, store, edit, and delete
files (e.g., `analysis.ipynb` in {numref}`vc1-no-changes`).
-Both copies of the repository also maintain a full project history
+Both copies of the repository also maintain a full project history
({numref}`vc1-no-changes`). This history is a record of all versions of the
project files that have been created. The repository history is not
automatically generated; Git must be explicitly told when to record
-a version of the project. These records are called **commits**. They
+a version of the project. These records are called **commits**. They
are a snapshot of the file contents as well
metadata about the repository at that time the record was created (who made the
commit, when it was made, etc.). In the local and remote repositories shown in
{numref}`vc1-no-changes`, there are two commits represented as gray
-circles. Each commit can be identified by a
+circles. Each commit can be identified by a
human-readable **message**, which you write when you make a commit, and a
-**commit hash** that Git automatically adds for you.
+**commit hash** that Git automatically adds for you.
-The purpose of the message is to contain a brief, rich description
+The purpose of the message is to contain a brief, rich description
of what work was done since the last commit.
-Messages act as a very useful narrative
-of the changes to a project over its lifespan.
+Messages act as a very useful narrative
+of the changes to a project over its lifespan.
If you ever want to view or revert to an earlier version of the project,
the message can help you identify which commit to view or revert to.
-In {numref}`vc1-no-changes`, you can see two such messages,
+In {numref}`vc1-no-changes`, you can see two such messages,
one for each commit: `Created README.md` and `Added analysis draft`.
```{index} hash
@@ -208,7 +208,7 @@ one for each commit: `Created README.md` and `Added analysis draft`.
The hash is a string of characters consisting of about 40 letters and numbers.
The purpose of the hash is to serve as a unique identifier for the commit,
-and is used by Git to index project history. Although hashes are quite long—imagine
+and is used by Git to index project history. Although hashes are quite long—imagine
having to type out 40 precise characters to view an old project version!—Git is able
to work with shorter versions of hashes. In {numref}`vc1-no-changes`, you can see
two of these shortened hashes, one for each commit: `Daa29d6` and `884c7ce`.
@@ -216,7 +216,7 @@ two of these shortened hashes, one for each commit: `Daa29d6` and `884c7ce`.
## Version control workflows
When you work in a local version-controlled repository, there are generally three additional
-steps you must take as part of your regular workflow. In addition to
+steps you must take as part of your regular workflow. In addition to
just working on files—creating,
editing, and deleting files as you normally would—you must:
@@ -227,9 +227,9 @@ editing, and deleting files as you normally would—you must:
In this section we will discuss all three of these steps in detail.
(commit-changes)=
-### Committing changes to a local repository
+### Committing changes to a local repository
-When working on files in your local version control
+When working on files in your local version control
repository (e.g., using Jupyter) and saving your work, these changes will only initially exist in the
working directory of the local repository ({numref}`vc2-changes`).
@@ -243,16 +243,16 @@ Local repository with changes to files.
```{index} git;add, staging area
```
-Once you reach a point that you want Git to keep a record
-of the current version of your work, you need to commit
+Once you reach a point that you want Git to keep a record
+of the current version of your work, you need to commit
(i.e., snapshot) your changes. A prerequisite to this is telling Git which
-files should be included in that snapshot. We call this step **adding** the
-files to the **staging area**.
-Note that the staging area is not a real physical location on your computer;
+files should be included in that snapshot. We call this step **adding** the
+files to the **staging area**.
+Note that the staging area is not a real physical location on your computer;
it is instead a conceptual placeholder for these files until they are committed.
-The benefit of the Git version control system using a staging area is that you
-can choose to commit changes in only certain files. For example,
-in {numref}`vc-ba2-add`, we add only the two files
+The benefit of the Git version control system using a staging area is that you
+can choose to commit changes in only certain files. For example,
+in {numref}`vc-ba2-add`, we add only the two files
that are important to the analysis project (`analysis.ipynb` and `README.md`)
and not our personal scratch notes for the project (`notes.txt`).
@@ -265,11 +265,11 @@ Adding modified files to the staging area in the local repository.
-Once the files we wish to commit have been added
+Once the files we wish to commit have been added
to the staging area, we can then commit those files to the repository history ({numref}`vc-ba3-commit`).
-When we do this, we are required to include a helpful *commit message* to tell
+When we do this, we are required to include a helpful *commit message* to tell
collaborators (which often includes future you!) about the changes that were
-made. In {numref}`vc-ba3-commit`, the message is `Message about changes...`; in
+made. In {numref}`vc-ba3-commit`, the message is `Message about changes...`; in
your work you should make sure to replace this with an
informative message about what changed. It is also important to note here that
these changes are only being committed to the local repository's history. The
@@ -292,11 +292,11 @@ Committing the modified files in the staging area to the local repository histor
-Once you have made one or more commits that you want to share with your collaborators,
-you need to **push** (i.e., send) those commits back to GitHub ({numref}`vc5-push`). This updates
-the history in the remote repository (i.e., GitHub) to match what you have in your
+Once you have made one or more commits that you want to share with your collaborators,
+you need to **push** (i.e., send) those commits back to GitHub ({numref}`vc5-push`). This updates
+the history in the remote repository (i.e., GitHub) to match what you have in your
local repository. Now when collaborators interact with the remote repository, they will be able
-to see the changes you made. And you can also take comfort in the fact that your work is now backed
+to see the changes you made. And you can also take comfort in the fact that your work is now backed
up in the cloud!
```{figure} img/vc5-push.png
@@ -312,7 +312,7 @@ Pushing the commit to send the changes to the remote repository on GitHub.
If you are working on a project with collaborators, they will also be making changes to files
(e.g., to the analysis code in a Jupyter notebook and the project's README file),
committing them to their own local repository, and pushing their commits to the remote GitHub repository
-to share them with you. When they push their changes, those changes will only initially exist in
+to share them with you. When they push their changes, those changes will only initially exist in
the remote GitHub repository and not in your local repository ({numref}`vc6-remote-changes`).
```{figure} img/vc6-remote-changes.png
@@ -330,7 +330,7 @@ to **pull** those changes to your own local repository. By pulling changes,
you synchronize your local repository to what is present on GitHub ({numref}`vc7-pull`).
Additionally, until you pull changes from the remote repository, you will not
be able to push any more changes yourself (though you will still be able to
-work and make commits in your own local repository).
+work and make commits in your own local repository).
```{figure} img/vc7-pull.png
---
@@ -341,20 +341,20 @@ Pulling changes from the remote GitHub repository to synchronize your local repo
-## Working with remote repositories using GitHub
+## Working with remote repositories using GitHub
```{index} repository;remote, GitHub, git;clone
```
-Now that you have been introduced to some of the key general concepts
-and workflows of Git version control, we will walk through the practical steps.
+Now that you have been introduced to some of the key general concepts
+and workflows of Git version control, we will walk through the practical steps.
There are several different ways to start using version control
-with a new project. For simplicity and ease of setup,
+with a new project. For simplicity and ease of setup,
we recommend creating a remote repository first.
This section covers how to both create and edit a remote repository on GitHub.
-Once you have a remote repository set up, we recommend **cloning** (or copying) that
+Once you have a remote repository set up, we recommend **cloning** (or copying) that
repository to create a local repository in which you primarily work.
You can clone the repository either
on your own computer or in a workspace on a server (e.g., a JupyterHub server).
@@ -362,34 +362,34 @@ Section {numref}`local-repo-jupyter` below will cover this second step in detail
### Creating a remote repository on GitHub
-Before you can create remote repositories on GitHub,
-you will need a GitHub account; you can sign up for a free account
+Before you can create remote repositories on GitHub,
+you will need a GitHub account; you can sign up for a free account
at [https://github.com/](https://github.com/).
-Once you have logged into your account, you can create a new repository to host
-your project by clicking on the "+" icon in the upper right-hand
-corner, and then on "New Repository," as shown in
+Once you have logged into your account, you can create a new repository to host
+your project by clicking on the "+" icon in the upper right-hand
+corner, and then on "New Repository," as shown in
{numref}`new-repository-01`.
```{figure} img/version_control/new_repository_01.png
---
name: new-repository-01
---
-New repositories on GitHub can be created by clicking on "New Repository" from the + menu.
+New repositories on GitHub can be created by clicking on "New Repository" from the + menu.
```
```{index} repository;public
```
-Repositories can be set up with a variety of configurations, including a name,
-optional description, and the inclusion (or not) of several template files.
+Repositories can be set up with a variety of configurations, including a name,
+optional description, and the inclusion (or not) of several template files.
One of the most important configuration items to choose is the visibility to the outside world,
either public or private. *Public* repositories can be viewed by anyone.
*Private* repositories can be viewed by only you. Both public and private repositories
are only editable by you, but you can change that by giving access to other collaborators.
-To get started with a *public* repository having a template `README.md` file, take the
-following steps shown in {numref}`new-repository-02`:
+To get started with a *public* repository having a template `README.md` file, take the
+following steps shown in {numref}`new-repository-02`:
1. Enter the name of your project repository. In the example below, we use `canadian_languages`. Most repositories follow a similar naming convention involving only lowercase letter words separated by either underscores or hyphens.
2. Choose an option for the privacy of your repository.
@@ -424,8 +424,8 @@ Respository configuration for a project that is public and initialized with a RE
```{index} GitHub; pen tool
```
-The pen tool can be used to edit existing plain text files. When you click on
-the pen tool, the file will be opened in a text box where you can use your
+The pen tool can be used to edit existing plain text files. When you click on
+the pen tool, the file will be opened in a text box where you can use your
keyboard to make changes ({numref}`pen-tool-01` and {numref}`pen-tool-02`).
```{figure} img/version_control/pen-tool_01.png
@@ -448,11 +448,11 @@ The text box where edits can be made after clicking on the pen tool.
-After you are done with your edits, they can be "saved" by *committing* your
-changes. When you *commit a file* in a repository, the version control system
-takes a snapshot of what the file looks like. As you continue working on the
-project, over time you will possibly make many commits to a single file; this
-generates a useful version history for that file. On GitHub, if you click the
+After you are done with your edits, they can be "saved" by *committing* your
+changes. When you *commit a file* in a repository, the version control system
+takes a snapshot of what the file looks like. As you continue working on the
+project, over time you will possibly make many commits to a single file; this
+generates a useful version history for that file. On GitHub, if you click the
green "Commit changes" button, it will save the file and then make a commit
({numref}`pen-tool-03`).
@@ -460,7 +460,7 @@ Recall from {numref}`commit-changes` that you normally have to add files
to the staging area before committing them. Why don't we have to do that when
we work directly on GitHub? Behind the scenes, when you click the green "Commit changes"
button, GitHub *is* adding that one file to the staging area prior to committing it.
-But note that on GitHub you are limited to committing changes to only one file at a time.
+But note that on GitHub you are limited to committing changes to only one file at a time.
When you work in your own local repository, you can commit
changes to multiple files simultaneously. This is especially useful when one
"improvement" to the project involves modifying multiple files.
@@ -481,9 +481,9 @@ Saving changes using the pen tool requires committing those changes, and an asso
-The "Add file" menu can be used to create new plain text files and upload files
-from your computer. To create a new plain text file, click the "Add file"
-drop-down menu and select the "Create new file" option
+The "Add file" menu can be used to create new plain text files and upload files
+from your computer. To create a new plain text file, click the "Add file"
+drop-down menu and select the "Create new file" option
({numref}`create-new-file-01`).
```{figure} img/version_control/create-new-file_01.png
@@ -498,14 +498,14 @@ New plain text files can be created directly on GitHub.
-A page will open with a small text box for the file name to be entered, and a
-larger text box where the desired file content text can be entered. Note the two
-tabs, "Edit new file" and "Preview". Toggling between them lets you enter and
+A page will open with a small text box for the file name to be entered, and a
+larger text box where the desired file content text can be entered. Note the two
+tabs, "Edit new file" and "Preview". Toggling between them lets you enter and
edit text and view what the text will look like when rendered, respectively
-({numref}`create-new-file-02`).
-Note that GitHub understands and renders `.md` files using a
-[markdown syntax](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf)
-very similar to Jupyter notebooks, so the "Preview" tab is especially helpful
+({numref}`create-new-file-02`).
+Note that GitHub understands and renders `.md` files using a
+[markdown syntax](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf)
+very similar to Jupyter notebooks, so the "Preview" tab is especially helpful
for checking markdown code correctness.
```{figure} img/version_control/create-new-file_02.png
@@ -515,7 +515,7 @@ name: create-new-file-02
New plain text files require a file name in the text box circled in red, and file content entered in the larger text box (red arrow).
```
-Save and commit your changes by clicking the green "Commit changes" button at the
+Save and commit your changes by clicking the green "Commit changes" button at the
bottom of the page ({numref}`create-new-file-03`).
```{figure} img/version_control/create-new-file_03.png
@@ -525,13 +525,13 @@ name: create-new-file-03
To be saved, newly created files are required to be committed along with an associated commit message.
```
-You can also upload files that you have created on your local machine by using
+You can also upload files that you have created on your local machine by using
the "Add file" drop-down menu and selecting "Upload files"
({numref}`upload-files-01`).
-To select the files from your local computer to upload, you can either drag and
-drop them into the gray box area shown below, or click the "choose your files"
-link to access a file browser dialog. Once the files you want to upload have
-been selected, click the green "Commit changes" button at the bottom of the
+To select the files from your local computer to upload, you can either drag and
+drop them into the gray box area shown below, or click the "choose your files"
+link to access a file browser dialog. Once the files you want to upload have
+been selected, click the green "Commit changes" button at the bottom of the
page ({numref}`upload-files-02`).
```{figure} img/version_control/upload-files_01.png
@@ -551,7 +551,7 @@ committed along with an associated commit message.
```
-Note that Git and GitHub are designed to track changes in individual files.
+Note that Git and GitHub are designed to track changes in individual files.
**Do not** upload your whole project in an archive file (e.g., `.zip`). If you do,
then Git can only keep track of changes to the entire `.zip` file, which will not
be human-readable. Committing one big archive defeats the whole purpose of using
@@ -574,8 +574,8 @@ remote repository that was created on GitHub to a local coding environment. Thi
can be done by creating and working in a local copy of the repository.
In this chapter, we focus on interacting with Git via Jupyter using
the Jupyter Git extension. The Jupyter Git extension
-can be run by Jupyter on your local computer, or on a JupyterHub server.
-*Note: we recommend reading the {ref}`getting-started-with-jupyter` chapter*
+can be run by Jupyter on your local computer, or on a JupyterHub server.
+*Note: we recommend reading the {ref}`getting-started-with-jupyter` chapter*
*to learn how to use Jupyter before reading this chapter.*
### Generating a GitHub personal access token
@@ -585,23 +585,23 @@ can be run by Jupyter on your local computer, or on a JupyterHub server.
-To send and retrieve work between your local repository
+To send and retrieve work between your local repository
and the remote repository on GitHub,
-you will frequently need to authenticate with GitHub
+you will frequently need to authenticate with GitHub
to prove you have the required permission.
-There are several methods to do this,
-but for beginners we recommend using the HTTPS method
+There are several methods to do this,
+but for beginners we recommend using the HTTPS method
because it is easier and requires less setup.
-In order to use the HTTPS method,
+In order to use the HTTPS method,
GitHub requires you to provide a *personal access token*.
A personal access token is like a password—so keep it a secret!—but it gives
you more fine-grained control over what parts of your account
the token can be used to access, and lets you set an expiry date for the authentication.
-To generate a personal access token,
+To generate a personal access token,
you must first visit [https://github.com/settings/tokens](https://github.com/settings/tokens),
which will take you to the "Personal access tokens" page in your account settings.
Once there, click "Generate new token" ({numref}`generate-pat-01`).
-Note that you may be asked to re-authenticate with your username
+Note that you may be asked to re-authenticate with your username
and password to proceed.
@@ -615,13 +615,13 @@ access token. It is found in the "Personal access tokens" section of the
```
-You will be asked to add a note to describe the purpose for your personal access token.
+You will be asked to add a note to describe the purpose for your personal access token.
Next, you need to select permissions for the token; this is where
you can control what parts of your account the token can be used to access.
Make sure to choose only those permissions that you absolutely require. In
-{numref}`generate-pat-02`, we tick only the "repo" box, which gives the
-token access to our repositories (so that we can push and pull) but none of our other GitHub
-account features. Finally, to generate the token, scroll to the bottom of that page
+{numref}`generate-pat-02`, we tick only the "repo" box, which gives the
+token access to our repositories (so that we can push and pull) but none of our other GitHub
+account features. Finally, to generate the token, scroll to the bottom of that page
and click the green "Generate token" button ({numref}`generate-pat-02`).
```{figure} img/generate-pat_02.png
@@ -632,16 +632,16 @@ Webpage for creating a new personal access token.
```
-Finally, you will be taken to a page where you will be able to see
-and copy the personal access token you just generated ({numref}`generate-pat-03`).
+Finally, you will be taken to a page where you will be able to see
+and copy the personal access token you just generated ({numref}`generate-pat-03`).
Since it provides access to certain parts of your account, you should
-treat this token like a password; for example, you should consider
+treat this token like a password; for example, you should consider
securely storing it (and your other passwords and tokens, too!) using a password manager.
Note that this page will only display the token to you once,
so make sure you store it in a safe place right away. If you accidentally forget to
-store it, though, do not fret—you can delete that token by clicking the
-"Delete" button next to your token, and generate a new one from scratch.
-To learn more about GitHub authentication,
+store it, though, do not fret—you can delete that token by clicking the
+"Delete" button next to your token, and generate a new one from scratch.
+To learn more about GitHub authentication,
see the additional resources section at the end of this chapter.
```{figure} img/generate-pat_03.png
@@ -653,7 +653,7 @@ Display of the newly generated personal access token.
-### Cloning a repository using Jupyter
+### Cloning a repository using Jupyter
@@ -664,9 +664,9 @@ the next step is -->
*Cloning* a remote repository from GitHub
-to create a local repository results in a
-copy that knows where it was obtained from so that it knows where to send/receive
-new committed edits. In order to do this, first copy the URL from the HTTPS tab
+to create a local repository results in a
+copy that knows where it was obtained from so that it knows where to send/receive
+new committed edits. In order to do this, first copy the URL from the HTTPS tab
of the Code drop-down menu on GitHub ({numref}`clone-02`).
```{figure} img/version_control/clone_02.png
@@ -676,7 +676,7 @@ name: clone-02
The green "Code" drop-down menu contains the remote address (URL) corresponding to the location of the remote GitHub repository.
```
-Open Jupyter, and click the Git+ icon on the file browser tab
+Open Jupyter, and click the Git+ icon on the file browser tab
({numref}`clone-01`).
```{figure} img/version_control/clone_01.png
@@ -688,7 +688,7 @@ The Jupyter Git Clone icon (red circle).
-Paste the URL of the GitHub project repository you
+Paste the URL of the GitHub project repository you
created and click the blue "CLONE" button ({numref}`clone-03`).
```{figure} img/version_control/clone_03.png
@@ -711,11 +711,11 @@ Cloned GitHub repositories can been seen and accessed via the Jupyter file brows
### Specifying files to commit
Now that you have cloned the remote repository from GitHub to create a local repository,
-you can get to work editing, creating, and deleting files.
-For example, suppose you created and saved a new file (named `eda.ipynb`) that you would
+you can get to work editing, creating, and deleting files.
+For example, suppose you created and saved a new file (named `eda.ipynb`) that you would
like to send back to the project repository on GitHub ({numref}`git-add-01`).
To "add" this modified file to the staging area (i.e., flag that this is a
-file whose changes we would like to commit), click the Jupyter Git extension
+file whose changes we would like to commit), click the Jupyter Git extension
icon on the far left-hand side of Jupyter ({numref}`git-add-01`).
```{figure} img/version_control/git_add_01.png
@@ -731,8 +731,8 @@ Jupyter Git extension icon (circled in red).
This opens the Jupyter Git graphical user interface pane. Next,
click the plus sign (+) beside the file(s) that you want to "add"
-({numref}`git-add-02`). Note that because this is the
-first change for this file, it falls under the "Untracked" heading.
+({numref}`git-add-02`). Note that because this is the
+first change for this file, it falls under the "Untracked" heading.
However, next time you edit this file and want to add the changes,
you will find it under the "Changed" heading.
@@ -748,11 +748,11 @@ name: git-add-02
`eda.ipynb` is added to the staging area via the plus sign (+).
```
-Clicking the plus sign (+) moves the file from the "Untracked" heading to the "Staged" heading,
-so that Git knows you want a snapshot of its current state
-as a commit ({numref}`git-add-03`). Now you are ready to "commit" the changes.
+Clicking the plus sign (+) moves the file from the "Untracked" heading to the "Staged" heading,
+so that Git knows you want a snapshot of its current state
+as a commit ({numref}`git-add-03`). Now you are ready to "commit" the changes.
Make sure to include a (clear and helpful!) message about what was changed
-so that your collaborators (and future you) know what happened in this commit.
+so that your collaborators (and future you) know what happened in this commit.
```{figure} img/version_control/git_add_03.png
@@ -770,16 +770,16 @@ Adding `eda.ipynb` makes it visible in the staging area.
-To snapshot the changes with an associated commit message,
-you must put a message in the text box at the bottom of the Git pane
-and click on the blue "Commit" button ({numref}`git-commit-01`).
-It is highly recommended to write useful and meaningful messages about what
-was changed. These commit messages, and the datetime stamp for a given
-commit, are the primary means to navigate through the project's history in the
- event that you need to view or retrieve a past version of a file, or
+To snapshot the changes with an associated commit message,
+you must put a message in the text box at the bottom of the Git pane
+and click on the blue "Commit" button ({numref}`git-commit-01`).
+It is highly recommended to write useful and meaningful messages about what
+was changed. These commit messages, and the datetime stamp for a given
+commit, are the primary means to navigate through the project's history in the
+ event that you need to view or retrieve a past version of a file, or
revert your project to an earlier state.
-When you click the "Commit" button for the first time, you will be prompted to
-enter your name and email. This only needs to be done once for each machine
+When you click the "Commit" button for the first time, you will be prompted to
+enter your name and email. This only needs to be done once for each machine
you use Git on.
```{figure} img/version_control/git_commit_01.png
@@ -790,7 +790,7 @@ A commit message must be added into the Jupyter Git extension commit text box be
```
-After "committing" the file(s), you will see there are 0 "Staged" files.
+After "committing" the file(s), you will see there are 0 "Staged" files.
You are now ready to push your changes
to the remote repository on GitHub ({numref}`git-commit-03`).
@@ -811,9 +811,9 @@ After recording a commit, the staging area should be empty.
-To send the committed changes back to the remote repository on
-GitHub, you need to *push* them. To do this,
-click on the cloud icon with the up arrow on the Jupyter Git tab
+To send the committed changes back to the remote repository on
+GitHub, you need to *push* them. To do this,
+click on the cloud icon with the up arrow on the Jupyter Git tab
({numref}`git-push-01`).
```{figure} img/version_control/git_push_01.png
@@ -824,9 +824,9 @@ The Jupyter Git extension "push" button (circled in red).
```
-You will then be prompted to enter your GitHub username
+You will then be prompted to enter your GitHub username
and the personal access token that you generated
-earlier (not your account password!). Click
+earlier (not your account password!). Click
the blue "OK" button to initiate the push ({numref}`git-push-02`).
```{figure} img/version_control/git_push_02.png
@@ -837,8 +837,8 @@ Enter your Git credentials to authorize the push to the remote repository.
```
-If the files were successfully pushed to the project repository on
-GitHub, you will be shown a success message ({numref}`git-push-03`).
+If the files were successfully pushed to the project repository on
+GitHub, you will be shown a success message ({numref}`git-push-03`).
Click "Dismiss" to continue working in Jupyter.
```{figure} img/version_control/git_push_03.png
@@ -849,7 +849,7 @@ The prompt that the push was successful.
```
-If you visit the remote repository on GitHub,
+If you visit the remote repository on GitHub,
you will see that the changes now exist there too
({numref}`git-push-04`)!
@@ -870,10 +870,10 @@ The GitHub web interface shows a preview of the commit message, and the time of
-As mentioned earlier, GitHub allows you to control who has access to your
-project. The default of both public and private projects are that only the
-person who created the GitHub repository has permissions to create, edit and
-delete files (*write access*). To give your collaborators write access to the
+As mentioned earlier, GitHub allows you to control who has access to your
+project. The default of both public and private projects are that only the
+person who created the GitHub repository has permissions to create, edit and
+delete files (*write access*). To give your collaborators write access to the
projects, navigate to the "Settings" tab ({numref}`add-collab-01`).
```{figure} img/version_control/add_collab_01.png
@@ -901,7 +901,7 @@ name: add-collab-03
The "Invite a collaborator" button on the GitHub web interface.
```
-Type in the collaborator's GitHub username or email,
+Type in the collaborator's GitHub username or email,
and select their name when it appears ({numref}`add-collab-04`).
```{figure} img/version_control/add_collab_04.png
@@ -920,15 +920,15 @@ name: add-collab-05
The confirmation button for adding a collaborator to a repository on the GitHub web interface.
```
-After this, you should see your newly added collaborator listed under the
-"Manage access" tab. They should receive an email invitation to join the
-GitHub repository as a collaborator. They need to accept this invitation
+After this, you should see your newly added collaborator listed under the
+"Manage access" tab. They should receive an email invitation to join the
+GitHub repository as a collaborator. They need to accept this invitation
to enable write access.
### Pulling changes from GitHub using Jupyter
-We will now walk through how to use the Jupyter Git extension tool to pull changes
-to our `eda.ipynb` analysis file that were made by a collaborator
+We will now walk through how to use the Jupyter Git extension tool to pull changes
+to our `eda.ipynb` analysis file that were made by a collaborator
({numref}`git-pull-00`).
```{figure} img/version_control/git_pull_00.png
@@ -941,7 +941,7 @@ The GitHub interface indicates the name of the last person to push a commit to t
```{index} git;pull
```
-You can tell Git to "pull" by clicking on the cloud icon with
+You can tell Git to "pull" by clicking on the cloud icon with
the down arrow in Jupyter ({numref}`git-pull-01`).
```{figure} img/version_control/git_pull_01.png
@@ -972,7 +972,7 @@ Changes made by the collaborator to `eda.ipynb` (code highlighted by red arrows)
```
It can be very useful to review the history of the changes to your project. You
-can do this directly in Jupyter by clicking "History" in the Git tab
+can do this directly in Jupyter by clicking "History" in the Git tab
({numref}`git-pull-04`).
```{figure} img/version_control/git_pull_04.png
@@ -983,12 +983,12 @@ Version control repository history viewed using the Jupyter Git extension.
```
-It is good practice to pull any changes at the start of *every* work session
-before you start working on your local copy.
-If you do not do this,
-and your collaborators have pushed some changes to the project to GitHub,
-then you will be unable to push your changes to GitHub until you pull.
-This situation can be recognized by the error message
+It is good practice to pull any changes at the start of *every* work session
+before you start working on your local copy.
+If you do not do this,
+and your collaborators have pushed some changes to the project to GitHub,
+then you will be unable to push your changes to GitHub until you pull.
+This situation can be recognized by the error message
shown in {numref}`merge-conflict-01`.
```{figure} img/version_control/merge_conflict_01.png
@@ -1031,7 +1031,7 @@ the changes.
To fix the merge conflict, you need to open the offending file
in a plain text editor and look for special marks that Git puts in the file to
-tell you where the merge conflict occurred ({numref}`merge-conflict-04`).
+tell you where the merge conflict occurred ({numref}`merge-conflict-04`).
```{figure} img/version_control/merge_conflict_04.png
@@ -1042,9 +1042,9 @@ How to open a Jupyter notebook as a plain text file view in Jupyter.
```
The beginning of the merge
-conflict is preceded by `<<<<<<< HEAD` and the end of the merge conflict is
-marked by `>>>>>>>`. Between these markings, Git also inserts a separator
-(`=======`). The version of the change before the separator is your change, and
+conflict is preceded by `<<<<<<< HEAD` and the end of the merge conflict is
+marked by `>>>>>>>`. Between these markings, Git also inserts a separator
+(`=======`). The version of the change before the separator is your change, and
the version that follows the separator was the change that existed on GitHub.
In {numref}`merge-conflict-05`, you can see that in your local repository
there is a line of code that calls `scale_color_manual` with three color values (`deeppink2`, `cyan4`, and `purple1`).
@@ -1059,7 +1059,7 @@ Merge conflict identifiers (highlighted in red).
Once you have decided which version of the change (or what combination!) to
keep, you need to use the plain text editor to remove the special marks that
-Git added ({numref}`merge-conflict-06`).
+Git added ({numref}`merge-conflict-06`).
```{figure} img/version_control/merge_conflict_06.png
---
@@ -1068,14 +1068,14 @@ name: merge-conflict-06
File where a merge conflict has been resolved.
```
-The file must be saved, added to the staging area, and then committed before you will be able to
+The file must be saved, added to the staging area, and then committed before you will be able to
push your changes to GitHub.
### Communicating using GitHub issues
When working on a project in a team, you don't just want a historical record of who changed
-what file and when in the project—you also want a record of decisions that were made,
-ideas that were floated, problems that were identified and addressed, and all other
+what file and when in the project—you also want a record of decisions that were made,
+ideas that were floated, problems that were identified and addressed, and all other
communication surrounding the project. Email and messaging apps are both very popular for general communication, but are not
designed for project-specific communication: they both generally do not have facilities for organizing conversations by project subtopics,
searching for conversations related to particular bugs or software versions, etc.
@@ -1083,19 +1083,19 @@ searching for conversations related to particular bugs or software versions, etc
```{index} GitHub;issues
```
-GitHub *issues* are an alternative written communication medium to email and
-messaging apps, and were designed specifically to facilitate project-specific
+GitHub *issues* are an alternative written communication medium to email and
+messaging apps, and were designed specifically to facilitate project-specific
communication. Issues are *opened* from the "Issues" tab on the project's
-GitHub page, and they persist there even after the conversation is over and the issue is *closed* (in
+GitHub page, and they persist there even after the conversation is over and the issue is *closed* (in
contrast to email, issues are not usually deleted). One issue thread is usually created
-per topic, and they are easily searchable using GitHub's search tools. All
-issues are accessible to all project collaborators, so no one is left out of
-the conversation. Finally, issues can be set up so that team members get email
-notifications when a new issue is created or a new post is made in an issue
+per topic, and they are easily searchable using GitHub's search tools. All
+issues are accessible to all project collaborators, so no one is left out of
+the conversation. Finally, issues can be set up so that team members get email
+notifications when a new issue is created or a new post is made in an issue
thread. Replying to issues from email is also possible. Given all of these advantages,
we highly recommend the use of issues for project-related communication.
-To open a GitHub issue,
+To open a GitHub issue,
first click on the "Issues" tab ({numref}`issue-01`).
```{figure} img/version_control/issue_01.png
@@ -1114,7 +1114,7 @@ name: issue-02
The "New issues" button on the GitHub web interface.
```
-Add an issue title (which acts like an email subject line), and then put the
+Add an issue title (which acts like an email subject line), and then put the
body of the message in the larger text box. Finally, click "Submit new issue"
to post the issue to share with others ({numref}`issue-03`).
@@ -1136,8 +1136,8 @@ Dialog box for replying to GitHub issues.
```
-When a conversation is resolved, you can click "Close issue".
-The closed issue can be later viewed by clicking the "Closed" header link
+When a conversation is resolved, you can click "Close issue".
+The closed issue can be later viewed by clicking the "Closed" header link
in the "Issue" tab ({numref}`issue-06`).
```{figure} img/version_control/issue_06.png
@@ -1149,8 +1149,8 @@ The "Closed" issues tab on the GitHub web interface.
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets#readme)
in the "Collaboration with version control" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -1162,11 +1162,11 @@ and guidance that the worksheets provide will function as intended.
## Additional resources
-Now that you've picked up the basics of version control with Git and GitHub,
+Now that you've picked up the basics of version control with Git and GitHub,
you can expand your knowledge through the resources listed below:
- GitHub's [guides website](https://guides.github.com/) and [YouTube
- channel](https://www.youtube.com/githubguides), and [*Happy Git and GitHub
+ channel](https://www.youtube.com/githubguides), and [*Happy Git and GitHub
for the useR*](https://happygitwithr.com/) are great resources to take the next steps in
learning about Git and GitHub.
- [Good enough practices in scientific