Skip to content

Commit

Permalink
Fix typos
Browse files Browse the repository at this point in the history
  • Loading branch information
Skipper Seabold committed Nov 2, 2017
1 parent a2afecd commit 94b6b36
Show file tree
Hide file tree
Showing 3 changed files with 37 additions and 18 deletions.
4 changes: 3 additions & 1 deletion 0 - Introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,9 @@
"\n",
"\n",
"“Programs are meant to be read by humans and only incidentally for computers to execute.” <br />\n",
"— H. Abelson and G. Sussman (in “Structure and Interpretation of Computer Programs”)"
"— H. Abelson and G. Sussman (in “Structure and Interpretation of Computer Programs”)\n",
"\n",
"<img src=\"https://mitpress.mit.edu/sicp/full-text/book/cover.jpg\" style=\"width:25%\">"
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions 2 - Data Wrangling with Pandas.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Or you can use the **geitem** syntax that relies on square brackets `[]`, which is familiar from dealing with dictionaries."
"Or you can use the **getitem** syntax that relies on square brackets `[]`, which is familiar from dealing with dictionaries."
]
},
{
Expand Down Expand Up @@ -242,7 +242,7 @@
"metadata": {},
"outputs": [],
"source": [
"dta.iloc[:1335320]"
"dta.loc[:1335320]"
]
},
{
Expand Down
47 changes: 32 additions & 15 deletions 5 - Modeling with scikit-learn.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
"\n",
"* **n_features**: The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases.\n",
"\n",
"The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where `scipy.sparse` matrices and other techniques can be useful, in that they are much more memory-efficient than numpy arrays."
"The number of features (almost always) must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where `scipy.sparse` matrices and other techniques can be useful, in that they are much more memory-efficient than numpy arrays."
]
},
{
Expand Down Expand Up @@ -90,6 +90,22 @@
"x = np.array([1, 2, 3, 4, 5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can't assign something that's not an integer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x[0] = 'a'"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -596,7 +612,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The **predictor** interface extends the notion of an estimator by adding a predict method that takes an array X_test and produces predictions based on the learned parameters of the estimator. In the case of supervised learning estimators, this method typically returns the predicted labels or values computed by the model. Some unsupervised learning estimators may also implement the predict interface, such as k-means, where the predicted values are the cluster labels."
"The **predictor** interface extends the notion of an estimator by adding a predict method that takes an array `X_test` and produces predictions based on the learned parameters of the estimator. In the case of supervised learning estimators, this method typically returns the predicted labels or values computed by the model. Some unsupervised learning estimators may also implement the predict interface, such as k-means, where the predicted values are the cluster labels."
]
},
{
Expand All @@ -605,16 +621,16 @@
"source": [
"all **supervised estimators** are expected to have the following methods:\n",
"\n",
"* `model.predict` : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.\n",
"* `model.predict_proba` : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().\n",
"* `model.predict` : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. `model.predict(X_new)`), and returns the learned label for each object in the array.\n",
"* `model.predict_proba` : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by `model.predict()`.\n",
"* `model.score` : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since it is common to modify or filter data before feeding it to a learning algorithm, some estimators in the library implement a **transformer** interface which defines a transform method. It takes as input some new data `X_test` and yields as output a transformed version. Preprocessing, feature selection, feature extraction and dimensionality reduction algorithms are all provided as transformers within the library."
"Since it is common to modify or filter data before feeding it to a learning algorithm, some estimators in the library implement a **transformer** interface which defines a `transform` method. It takes as input some new data `X_test` and yields as output a transformed version. Preprocessing, feature selection, feature extraction and dimensionality reduction algorithms are all provided as transformers within the library."
]
},
{
Expand All @@ -623,7 +639,7 @@
"source": [
"**unsupervised estimators** will always have these methods:\n",
"\n",
"* `model.transform` : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.\n",
"* `model.transform` : given an unsupervised model, transform new data into the new basis. This also accepts one argument `X_new`, and returns the new representation of the data based on the unsupervised model.\n",
"* `model.fit_transform` : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data."
]
},
Expand Down Expand Up @@ -684,7 +700,7 @@
"2. Using this vocabulary, assign a number to the count of each word occuring in any document.\n",
"\n",
"What you're left with is a matrix $X$, where each value $X[i,j]$ is the count of word $j$ in document $i$.\n",
"$X$ is a matrix of dimension n_documents by n_vocabulary. This is large. Luckily, most words don't occur in every document. If they did, we would not be able to separate the documents according to topics.\n",
"$X$ is a matrix of dimension `n_documents` by `n_vocabulary`. This is large. Luckily, most words don't occur in every document. If they did, we would not be able to separate the documents according to topics.\n",
"\n",
"For this reason, bag of words documents are often high-dimensional, sparse datasets. We don't need to keep the zeros in memory."
]
Expand All @@ -704,7 +720,7 @@
"\n",
"We turn human writing into a set of feature vectors by taking care of these issues. This process is called tokenization.\n",
"\n",
"scikit-learn provides some nice facilities for building a dictionary of features and transform documents to feature vectors. The first of these that we will look at is the **CountVectorizer** transformer.\n",
"scikit-learn provides some nice facilities for building a dictionary of features and transforming documents to feature vectors. The first of these that we will look at is the **CountVectorizer** transformer.\n",
"\n",
"Recall from above that a transformer is an estimator that provides a transform method."
]
Expand Down Expand Up @@ -762,7 +778,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In the case of `CountVectorizer` this is a dictionary called `vocabulary_` which stores a mapping from the known vocabulary to the column in the sparse matrix which contains the counts for that word. "
"In the case of `CountVectorizer` there is a dictionary called `vocabulary_` which stores a mapping from the known vocabulary to the column in the sparse matrix which contains the counts for that word. "
]
},
{
Expand Down Expand Up @@ -955,7 +971,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's prepare our TfidfVectorizer. We'll remove stop-words, remove any words that don't occur in at least 100 documents and remove words that occur in 85% or more documents.\n",
"Let's prepare our TfidfVectorizer. We'll remove stop-words, remove any words that don't occur in at least 50 documents and remove words that occur in 85% or more documents.\n",
"\n",
"Finally, we'll use a **regular expression** pattern to determine what exactly a token (or word) is. In this case, we deviate from the scikit-learn default by not allowing numbers to be words."
]
Expand Down Expand Up @@ -1023,12 +1039,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at another kind of transformer in scikit-learn, one that provides dimensionality reduction. Here we'll use Truncated SVD on the tf-idf matrix. Formally, this is known as Latent Semantic Analysis (LSA), because it transforms the documents to a low-dimensional \"semantic\" space. Formally, truncated SVD is a lot like Principle Components Analysis (PCA), except that the decomposition is on the documents rather than the covariance matrix. \n",
"Let's take a look at another kind of transformer in scikit-learn, one that provides dimensionality reduction. Here we'll use Truncated SVD on the tf-idf matrix. Formally, this is known as Latent Semantic Analysis (LSA), because it transforms the documents to a low-dimensional \"semantic\" space. Truncated SVD is a lot like Principle Components Analysis (PCA), except that the decomposition is on the documents rather than the covariance matrix. \n",
"\n",
"\n",
"Mathematically, truncated SVD applied to training samples X produces a low-rank approximation $X_k$:\n",
"\n",
"$$X \\approx X_k = U_k \\Sigma_k V_k^\\top$$\n",
"$$X \\approx X = U_k \\Sigma_k V_k^\\top$$\n",
"\n",
"After this operation, $U_k \\Sigma_k^\\top$ is the transformed training set with k features (called `n_components` in the API).\n",
"\n",
Expand Down Expand Up @@ -1329,7 +1345,8 @@
"source": [
"import re\n",
"\n",
"result = re.search(\"(?<=Comments:)(.+)\", \"1. This is a violation. Comments: This was a really egregious violation.\")\n",
"result = re.search(\"(?<=Comments:)(.+)\", \n",
" \"1. This is a violation. Comments: This was a really egregious violation.\")\n",
"\n",
"result"
]
Expand Down Expand Up @@ -1822,7 +1839,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**: If you can't run this part, don't worry. You'll need to have graphviz installed."
"**Note**: If you can't run this part, don't worry. You'll need to have graphviz installed. (`conda install graphviz` *should* do the trick.)"
]
},
{
Expand Down Expand Up @@ -1952,7 +1969,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, we have a roughly 1:3 class imbalance here, so we may also want to check the AUC score and have a look at the confusing matrix.\n",
"Of course, we have a roughly 1:3 class imbalance here, so we may also want to check the AUC score and have a look at the confusion matrix.\n",
"\n",
"In the confusion matrix the true labels are the rows, and the predicted labels are the columns. Here we see that we're slightly high on our false positive rate. Inspections that were a Fail are being predicted as Pass."
]
Expand Down

0 comments on commit 94b6b36

Please sign in to comment.