Skip to content

Commit

Permalink
Update Part 1 - Introduction to Machine Learning with scikit-learn.md
Browse files Browse the repository at this point in the history
  • Loading branch information
cfiutak1 authored Apr 2, 2019
1 parent e5c7cf7 commit aad70a9
Showing 1 changed file with 8 additions and 10 deletions.
18 changes: 8 additions & 10 deletions Part 1 - Introduction to Machine Learning with scikit-learn.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,18 +71,16 @@ X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
```

In the above example, we import the `train_test_split` method from scikit-learn's `model_selection` sublibrary and use it to generate four smaller arrays:
`X_train`, a two-dimensional array containing a certain amount of entries from the main dataset. Does not include the expected outcome of each data entry.
`Y_train`, a one-dimensional array containing the expected outcome of each data entry in `X_train`.

`X_test`, a two-dimensional array containing a certain amount of entries from the main dataset. Does not include the expected outcome of each data entry.
`Y_test`, a one-dimensional array containing the expected outcome of each data entry in `X_test`.
* `X_train`, a two-dimensional array containing a certain amount of entries from the main dataset. Does not include the expected outcome of each data entry.
* `Y_train`, a one-dimensional array containing the expected outcome of each data entry in `X_train`.
* `X_test`, a two-dimensional array containing a certain amount of entries from the main dataset. Does not include the expected outcome of each data entry.
* `Y_test`, a one-dimensional array containing the expected outcome of each data entry in `X_test`.

Continuing our analogy of studying for a math exam,
`X_train` contains all of your answers to the practice problems
`Y_train` contains all the correct answers to the practice problems

`X_test` contains all of your answers to the real exam
`Y_test` contains all of the correct answers to the real exam
* `X_train` contains all of your answers to the practice problems
* `Y_train` contains all the correct answers to the practice problems
* `X_test` contains all of your answers to the real exam
* `Y_test` contains all of the correct answers to the real exam


🤔 **Food for Thought:** It can be tough to find a good ratio between the training and testing set size. In this case, we split it evenly (`test_size=0.5`), but many algorithms use much smaller testing set sizes (closer to 0.2). Although it may be tempting to improve your algorithm's accuracy by increasing the size of the training set, also consider that this will increase the margin of error of your testing accuracy.
Expand Down

0 comments on commit aad70a9

Please sign in to comment.