-
Notifications
You must be signed in to change notification settings - Fork 77
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
replace configuration with subset where appropriate (#2993)
- Loading branch information
Showing
15 changed files
with
65 additions
and
83 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,21 @@ | ||
# Splits and configurations | ||
# Splits and subsets | ||
|
||
Machine learning datasets are commonly organized in *splits* and they may also have *configurations*. These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset's structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation. | ||
Machine learning datasets are commonly organized in *splits* and they may also have *subsets* (also called *configurations*). These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset's structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation. | ||
|
||
![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif) | ||
|
||
## Splits | ||
|
||
Every processed and cleaned dataset contains *splits*, specific subsets of data reserved for specific needs. The most common splits are: | ||
Every processed and cleaned dataset contains *splits*, specific parts of the data reserved for specific needs. The most common splits are: | ||
|
||
* `train`: data used to train a model; this data is exposed to the model | ||
* `validation`: data reserved for evaluation and improving model hyperparameters; this data is hidden from the model | ||
* `test`: data reserved for evaluation only; this data is completely hidden from the model and ourselves | ||
|
||
The `validation` and `test` sets are especially important to ensure a model is actually learning instead of *overfitting*, or just memorizing the data. | ||
|
||
## Configurations | ||
## Subsets | ||
|
||
A *configuration* is a higher-level internal structure than a split, and a configuration contains splits. You can think of a configuration as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) dataset, you'll notice there are eight different languages. While you can create a dataset containing all eight languages, it's probably neater to create a dataset with each language as a configuration. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language. | ||
A *subset* (also called *configuration*) is a higher-level internal structure than a split, and a subset contains splits. You can think of a subset as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) dataset, you'll notice there are eight different languages. While you can create a dataset containing all eight languages, it's probably neater to create a dataset with each language as a subset. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language. | ||
|
||
Configurations are flexible, and can be used to organize a dataset along whatever objective you'd like. For example, the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset uses configurations to organize the dataset by task. One configuration is dedicated to segmenting the whole image, while the other configuration is for instance segmentation. | ||
Subsets are flexible, and can be used to organize a dataset along whatever objective you'd like. For example, the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset uses subsets to organize the dataset by task. One subset is dedicated to segmenting the whole image, while the other subset is for instance segmentation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.