Skip to content

Commit

Permalink
replace configuration with subset where appropriate (#2993)
Browse files Browse the repository at this point in the history
  • Loading branch information
severo authored Jul 22, 2024
1 parent a438bd4 commit acb17bd
Show file tree
Hide file tree
Showing 15 changed files with 65 additions and 83 deletions.
4 changes: 2 additions & 2 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
- local: valid
title: Check dataset validity
- local: splits
title: List splits and configurations
title: List splits and subsets
- local: info
title: Get dataset information
- local: first_rows
Expand Down Expand Up @@ -49,7 +49,7 @@
- title: Conceptual Guides
sections:
- local: configs_and_splits
title: Splits and configurations
title: Splits and subsets
- local: data_types
title: Data types
- local: server
Expand Down
12 changes: 6 additions & 6 deletions docs/source/configs_and_splits.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
# Splits and configurations
# Splits and subsets

Machine learning datasets are commonly organized in *splits* and they may also have *configurations*. These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset's structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation.
Machine learning datasets are commonly organized in *splits* and they may also have *subsets* (also called *configurations*). These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset's structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation.

![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)

## Splits

Every processed and cleaned dataset contains *splits*, specific subsets of data reserved for specific needs. The most common splits are:
Every processed and cleaned dataset contains *splits*, specific parts of the data reserved for specific needs. The most common splits are:

* `train`: data used to train a model; this data is exposed to the model
* `validation`: data reserved for evaluation and improving model hyperparameters; this data is hidden from the model
* `test`: data reserved for evaluation only; this data is completely hidden from the model and ourselves

The `validation` and `test` sets are especially important to ensure a model is actually learning instead of *overfitting*, or just memorizing the data.

## Configurations
## Subsets

A *configuration* is a higher-level internal structure than a split, and a configuration contains splits. You can think of a configuration as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) dataset, you'll notice there are eight different languages. While you can create a dataset containing all eight languages, it's probably neater to create a dataset with each language as a configuration. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language.
A *subset* (also called *configuration*) is a higher-level internal structure than a split, and a subset contains splits. You can think of a subset as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) dataset, you'll notice there are eight different languages. While you can create a dataset containing all eight languages, it's probably neater to create a dataset with each language as a subset. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language.

Configurations are flexible, and can be used to organize a dataset along whatever objective you'd like. For example, the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset uses configurations to organize the dataset by task. One configuration is dedicated to segmenting the whole image, while the other configuration is for instance segmentation.
Subsets are flexible, and can be used to organize a dataset along whatever objective you'd like. For example, the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset uses subsets to organize the dataset by task. One subset is dedicated to segmenting the whole image, while the other subset is for instance segmentation.
2 changes: 1 addition & 1 deletion docs/source/croissant.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ curl https://huggingface.co/api/datasets/ibm/duorc/croissant \

Under the hood it uses the `https://datasets-server.huggingface.co/croissant-crumbs` endpoint and enriches it with the Hub metadata.

The endpoint response is a [JSON-LD](https://json-ld.org/) containing the metadata in the Croissant format. For example, the [`ibm/duorc`](https://huggingface.co/datasets/ibm/duorc) dataset has two configurations, `ParaphraseRC` and `SelfRC` (see the [List splits and configurations](./splits) guide for more details about splits and configurations). The metadata links to their Parquet files and describes the type of each of the six columns: `plot_id`, `plot`, `title`, `question_id`, `question`, and `no_answer`:
The endpoint response is a [JSON-LD](https://json-ld.org/) containing the metadata in the Croissant format. For example, the [`ibm/duorc`](https://huggingface.co/datasets/ibm/duorc) dataset has two subsets, `ParaphraseRC` and `SelfRC` (see the [List splits and subsets](./splits) guide for more details about splits and subsets). The metadata links to their Parquet files and describes the type of each of the six columns: `plot_id`, `plot`, `title`, `question_id`, `question`, and `no_answer`:

```json
{
Expand Down
4 changes: 2 additions & 2 deletions docs/source/filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Feel free to also try it out with [ReDoc](https://redocly.github.io/redoc/?url=h

The `/filter` endpoint accepts the following query parameters:
- `dataset`: the dataset name, for example `nyu-mll/glue` or `mozilla-foundation/common_voice_10_0`
- `config`: the configuration name, for example `cola`
- `config`: the subset name, for example `cola`
- `split`: the split name, for example `train`
- `where`: the filter condition
- `orderby`: the order-by clause
Expand Down Expand Up @@ -44,7 +44,7 @@ either the string "name" column is equal to 'Simone' or the integer "children" c
The `orderby` parameter must contain the column name (in double quotes) whose values will be sorted (in ascending order by default).
To sort the rows in descending order, use the DESC keyword, like `orderby="age" DESC`.

For example, let's filter those rows with no_answer=false in the `train` split of the `SelfRC` configuration of the `ibm/duorc` dataset restricting the results to the slice 150-151:
For example, let's filter those rows with no_answer=false in the `train` split of the `SelfRC` subset of the `ibm/duorc` dataset restricting the results to the slice 150-151:

<inferencesnippet>
<python>
Expand Down
2 changes: 1 addition & 1 deletion docs/source/first_rows.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ This guide shows you how to use the dataset viewer's `/first-rows` endpoint to p
The `/first-rows` endpoint accepts three query parameters:

- `dataset`: the dataset name, for example `nyu-mll/glue` or `mozilla-foundation/common_voice_10_0`
- `config`: the configuration name, for example `cola`
- `config`: the subset name, for example `cola`
- `split`: the split name, for example `train`

<inferencesnippet>
Expand Down
2 changes: 1 addition & 1 deletion docs/source/info.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The dataset viewer provides an `/info` endpoint for exploring the general inform
The `/info` endpoint accepts two query parameters:

- `dataset`: the dataset name
- `config`: the configuration name
- `config`: the subset name

<inferencesnippet>
<python>
Expand Down
74 changes: 28 additions & 46 deletions docs/source/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -1292,7 +1292,7 @@
"X-Error-Code-DatasetWithTooManyConfigsError": {
"type": "string",
"const": "DatasetWithTooManyConfigsError",
"description": "The number of configs of a dataset exceeded the limit."
"description": "The number of subsets of a dataset exceeded the limit."
},
"X-Error-Code-ExternalAuthenticatedError": {
"type": "string",
Expand Down Expand Up @@ -1356,8 +1356,8 @@
},
"examples": {
"InexistentConfigError": {
"summary": "The response is not found because the config does not exist.",
"description": "try with config=inexistent-config.",
"summary": "The response is not found because the subset does not exist.",
"description": "try with config=inexistent-subset.",
"value": {
"error": "Not found."
}
Expand Down Expand Up @@ -1468,7 +1468,7 @@
"RequiredConfig": {
"name": "config",
"in": "query",
"description": "The dataset configuration (or subset).",
"description": "The dataset subset (also called 'configuration').",
"required": true,
"schema": {
"type": "string"
Expand Down Expand Up @@ -1510,7 +1510,7 @@
"OptionalConfig": {
"name": "config",
"in": "query",
"description": "The dataset configuration (or subset) on which to filter the response.",
"description": "The dataset subset on which to filter the response.",
"schema": {
"type": "string"
},
Expand All @@ -1520,7 +1520,7 @@
"value": "cola"
},
"yangdong/ecqa": {
"summary": "The default configuration given by the 🤗 Datasets library",
"summary": "The default subset given by the 🤗 Datasets library",
"value": "default"
}
}
Expand Down Expand Up @@ -1827,8 +1827,8 @@
"failed": []
}
},
"splits for a single config": {
"summary": "dair-ai/emotion has two configs. Setting config=unsplit only returns the splits for this config.",
"splits for a single subset": {
"summary": "dair-ai/emotion has two subsets. Setting config=unsplit only returns the splits for this subset.",
"description": "Try with https://datasets-server.huggingface.co/splits?dataset=dair-ai/emotion&config=unsplit.",
"value": {
"splits": [
Expand All @@ -1840,8 +1840,8 @@
]
}
},
"one of the config has an error": {
"summary": "one of the configs require manual download, and fails to give the split names",
"one of the subsets has an error": {
"summary": "one of the subsets require manual download, and fails to give the split names",
"description": "Try with https://datasets-server.huggingface.co/splits?dataset=superb.",
"value": {
"splits": [
Expand Down Expand Up @@ -2023,8 +2023,8 @@
"$ref": "#/components/schemas/CustomError"
},
"examples": {
"too many configs in the dataset": {
"summary": "The dataset has too many configs. The server does not support more than 3,000 configs.",
"too many subsets in the dataset": {
"summary": "The dataset has too many subsets. The server does not support more than 3,000 subsets.",
"description": "Try with https://datasets-server.huggingface.co/splits?dataset=facebook/flores",
"value": {
"error": "The maximum number of configs allowed is 3000, dataset has 41617 configs."
Expand Down Expand Up @@ -4774,7 +4774,7 @@
"/info": {
"get": {
"summary": "Get the metadata of a dataset.",
"description": "Returns the metadata of the dataset: description, homepage, features, etc. Use the optional config parameter to filter the response.",
"description": "Returns the metadata of the dataset: description, homepage, features, etc. Use the optional config parameter to filter the response on a subset.",
"externalDocs": {
"description": "The response is a dump of the DatasetInfo object from the datasets library",
"url": "https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetInfo"
Expand Down Expand Up @@ -4815,7 +4815,7 @@
},
"examples": {
"dataset metadata": {
"summary": "metadata of a dataset. It's an object, with one key per config",
"summary": "metadata of a dataset. It's an object, with one key per subset",
"description": "Try with https://datasets-server.huggingface.co/info?dataset=mnist",
"value": {
"dataset_info": {
Expand Down Expand Up @@ -4903,7 +4903,7 @@
}
},
"config metadata": {
"summary": "metadata for a dataset config",
"summary": "metadata for a dataset subset",
"description": "Try with https://datasets-server.huggingface.co/info?dataset=nyu-mll/glue&config=ax",
"value": {
"dataset_info": {
Expand Down Expand Up @@ -4943,8 +4943,8 @@
"partial": false
}
},
"dataset metadata with failed configs": {
"summary": "metadata of a dataset which has failed configs. The failed configs are listed in 'failed'.",
"dataset metadata with failed subsets": {
"summary": "metadata of a dataset which has failed subsets. The failed subsets are listed in 'failed'.",
"description": "Try with https://datasets-server.huggingface.co/info?dataset=atomic",
"value": {
"dataset_info": {},
Expand Down Expand Up @@ -5066,7 +5066,7 @@
"/size": {
"get": {
"summary": "Get the size of a dataset.",
"description": "Returns the size (number of rows, storage) of the dataset. Use the optional config parameter to filter the response.",
"description": "Returns the size (number of rows, storage) of the dataset. Use the optional config parameter to filter the response on a subset.",
"externalDocs": {
"description": "See size in the Hub docs.",
"url": "https://huggingface.co/docs/datasets-server/size"
Expand Down Expand Up @@ -5156,7 +5156,7 @@
}
},
"config size": {
"summary": "size of a dataset config",
"summary": "size of a dataset subset",
"description": "Try with https://datasets-server.huggingface.co/size?dataset=nyu-mll/glue&config=ax",
"value": {
"size": {
Expand Down Expand Up @@ -5184,8 +5184,8 @@
"partial": false
}
},
"dataset size with failed configs": {
"summary": "size of a dataset which has failed configs. The failed configs are listed in 'failed'.",
"dataset size with failed subsets": {
"summary": "size of a dataset which has failed subsets. The failed subsets are listed in 'failed'.",
"description": "Try with https://datasets-server.huggingface.co/size?dataset=atomic",
"value": {
"size": {
Expand Down Expand Up @@ -5317,7 +5317,7 @@
"/opt-in-out-urls": {
"get": {
"summary": "Get the number of opted-in and opted-out image URLs in a dataset.",
"description": "Based on the API of spawning.ai, returns the number of image URLs that have been opted-in and opted-out. Use the optional config and splits parameters to filter the response. Only a sample of the rows is scanned, the first 100K rows at the moment.",
"description": "Based on the API of spawning.ai, returns the number of image URLs that have been opted-in and opted-out. Use the optional config and split parameters to filter the response. Only a sample of the rows is scanned, the first 100K rows at the moment.",
"externalDocs": {
"description": "See spawning.io (Hub docs). The doc is still missing for the endpoint, see https://github.com/huggingface/dataset-viewer/issues/1664.",
"url": "https://huggingface.co/docs/datasets-server/"
Expand Down Expand Up @@ -5373,8 +5373,8 @@
"full_scan": false
}
},
"number of URLS for a config": {
"summary": "number of URLs for a config.",
"number of URLS for a subset": {
"summary": "number of URLs for a subset.",
"description": "Try with https://datasets-server.huggingface.co/opt-in-out-urls?dataset=conceptual_captions&config=labeled",
"value": {
"urls_columns": ["image_url"],
Expand Down Expand Up @@ -6432,29 +6432,11 @@
"std": 60.07286,
"histogram": {
"hist": [
1734,
1637,
1326,
121,
10,
3,
1,
3,
1,
2
1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2
],
"bin_edges": [
256,
318,
380,
442,
504,
566,
628,
690,
752,
814,
873
256, 318, 380, 442, 504, 566, 628, 690, 752,
814, 873
]
}
}
Expand Down Expand Up @@ -6492,7 +6474,7 @@
}
}
}
],
],
"partial": false
}
}
Expand Down
Loading

0 comments on commit acb17bd

Please sign in to comment.