replace configuration with subset where appropriate (#2993)

huggingface · Jul 22, 2024 · acb17bd · acb17bd
1 parent a438bd4
commit acb17bd
Show file tree

Hide file tree

Showing 15 changed files with 65 additions and 83 deletions.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -11,7 +11,7 @@
     - local: valid
       title: Check dataset validity
     - local: splits
-      title: List splits and configurations
+      title: List splits and subsets
     - local: info
       title: Get dataset information
     - local: first_rows
@@ -49,7 +49,7 @@
 - title: Conceptual Guides
   sections:
     - local: configs_and_splits
-      title: Splits and configurations
+      title: Splits and subsets
     - local: data_types
       title: Data types
     - local: server

diff --git a/docs/source/configs_and_splits.md b/docs/source/configs_and_splits.md
@@ -1,21 +1,21 @@
-# Splits and configurations
+# Splits and subsets
 
-Machine learning datasets are commonly organized in *splits* and they may also have *configurations*. These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset's structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation.
+Machine learning datasets are commonly organized in *splits* and they may also have *subsets* (also called *configurations*). These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset's structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation.
 
 ![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)
 
 ## Splits
 
-Every processed and cleaned dataset contains *splits*, specific subsets of data reserved for specific needs. The most common splits are:
+Every processed and cleaned dataset contains *splits*, specific parts of the data reserved for specific needs. The most common splits are:
 
 * `train`: data used to train a model; this data is exposed to the model
 * `validation`: data reserved for evaluation and improving model hyperparameters; this data is hidden from the model
 * `test`: data reserved for evaluation only; this data is completely hidden from the model and ourselves
 
 The `validation` and `test` sets are especially important to ensure a model is actually learning instead of *overfitting*, or just memorizing the data.
 
-## Configurations
+## Subsets
 
-A *configuration* is a higher-level internal structure than a split, and a configuration contains splits. You can think of a configuration as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) dataset, you'll notice there are eight different languages. While you can create a dataset containing all eight languages, it's probably neater to create a dataset with each language as a configuration. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language.
+A *subset* (also called *configuration*) is a higher-level internal structure than a split, and a subset contains splits. You can think of a subset as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) dataset, you'll notice there are eight different languages. While you can create a dataset containing all eight languages, it's probably neater to create a dataset with each language as a subset. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language.
 
-Configurations are flexible, and can be used to organize a dataset along whatever objective you'd like. For example, the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset uses configurations to organize the dataset by task. One configuration is dedicated to segmenting the whole image, while the other configuration is for instance segmentation.
+Subsets are flexible, and can be used to organize a dataset along whatever objective you'd like. For example, the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset uses subsets to organize the dataset by task. One subset is dedicated to segmenting the whole image, while the other subset is for instance segmentation.
diff --git a/docs/source/croissant.md b/docs/source/croissant.md
@@ -54,7 +54,7 @@ curl https://huggingface.co/api/datasets/ibm/duorc/croissant \
 
 Under the hood it uses the `https://datasets-server.huggingface.co/croissant-crumbs` endpoint and enriches it with the Hub metadata.
 
-The endpoint response is a [JSON-LD](https://json-ld.org/) containing the metadata in the Croissant format. For example, the [`ibm/duorc`](https://huggingface.co/datasets/ibm/duorc) dataset has two configurations, `ParaphraseRC` and `SelfRC` (see the [List splits and configurations](./splits) guide for more details about splits and configurations). The metadata links to their Parquet files and describes the type of each of the six columns: `plot_id`, `plot`, `title`, `question_id`, `question`, and `no_answer`:
+The endpoint response is a [JSON-LD](https://json-ld.org/) containing the metadata in the Croissant format. For example, the [`ibm/duorc`](https://huggingface.co/datasets/ibm/duorc) dataset has two subsets, `ParaphraseRC` and `SelfRC` (see the [List splits and subsets](./splits) guide for more details about splits and subsets). The metadata links to their Parquet files and describes the type of each of the six columns: `plot_id`, `plot`, `title`, `question_id`, `question`, and `no_answer`:
 
 ```json
 {

diff --git a/docs/source/filter.md b/docs/source/filter.md
@@ -13,7 +13,7 @@ Feel free to also try it out with [ReDoc](https://redocly.github.io/redoc/?url=h
 
 The `/filter` endpoint accepts the following query parameters:
 - `dataset`: the dataset name, for example `nyu-mll/glue` or `mozilla-foundation/common_voice_10_0`
-- `config`: the configuration name, for example `cola`
+- `config`: the subset name, for example `cola`
 - `split`: the split name, for example `train`
 - `where`: the filter condition
 - `orderby`: the order-by clause
@@ -44,7 +44,7 @@ either the string "name" column is equal to 'Simone' or the integer "children" c
 The `orderby` parameter must contain the column name (in double quotes) whose values will be sorted (in ascending order by default).
 To sort the rows in descending order, use the DESC keyword, like `orderby="age" DESC`.
 
-For example, let's filter those rows with no_answer=false in the `train` split of the `SelfRC` configuration of the `ibm/duorc` dataset restricting the results to the slice 150-151:
+For example, let's filter those rows with no_answer=false in the `train` split of the `SelfRC` subset of the `ibm/duorc` dataset restricting the results to the slice 150-151:
 
 <inferencesnippet>
 <python>

diff --git a/docs/source/first_rows.md b/docs/source/first_rows.md
@@ -9,7 +9,7 @@ This guide shows you how to use the dataset viewer's `/first-rows` endpoint to p
 The `/first-rows` endpoint accepts three query parameters:
 
 - `dataset`: the dataset name, for example `nyu-mll/glue` or `mozilla-foundation/common_voice_10_0`
-- `config`: the configuration name, for example `cola`
+- `config`: the subset name, for example `cola`
 - `split`: the split name, for example `train`
 
 <inferencesnippet>

diff --git a/docs/source/info.md b/docs/source/info.md
@@ -5,7 +5,7 @@ The dataset viewer provides an `/info` endpoint for exploring the general inform
 The `/info` endpoint accepts two query parameters:
 
 - `dataset`: the dataset name
-- `config`: the configuration name
+- `config`: the subset name
 
 <inferencesnippet>
 <python>

diff --git a/docs/source/openapi.json b/docs/source/openapi.json
@@ -1292,7 +1292,7 @@
       "X-Error-Code-DatasetWithTooManyConfigsError": {
         "type": "string",
         "const": "DatasetWithTooManyConfigsError",
-        "description": "The number of configs of a dataset exceeded the limit."
+        "description": "The number of subsets of a dataset exceeded the limit."
       },
       "X-Error-Code-ExternalAuthenticatedError": {
         "type": "string",
@@ -1356,8 +1356,8 @@
     },
     "examples": {
       "InexistentConfigError": {
-        "summary": "The response is not found because the config does not exist.",
-        "description": "try with config=inexistent-config.",
+        "summary": "The response is not found because the subset does not exist.",
+        "description": "try with config=inexistent-subset.",
         "value": {
           "error": "Not found."
         }
@@ -1468,7 +1468,7 @@
       "RequiredConfig": {
         "name": "config",
         "in": "query",
-        "description": "The dataset configuration (or subset).",
+        "description": "The dataset subset (also called 'configuration').",
         "required": true,
         "schema": {
           "type": "string"
@@ -1510,7 +1510,7 @@
       "OptionalConfig": {
         "name": "config",
         "in": "query",
-        "description": "The dataset configuration (or subset) on which to filter the response.",
+        "description": "The dataset subset on which to filter the response.",
         "schema": {
           "type": "string"
         },
@@ -1520,7 +1520,7 @@
             "value": "cola"
           },
           "yangdong/ecqa": {
-            "summary": "The default configuration given by the 🤗 Datasets library",
+            "summary": "The default subset given by the 🤗 Datasets library",
             "value": "default"
           }
         }
@@ -1827,8 +1827,8 @@
                       "failed": []
                     }
                   },
-                  "splits for a single config": {
-                    "summary": "dair-ai/emotion has two configs. Setting config=unsplit only returns the splits for this config.",
+                  "splits for a single subset": {
+                    "summary": "dair-ai/emotion has two subsets. Setting config=unsplit only returns the splits for this subset.",
                     "description": "Try with https://datasets-server.huggingface.co/splits?dataset=dair-ai/emotion&config=unsplit.",
                     "value": {
                       "splits": [
@@ -1840,8 +1840,8 @@
                       ]
                     }
                   },
-                  "one of the config has an error": {
-                    "summary": "one of the configs require manual download, and fails to give the split names",
+                  "one of the subsets has an error": {
+                    "summary": "one of the subsets require manual download, and fails to give the split names",
                     "description": "Try with https://datasets-server.huggingface.co/splits?dataset=superb.",
                     "value": {
                       "splits": [
@@ -2023,8 +2023,8 @@
                   "$ref": "#/components/schemas/CustomError"
                 },
                 "examples": {
-                  "too many configs in the dataset": {
-                    "summary": "The dataset has too many configs. The server does not support more than 3,000 configs.",
+                  "too many subsets in the dataset": {
+                    "summary": "The dataset has too many subsets. The server does not support more than 3,000 subsets.",
                     "description": "Try with https://datasets-server.huggingface.co/splits?dataset=facebook/flores",
                     "value": {
                       "error": "The maximum number of configs allowed is 3000, dataset has 41617 configs."
@@ -4774,7 +4774,7 @@
     "/info": {
       "get": {
         "summary": "Get the metadata of a dataset.",
-        "description": "Returns the metadata of the dataset: description, homepage, features, etc. Use the optional config parameter to filter the response.",
+        "description": "Returns the metadata of the dataset: description, homepage, features, etc. Use the optional config parameter to filter the response on a subset.",
         "externalDocs": {
           "description": "The response is a dump of the DatasetInfo object from the datasets library",
           "url": "https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetInfo"
@@ -4815,7 +4815,7 @@
                 },
                 "examples": {
                   "dataset metadata": {
-                    "summary": "metadata of a dataset. It's an object, with one key per config",
+                    "summary": "metadata of a dataset. It's an object, with one key per subset",
                     "description": "Try with https://datasets-server.huggingface.co/info?dataset=mnist",
                     "value": {
                       "dataset_info": {
@@ -4903,7 +4903,7 @@
                     }
                   },
                   "config metadata": {
-                    "summary": "metadata for a dataset config",
+                    "summary": "metadata for a dataset subset",
                     "description": "Try with https://datasets-server.huggingface.co/info?dataset=nyu-mll/glue&config=ax",
                     "value": {
                       "dataset_info": {
@@ -4943,8 +4943,8 @@
                       "partial": false
                     }
                   },
-                  "dataset metadata with failed configs": {
-                    "summary": "metadata of a dataset which has failed configs. The failed configs are listed in 'failed'.",
+                  "dataset metadata with failed subsets": {
+                    "summary": "metadata of a dataset which has failed subsets. The failed subsets are listed in 'failed'.",
                     "description": "Try with https://datasets-server.huggingface.co/info?dataset=atomic",
                     "value": {
                       "dataset_info": {},
@@ -5066,7 +5066,7 @@
     "/size": {
       "get": {
         "summary": "Get the size of a dataset.",
-        "description": "Returns the size (number of rows, storage) of the dataset. Use the optional config parameter to filter the response.",
+        "description": "Returns the size (number of rows, storage) of the dataset. Use the optional config parameter to filter the response on a subset.",
         "externalDocs": {
           "description": "See size in the Hub docs.",
           "url": "https://huggingface.co/docs/datasets-server/size"
@@ -5156,7 +5156,7 @@
                     }
                   },
                   "config size": {
-                    "summary": "size of a dataset config",
+                    "summary": "size of a dataset subset",
                     "description": "Try with https://datasets-server.huggingface.co/size?dataset=nyu-mll/glue&config=ax",
                     "value": {
                       "size": {
@@ -5184,8 +5184,8 @@
                       "partial": false
                     }
                   },
-                  "dataset size with failed configs": {
-                    "summary": "size of a dataset which has failed configs. The failed configs are listed in 'failed'.",
+                  "dataset size with failed subsets": {
+                    "summary": "size of a dataset which has failed subsets. The failed subsets are listed in 'failed'.",
                     "description": "Try with https://datasets-server.huggingface.co/size?dataset=atomic",
                     "value": {
                       "size": {
@@ -5317,7 +5317,7 @@
     "/opt-in-out-urls": {
       "get": {
         "summary": "Get the number of opted-in and opted-out image URLs in a dataset.",
-        "description": "Based on the API of spawning.ai, returns the number of image URLs that have been opted-in and opted-out. Use the optional config and splits parameters to filter the response. Only a sample of the rows is scanned, the first 100K rows at the moment.",
+        "description": "Based on the API of spawning.ai, returns the number of image URLs that have been opted-in and opted-out. Use the optional config and split parameters to filter the response. Only a sample of the rows is scanned, the first 100K rows at the moment.",
         "externalDocs": {
           "description": "See spawning.io (Hub docs). The doc is still missing for the endpoint, see https://github.com/huggingface/dataset-viewer/issues/1664.",
           "url": "https://huggingface.co/docs/datasets-server/"
@@ -5373,8 +5373,8 @@
                       "full_scan": false
                     }
                   },
-                  "number of URLS for a config": {
-                    "summary": "number of URLs for a config.",
+                  "number of URLS for a subset": {
+                    "summary": "number of URLs for a subset.",
                     "description": "Try with https://datasets-server.huggingface.co/opt-in-out-urls?dataset=conceptual_captions&config=labeled",
                     "value": {
                       "urls_columns": ["image_url"],
@@ -6432,29 +6432,11 @@
                             "std": 60.07286,
                             "histogram": {
                               "hist": [
-                                  1734,
-                                  1637,
-                                  1326,
-                                  121,
-                                  10,
-                                  3,
-                                  1,
-                                  3,
-                                  1,
-                                  2
+                                1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2
                               ],
                               "bin_edges": [
-                                  256,
-                                  318,
-                                  380,
-                                  442,
-                                  504,
-                                  566,
-                                  628,
-                                  690,
-                                  752,
-                                  814,
-                                  873
+                                256, 318, 380, 442, 504, 566, 628, 690, 752,
+                                814, 873
                               ]
                             }
                           }
@@ -6492,7 +6474,7 @@
                             }
                           }
                         }
-                    ],
+                      ],
                       "partial": false
                     }
                   }