Skip to content

Commit

Permalink
Stats for images (#2712)
Browse files Browse the repository at this point in the history
* add ImageColumn (compute histogram of widths)

* add test image dataset

* refactor

* update openapi.json

* update Hub's doc

* update workers' readme
  • Loading branch information
polinaeterna authored Apr 17, 2024
1 parent abca742 commit c7fb237
Show file tree
Hide file tree
Showing 12 changed files with 457 additions and 122 deletions.
86 changes: 85 additions & 1 deletion docs/source/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -1098,7 +1098,8 @@
"string_text",
"bool",
"list",
"audio"
"audio",
"image"
]
},
"Histogram": {
Expand Down Expand Up @@ -6232,6 +6233,89 @@
],
"partial": false
}
},
"A split (Matthijs/snacks) with image column": {
"summary": "Statistics on an image column 'image'.",
"description": "Try with https://datasets-server.huggingface.co/statistics?dataset=Matthijs/snacks&config=default&split=train.",
"value": {
"num_examples": 4838,
"statistics": [
{
"column_name": "image",
"column_type": "image",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 256,
"max": 873,
"mean": 327.99339,
"median": 341.0,
"std": 60.07286,
"histogram": {
"hist": [
1734,
1637,
1326,
121,
10,
3,
1,
3,
1,
2
],
"bin_edges": [
256,
318,
380,
442,
504,
566,
628,
690,
752,
814,
873
]
}
}
},
{
"column_name": "label",
"column_type": "class_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"no_label_count": 0,
"no_label_proportion": 0.0,
"n_unique": 20,
"frequencies": {
"apple": 250,
"banana": 250,
"cake": 249,
"candy": 249,
"carrot": 249,
"cookie": 249,
"doughnut": 250,
"grape": 250,
"hot dog": 250,
"ice cream": 250,
"juice": 250,
"muffin": 250,
"orange": 249,
"pineapple": 260,
"popcorn": 180,
"pretzel": 154,
"salad": 250,
"strawberry": 249,
"waffle": 250,
"watermelon": 250
}
}
}
],
"partial": false
}
}
}
}
Expand Down
61 changes: 60 additions & 1 deletion docs/source/statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ The response JSON contains three keys:

## Response structure by data type

Currently, statistics are supported for strings, float and integer numbers, lists, audio data and the special [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature type of the [`datasets`](https://huggingface.co/docs/datasets/) library.
Currently, statistics are supported for strings, float and integer numbers, lists, audio and image data and the special [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature type of the [`datasets`](https://huggingface.co/docs/datasets/) library.

`column_type` in response can be one of the following values:

Expand All @@ -177,6 +177,7 @@ Currently, statistics are supported for strings, float and integer numbers, list
* `string_text` - for string data types if they do not represent categories (see below)
* `list` - for lists of any other data types (including lists)
* `audio` - for audio data
* `image` - for image data

### `class_label`

Expand Down Expand Up @@ -532,3 +533,61 @@ For audio data, the distribution of audio files durations is computed. The follo

</p>
</details>


### image

For image data, the distribution of images widths is computed. The following measures are returned:

* minimum, maximum, mean, and standard deviation of widths of image files
* number and proportion of `null` values
* histogram of images widths with 10 bins

<details><summary>Example </summary>
<p>

```json
{
"column_name": "image",
"column_type": "image",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 256,
"max": 873,
"mean": 327.99339,
"median": 341.0,
"std": 60.07286,
"histogram": {
"hist": [
1734,
1637,
1326,
121,
10,
3,
1,
3,
1,
2
],
"bin_edges": [
256,
318,
380,
442,
504,
566,
628,
690,
752,
814,
873
]
}
}
}
```

</p>
</details>
51 changes: 51 additions & 0 deletions services/worker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ The response has three fields: `num_examples`, `statistics`, and `partial`. `par
* `bool` - for boolean dtype ("bool")
* `list` - for lists of other data types (including lists)
* `audio` - for audio data
* `image` - for image data

`column_statistics` content depends on the feature type, see examples below.
##### class_label
Expand Down Expand Up @@ -542,7 +543,57 @@ Shows distribution of audio files durations.
</p>
</details>

##### image

Shows distribution of image files widths.

<details><summary>example: </summary>
<p>

```python
{
"column_name": "image",
"column_type": "image",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 256,
"max": 873,
"mean": 327.99339,
"median": 341.0,
"std": 60.07286,
"histogram": {
"hist": [
1734,
1637,
1326,
121,
10,
3,
1,
3,
1,
2
],
"bin_edges": [
256,
318,
380,
442,
504,
566,
628,
690,
752,
814,
873
]
}
}
}
```
</p>
</details>

### Splits worker

Expand Down
Loading

0 comments on commit c7fb237

Please sign in to comment.