Add reduce function #5533

AJDERS · 2023-02-15T13:44:01Z

This PR closes #5496 .

I tried to imitate the reduce-method from functools, i.e. the function input must be a binary operation. I assume that the input type has an empty element, i.e. input_type() is defined, as the acumulant is instantiated as this object - im not sure that is this a reasonable assumption?

If batched= True the reduction of each shard is not returned, but the reduction of the entire dataset. I was unsure wether this was an intuitive API, or it would make more sense to return the reduction of each shard?

…map.

This reverts commit ca58769.

…ace#5542)" This reverts commit b142d20.

…ingface#5522)" This reverts commit 6191605.

This reverts commit 9eaeff4.

lhoestq

The documentation is starting to have a great shape ! Thanks a lot :)

I added some suggestions, and also one question regarding the output which IMO should be of the same type as the initializer (maybe I should have noticed this earlier - sorry for the back and forth)

src/datasets/arrow_dataset.py

lhoestq · 2023-02-21T16:53:56Z

src/datasets/arrow_dataset.py

+         >>> result
+        {'text': Counter({'and': 2, 'compassionately': 1, 'explores': 1, 'the': 1, 'seemingly': 1, 'irreconcilable': 1, 'situation': 1, 'between': 1, 'conservative': 1, 'christian': 1, 'parents': 1, 'their': 1, 'estranged': 1, 'gay': 1, 'lesbian': 1, 'children': 1, '.': 1})}


Since the initializer is a Counter(), I think we can expect the output to be a Counter() as well no ?

I see your point! It would also align more with the behaviour of functools.reduce. On the other hand, its a bit annoying that the output of different parameter variations is not the same type, as an example, r1 and r2 below would be int and dict respectively.

int_ds = Dataset.from_dict({"x": [1, 2, 3], "y": [1, 2, 3]}) sum = lambda x,y: x+y r1 = reduce(sum, initializer=0) # 6 r2 = reduce(sum) # {"x": 6, "y":6}

I don't really have a strong opinion one way or the other, either would be confusion/annoying in some way. Which do you prefer?

src/datasets/arrow_dataset.py

mariosasko · 2023-02-22T15:12:50Z

The proposed API doesn't seem intuitive to me - one can already use functools.reduce or Dataset.map for this purpose (Colab with examples), so perhaps we could have a section in the docs that uses these methods to perform reductions rather than introducing a new method (which needs to be maintained later)

lhoestq · 2023-02-22T15:50:13Z

Thanks for sharing this google colab, it has nice examples !

Though I still think functools.reduce with multiprocessing can be a pain - we offer something easier here:

no need to use a pool yourself
no need to use map just to iterate on the dataset (not its main purpose)
native support for lambdas (using dill)
the combiner is mandatory for multiprocessing to avoid ending up with an incorrect result as in your example

However I agree that maintaining this can be challenging, especially if you think about how map already is, and if we also have to deal with dataset formatting.

mariosasko · 2023-02-22T16:36:59Z

native support for lambdas (using dill)

Replacing multiprocessing with multiprocess in the example would allow that.

no need to use map just to iterate on the dataset (not its main purpose)

Not the main purpose, but this was mentioned as a "feature" in the previous docs if I remember.

And all this is related to the multi-processing case, which we can document.

Besides the linked issue, I can't find requests for Dataset.reduce, which makes me think functools.reduce does the job for most users.

lhoestq · 2023-02-22T17:08:10Z

Besides the linked issue, I can't find requests for Dataset.reduce, which makes me think functools.reduce does the job for most users.

I think @srush was looking for a way to do a word count but ended up using a single processed map. I also saw some users on the forum wanting to compute max

Not the main purpose, but this was mentioned as a "feature" in the previous docs if I remember.

And all this is related to the multi-processing case, which we can document.

Yup indeed

srush · 2023-02-22T17:18:16Z

While counting is one example, I often find I want to compute different statistics over a dataset. This seems like a natural way to do it in a stateless manner.

I guess you could use functools reduce, but that wouldn't allow batching, right?

mariosasko · 2023-02-22T18:28:09Z

I've updated the Colab with an example that reduces batches with map and then computes the final result. It would be nice to have a similar example (explained in detail) in the docs to show the full power of map.

Plus, for simple reductions such as max, one can do pc.max(ds.with_format("arrow")["col"]) to directly get the result (without loading the entire column in RAM).

@srush

I guess you could use functools reduce, but that wouldn't allow batching, right?

You can use .iter(batch_size) to get batches

srush · 2023-02-22T18:50:12Z

That functools tools example is clean. I didn't know about iter. That would handle my use case.

The stateful map with a global variable is pretty hairy. I don't think we should recommend people do that.

AJDERS · 2023-02-22T19:04:09Z

Whenever I in the past wanted to calculate statistics for datasets I used functools similarly to how it's described in the colab, but I always felt it was a bit of a hassle to use it together with multiprocessing, which is why I picked up the issue, to do it "once and for all".

AJDERS · 2023-02-27T13:43:53Z

Should i close this and open another PR, with descriptions of how to use map for reduction, or?

lhoestq · 2023-02-27T14:18:29Z

Yes I think good documentation is the way to go here. @mariosasko 's examples are clear and efficient.

Maybe we could have an Aggregations section in the Process page with some guides on how to:

use .map() to compute aggregates
use .with_format("arrow") for max, min, etc. to save RAM and get max speed
use a multiprocessed .map() to get partial results in parallel and combine them (max text length example)
(advanced) use multiprocessing with an arbitrary accumulator (word count example)

And also a new conceptual guide on Multiprocessed mapping to say that it helps speed up CPU intensive processing but why it may lead to incorrect results when computing aggregates.

cc @stevhliu for visibility and if you have some comments

stevhliu · 2023-02-27T18:30:46Z

I would create a Reduce - to be more exact - subsection under Map to demonstrate these examples since we're showing how they can be done with the Dataset.map function. It'd also be good to add a link to the new concept guide from this section to solidify user understanding :)

AJDERS · 2023-02-28T14:46:12Z

Coolio. Ill close this PR and get going on another one adding what we've discussed during the next couple of days!

taha-yassine · 2024-11-24T22:56:55Z

Is adding a section to the docs still planned? Couldn't find any related PR.

lhoestq · 2024-11-25T14:31:59Z

There is a new integration with polars which is convenient btw. Here is an example for computing the length of the longest dialogue in a dataset using polars:

>>> from datasets import load_dataset
>>> ds = load_dataset("HuggingFaceTB/smoltalk", "all", split="train")
>>> df = ds.to_polars()
>>> df.head()
shape: (5, 2)
┌─────────────────────────────────┬───────────────────┐
│ messages                        ┆ source            │
│ ---                             ┆ ---               │
│ list[struct[2]]                 ┆ str               │
╞═════════════════════════════════╪═══════════════════╡
│ [{"The function \( g(x) \) sat… ┆ numina-cot-100k   │
│ [{"Ben twice chooses a random … ┆ numina-cot-100k   │
│ [{"Find all values of $x$ that… ┆ numina-cot-100k   │
│ [{"How can you help me? I'm wr… ┆ smol-magpie-ultra │
│ [{"Extract and present the mai… ┆ smol-summarize    │
└─────────────────────────────────┴───────────────────┘
>>> df["messages"].list.len().max()
58

For very large scale dataset it can be worth using map() on batches of data to compute intermediate results, save some memory, and cache the result:

>>> f = lambda df: pl.DataFrame({"messages_max_length": [df["messages"].list.len().max()]})
>>> intermediate_ds = ds.with_format("polars").map(f, batched=True)  # you can also set batch_size=
>>> intermediate_ds.to_polars()["messages_max_length"].max()
58

This last method can be used to implement a map + intermediate reduce + final reduce approach

AJDERS added 30 commits February 15, 2023 10:26

feat: added reduce and reduce_single function, stolen docstring from …

dcbb56c

…map.

docstring: clean unused parameters

443f676

fix: remove stuff pertaining to remove_columns

915cc00

fix: output type, if empty we return the dataset.

4b57307

docstrings: clean

3b70da0

feat: move function, add batched option back in

3c5ed20

docstring: clean up docstrings

e23fa8d

docstrings: clean-up and remove unused parameters

d8b9b30

feat: remove caching

cd34172

tests: add tests for reduce, with little to no kwargs

c311f69

feat: adding batching remove caching

52c482e

fix: docstrings, remove with_rank, and with_indices and indices keywords

d5676f1

tests: remove test pertaining to indices

c70aa41

fix: remove unused kwargs

d45066b

fix: remove unused kwarg

95b0f85

tests: add test for reduction of formatted dataset

c2e556f

tests: add more cases for one test, and add multiprocessing test

e92bced

fix: unused kwarg, and remove tqdm stuff

b4f3bed

tests: rename test, and remove cases

32cecae

fix: docstring and remove reference to index

ccc357b

fix: remove artifacts from experiment

a720dd1

tests: minor fix

4e1bc07

fix: remove unnecessary print

51727a6

tests: add test

ae3db7d

tests: add batched tests

35c273f

fix: name

1241544

fix: docstrings

ce66fcc

fix: remove unused context manager and rename function

a1c8d51

fix: style

a466091

fix: style

68e1ddc

AJDERS added 6 commits February 21, 2023 16:18

Revert "Tutorial for creating a dataset (huggingface#5540)"

19bfcc6

This reverts commit ca58769.

Revert "Avoid saving sparse ChunkedArrays in pyarrow tables (huggingf…

c14d283

…ace#5542)" This reverts commit b142d20.

Revert "Minor changes in JAX-formatting docstrings & type-hints (hugg…

d6cecfd

…ingface#5522)" This reverts commit 6191605.

Revert "fix: docstring"

0d460d1

This reverts commit 9eaeff4.

Merge remote-tracking branch 'upstream/main' into add-reduce-function

cac7bc6

fix: docstrings, and remove unnessary check

b8845d5

lhoestq reviewed Feb 21, 2023

View reviewed changes

AJDERS added 7 commits February 21, 2023 21:19

feat: add reduce to DatasetDict

4db634f

docstring: remove superfluous docstring example

2f2866b

docstring: remove superfluous docstring example datasetdict

341e4d5

docstring: minor change

8630c6b

docstring: remove combiner from non-multiprocessed docstring example

67e0a41

docstring: better description of non-empty initializer behaviour

e1aefa1

fix: formatting

d4bd486

AJDERS closed this Feb 28, 2023

mariosasko mentioned this pull request Jul 21, 2023

Add a reduce method #5496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reduce function #5533

Add reduce function #5533

AJDERS commented Feb 15, 2023

lhoestq left a comment •

edited

Loading

lhoestq Feb 21, 2023

AJDERS Feb 21, 2023 •

edited

Loading

mariosasko commented Feb 22, 2023

lhoestq commented Feb 22, 2023 •

edited

Loading

mariosasko commented Feb 22, 2023

lhoestq commented Feb 22, 2023

srush commented Feb 22, 2023

mariosasko commented Feb 22, 2023 •

edited

Loading

srush commented Feb 22, 2023

AJDERS commented Feb 22, 2023 •

edited

Loading

AJDERS commented Feb 27, 2023

lhoestq commented Feb 27, 2023

stevhliu commented Feb 27, 2023

AJDERS commented Feb 28, 2023

taha-yassine commented Nov 24, 2024

lhoestq commented Nov 25, 2024 •

edited

Loading

		>>> result
		{'text': Counter({'and': 2, 'compassionately': 1, 'explores': 1, 'the': 1, 'seemingly': 1, 'irreconcilable': 1, 'situation': 1, 'between': 1, 'conservative': 1, 'christian': 1, 'parents': 1, 'their': 1, 'estranged': 1, 'gay': 1, 'lesbian': 1, 'children': 1, '.': 1})}

Add reduce function #5533

Add reduce function #5533

Conversation

AJDERS commented Feb 15, 2023

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

lhoestq Feb 21, 2023

Choose a reason for hiding this comment

AJDERS Feb 21, 2023 • edited Loading

Choose a reason for hiding this comment

mariosasko commented Feb 22, 2023

lhoestq commented Feb 22, 2023 • edited Loading

mariosasko commented Feb 22, 2023

lhoestq commented Feb 22, 2023

srush commented Feb 22, 2023

mariosasko commented Feb 22, 2023 • edited Loading

srush commented Feb 22, 2023

AJDERS commented Feb 22, 2023 • edited Loading

AJDERS commented Feb 27, 2023

lhoestq commented Feb 27, 2023

stevhliu commented Feb 27, 2023

AJDERS commented Feb 28, 2023

taha-yassine commented Nov 24, 2024

lhoestq commented Nov 25, 2024 • edited Loading

lhoestq left a comment •

edited

Loading

AJDERS Feb 21, 2023 •

edited

Loading

lhoestq commented Feb 22, 2023 •

edited

Loading

mariosasko commented Feb 22, 2023 •

edited

Loading

AJDERS commented Feb 22, 2023 •

edited

Loading

lhoestq commented Nov 25, 2024 •

edited

Loading