-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reduce function #5533
Add reduce function #5533
Conversation
This reverts commit ca58769.
…ingface#5522)" This reverts commit 6191605.
This reverts commit 9eaeff4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation is starting to have a great shape ! Thanks a lot :)
I added some suggestions, and also one question regarding the output which IMO should be of the same type as the initializer (maybe I should have noticed this earlier - sorry for the back and forth)
>>> result | ||
{'text': Counter({'and': 2, 'compassionately': 1, 'explores': 1, 'the': 1, 'seemingly': 1, 'irreconcilable': 1, 'situation': 1, 'between': 1, 'conservative': 1, 'christian': 1, 'parents': 1, 'their': 1, 'estranged': 1, 'gay': 1, 'lesbian': 1, 'children': 1, '.': 1})} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the initializer is a Counter()
, I think we can expect the output to be a Counter()
as well no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point! It would also align more with the behaviour of functools.reduce
. On the other hand, its a bit annoying that the output of different parameter variations is not the same type, as an example, r1
and r2
below would be int
and dict
respectively.
int_ds = Dataset.from_dict({"x": [1, 2, 3], "y": [1, 2, 3]})
sum = lambda x,y: x+y
r1 = reduce(sum, initializer=0)
# 6
r2 = reduce(sum)
# {"x": 6, "y":6}
I don't really have a strong opinion one way or the other, either would be confusion/annoying in some way. Which do you prefer?
The proposed API doesn't seem intuitive to me - one can already use |
Thanks for sharing this google colab, it has nice examples ! Though I still think
However I agree that maintaining this can be challenging, especially if you think about how |
Replacing
Not the main purpose, but this was mentioned as a "feature" in the previous docs if I remember. And all this is related to the multi-processing case, which we can document. Besides the linked issue, I can't find requests for |
I think @srush was looking for a way to do a word count but ended up using a single processed
Yup indeed |
While counting is one example, I often find I want to compute different statistics over a dataset. This seems like a natural way to do it in a stateless manner. I guess you could use functools reduce, but that wouldn't allow batching, right? |
I've updated the Colab with an example that reduces batches with Plus, for simple reductions such as
You can use |
That The stateful |
Whenever I in the past wanted to calculate statistics for datasets I used |
Should i close this and open another PR, with descriptions of how to use |
Yes I think good documentation is the way to go here. @mariosasko 's examples are clear and efficient. Maybe we could have an
And also a new conceptual guide on cc @stevhliu for visibility and if you have some comments |
I would create a |
Coolio. Ill close this PR and get going on another one adding what we've discussed during the next couple of days! |
Is adding a section to the docs still planned? Couldn't find any related PR. |
There is a new integration with polars which is convenient btw. Here is an example for computing the length of the longest dialogue in a dataset using polars: >>> from datasets import load_dataset
>>> ds = load_dataset("HuggingFaceTB/smoltalk", "all", split="train")
>>> df = ds.to_polars()
>>> df.head()
shape: (5, 2)
┌─────────────────────────────────┬───────────────────┐
│ messages ┆ source │
│ --- ┆ --- │
│ list[struct[2]] ┆ str │
╞═════════════════════════════════╪═══════════════════╡
│ [{"The function \( g(x) \) sat… ┆ numina-cot-100k │
│ [{"Ben twice chooses a random … ┆ numina-cot-100k │
│ [{"Find all values of $x$ that… ┆ numina-cot-100k │
│ [{"How can you help me? I'm wr… ┆ smol-magpie-ultra │
│ [{"Extract and present the mai… ┆ smol-summarize │
└─────────────────────────────────┴───────────────────┘
>>> df["messages"].list.len().max()
58 For very large scale dataset it can be worth using >>> f = lambda df: pl.DataFrame({"messages_max_length": [df["messages"].list.len().max()]})
>>> intermediate_ds = ds.with_format("polars").map(f, batched=True) # you can also set batch_size=
>>> intermediate_ds.to_polars()["messages_max_length"].max()
58 This last method can be used to implement a map + intermediate reduce + final reduce approach |
This PR closes #5496 .
I tried to imitate the
reduce
-method fromfunctools
, i.e. the function input must be a binary operation. I assume that the input type has an empty element, i.e.input_type()
is defined, as the acumulant is instantiated as this object - im not sure that is this a reasonable assumption?If
batched= True
the reduction of each shard is not returned, but the reduction of the entire dataset. I was unsure wether this was an intuitive API, or it would make more sense to return the reduction of each shard?