-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing / Reading Bug involving writer chunk_bytes
information
#393
Labels
Comments
dangthatsright
added
bug
Something isn't working
help wanted
Extra attention is needed
labels
Oct 10, 2024
Hi! thanks for your contribution!, great first issue! |
Hey @dangthatsright, Nice, option 1 sounds good. Feel free to make a PR to fix it. |
I think this is what was happening in my issue from before! Good fix #388 |
A simple fix would be to pad the chunk_bytes with the chunk header in the reader. Same as done within the writer. So it is backward compatible.
|
that's a great idea, thank you! |
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
🐛 Bug
When writing chunks,
chunk_bytes
is calculated vialitdata/src/litdata/streaming/writer.py
Line 237 in b9aa903
When reading chunks, there is a separate thread to download the chunks from cloud and a while loop that spins until the file size is larger than
chunk_bytes
seelitdata/src/litdata/streaming/item_loader.py
Line 146 in b9aa903
This means that there are edge cases where the reader is downloading the file, and the file exceeds
chunk_bytes
since the file is a larger size than that. The reader thinks the file is ready and indexes into an offset that doesn't exist yet, leading to downstream errors.To Reproduce
Since this is non deterministic, and involves large data, I don't have code, but if I can outline my scenario. You create large chunks (I'm using default of 64 MB), and then you index through the last data point of each chunk (I have > 100 chunks), you'll most likely hit this issue.
Maybe if you have even larger chunks with a lot of data, as long as the offset stored in the chunk is sufficiently large (since that doesn't get accounted for in the
chunk_bytes
info, and you index the last element, you'll probably see it is my guess.Expected behavior
This should work. I am happy to make a PR but unsure which direction to pursue. Several ideas:
chunk_bytes
to be the actual file size rather than just the size of data points. This is obviously the easiest but I'm not sure if this info is used somewhere else.The text was updated successfully, but these errors were encountered: