Writing / Reading Bug involving writer `chunk_bytes` information #393

dangthatsright · 2024-10-10T16:17:04Z

🐛 Bug

When writing chunks, chunk_bytes is calculated via

Line 237 in b9aa903

current_chunk_bytes = sum([item.bytes for item in items])

but the actual data size is more since data contains additional (potentially large) metadata info in the beginning

When reading chunks, there is a separate thread to download the chunks from cloud and a while loop that spins until the file size is larger than chunk_bytes see

litdata/src/litdata/streaming/item_loader.py

Line 146 in b9aa903

while not exists:

This means that there are edge cases where the reader is downloading the file, and the file exceeds chunk_bytes since the file is a larger size than that. The reader thinks the file is ready and indexes into an offset that doesn't exist yet, leading to downstream errors.

To Reproduce

Since this is non deterministic, and involves large data, I don't have code, but if I can outline my scenario. You create large chunks (I'm using default of 64 MB), and then you index through the last data point of each chunk (I have > 100 chunks), you'll most likely hit this issue.

Maybe if you have even larger chunks with a lot of data, as long as the offset stored in the chunk is sufficiently large (since that doesn't get accounted for in the chunk_bytes info, and you index the last element, you'll probably see it is my guess.

Expected behavior

This should work. I am happy to make a PR but unsure which direction to pursue. Several ideas:

In the writer logic, set chunk_bytes to be the actual file size rather than just the size of data points. This is obviously the easiest but I'm not sure if this info is used somewhere else.
Rewrite the reader logic such that it waits until the file size stops changing. A bit nastier.
Use the FileLocks you have for downloading and wait for them to be released or something? Haven't used FileLocks before so I can't comment more on this.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-10T16:17:30Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-10-10T16:28:19Z

Hey @dangthatsright,

Nice, option 1 sounds good. Feel free to make a PR to fix it.

jmoller93 · 2024-10-10T16:35:41Z

I think this is what was happening in my issue from before! Good fix #388

tchaton · 2024-10-10T16:54:11Z

A simple fix would be to pad the chunk_bytes with the chunk header in the reader. Same as done within the writer. So it is backward compatible.

        num_items = np.uint32(len(items))
        sizes = list(map(len, items))
        offsets = np.array([0] + sizes).cumsum().astype(np.uint32)
        offsets += len(num_items.tobytes()) + len(offsets.tobytes())
        sample_data = b"".join([item.data for item in items])
        data = num_items.tobytes() + offsets.tobytes() + sample_data

dangthatsright · 2024-10-10T17:43:23Z

that's a great idea, thank you!

dangthatsright added bug Something isn't working help wanted Extra attention is needed labels Oct 10, 2024

dangthatsright linked a pull request Oct 10, 2024 that will close this issue

correct the chunk size by adding header size #394

Open

tchaton linked a pull request Oct 11, 2024 that will close this issue

correct the chunk size by adding header size #395

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing / Reading Bug involving writer `chunk_bytes` information #393

Writing / Reading Bug involving writer `chunk_bytes` information #393

dangthatsright commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

tchaton commented Oct 10, 2024

jmoller93 commented Oct 10, 2024

tchaton commented Oct 10, 2024 •

edited

Loading

dangthatsright commented Oct 10, 2024

Writing / Reading Bug involving writer chunk_bytes information #393

Writing / Reading Bug involving writer chunk_bytes information #393

Comments

dangthatsright commented Oct 10, 2024

🐛 Bug

To Reproduce

Expected behavior

github-actions bot commented Oct 10, 2024

tchaton commented Oct 10, 2024

jmoller93 commented Oct 10, 2024

tchaton commented Oct 10, 2024 • edited Loading

dangthatsright commented Oct 10, 2024

Writing / Reading Bug involving writer `chunk_bytes` information #393

Writing / Reading Bug involving writer `chunk_bytes` information #393

tchaton commented Oct 10, 2024 •

edited

Loading