Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The role of dirty bit #36

Open
COST-97 opened this issue Dec 5, 2024 · 3 comments
Open

The role of dirty bit #36

COST-97 opened this issue Dec 5, 2024 · 3 comments

Comments

@COST-97
Copy link

COST-97 commented Dec 5, 2024

Hello:
I carefully checked the data reading code _safe_load and found that the dirty bit seems to be used only to mark the data that has not been selected. After the data is selected for training, it is further marked by the code:

dirty_bit = read_dirty_bit(read_chunk_dir)
dirty_bit[read_chunk_item_index] = 1
save_dirty_bit(read_chunk_dir, dirty_bit)

So, if I understand correctly, during the training process, more and more data is marked, and less and less data is used for training. This is different from the common uniform sampling of data for training. Why is this data reading method used?

Could you give me some advice? Thank you very much!

@csuastt
Copy link
Collaborator

csuastt commented Dec 10, 2024

We use a buffer to temporarily store samples. The dirty bit is used to tell which samples have been loaded. When the consumer reads samples from the buffer, it will not select those dirty samples to avoid repeated loading. The producer is responsible for replacing those dirty samples with new samples.

@COST-97
Copy link
Author

COST-97 commented Dec 12, 2024

Hello:
I am glad to receive your reply.
I understand that "The dirty bit is used to tell which samples have been loaded.", but generally speaking, during the training phase of the model, the data in the dataset (or buffer) should be sampled evenly in batches, and some samples may indeed be sampled again. So in your method, why do you want to avoid data from being sampled repeatedly?

Moreover, I only see the read data marked as dirty in _safe_load:

dirty_bit = read_dirty_bit(read_chunk_dir)
dirty_bit[read_chunk_item_index] = 1
save_dirty_bit(read_chunk_dir, dirty_bit)

It seems that I have not seen the implementation of "replacing those dirty samples with new samples".
Maybe there are some details that I didn't notice, which led to some confusion. Looking forward to your feedback.
Thank you so much!

@csuastt
Copy link
Collaborator

csuastt commented Dec 12, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants