Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📝 Update Docs: Merge multiple optimized datasets into one #385

Merged
merged 2 commits into from
Oct 2, 2024

Conversation

bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Oct 2, 2024

What does this PR do?

  • Update docs to include merge multiple optimized datasets into one

Usage

import numpy as np
from PIL import Image

from litdata import StreamingDataset, merge_datasets, optimize


def random_images(index):
    return {
        "index": index,
        "image": Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8)),
        "class": np.random.randint(10),
    }


if __name__ == "__main__":
    out_dirs = ["fast_data_1", "fast_data_2", "fast_data_3", "fast_data_4"]  # or ["s3://my-bucket/fast_data_1", etc.]"
    for out_dir in out_dirs:
        optimize(fn=random_images, inputs=list(range(250)), output_dir=out_dir, num_workers=4, chunk_bytes="64MB")

    merged_out_dir = "merged_fast_data" # or "s3://my-bucket/merged_fast_data"
    merge_datasets(input_dirs=out_dirs, output_dir=merged_out_dir)

    dataset = StreamingDataset(merged_out_dir)
    print(len(dataset))
    # out: 1000

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@bhimrazy bhimrazy added the documentation Improvements or additions to documentation label Oct 2, 2024
@bhimrazy bhimrazy self-assigned this Oct 2, 2024
@bhimrazy bhimrazy requested a review from tchaton as a code owner October 2, 2024 04:43
Copy link

codecov bot commented Oct 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78%. Comparing base (e253c6c) to head (aebe65a).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #385   +/-   ##
===================================
  Coverage    78%    78%           
===================================
  Files        34     34           
  Lines      5022   5022           
===================================
  Hits       3931   3931           
  Misses     1091   1091           

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it ! I left it un-done to see if you would do it @bhimrazy ;) Test passed !

@tchaton tchaton merged commit d3b11ae into Lightning-AI:main Oct 2, 2024
29 checks passed
@bhimrazy bhimrazy deleted the docs/merge-datasets branch October 2, 2024 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants