Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to download the large numbers of data #32

Open
wonkyoc opened this issue Jul 3, 2024 · 0 comments
Open

Fail to download the large numbers of data #32

wonkyoc opened this issue Jul 3, 2024 · 0 comments

Comments

@wonkyoc
Copy link

wonkyoc commented Jul 3, 2024

Dataset
https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated/tree/main/data

Problem
The above dataset has 1659 files but the downloader only downloads 1000 files. This is not specifically a problem of the downloader but a way of HTTP request. The HTTP GET request only gets 1000 files.

curl https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated/tree/main/data -o a.out
cat a.out
...
  "2b6a58077011c0cdaf57675ab5d3f3cc64f1b36b","size":285632877,"lfs":{"oid":"d2115061684c0cd7b286c04f6d1a644490bbe8a91d7822480b9f8edbfd659c7e","size":285632877,"pointerSize":134},"path":"data/train-00999-of-01650-c966fff517a32923.parquet"}]
...

I was investigating a solution but I have not found any clear solution yet. Can you provide any info on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant