-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executing V2 issues #80
Comments
I can verify that len(inputs_by_snapsh) is 0 edit: I seem to not have the listings correct, or the s3 bucket info correct. is it possible to get an example of a listsings.txt and S3_ENDPOINT_URL="https://red-pajama.s3.us-east-1.amazonaws.com" DOCKER_S3_ENDPOINT_URL="https://red-pajama.s3.us-east-1.amazonaws.com" does this look right? |
Hi @hicotton02 , thanks for your question!
Once the environment variables are specified, you can get the listings via s5cmd --profile "$S3_PROFILE" --endpoint-url "$S3_ENDPOINT_URL" \
ls "${S3_BUCKET%/}${S3_CCNET_PREFIX%/}/*" | grep "\.json\.gz$" | awk '{print $NF}' >"${LISTINGS_FILE}" what should produce a file with contents of the form:
Let me know if this helps!:) |
Thank you so much for the response. Is the s5cmd command supposed to point to my own s3 bucket or someone else's? I created a bucket but it is blank at this time. I remember in V1 we were downloading info from I think Arxiv's bucket. edit: As part of this workstream, do we download ccnet data separately (I see their repo went archive)? |
There is no data that needs to be pulled from an external S3 bucket, only your own where you have the ccnet output stored -- it is also only required to create the artifacts. Are you creating your own artifacts for a custom dataset or are you trying to reproduce the quality signals we have provided? You can download the ccnet out from the public urls (https://data.together.xyz/redpajama-data-v2/v1.0.0/) and then upload it to your own S3 bucket. Also check out the huggingface repo here which contains instructions on how to download the data. |
I am going to end up doing both. Right now I am just learning how you guys did all this. Once I have that done, and have some sort of understanding on what is going on, I want to add and remove data to see how that affects everything. I am in my master's for AI/ML and using this to learn in addition to what I am learning in school. |
Wrote a python script to download all the ccnet data based on your links above. it does this in parallel and is basic. saturated my connection and server to get the most efficient process going.
|
Since the new version came out, I have been trying to get things working. Here are a couple issues that I ran into, and resolved:
needed s5cmd so had to install conda then s5cmd
installed docker rootless although networking is unavailable, so for now, running docker as root
default.conf is missing lines for the AWS Secret and ID. Added them no problem.
when running the
I modified to run in my environment (Ubuntu 22.04 WSL2):
I get the following error:
is my listing parameter correct? or is there some other issue
The text was updated successfully, but these errors were encountered: