Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executing V2 issues #80

Open
hicotton02 opened this issue Nov 1, 2023 · 6 comments
Open

Executing V2 issues #80

hicotton02 opened this issue Nov 1, 2023 · 6 comments

Comments

@hicotton02
Copy link

hicotton02 commented Nov 1, 2023

Since the new version came out, I have been trying to get things working. Here are a couple issues that I ran into, and resolved:

needed s5cmd so had to install conda then s5cmd
installed docker rootless although networking is unavailable, so for now, running docker as root
default.conf is missing lines for the AWS Secret and ID. Added them no problem.

when running the

bash scripts/run_prep_artifacts.sh \
  --config configs/rp_v2.0.conf \
  --listings /path/to/listings/file.txt\
  --max_workers 32

I modified to run in my environment (Ubuntu 22.04 WSL2):

sudo bash scripts/run_prep_artifacts.sh \
  --config configs/default.conf \
  --listings ../data/listings/listing.txt \
  --max_workers 32

I get the following error:

Created run id: 7f15c068
Writing run id to file /nfs/slow/data/artifacts-7f15c068/_RUN_ID
copied listings file from ../data/listings/listing.txt to /nfs/slow/data/artifacts-7f15c068/listings/listings.txt
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-15
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-23
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-35
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-41
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-42
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-49
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-52
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-14
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-27
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-32
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-35
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-40
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-48
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-07
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-18
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-26
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-30
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-36
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-40
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-44
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-50
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-04
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-09
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-17
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-26
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-30
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-34
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-39
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-43
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-47
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-51
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-05
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-09
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-13
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-17
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-26
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-30
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-34
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-39
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-43
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-47
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-51
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-04
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-09
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-13
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-18
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-26
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-30
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-35
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-39
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-43
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-47
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-51
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-05
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-10
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-16
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-24
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-29
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-34
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-40
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-45
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-50
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-04
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-10
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-17
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-21
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-25
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-31
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-39
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-43
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-49
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-05
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-21
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-27
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-33
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-40
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-49
__SNAPSHOT_LISTINGS_SUCCESS__ 2023-06
__SNAPSHOT_LISTINGS_SUCCESS__ 2023-14
Toal number of listings: 83
__LANG_PREP_START__ en @ Wed Nov  1 12:50:35 MDT 2023
[sudo] password for theskaz:
[2023-11-01 18:50:41,592]::(PID 1)::INFO::Start preparing artifacts for en
[2023-11-01 18:50:41,592]::(PID 1)::INFO::num_samples: 500000
[2023-11-01 18:50:41,592]::(PID 1)::INFO::PYTHONHASHSEED: 42
[2023-11-01 18:50:41,596]::(PID 1)::INFO::CCNetDownloader(en) Start loading input listings...
[2023-11-01 18:50:41,597]::(PID 1)::INFO::CCNetDownloader(en) Partitioning inputs by snapshot...
Traceback (most recent call last):
  File "/usr/app/src/prep_artifacts.py", line 186, in <module>
    main(artifacts_dir=args.artifacts_dir,
  File "/usr/app/src/prep_artifacts.py", line 114, in main
    ccnet.run(logger=logger)
  File "/usr/app/src/artifacts/downloaders/ccnet_downloader.py", line 95, in run
    1, self._num_samples // len(inputs_by_snapsh)
       ~~~~~~~~~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~
ZeroDivisionError: integer division or modulo by zero
Error: scripts/run_prep_artifacts.sh:7: command `sudo docker run --env AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" --env AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" -v "${DATA_ROOT%/}":"${DOCKER_MNT_DIR%/}" -t "${DOCKER_REPO}" python3 src/prep_artifacts.py --artifacts_dir "${ARTIFACTS_DIR%/}" --cc_input "${ARTIFACTS_DIR%/}/listings/listings.txt" --cc_input_base_uri "${S3_BUCKET%/}${S3_CCNET_PREFIX%/}" --cache_dir "${DOCKER_MNT_DIR%/}/.hf_cache" --lang "${lang}" --max_workers "${MAX_WORKERS}" --endpoint_url "$DOCKER_S3_ENDPOINT_URL" --dsir_num_samples "${DSIR_NUM_SAMPLES}" --dsir_feature_dim "${DSIR_FEATURE_DIM}" --classifiers_num_samples "${CLASSIFIERS_NUM_SAMPLES}" --max_paragraphs_per_book_sample "${MAX_PARAGRAPHS_PER_BOOK_SAMPLE}" --max_samples_per_book "${MAX_SAMPLES_PER_BOOK}"` failed with exit code 1

is my listing parameter correct? or is there some other issue

@hicotton02
Copy link
Author

hicotton02 commented Nov 1, 2023

I can verify that len(inputs_by_snapsh) is 0

edit: I seem to not have the listings correct, or the s3 bucket info correct. is it possible to get an example of a listsings.txt and

S3_ENDPOINT_URL="https://red-pajama.s3.us-east-1.amazonaws.com"
S3_BUCKET="red-pajama"
S3_CCNET_PREFIX="/rs_cc_net"
S3_PROFILE="default"

DOCKER_S3_ENDPOINT_URL="https://red-pajama.s3.us-east-1.amazonaws.com"
DOCKER_MNT_DIR="/mnt/data"
DOCKER_REPO="theskaz/red-pajama"

does this look right?

@mauriceweber
Copy link
Collaborator

Hi @hicotton02 , thanks for your question!

default.conf is missing lines for the AWS Secret and ID
We deliberately left these out since so that users specify these via export AWS_SECRET... (reduces the risk of uploading access keys to github).

Once the environment variables are specified, you can get the listings via

s5cmd --profile "$S3_PROFILE" --endpoint-url "$S3_ENDPOINT_URL" \
    ls "${S3_BUCKET%/}${S3_CCNET_PREFIX%/}/*" | grep "\.json\.gz$" | awk '{print $NF}' >"${LISTINGS_FILE}"

what should produce a file with contents of the form:

2014-15/0000/en_head.json.gz
2014-15/0000/en_middle.json.gz
2014-15/0001/en_head.json.gz
2014-15/0001/en_middle.json.gz

Let me know if this helps!:)

@hicotton02
Copy link
Author

hicotton02 commented Nov 2, 2023

Thank you so much for the response.

Is the s5cmd command supposed to point to my own s3 bucket or someone else's?

I created a bucket but it is blank at this time. I remember in V1 we were downloading info from I think Arxiv's bucket.

edit: As part of this workstream, do we download ccnet data separately (I see their repo went archive)?

@mauriceweber
Copy link
Collaborator

There is no data that needs to be pulled from an external S3 bucket, only your own where you have the ccnet output stored -- it is also only required to create the artifacts. Are you creating your own artifacts for a custom dataset or are you trying to reproduce the quality signals we have provided?

You can download the ccnet out from the public urls (https://data.together.xyz/redpajama-data-v2/v1.0.0/) and then upload it to your own S3 bucket. Also check out the huggingface repo here which contains instructions on how to download the data.

@hicotton02
Copy link
Author

There is no data that needs to be pulled from an external S3 bucket, only your own where you have the ccnet output stored -- it is also only required to create the artifacts. Are you creating your own artifacts for a custom dataset or are you trying to reproduce the quality signals we have provided?

You can download the ccnet out from the public urls (https://data.together.xyz/redpajama-data-v2/v1.0.0/) and then upload it to your own S3 bucket. Also check out the huggingface repo here which contains instructions on how to download the data.

I am going to end up doing both. Right now I am just learning how you guys did all this. Once I have that done, and have some sort of understanding on what is going on, I want to add and remove data to see how that affects everything. I am in my master's for AI/ML and using this to learn in addition to what I am learning in school.

@hicotton02
Copy link
Author

Wrote a python script to download all the ccnet data based on your links above. it does this in parallel and is basic. saturated my connection and server to get the most efficient process going.

import os
import subprocess
import multiprocessing as mp

CC_SNAPSHOT_IDS = [
  "2014-15",
  "2014-23",
  "2014-35",
  "2014-41",
  "2014-42",
  "2014-49",
  "2014-52",
  "2015-14",
  "2015-22",
  "2015-27",
  "2015-32",
  "2015-35",
  "2015-40",
  "2015-48",
  "2016-07",
  "2016-18",
  "2016-22",
  "2016-26",
  "2016-30",
  "2016-36",
  "2016-40",
  "2016-44",
  "2016-50",
  "2017-04",
  "2017-09",
  "2017-17",
  "2017-22",
  "2017-26",
  "2017-30",
  "2017-34",
  "2017-39",
  "2017-43",
  "2017-47",
  "2017-51",
  "2018-05",
  "2018-09",
  "2018-13",
  "2018-17",
  "2018-22",
  "2018-26",
  "2018-30",
  "2018-34",
  "2018-39",
  "2018-43",
  "2018-47",
  "2018-51",
  "2019-04",
  "2019-09",
  "2019-13",
  "2019-18",
  "2019-22",
  "2019-26",
  "2019-30",
  "2019-35",
  "2019-39",
  "2019-43",
  "2019-47",
  "2019-51",
  "2020-05",
  "2020-10",
  "2020-16",
  "2020-24",
  "2020-29",
  "2020-34",
  "2020-40",
  "2020-45",
  "2020-50",
  "2021-04",
  "2021-10",
  "2021-17",
  "2021-21",
  "2021-25",
  "2021-31",
  "2021-39",
  "2021-43",
  "2021-49",
  "2022-05",
  "2022-21",
  "2022-27",
  "2022-33",
  "2022-40",
  "2022-49",
  "2023-06",
  "2023-14"
]


def download_snapshot(snapshot_id, semaphore):
    with semaphore:
        LANG = "en"
        BASE_URL = "https://data.together.xyz/redpajama-data-v2/v1.0.0"
        PARTITION = "head_middle"
        listings_tag = f"{LANG}-{snapshot_id}-{PARTITION}"
        os.makedirs("listings", exist_ok=True)
        subprocess.run(["wget", f"{BASE_URL}/listings/{listings_tag}.txt", "-O", f"listings/{listings_tag}.txt"])

        with open(f"listings/{listings_tag}.txt", "r") as listings_file:
            for line in listings_file:
                line = line.strip()
                url = f"{BASE_URL}/documents/{line}.json.gz"
                dest = f"documents/{line}.json.gz"
                os.makedirs(os.path.dirname(dest), exist_ok=True)
                subprocess.run(["wget", url, "-O", dest])

                url = f"{BASE_URL}/quality_signals/{line}.signals.json.gz"
                dest = f"quality_signals/{line}.signals.json.gz"
                os.makedirs(os.path.dirname(dest), exist_ok=True)
                subprocess.run(["wget", url, "-O", dest])

            COMPS = ["minhash", "duplicates"]
            for comp in COMPS:
                listings_file.seek(0)
                for line in listings_file:
                    line = line.strip()
                    url = f"{BASE_URL}/{comp}/{line}.{comp}.parquet"
                    dest = f"{comp}/{line}.{comp}.parquet"
                    os.makedirs(os.path.dirname(dest), exist_ok=True)
                    subprocess.run(["wget", url, "-O", dest])


if __name__ == "__main__":
    os.chdir("/nfs/slow/data/ccnet")
    semaphore = mp.Semaphore(mp.cpu_count())
    processes = [mp.Process(target=download_snapshot, args=(i, semaphore)) for i in CC_SNAPSHOT_IDS]
    for process in processes:
        process.start()
    for process in processes:
        process.join()
    print("All Threads Completed", flush=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants