CM fails to build DLRMv2 99 #92

WarrenSchultz · 2024-07-02T02:22:41Z

Tried running both the command to run it via a docker container, and also running it within the ResNet50 container.

End of the log follows

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
Collecting pyre_extensions
  Downloading pyre_extensions-0.0.30-py3-none-any.whl (12 kB)
Requirement already satisfied: typing-extensions in /home/cmuser/.local/lib/python3.8/site-packages (from pyre_extensions) (4.12.2)
Requirement already satisfied: typing-inspect in /home/cmuser/.local/lib/python3.8/site-packages (from pyre_extensions) (0.9.0)
Requirement already satisfied: mypy-extensions>=0.3.0 in /home/cmuser/.local/lib/python3.8/site-packages (from typing-inspect->pyre_extensions) (1.0.0)
Installing collected packages: pyre-extensions
Successfully installed pyre-extensions-0.0.30
             ! cd /home/cmuser/CM/repos/local/cache/60d83fede2d04cfd
             ! call /home/cmuser/CM/repos/gateoverflow@cm4mlops/script/get-generic-python-lib/run.sh from tmp-run.sh
             ! call "postprocess" from /home/cmuser/CM/repos/gateoverflow@cm4mlops/script/get-generic-python-lib/customize.py
            Detected version: 0.0.30
Traceback (most recent call last):
  File "/home/cmuser/.local/bin/cm", line 8, in <module>
    sys.exit(run())
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/cli.py", line 37, in run
    r = cm.access(argv, out='con')
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 1490, in _run
    r = customize_code.preprocess(ii)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/script/run-mlperf-inference-app/customize.py", line 219, in preprocess
    r = cm.access(ii)
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 758, in access
    return cm.access(i)
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 1553, in _run
    r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta,  env, state, const, const_state, add_deps_recursive,
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 2909, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 3080, in _run_deps
    r = self.cmind.access(ii)
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 1380, in _run
    r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 2909, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 3080, in _run_deps
    r = self.cmind.access(ii)
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 1490, in _run
    r = customize_code.preprocess(ii)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/script/get-preprocessed-dataset-criteo/customize.py", line 23, in preprocess
    output_dir = env['CM_DATASET_PREPROCESSED_PATH']
KeyError: 'CM_DATASET_PREPROCESSED_PATH'

The text was updated successfully, but these errors were encountered:

arjunsuresh · 2024-07-02T14:58:37Z

DLRM docker container needs criteo dataset to be preprocessed outside of it. We need to add this option in the documentation page but if you have the preprocessed data we can tell you how to use it.

@anandhu-eng we can sync on how to add this option in the documentation page.

WarrenSchultz · 2024-07-02T15:02:10Z

Huh, ok. I thought I saw it pulling down the full dataset, but I may have been mistaken. I'm working on a lot in parallel at the moment. :)
What's the correct command to do so at this point through CM?

arjunsuresh · 2024-07-02T15:07:03Z

Currently we only support plugging in the preprocessed data as the download of criteo stopped working without manual intervention. I believe we can share you the preprocessed data - doing preprocessing is heavy - needs 6.4 TB disk space and 600 GB+ of memory and around 3 days of running. The preprocessed data is less than 300 GB. We can share it by end of this week - needs to test it for expected accuracy.

WarrenSchultz · 2024-07-02T15:09:44Z

Great, thank you.

VLLM Server Docker Support

arjunsuresh · 2024-07-18T08:56:26Z

Hi @WarrenSchultz MLCommons has just made available the preprocessed dataset for DLRMv2. Its about 150GB download. We no longer needs TBs of disk space and days of waiting.

WarrenSchultz · 2024-07-18T17:36:14Z

Hi @WarrenSchultz MLCommons has just made available the preprocessed dataset for DLRMv2. Its about 150GB download. We no longer needs TBs of disk space and days of waiting.

That's great news, thanks! I'll give it a try soon.

pdrtrncs23 · 2024-08-15T21:43:15Z

I am trying now to go through the steps in the documentation and have to report two issues related to this thread:

(1) The commands in https://docs.mlcommons.org/inference/benchmarks/recommendation/dlrm-v2/#__tabbed_5_2 have the option "--model=dlrm_v2-99" but this should be "--model=dlrm-v2-99" in order to work

(2) After fixing this the command proceeds with its several steps and reaches the downloading of the dataset (the preprocessed data that you provided) and when it completes to download the "day_23_sparse_multi_hot.npz" to 100% it does not proceed.

I have killed the command and changed the partial downloaded name to the expected name and re-ran the command but now I get a core dump, so I am wondering if there is something missing that was not completed with the download or the core dump is another issue. The error output is:

./run_local.sh: line 14: 162746 Illegal instruction     (core dumped) python python/main.py --profile $profile $common_opt --model $model --model-path $model_path --dataset $dataset --dataset-path $DATA_DIR --output $OUTPUT_DIR $EXTRA_OPS $@
./run.sh: line 59: 132: command not found
./run.sh: line 65: 132: command not found

CM error: Portable CM script failed (name = benchmark-program, return code = 32512)

arjunsuresh · 2024-08-16T14:40:49Z

@pdrtrncs23 Thank you for reporting the issue with documentation. This should be solved if you do cm pull repo.

Please do cm rm cache --tags=criteo,preprocessed -f to remove the stale download. We have just added checksum check for criteo preprocessed dataset and from now the workflow will automatically tell if the downloaded file is corrupt.

arjunsuresh · 2024-09-13T11:06:03Z

Closing this issue. Feel free to reopen if issue persists.

arjunsuresh added a commit that referenced this issue Jul 17, 2024

Merge pull request #92 from anandhu-eng/redhat_llama2

ee81196

VLLM Server Docker Support

arjunsuresh closed this as completed Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CM fails to build DLRMv2 99 #92

CM fails to build DLRMv2 99 #92

WarrenSchultz commented Jul 2, 2024

arjunsuresh commented Jul 2, 2024

WarrenSchultz commented Jul 2, 2024 •

edited

Loading

arjunsuresh commented Jul 2, 2024

WarrenSchultz commented Jul 2, 2024

arjunsuresh commented Jul 18, 2024

WarrenSchultz commented Jul 18, 2024

pdrtrncs23 commented Aug 15, 2024

arjunsuresh commented Aug 16, 2024

arjunsuresh commented Sep 13, 2024

CM fails to build DLRMv2 99 #92

CM fails to build DLRMv2 99 #92

Comments

WarrenSchultz commented Jul 2, 2024

arjunsuresh commented Jul 2, 2024

WarrenSchultz commented Jul 2, 2024 • edited Loading

arjunsuresh commented Jul 2, 2024

WarrenSchultz commented Jul 2, 2024

arjunsuresh commented Jul 18, 2024

WarrenSchultz commented Jul 18, 2024

pdrtrncs23 commented Aug 15, 2024

arjunsuresh commented Aug 16, 2024

arjunsuresh commented Sep 13, 2024

WarrenSchultz commented Jul 2, 2024 •

edited

Loading