Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CM fails to build DLRMv2 99 #92

Closed
WarrenSchultz opened this issue Jul 2, 2024 · 9 comments
Closed

CM fails to build DLRMv2 99 #92

WarrenSchultz opened this issue Jul 2, 2024 · 9 comments

Comments

@WarrenSchultz
Copy link

Tried running both the command to run it via a docker container, and also running it within the ResNet50 container.

End of the log follows

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
Collecting pyre_extensions
  Downloading pyre_extensions-0.0.30-py3-none-any.whl (12 kB)
Requirement already satisfied: typing-extensions in /home/cmuser/.local/lib/python3.8/site-packages (from pyre_extensions) (4.12.2)
Requirement already satisfied: typing-inspect in /home/cmuser/.local/lib/python3.8/site-packages (from pyre_extensions) (0.9.0)
Requirement already satisfied: mypy-extensions>=0.3.0 in /home/cmuser/.local/lib/python3.8/site-packages (from typing-inspect->pyre_extensions) (1.0.0)
Installing collected packages: pyre-extensions
Successfully installed pyre-extensions-0.0.30
             ! cd /home/cmuser/CM/repos/local/cache/60d83fede2d04cfd
             ! call /home/cmuser/CM/repos/gateoverflow@cm4mlops/script/get-generic-python-lib/run.sh from tmp-run.sh
             ! call "postprocess" from /home/cmuser/CM/repos/gateoverflow@cm4mlops/script/get-generic-python-lib/customize.py
            Detected version: 0.0.30
Traceback (most recent call last):
  File "/home/cmuser/.local/bin/cm", line 8, in <module>
    sys.exit(run())
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/cli.py", line 37, in run
    r = cm.access(argv, out='con')
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 1490, in _run
    r = customize_code.preprocess(ii)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/script/run-mlperf-inference-app/customize.py", line 219, in preprocess
    r = cm.access(ii)
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 758, in access
    return cm.access(i)
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 1553, in _run
    r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta,  env, state, const, const_state, add_deps_recursive,
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 2909, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 3080, in _run_deps
    r = self.cmind.access(ii)
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 1380, in _run
    r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 2909, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 3080, in _run_deps
    r = self.cmind.access(ii)
  File "/home/cmuser/.local/lib/python3.8/site-packages/cmind/core.py", line 602, in access
    r = action_addr(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 211, in run
    r = self._run(i)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/automation/script/module.py", line 1490, in _run
    r = customize_code.preprocess(ii)
  File "/home/cmuser/CM/repos/gateoverflow@cm4mlops/script/get-preprocessed-dataset-criteo/customize.py", line 23, in preprocess
    output_dir = env['CM_DATASET_PREPROCESSED_PATH']
KeyError: 'CM_DATASET_PREPROCESSED_PATH'
@arjunsuresh
Copy link
Contributor

DLRM docker container needs criteo dataset to be preprocessed outside of it. We need to add this option in the documentation page but if you have the preprocessed data we can tell you how to use it.

@anandhu-eng we can sync on how to add this option in the documentation page.

@WarrenSchultz
Copy link
Author

WarrenSchultz commented Jul 2, 2024

Huh, ok. I thought I saw it pulling down the full dataset, but I may have been mistaken. I'm working on a lot in parallel at the moment. :)
What's the correct command to do so at this point through CM?

@arjunsuresh
Copy link
Contributor

Currently we only support plugging in the preprocessed data as the download of criteo stopped working without manual intervention. I believe we can share you the preprocessed data - doing preprocessing is heavy - needs 6.4 TB disk space and 600 GB+ of memory and around 3 days of running. The preprocessed data is less than 300 GB. We can share it by end of this week - needs to test it for expected accuracy.

@WarrenSchultz
Copy link
Author

Great, thank you.

arjunsuresh added a commit that referenced this issue Jul 17, 2024
@arjunsuresh
Copy link
Contributor

Hi @WarrenSchultz MLCommons has just made available the preprocessed dataset for DLRMv2. Its about 150GB download. We no longer needs TBs of disk space and days of waiting.

@WarrenSchultz
Copy link
Author

Hi @WarrenSchultz MLCommons has just made available the preprocessed dataset for DLRMv2. Its about 150GB download. We no longer needs TBs of disk space and days of waiting.

That's great news, thanks! I'll give it a try soon.

@pdrtrncs23
Copy link

I am trying now to go through the steps in the documentation and have to report two issues related to this thread:

(1) The commands in https://docs.mlcommons.org/inference/benchmarks/recommendation/dlrm-v2/#__tabbed_5_2 have the option "--model=dlrm_v2-99" but this should be "--model=dlrm-v2-99" in order to work

(2) After fixing this the command proceeds with its several steps and reaches the downloading of the dataset (the preprocessed data that you provided) and when it completes to download the "day_23_sparse_multi_hot.npz" to 100% it does not proceed.

I have killed the command and changed the partial downloaded name to the expected name and re-ran the command but now I get a core dump, so I am wondering if there is something missing that was not completed with the download or the core dump is another issue. The error output is:

./run_local.sh: line 14: 162746 Illegal instruction     (core dumped) python python/main.py --profile $profile $common_opt --model $model --model-path $model_path --dataset $dataset --dataset-path $DATA_DIR --output $OUTPUT_DIR $EXTRA_OPS $@
./run.sh: line 59: 132: command not found
./run.sh: line 65: 132: command not found

CM error: Portable CM script failed (name = benchmark-program, return code = 32512)

@arjunsuresh
Copy link
Contributor

@pdrtrncs23 Thank you for reporting the issue with documentation. This should be solved if you do cm pull repo.

Please do cm rm cache --tags=criteo,preprocessed -f to remove the stale download. We have just added checksum check for criteo preprocessed dataset and from now the workflow will automatically tell if the downloaded file is corrupt.

@arjunsuresh
Copy link
Contributor

Closing this issue. Feel free to reopen if issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants