Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update known-issues.md #487

Merged
merged 1 commit into from
Sep 25, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 25 additions & 57 deletions docs/aurora/known-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@ This is a collection of known issues that have been encountered during Aurora's

A known issues [page](https://apps.cels.anl.gov/confluence/display/inteldga/Known+Issues) can be found in the CELS Wiki space used for NDA content. Note that this page requires a JLSE Aurora early hw/sw resource account for access.

## Running Applications
## Runtime Errors

1. `Cassini Event Queue overflow detected.` errors may occur for certain MPI communications and may happen for a variety of reasons - software and hardware, job placement, job routing, and the sate of the machine. Simply speaking, it means one of the network interfaces is getting messages too fast and cannot keep up to process them
### 1. `Cassini Event Queue overflow detected.`

`Cassini Event Queue overflow detected.` errors may occur for certain MPI communications and may happen for a variety of reasons - software and hardware, job placement, job routing, and the sate of the machine. Simply speaking, it means one of the network interfaces is getting messages too fast and cannot keep up to process them

```
libfabric:16642:1701636928::cxi:core:cxip_cq_eq_progress():531<warn> x4204c1s3b0n0: Cassini Event Queue overflow detected.
Expand All @@ -22,66 +24,19 @@ export FI_CXI_CQ_FILL_PERCENT=20

The value of `FI_CXI_DEFAULT_CQ_SIZE` can be set to something larger if issues persist. This is directly impacted by the number of unexpected messages sent and so may need to be increased as the scale of the job increases.

2. `double free detected` output while running with the mpich/52.2/* modules

A core dump might indicate communicator cleanup e.g. after calling MPI_Comm_split_type. A workaround is to unset a few config-file related variables:
```
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
```
Additional information is here: https://github.com/pmodels/mpich/pull/6730

3. Slower-than expected GPU-Aware MPI:
You can try one of those 2 set of env:
- RDMA
```
export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
export MPIR_CVAR_CH4_OFI_ENABLE_MR_HMEM=0
export MPIR_CVAR_CH4_OFI_ENABLE_MULTI_NIC_STRIPING=0
export MPIR_CVAR_CH4_OFI_MAX_NICS=8
export MPIR_CVAR_CH4_OFI_GPU_RDMA_THRESHOLD=0
```
It maybe be useful to use other libfabric environment settings.
In particular, the setting below may be useful to try. These are what what Cray MPI sets by default
[Cray MPI libfabric Settings](https://cpe.ext.hpe.com/docs/24.03/mpt/mpich/intro_mpi.html#libfabric-environment-variables-for-hpe-slingshot-nic-slingshot-11).

- Pipelining
```
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_THRESHOLD=0
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=4194304
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_NUM_BUFFERS_PER_CHUNK=256
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_MAX_NUM_BUFFERS=256
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_D2H_ENGINE_TYPE=0
```
### 2. `failed to convert GOTPCREL relocation `

4. Compiler error like
If you see
```
_libm_template.c:(.text+0x7): failed to convert GOTPCREL relocation against '__libm_acos_chosen_core_func_x'; relink with --no-relax
```
in SYCL
- Please try linking with `-flink-huge-device-code`

5. General MPI Error

Similar to Issue #1, it maybe be useful to use other libfabric environment settings.
In particular, the setting below may be useful to try. These are what what Cray MPI sets by default
[Cray MPI libfabric Settings](https://cpe.ext.hpe.com/docs/24.03/mpt/mpich/intro_mpi.html#libfabric-environment-variables-for-hpe-slingshot-nic-slingshot-11).
```
export FI_CXI_RDZV_THRESHOLD=16384
export FI_CXI_RDZV_EAGER_SIZE=2048
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_DEFAULT_TX_SIZE=1024
export FI_CXI_OFLOW_BUF_SIZE=12582912
export FI_CXI_OFLOW_BUF_COUNT=3
export FI_CXI_RX_MATCH_MODE=hardware
export FI_CXI_REQ_BUF_MIN_POSTED=6
export FI_CXI_REQ_BUF_SIZE=12582912
export FI_MR_CACHE_MAX_SIZE=-1
export FI_MR_CACHE_MAX_COUNT=524288
export FI_CXI_REQ_BUF_MAX_CACHED=0
export FI_CXI_REQ_BUF_MIN_POSTED=6
```
try linking with `-flink-huge-device-code`

5. SYCL Device Free Memory Query Error
### 3. SYCL Device Free Memory Query Error

Note that if you are querying the free memory on a device with the Intel SYCL extension "get_info<sycl::ext::intel::info::device::free_memory>();", you will need to set `export ZES_ENABLE_SYSMAN=1`. Otherwise you may see an error like:

Expand All @@ -91,6 +46,20 @@ x1921c1s4b0n0.hostmgmt2000.cm.americas.sgi.com 0: terminate called after throwin
what(): The device does not have the ext_intel_free_memory aspect -33 (PI_ERROR_INVALID_DEVICE)
```

### 4. `No VNIs available in internal allocator.`

If you see an error like `start failed on x4102c5s2b0n0: No VNIs available in internal allocator`, pass ` --no-vni` to `mpiexec`

### 5. `PMIX ERROR: PMIX_ERR_NOT_FOUND` and `PMIX ERROR: PMIX_ERROR `

When running on single node, you may observe this error message:
```
PMIX ERROR: PMIX_ERR_NOT_FOUND in file dstore_base.c at line 1567
PMIX ERROR: PMIX_ERROR in file dstore_base.c at line 2334
```
These errors can be safely ignored.


## Submitting Jobs

Jobs may fail to successfully start at times (particularly at higher node counts). If no error message is apparent, then one thing to check is the `comment` field in the full job information for the job using the command `qstat -xfw [JOBID] | grep comment`. Some example comments follow.
Expand Down Expand Up @@ -130,5 +99,4 @@ To increase the chances that a large job does not terminate due to a node failur
* Interim Filesystem: The early access filesystem is not highly performant. Intermittent hangs or pauses should be expected - waiting for IO to complete is recommended and IO completions should pass without failure. Jobs requiring significant filesystem performance must be avoided at this time.
* Large number of Machine Check Events from the PVC, that causes nodes to panic and reboot.
* HBM mode is not automatically validated. Jobs requiring flat memory mode should test by looking at `numactl -H` for 4 NUMA memory nodes instead of 16 on the nodes.
* Application failures at large node-count are being tracked in the CNDA Slack workspace. See this [canvas table](https://alcf-cnda.slack.com/canvas/C05HMK7DD4J?focus_section_id=temp:C:EYXdcf8f1d1b86d44428a9abab5b) for more information and to document your case. ESP and ECP project members with access to Aurora should have access to the CNDA slack workspace. Contact [email protected] if you have have access to Aurora and belong to an ESP or ECP project, but are not in the CNDA Slack workspace.
* Application failures at single-node are tracked in the JLSE wiki/confluence [page](https://apps.cels.anl.gov/confluence/pages/viewpage.action?pageId=4784336)
Loading