argonne-lcf · TApplencourt · Sep 25, 2024 · Sep 25, 2024
diff --git a/docs/aurora/known-issues.md b/docs/aurora/known-issues.md
@@ -4,9 +4,11 @@ This is a collection of known issues that have been encountered during Aurora's
 
 A known issues [page](https://apps.cels.anl.gov/confluence/display/inteldga/Known+Issues) can be found in the CELS Wiki space used for NDA content. Note that this page requires a JLSE Aurora early hw/sw resource account for access.
 
-## Running Applications
+## Runtime Errors
 
-1. `Cassini Event Queue overflow detected.` errors may occur for certain MPI communications and may happen for a variety of reasons - software and hardware, job placement, job routing, and the sate of the machine. Simply speaking, it means one of the network interfaces is getting messages too fast and cannot keep up to process them
+### 1. `Cassini Event Queue overflow detected.`
+
+`Cassini Event Queue overflow detected.` errors may occur for certain MPI communications and may happen for a variety of reasons - software and hardware, job placement, job routing, and the sate of the machine. Simply speaking, it means one of the network interfaces is getting messages too fast and cannot keep up to process them
 
 ```
 libfabric:16642:1701636928::cxi:core:cxip_cq_eq_progress():531<warn> x4204c1s3b0n0: Cassini Event Queue overflow detected.
@@ -22,66 +24,19 @@ export FI_CXI_CQ_FILL_PERCENT=20
 
 The value of `FI_CXI_DEFAULT_CQ_SIZE` can be set to something larger if issues persist. This is directly impacted by the number of unexpected messages sent and so may need to be increased as the scale of the job increases. 
 
-2. `double free detected` output while running with the mpich/52.2/* modules
-
-A core dump might indicate communicator cleanup e.g. after calling MPI_Comm_split_type. A workaround is to unset a few config-file related variables: 
-```
-unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
-unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
-unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
-```
-Additional information is here: https://github.com/pmodels/mpich/pull/6730 
-
-3. Slower-than expected GPU-Aware MPI:
-You can try one of those 2 set of env:
-- RDMA
-```
- export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
- export MPIR_CVAR_CH4_OFI_ENABLE_MR_HMEM=0
- export MPIR_CVAR_CH4_OFI_ENABLE_MULTI_NIC_STRIPING=0
- export MPIR_CVAR_CH4_OFI_MAX_NICS=8
- export MPIR_CVAR_CH4_OFI_GPU_RDMA_THRESHOLD=0
-```
+It maybe be useful to use other libfabric environment settings.
+In particular, the setting below may be useful to try. These are what what Cray MPI sets by default
+[Cray MPI libfabric Settings](https://cpe.ext.hpe.com/docs/24.03/mpt/mpich/intro_mpi.html#libfabric-environment-variables-for-hpe-slingshot-nic-slingshot-11).
 
-- Pipelining
-```
- export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
- export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_THRESHOLD=0
- export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=4194304
- export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_NUM_BUFFERS_PER_CHUNK=256
- export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_MAX_NUM_BUFFERS=256
- export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_D2H_ENGINE_TYPE=0
-```
+### 2. `failed to convert GOTPCREL relocation `
 
-4. Compiler error like
+If you see
 ```
 _libm_template.c:(.text+0x7): failed to convert GOTPCREL relocation against '__libm_acos_chosen_core_func_x'; relink with --no-relax
 ```
-in SYCL
-- Please try linking with `-flink-huge-device-code`
-
-5. General MPI Error
-
-Similar to Issue #1, it maybe be useful to use other libfabric environment settings.
-In particular, the setting below may be useful to try. These are what what Cray MPI sets by default
-[Cray MPI libfabric Settings](https://cpe.ext.hpe.com/docs/24.03/mpt/mpich/intro_mpi.html#libfabric-environment-variables-for-hpe-slingshot-nic-slingshot-11).
-```
-export FI_CXI_RDZV_THRESHOLD=16384
-export FI_CXI_RDZV_EAGER_SIZE=2048
-export FI_CXI_DEFAULT_CQ_SIZE=131072
-export FI_CXI_DEFAULT_TX_SIZE=1024
-export FI_CXI_OFLOW_BUF_SIZE=12582912
-export FI_CXI_OFLOW_BUF_COUNT=3
-export FI_CXI_RX_MATCH_MODE=hardware
-export FI_CXI_REQ_BUF_MIN_POSTED=6
-export FI_CXI_REQ_BUF_SIZE=12582912
-export FI_MR_CACHE_MAX_SIZE=-1
-export FI_MR_CACHE_MAX_COUNT=524288
-export FI_CXI_REQ_BUF_MAX_CACHED=0
-export FI_CXI_REQ_BUF_MIN_POSTED=6
-```
+try linking with `-flink-huge-device-code`
 
-5. SYCL Device Free Memory Query Error
+### 3. SYCL Device Free Memory Query Error
 
 Note that if you are querying the free memory on a device with the Intel SYCL extension "get_info<sycl::ext::intel::info::device::free_memory>();", you will need to set `export ZES_ENABLE_SYSMAN=1`. Otherwise you may see an error like:
 
@@ -91,6 +46,20 @@ x1921c1s4b0n0.hostmgmt2000.cm.americas.sgi.com 0: terminate called after throwin
  what(): The device does not have the ext_intel_free_memory aspect -33 (PI_ERROR_INVALID_DEVICE)
 ```
 
+### 4. `No VNIs available in internal allocator.` 
+
+If you see an error like `start failed on x4102c5s2b0n0: No VNIs available in internal allocator`, pass ` --no-vni` to `mpiexec`
+
+### 5. `PMIX ERROR: PMIX_ERR_NOT_FOUND` and `PMIX ERROR: PMIX_ERROR `
+
+When running on single node, you may observe this error message:
+```
+PMIX ERROR: PMIX_ERR_NOT_FOUND in file dstore_base.c at line 1567 
+PMIX ERROR: PMIX_ERROR in file dstore_base.c at line 2334
+```
+These errors can be safely ignored. 
+
+
 ## Submitting Jobs
 
 Jobs may fail to successfully start at times (particularly at higher node counts). If no error message is apparent, then one thing to check is the `comment` field in the full job information for the job using the command `qstat -xfw [JOBID] | grep comment`. Some example comments follow.
@@ -130,5 +99,4 @@ To increase the chances that a large job does not terminate due to a node failur
 * Interim Filesystem: The early access filesystem is not highly performant. Intermittent hangs or pauses should be expected - waiting for IO to complete is recommended and IO completions should pass without failure. Jobs requiring significant filesystem performance must be avoided at this time.
 * Large number of Machine Check Events from the PVC, that causes nodes to panic and reboot.
 * HBM mode is not automatically validated. Jobs requiring flat memory mode should test by looking at `numactl -H` for 4 NUMA memory nodes instead of 16 on the nodes.
-* Application failures at large node-count are being tracked in the CNDA Slack workspace. See this [canvas table](https://alcf-cnda.slack.com/canvas/C05HMK7DD4J?focus_section_id=temp:C:EYXdcf8f1d1b86d44428a9abab5b) for more information and to document your case. ESP and ECP project members with access to Aurora should have access to the CNDA slack workspace. Contact [email protected] if you have have access to Aurora and belong to an ESP or ECP project, but are not in the CNDA Slack workspace.
 * Application failures at single-node are tracked in the JLSE wiki/confluence [page](https://apps.cels.anl.gov/confluence/pages/viewpage.action?pageId=4784336)