From d6db92bf7ad9364b30ba278f00719edfeedaf2df Mon Sep 17 00:00:00 2001
From: Gera Shegalov <gera@apache.org>
Date: Tue, 1 Feb 2022 09:48:14 -0800
Subject: [PATCH] Add language clarifying GPU address in Spark on YARN with 
 isolation (#93)

Explain why the user observes GPU index 0 for all executors, and limitations of using MIG in a common section regardless of the approach.

Signed-off-by: Gera Shegalov gera@apache.org
---
 README.md                                     | 10 +--
 examples/MIG-Support/README.md                | 61 +++++++++++++++++++
 examples/MIG-Support/yarn-unpatched/README.md |  7 +--
 3 files changed, 65 insertions(+), 13 deletions(-)
 create mode 100644 examples/MIG-Support/README.md

diff --git a/README.md b/README.md
index 0a0f80aea..bad587561 100644
--- a/README.md
+++ b/README.md
@@ -45,14 +45,8 @@ This is an example of the GPU accelerated PCA algorithm running on Spark. For de
 [guide](/examples/Spark-cuML/pca/README.md).
 
 ### 5. MIG support
-We provide some guides about the Multi-Instance GPU (MIG) feature based on the NVIDIA Ampere architecture (such as NVIDIA A100 and A30) GPU.
-- [YARN 3.3.0+ MIG GPU Plugin](/examples/MIG-Support/device-plugins/gpu-mig) for adding a Java-based plugin for MIG
-on top of the Pluggable Device Framework
-- [YARN 3.1.2 until YARN 3.3.0 MIG GPU Support](/examples/MIG-Support/resource-types/gpu-mig) for
-patching and rebuilding YARN code base to support MIG devices.
-- [YARN 3.1.2+ MIG GPU Support without modifying YARN / Device Plugin Code](/examples/MIG-Support/yarn-unpatched)
-relying on installing nvidia CLI wrappers written in `bash`, but unlike the solutions above without
-any Java code changes.
+We provide some [guides](/examples/MIG-Support/README.md) about the Multi-Instance GPU (MIG) feature based on
+the NVIDIA Ampere architecture (such as NVIDIA A100 and A30) GPU.
 
 ## API
 ### 1. Xgboost examples API
diff --git a/examples/MIG-Support/README.md b/examples/MIG-Support/README.md
new file mode 100644
index 000000000..0e9c36ded
--- /dev/null
+++ b/examples/MIG-Support/README.md
@@ -0,0 +1,61 @@
+# Multi-Instance GPU (MIG) support in Apache Hadoop YARN
+
+There are multiple solutions for MIG scheduling on YARN that you can choose based on your environment and
+deployment requirements:
+
+- [YARN 3.3.0+ MIG GPU Plugin](/examples/MIG-Support/device-plugins/gpu-mig) for adding a Java-based plugin for MIG
+on top of the Pluggable Device Framework
+- [YARN 3.1.2 until YARN 3.3.0 MIG GPU Support](/examples/MIG-Support/resource-types/gpu-mig) for
+patching and rebuilding YARN code base to support MIG devices.
+- [YARN 3.1.2+ MIG GPU Support without modifying YARN / Device Plugin Code](/examples/MIG-Support/yarn-unpatched)
+relying on installing nvidia CLI wrappers written in `bash`, but unlike the solutions above without
+any Java code changes.
+
+## Limitations and Caveats
+
+Note that are some common caveats for the solutions above.
+
+### Single MIG GPU per Container
+
+Please see the [MIG Application Considerations](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#app-considerations)
+and [CUDA Device Enumeration](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices).
+
+It is important to note that CUDA 11 only supports enumeration of a single MIG instance.
+It is recommended that you configure YARN to only allow a single GPU be requested. See
+the YARN config `yarn.resource-types.nvidia/miggpu.maximum-allocation` for the [Pluggable Device Framework]
+(/examples/MIG-Support/device-plugins/gpu-mig) solution and
+`yarn.resource-types.yarn.io/gpu.maximum-allocation` for the remainder of MIG Support options above, respectively.
+
+### Metrics
+Some metrics are not and cannot be broken down by MIG device. For example, `utilization` is the
+aggregate utilization of the parent GPU, and there is no attribution of `temperature` to a
+particular MIG device.
+
+### GPU index / address as reported by Apache Spark in logs and UI
+
+With YARN isolation using NVIDIA Container Runtime ensuring a single visible device
+per Docker container running a Spark Executor, each Executor will see a disjoint list comprising
+a single device.
+Therefore, the user will end up observing index 0 being used by all executors. However, they refer
+to different GPU/MIG instances. You can verify this by running something like the following on a
+YARN worker node host OS:
+
+```bash
+for cid in $(sudo docker ps -q); do sudo docker exec $cid bash -c "printenv | grep VISIBLE; nvidia-smi -L"; done
+NVIDIA_VISIBLE_DEVICES=3
+GPU 0: NVIDIA A30 (UUID: GPU-05aa99be-b706-0dc1-ab62-dd12f2227b7d)
+  MIG 1g.6gb      Device  0: (UUID: MIG-70dc024a-e8d7-587c-81dd-57ad493b1d91)
+NVIDIA_VISIBLE_DEVICES=1
+GPU 0: NVIDIA A30 (UUID: GPU-05aa99be-b706-0dc1-ab62-dd12f2227b7d)
+  MIG 1c.2g.12gb  Device  0: (UUID: MIG-54cc2421-6f2d-59e9-b074-20707aadd71e)
+NVIDIA_VISIBLE_DEVICES=2
+GPU 0: NVIDIA A30 (UUID: GPU-05aa99be-b706-0dc1-ab62-dd12f2227b7d)
+  MIG 1g.6gb      Device  0: (UUID: MIG-7e5552bf-d328-57a8-b091-0720d4530ffb)
+NVIDIA_VISIBLE_DEVICES=0
+GPU 0: NVIDIA A30 (UUID: GPU-05aa99be-b706-0dc1-ab62-dd12f2227b7d)
+  MIG 1c.2g.12gb  Device  0: (UUID: MIG-e6af58f0-9af8-594f-825e-74d23e1a68c1)
+```
+
+
+
+
diff --git a/examples/MIG-Support/yarn-unpatched/README.md b/examples/MIG-Support/yarn-unpatched/README.md
index 4ec3279c4..c36e7f043 100644
--- a/examples/MIG-Support/yarn-unpatched/README.md
+++ b/examples/MIG-Support/yarn-unpatched/README.md
@@ -20,7 +20,8 @@ to discover GPUs. It replaces MIG-enabled GPUs with the list of `<gpu>` elements
 ## Installation
 
 These instructions assume NVIDIA Container Toolkit (nvidia-docker2) and YARN is already installed
-and configured with [CGroups enabled](https://hadoop.apache.org/docs/r3.1.2/hadoop-yarn/hadoop-yarn-site/UsingGpus.html).
+and configured with GPU Scheduling and
+[CGroups enabled](https://hadoop.apache.org/docs/r3.1.2/hadoop-yarn/hadoop-yarn-site/UsingGpus.html).
 
 Enable and configure your [GPUs with MIG](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html) on all of the nodes
 it applies to.
@@ -76,7 +77,3 @@ environment = [ "MIG_AS_GPU_ENABLED=1",  "REAL_NVIDIA_SMI_PATH=/if/non-default/p
 Note, the values for `MIG_AS_GPU_ENABLED`, `REAL_NVIDIA_SMI_PATH`, `ENABLE_NON_MIG_GPUS` should be
 identical to the ones specified in `yarn-env.sh`.
 
-## Limitations and Caveats
-Some metrics are not and cannot be broken down by MIG device. For example, `utilization` is the
-aggregate utilization of the parent GPU, and there is no attribution of `temperature` to a
-particular MIG device.
\ No newline at end of file