-
Notifications
You must be signed in to change notification settings - Fork 424
UCF 2023 Schedule
Date | Time | Topic | Speaker/Moderator |
---|---|---|---|
12/5 | 09:00-09:15 | Opening Remarks and UCFUnified Communication Framework (UCF) - Collaboration between industry, laboratories, and academia to create production grade communication frameworks and open standards for data-centric and high-performance applications. In this talk we will present recent advances in development UCF projects including Open UCX, Apache Spark UCX as well incubation projects in the area of SmartNIC programming, benchmarking, and other areas of accelerated compute. |
Gilad Shainer, NVIDIAGilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel. |
09:15-10:00 | Recent Advances in UCX for AMD GPUsThis talk will focus on recent developments in UCX to support AMD GPUs and the ROCm software stack. The presentation will go over some of the most relevant enhancements to the ROCm components in UCX over the last year, including: 1) enhancements to the uct/rocm-copy components zero-copy functionality to enhance device-to-host and host-to-device transfers: by allowing the zero-copy operations to perform asynchronously, device-to-host and host-device transfers can overlap the various stages in the rendezvous protocol, leading to up to 30% performance improvements in our measurements, 2) adding support for dma-buf based memory registration for ROCm devices: the linux kernels dma-buf mechanism is a portable mechanism which enables sharing device buffers across multiple devices by creating a dma-buf handle at the source and importing the handle at the consumer side. ROCm 5.6 introduced the runtime functionality to export a user-space dma-buf handle of GPU device memory, and support has been added to the ROCm memory domain in UCX starting from release 1.15.0, 3) updates required to support the ROCm versions released during the year (ROCm 5.4, 5.5, 5.6). Furthermore, the presentation will also include some details on the ongoing work to take advantage of new interfaces available starting from ROCm 5.7 which will allow to explicitly control and select which DMA engine(s) to use for a inter-process device-to-device data transfer operations. |
Edgar Gabriel, AMDBIO |
|
10:00-11:00 | UCX Backend for Realm: Design, Benefits, and Feature GapsRealm is a fully asynchronous low-level runtime for heterogeneous distributed memory machines. It is a key part of the LLR software stack (Legate/Legion/Realm) which provides the foundation for the construction of a composable software ecosystem that can transparently scale to multi-GPU and multinode systems. Realm enables the users to express a parallel application in terms of an explicit and dynamic task graph that is managed by the runtime system. With direct access to the task graph, Realm takes responsibility for all synchronization and scheduling, which not only removes the burden from programmers’ shoulders, but can also yield higher performance, as well as better performance portability. Being a dynamic, explicit task graph management system, Realm must manage the task graph as it is generated on-line at runtime. Thus, it is essential to lower the runtime system overheads, including the additional cost of communications in distributed memory machines. In this talk, we will presentthe design and implementation of the UCX network module for Realm. More specifically, we describe how we implement Realm’s active message API over UCX and discuss the advantages of using UCX compared to the existing GASNetEx backend. We also point out the challenges and feature gaps that we face for implementing Realm’s network module over UCX, the workarounds we use to avoid them, as well as the potential ways to address them within UCX itself. |
Hessam Mirsadeghi , NVIDIABIO |
|
11:00-11:30 | Lunch | ||
11:30-12:15 | Use In-Chip Memory for RDMA OperationsSome modern RDMA devices contain fast on-chip memory that can be accessed without crossing the PCI. This hardware capability can be leveraged to improve performance of SHMEM atomic operations on large scale. In this work, we extend the UCX API to support allocating on-chip memory, use it in the OpenSHMEM layer to implement shmem_alloc_with_hint and demonstrate performance improvement in existing atomic benchmarks. |
Roie Danino, NVIDIABIO |
|
12:15-13:00 | Low-Latency MPI RMA: Implementation and ChallengesMany applications rely on an active and local completion semantics, such as point-to-point or active MPI-RMA. The point-to-point approach has been heavily optimized over time in most of the MPI distribution. However, the MPI-RMA semantics has been overlooked over the past few years and therefore suffers from inefficiencies in its current implementations, especially when considering GPU-to-GPU communications. In this talk, we will present our current effort towards a low latency MPI-RMA semantics. While MPI-RMA offers a low latency, especially when considering local completion, it can be easily overwhelmed by the cost of synchronization and notification. In this work, we will investigate different strategies for an local completion mechanism similar to the existing PSCW flavor. We will first detail those strategies as well as their implementations. Then, we will present our results comparing the different approaches and identify gaps in the interface that could be addressed as part of the MPI-5 standard. |
Thomas Gillis, ANLBIO |
|
13:00-13:45 | UCX Protocols for NVIDIA Grace HopperNVIDIA Grace Hopper provides developer productivity features such as hardware-accelerated CPU-GPU memory coherence and the ability to perform inter-process communication across OS instance over NVLINK in the presence of NVSwitches. While developers can pass the same malloc memory to GPU kernels and communication routines on the Grace Hopper platform, as pages belonging to this memory can migrate between CPU and GPU based, there are performance tradeoffs associated with communication cost based on memory allocation choices for multi-GPU workloads. Also, the availability of inter- process communication across OS instances over NVLINK is available for specific memory types. In this talk, we will discuss 1. The roadmap for protocol choices that UCX communication library can make on Grace Hopper platform to take advantage of features such as on-demand-paging (ODP), multinode NVLINK, page residence queries, and more techniques; 2. Expected performance from using different communication paths at the UCT level; 3. Potential options for UCP protocols v2 in selecting communication paths that the UCT layer will expose 4. How protocol choice affects execution time and subsequent potential page migrations; 5. How the application layer can help communication layer by using memory binding, hints with allocation API, and more to avoid common overheads. |
Akshay Venkatesh, NVIDIABIO |
|
13:45-14:00 | Adjourn | ||
12/6 | 09:00-09:15 | Day 2 Open and Recap |
Pavel Shamis (Pasha), NVIDIAPavel Shamis is a Principal Research Engineer at Arm. His work is focused on co-design software, and hardware building blocks for high-performance interconnect technologies, development of communication middleware, and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domains including Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was one of the key drivers in the enablement Mellanox HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a board member of UCF consortium and co-maintainer of Open UCX. He holds multiple patents in the area of in-network accelerator. Pavel is a recipient of 2015 R&D100 award for his contribution to the development CORE-Direct in-network computing technology and the 2019 R&D100 award for the development of Open Unified Communication X (Open UCX) software framework for HPC, data analytics, and AI. |
09:15-10:00 | An Implementation of LCI Backend Using UCXHigh performance computing and artificial intelligence have evolved to be the primary data processing engines for wide commercial use. HPC clouds host growing numbers of users and applications, and therefore need to carefully manage the network resources and provide performance isolation between workloads. We'll explore best practices for optimizing the network activity and supporting variety of applications and users on the same network, including application examples from on premise clusters and from Microsoft Azure HPC Cloud. |
Gilad Shainer, NVIDIAGilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel. Jithin Jose, MicrosoftSpeaker Bio |
|
10:00-11:00 | Spark Shuffle Offload on DPUabstract |
Manjunath Gorentla Venkata, NVIDIAManjunath Gorentla Venkata is a director of architecture and principal HPC architect at NVIDIA. He has researched, architected, and developed multiple HPC products and features. His team is primarily responsible for developing features for parallel programming models, libraries, and network libraries to address the needs of HPC and AI/DL systems. The innovations architected and designed by him and his team land as features in NVIDIA networking products including UCC, UCX, CX HCAs, and BlueField DPUs. Prior to NVIDIA, Manju worked as a research scientist at DOE’s ORNL focused on middleware for HPC systems, including InfiniBand and Cray Systems. Manju earned Ph.D. and M.S. degrees in computer science from the University of New Mexico. Valentine Petrov, NVIDIABIO Ferrol Aderholdt, NVIDIABIO Sergey Lebdev, NVIDIABIO |
|
11:00-11:30 | Lunch | ||
11:30-12:30 | Investigating Performance Scalability for Small Message Aggregation in OpenSHMEM on Irregular Access PatternsIn this talk, we will discuss the current state of MPICH support for the UCX library, focusing on changes since the last annual meeting. Topics covered will include build configuration, point-to-point communication, RMA, multi-threading, GPU support, and more. We also look toward future UCX development items for the coming year. |
Yanfei Guo, Argonne National LaboratoryDr. Yanfei Guo holds an appointment as an Assistant Computer Scientist at the Argonne National Laboratory. He is a member of the Programming Models and the Runtime Systems group. He has been working on multiple software projects including MPI, Yaksa and OSHMPI. His research interests include parallel programming models and runtime systems in extreme-scale supercomputing systems, data-intensive computing and cloud computing systems. Yanfei has received the best paper award at the USENIX International Conference on Autonomic Computing 2013 (ICAC’13). His work on programming models and runtime systems has been published on peer-reviewed conferences and journals including the ACM/IEEE Supercomputing Conference (SC’14, SC’15) and IEEE Transactions on Parallel and Distributed Systems (TPDS). |
|
12:30-13:15 | Wire Compatibility in UCXApplications that take advantage of GPU capabilities often use stream abstractions to express dependencies, concurrency and to make the best use of the underlying hardware capabilities. Streams capture the notion of a queue of tasks that the GPU executes in order. This allows for enqueuing and dequeuing of compute (such as GPU kernels) and communication (such as a memory copy between host and device memory) tasks. The GPU is not required to maintain any ordering between tasks belonging to different streams and hence applications commonly use multiple streams to increase occupancy of GPU resources. A task enqueued onto a stream is generally asynchronous from the CPU’s perspective but synchronous with respect to other tasks enqueued on the same stream. A current limitation in UCX (and most of the libraries that take advantage of UCX) is that it does not provide abstractions to build dependencies between tasks enqueued onto streams and UCX communication operations. This means that if the CPU is required to send the result of a GPU kernel to another peer process, it must first synchronize with the stream onto which the GPU kernel was enqueued. This results in CPU resources being wasted when there exist methods of building communication dependencies without explicit CPU intervention in the critical path. The problem is especially important to solve in applications dominated by short running kernels, and kernel launch overheads present the primary bottleneck. Finally, such capabilities are already part of existing communication libraries such as NCCL, so the limitation in UCX presents a gap that applications are looking to have addressed for better composition. In this work, we plan to present 1. the current shortcomings in CPU-synchronous communication; 2. Alternatives to extending UCX API to embed stream objects into communication tasks; 3. Stream-synchronous send/receive and progress semantics; 4. Interoperability with CPU-synchronous semantics; 5. Implications on protocol implementations for performance and overlap. |
Akshay Venkatesh, NVIDIASpeaker Bio Sreeram Potluri, NVIDIASpeaker Bio Jim Dinan, NVIDIAJim Dinan is a principal engineer at NVIDIA in the GPU communications team. Prior to joining NVIDIA, Jim was a principal engineer at Intel and a James Wallace Givens postdoctoral fellow at Argonne National Laboratory. He earned a Ph.D. in computer science from The Ohio State University and a B.S. in computer systems engineering from the University of Massachusetts at Amherst. Jim has served for more than a decade on open standards committees for HPC parallel programming models, including MPI and OpenSHMEM, and he currently leads the MPI Hybrid & Accelerator Working Group. Hessam Mirsadeghi, NVIDIASpeaker Bio |
|
13:15-14:00 | Symmetric Remote Key with UCXIn this paper, we present a framework for moving compute and data between processing elements in a distributed heterogeneous system. The implementation of the framework is based on the LLVM compiler toolchain combined with the UCX communication framework. The framework can generate binary machine code or LLVM bitcode for multiple CPU architectures and move the code to remote machines while dynamically optimizing and linking the code on the target platform. The remotely injected code can recursively propagate itself to other remote machines or generate new code. The goal of this paper is threefold: (a) to present an architecture and implementation of the framework that provides essential infrastructure to program a new class of disaggregated systems wherein heterogeneous programming elements such as compute nodes and data processing units (DPUs) are distributed across the system, (b) to demonstrate how the framework can be integrated with modern, high-level programming languages such as Julia, and (c) to demonstrate and evaluate a new class of eXtended Remote Direct Memory Access (X-RDMA) communication operations that are enabled by this framework. To evaluate the capabilities of the framework, we used a cluster with Fujitsu CPUs and heterogeneous cluster with Intel CPUs and BlueField-2 DPUs interconnected using high-performance RDMA fabric. We demonstrated an X-RDMA pointer chase application that outperforms an RDMA GET-based implementation by 70% and is as fast as Active Messages, but does not require function predeployment on remote platforms. |
Luis E. Peña, ArmSpeaker bio |
|
14:00 | Adjourn | ||
12/7 | 08:00-08:15 | Opening Remarks for OpenSHMEM sessionOpenSHMEM update |
Steve Poole, LANLBIO |
08:15-08:45 | Rethinking OpenSHMEM Concepts for Better Small Message PerformanceOpenSHMEM update |
Aaron Welch, University of HoustonBIO |
|
08:45-09:15 | QoS-based Interfaces for Taming Tail LatencyOpenSHMEM update |
Vishwanath Venkatesan, NVIDIA and Manjunath Gorentla Venkata, NVIDIABIO |
|
09:15-10:00 | Break | ||
10:00-11:00 | Panel: Future Direction of OpenSHMEM and Related TechnologiesCommunity discussion |
Steve Poole, LANL and Pavel Shamis, NVIDIA and Oscar Hernandez, NVIDIA and Tony Curtis, Stony Brook University and Jim Dinan, NVIDIA and Manjunath Gorentla Venkata, NVIDIA and Matthew Baker, Voltron DataBIO |
|
11:00-11:05 | Adjourn | ||