Skip to content

Latest commit

 

History

History
233 lines (195 loc) · 22.4 KB

README.md

File metadata and controls

233 lines (195 loc) · 22.4 KB

CSE 585: Advanced Scalable Systems for Generative AI (F'24)

Administrivia

  • Catalog Number: 34176
  • Lectures/Discussion: 1003 EECS, TTh: 10:30 AM – 12:00 PM
  • Projects/Makeup: 185 EWRE, F 1:30 PM – 2:30 PM
  • Counts as: Software Breadth and Depth (PhD); Technical Elective and 500-Level (MS/E)

Team

Member (uniqname) Role Office Hours
Mosharaf Chowdhury (mosharaf) Faculty 4820 BBB. By appointments only.
Insu Jang (insujang) GSI 4828 BBB. Friday 12:30PM - 1:30PM

Piazza

ALL communication regarding this course must be via Piazza. This includes questions, discussions, announcements, as well as private messages.

Presentation slides and paper summaries should be emailed to [email protected].

Course Description

This iteration of CSE585 will introduce you to the key concepts and the state-of-the-art in practical, scalable, and fault-tolerant software systems for emerging Generative AI (GenAI) and encourage you to think about either building new tools or how to apply an existing one in your own research.

Since datacenters and cloud computing form the backbone of modern computing, we will start with an overview of the two. We will then take a deep dive into systems for the Generative AI landscape, focusing on different types of problems. Our topics will include: basics on generative models from a systems perspective; systems for GenAI lifecycle including pre-training, fine-tuning/alignment, grounding, and inference serving systems; etc. We will cover GenAI topics from top conferences that take a systems view to the relevant challenges.

Note that this course is NOT focused on AI methods. Instead, we will focus on how one can build software systems so that existing AI methods can be used in practice and new AI methods can emerge.

Prerequisites

Students are expected to have good programming skills and must have taken at least one undergraduate-level systems-related course (from operating systems/EECS482, databases/EECS484, distributed systems/EECS491, and networking/EECS489). Having an undergraduate ML/AI course may be helpful, but not required or necessary.

Textbook

This course has no textbooks. We will read recent papers from top venues to understand trends in scalable (GenAI) systems and their applications.

Tentative Schedule and Reading List

This is an evolving list and subject to changes due to the breakneck pace of GenAI innovations.

Date Readings Presenter Summary Reviewer
Aug 27 Introduction Mosharaf
How to Read a Paper (Required)
How to Give a Bad Talk (Required)
Writing Reviews for Systems Conferences
The Datacenter as a Computer (Chapters 1 and 2)
The Llama 3 Herd of Models
GenAI Basics
Aug 29 The Illustrated Transformer (Required) Insu
The Illustrated GPT2
Challenges and Applications of Large Language Models
Attention is All You Need
Sep 3 The Illustrated Stable Diffusion (Required) Shiqi, Insu, Mosharaf
Multimodality and Large Multimodal Models (LMMs) (Required)
Improved Baselines with Visual Instruction Tuning
NExT-GPT: Any-to-Any Multimodal LLM
Sep 5 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Required) Mosharaf, Insu
Mixture of Experts Explained
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Required)
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Sep 10 No Lecture: Work on Project Proposals
Worse is Better (Required)
Sep 12 No Lecture: Work on Project Proposals
Hints and Principles for Computer System Design (Required)
Pre-Training
Sep 17 Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (Required) Kaiwen Xue, Yi Chen, Yunchi Lu Haotian Gong, Zhongwei Xu, Zheng Li, Siyuan Dong Brandon Zhang, Justin Paul, Sreya Gogineni, Sarah Stec
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (Required)
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Sep 19 PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel (Required) Runyu Lu, Jeff Ma, Ruofan Wu Melina O'Dell, Yutong Ai, Jeff Brill, Max Liu Paul-Andrei Aldea, Ho Jung Kim, Michael Hwang, Jason Liang
RingAttention with Blockwise Transformers for Near-Infinite Context
Tutel: Adaptive Mixture-of-Experts at Scale (Required)
Sep 24 ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning (Required) Haotian Gong, Zhongwei Xu, Zheng Li, Siyuan Dong Kaiwen Xue, Yi Chen, Yunchi Lu Melina O'Dell, Yutong Ai, Jeff Brill, Max Liu
Reducing Activation Recomputation in Large Transformer Models (Required)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Sep 26 FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline (Required) Alex Zhang,Jiyu Chen,Kasey Feng,Yuning Cong Runyu Lu, Jeff Ma, Ruofan Wu Kaiwen Xue, Yi Chen, Yunchi Lu
Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters (Required)
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
Post-Training
Oct 1 LoRA: Low-Rank Adaptation of Large Language Models (Required) Connor Wilkinson, Kevin Sun, Oskar Shiomi Jensen Rishika Varma Kalidindi, Nicholas Mellon, Sitota Mersha, Kalab Assefa Runyu Lu, Jeff Ma, Ruofan Wu
LIMA: Less Is More for Alignment
The Llama 3 Herd of Models (Section 4)
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Required)
Inference
Oct 3 Efficient Memory Management for Large Language Model Serving with PagedAttention (Required) Yuchen Xia, Yeda Song, Hendrik Mayer, Erik Chi Shmeelok Chakraborty, Peter Cao, Harsh Sinha, Divyam S. Haotian Gong, Zhongwei Xu, Zheng Li, Siyuan Dong
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention (Required)
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Oct 8 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (Required) Aditya Singhvi, Alex de la Iglesia, Ammar Ahmed Vatsal Joshi, Yoon Sung Ji, Omkar Yadav, Lohit Kamatham Yuchen Xia, Yeda Song, Hendrik Mayer, Erik Chi
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics
SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification (Required)
Oct 10 Splitwise: Efficient Generative LLM Inference Using Phase Splitting (Required) Shaurya Gunderia, Marissa Bhavsar, Raghav Ramesh, Chris Lin Aryan Joshi, Adit Kolli, Anup Bagali, Keshav Singh Shmeelok Chakraborty, Peter Cao, Harsh Sinha, Divyam S.
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Llumnix: Dynamic Scheduling for Large Language Model Serving (Required)
Oct 15 Fall Study Break
Oct 17 Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services (Required) Rishika Varma Kalidindi, Nicholas Mellon, Sitota Mersha, Kalab Assefa Paul-Andrei Aldea, Ho Jung Kim, Michael Hwang, Jason Liang Aryan Joshi, Adit Kolli, Anup Bagali, Keshav Singh
Fairness in Serving Large Language Models (Required)
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Oct 22 Mid-Semester Presentations
Oct 24 Mid-Semester Presentations
Oct 29 No Lecture: Recalibrate Projects
Oct 31 No Lecture: Work on Projects
Nov 5 No Lecture: Work on Projects
Nov 7 dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving (Required) Vatsal Joshi, Yoon Sung Ji, Omkar Yadav, Lohit Kamatham Shaurya Gunderia, Marissa Bhavsar, Raghav Ramesh, Chris Lin Rishika Varma Kalidindi, Nicholas Mellon, Sitota Mersha, Kalab Assefa
Mixture of LoRA Experts (Required)
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
Nov 12 AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration (Required) Paul-Andrei Aldea, Ho Jung Kim, Michael Hwang, Jason Liang Aditya Singhvi, Alex de la Iglesia, Ammar Ahmed Shaurya Gunderia, Marissa Bhavsar, Raghav Ramesh, Chris Lin
LLM in a flash: Efficient Large Language Model Inference with Limited Memory (Required)
SpotServe: Serving Generative Large Language Models on Preemptible Instances
Grounding
Nov 14 Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (Required) Brandon Zhang, Justin Paul, Sreya Gogineni, Sarah Stec Alex Zhang, Jiyu Chen, Kasey Feng, Yuning Cong Aditya Singhvi, Alex de la Iglesia, Ammar Ahmed
Fast Vector Query Processing for Large Datasets Beyond GPU Memory with Reordered Pipelining (Required)
Improving Language Models by Retrieving from Trillions of Tokens
GenAI (for) Systems
Nov 19 Parrot: Efficient Serving of LLM-based Applications with Semantic Variable (Required) Aryan Joshi, Adit Kolli, Anup Bagali, Keshav Singh Connor Wilkinson, Kevin Sun, Oskar Shiomi Jensen Alex Zhang,Jiyu Chen,Kasey Feng,Yuning Cong
The Shift from Models to Compound AI Systems
Automatic Root Cause Analysis via Large Language Models for Cloud Incidents (Required)
Power and Energy Optimizations
Nov 21 Perseus: Reducing Energy Bloat in Large Model Training (Required) Shmeelok Chakraborty, Peter Cao, Harsh Sinha, Divyam S. Brandon Zhang, Justin Paul, Sreya Gogineni, Sarah Stec Connor Wilkinson, Kevin Sun, Oskar Shiomi Jensen
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency (Required)
Characterizing Power Management Opportunities for LLMs in the Cloud
Ethical Considerations
Nov 26 Sociotechnical Safety Evaluation of Generative AI Systems (Required) Melina O'Dell, Yutong Ai, Jeff Brill, Max Liu Yuchen Xia, Yeda Song, Hendrik Mayer, Erik Chi Vatsal Joshi, Yoon Sung Ji, Omkar Yadav, Lohit Kamatham
On the Dangers of Stochastic Parrots: Can Language Models be too Big?🦜 (Required)
Foundation Models and Fair Use
Nov 28 No Lecture: Thanksgiving Recess
Dec 3 Wrap Up Mosharaf
How to Write a Great Research Paper (Required)
Dec 5 Final Poster Presentations
Tishman Hall @Beyster
(10:30AM - 12PM)
Template

Policies

Honor Code

The Engineering Honor Code applies to all activities related to this course.

Groups

All activities of this course will be performed in groups of 3-4 students.

Required Reading

Each lecture will have two required reading that everyone must read.
There will be one or more optional related reading(s) that only the presenter(s) should be familiar with. They are optional for the rest of the class.

Student Lectures

The course will be conducted as a seminar. Only one group will present in each class. Each group will be assigned at least one lecture over the course of the semester. Presentations should last at most 40 minutes without interruption. However, presenters should expect questions and interruptions throughout.

In the presentation, you should:

  • Provide necessary background and motivate the problem.
  • Present the high level idea, approach, and/or insight (using examples, whenever appropriate) in the required reading as well as the additional reading.
  • Discuss technical details so that one can understand key details without carefully reading.
  • Explain the differences between related works.
  • Identify strengths and weaknesses of the required reading and propose directions of future research.

The slides for a presentation must be emailed to the instructor team at least 24 hours prior to the corresponding class. Use Google slides to enable in-line comments and suggestions.

Lecture Summaries

Each group will also be assigned to write summaries for at least one lectures. The summary assigned to a group will not be the reading they gave the lecture on.

A paper summary must address the following four questions in sufficient details (2-3 pages):

  • What is the problem addressed in the lecture, and why is this problem important?
  • What is the state of related works in this topic?
  • What is the proposed solution, and what key insight guides their solution?
  • What is one (or more) drawback or limitation of the proposal?
  • What are potential directions for future research?

The paper summary of a paper must be emailed to the instructor team within 24 hours after its presentation. Late reviews will not be counted. You should use this format for writing your summary. Use Google doc to enable in-line comments and suggestions.

Allocate enough time for your reading, discuss as a group, write the summary carefully, and finally, include key observations from the class discussion.

Post-Presentation Panel Discussion

To foster a deeper understanding of the papers and encourage critical thinking, each lecture will be followed by a panel discussion. This discussion will involve three distinct roles played by different student groups, simulating an interactive and dynamic scholarly exchange.

Roles and Responsibilities

  1. The Authors
  • Group Assignment: The group that presents the paper and the group that writes the summary will play the role of the paper's authors.
  • Responsibility: As authors, you are expected to defend your paper against critiques, answer questions, and discuss how you might improve or extend your research in the future, akin to writing a rebuttal during the peer-review process.
  1. The Reviewers
  • Group Assignment: Each group will be assigned to one slot to play the role of reviewers.
  • Responsibility: Reviewers critically assess the paper, posing challenging questions and highlighting potential weaknesses or areas for further investigation. Your goal is to engage in a constructive critique of the paper, simulating a peer review scenario.
  1. Rest of the Class
  • Responsibility:
    • You are required to submit one insightful question for each presented papers before each class.
    • During the panel discussions, feel free to actively ask questions and engage in the dialogue.

Participation

Given the discussion-based nature of this course, participation is required both for your own understanding and to improve the overall quality of the course. You are expected to attend all lectures (you may skip up to 2 lectures due to legitimate reasons), and more importantly, participate in class discussions.

A key part of participation will be in the form of discussion in Piazza. The group in charge of the summary should initiate the discussion and the rest should participate. Not everyone must have add something every day, but it is expected that everyone has something to say over the semester.

Project

You will have to complete substantive work an instructor-approved problem and have original contribution. Surveys are not permitted as projects; instead, each project must contain a survey of background and related work.

You must meet the following milestones (unless otherwise specified in future announcements) to ensure a high-quality project at the end of the semester:

  • Form a group of 3-4 members and declare your group's membership and paper preferences by September 5. After this date, we will form groups from the remaining students.
  • Turn in a 2-page draft proposal (including references) by September 19. Remember to include the names and Michigan email addresses of the group members.
  • Each group must present mid-semester progress during class hours on October 22 and October 24.
  • Each group must turn in an 8-page final report and your code via email on or before 6:00PM EST on December 16. The report must be submitted as a PDF file, with formatting similar to that of the papers you've read in the class. The self-contained (i.e., include ALL dependencies) code must be submitted as a zip file. Each zip file containing the code must include a README file with a step-by-step guide on how to compile and run the provided code.
  • You can find how to access GPU resources here.

Tentative Grading

Weight
Paper Presentation 15%
Paper Summary 15%
Participation 10%
Project Report 40%
Project Presentations 20%