diff --git a/pdfs/jia_thesis.pdf b/pdfs/jia_thesis.pdf new file mode 100644 index 0000000..28f2b83 Binary files /dev/null and b/pdfs/jia_thesis.pdf differ diff --git a/publications.md b/publications.md index 6c3b6ff..98e5194 100644 --- a/publications.md +++ b/publications.md @@ -49,6 +49,7 @@ title: Publications * [Sean Treichler's Thesis (2016)](#treichler_thesis) \[[PDF]({{ "/pdfs/treichler_thesis.pdf" | relative_url }})] * [Elliott Slaughter's Thesis (2017)](#slaughter_thesis) \[[PDF]({{ "/pdfs/slaughter_thesis.pdf" | relative_url }})] * [Wonchan Lee's Thesis (2019)](#lee_thesis) \[[PDF]({{ "/pdfs/lee_thesis.pdf" | relative_url }})] + * [Zhihao Jia's Thesis (2020)](#jia_thesis) \[[PDF]({{ "/pdfs/jia_thesis.pdf" | relative_url }})] * [Rupanshu Soi's Thesis (2021)](#soi_thesis) \[[PDF]({{ "/pdfs/soi_thesis.pdf" | relative_url }})] ## Papers @@ -896,6 +897,73 @@ significantly improves the efficiency of tasking, and thereby brings the strong scalability of explicit parallelism to implicit task parallelism. + +**Automated Discovery of Machine Learning Optimizations** [PDF]({{ "/pdfs/jia_thesis.pdf" | relative_url }})
+*Zhihao Jia*
+August 2020
+**Abstract:** The increasing complexity of machine learning (ML) +models and ML-specific hardware architectures makes it increasingly +challenging to build efficient and scalable ML systems. Today's ML +systems heavily rely on human effort to optimize the deployment of ML +models on modern hardware platforms, which requires a tremendous +amount of engineering effort but only provides suboptimal runtime +performance. Moreover, the rapid evolution of ML models and +ML-specific hardware makes it infeasible to manually optimize +performance for all model and hardware combinations. + +In this dissertation, we propose a search-based methodology to build +performant ML systems by automatically discovering performance +optimizations for ML computations. Instead of only considering the +limited set of manually designed performance optimizations in current +ML systems, our approach introduces a significantly more comprehensive +search space of possible strategies to optimize the deployment of an +ML model on a hardware platform. In addition, we design efficient +search algorithms to explore the search space and discover +highly-optimized strategies. The search is guided by a cost model for +evaluating the performance of different strategies. We also propose a +number of techniques to accelerate the search procedure by leveraging +the topology of the search space. + +This dissertation presents three ML systems that apply this +methodology to optimize different tasks in ML deployment. Compared to +current ML systems relying on manually designed optimizations, our ML +systems enable better runtime performance by automatically discovering +novel performance optimizations that are missing in current ML +systems. Moreover, the performance improvement is achieved with less +engineering effort, since the code needed for discovering these +optimizations is much less than manual implementation of these +optimizations. + +First, we developed TASO, the first ML graph optimizer that +automatically generates graph optimizations. TASO formally verifies +the correctness of the generated graph optimizations using an +automated theorem prover, and uses cost-based backtracking search to +discover how to apply the verified optimizations. In addition to +improving runtime performance and reducing engineering effort, TASO +also provides correctness guarantees using formal methods. + +Second, to generalize and go beyond today's manually designed +parallelization strategies for distributed ML computations, we +introduce the SOAP search space, which contains a comprehensive set of +possible strategies to parallelize ML computations by identifying +parallelization opportunities across different Samples, Operators, +Attributes, and Parameters. We developed FlexFlow, a deep learning +engine that automatically searches over strategies in the SOAP search +space. FlexFlow includes a novel execution simulator to evaluate the +runtime performance of different strategies, and uses a Markov Chain +Monte Carlo (MCMC) search algorithm to find performant +strategies. FlexFlow discovers strategies that significantly +outperform existing strategies, while requiring no manual effort +during the search procedure. + +Finally, we developed Roc, which automates data placement +optimizations and minimizes data transfers in the memory hierarchy for +large-scale graph neural network (GNN) computations. Roc formulates +the task of optimizing data placement as a cost minimization problem +and uses a dynamic programming algorithm to discover a globally +optimal data management plan that minimizes data transfers between +memories. + **Scaling Implicit Parallelism with Index Launches** [PDF]({{ "/pdfs/soi_thesis.pdf" | relative_url }})
*Rupanshu Soi*
December 2021