From 44d3474f348f6eca07caff7b53153d7d496b6004 Mon Sep 17 00:00:00 2001 From: wildkid1024 Date: Sat, 7 Dec 2024 19:14:39 +0800 Subject: [PATCH] add index --- docs/index.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/index.md b/docs/index.md index 01ea819..5dd6e38 100644 --- a/docs/index.md +++ b/docs/index.md @@ -93,6 +93,9 @@ - 针对加速的设计主要考虑两点:量化和NCHW转换;针对codegen 使用了3种方式,json、c和特定格式; - 在runtime时对模型输入输出权重进行管理,可以利用内存重用和cache engine的一些方法 +### Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems[Arixv'23][CMU] + - 见论文[LLM_Serving_Survey](LLM_Serving_Survey.md) + ## 模型优化 ### FastFormers: Highly Efficient Transformer Models for Natural Language Understanding[arxiv'20][MSRA] - msra文章,但是只是单纯做了模型裁剪、蒸馏和量化,是一篇纯实验结果堆的文章