-
Notifications
You must be signed in to change notification settings - Fork 1
/
test.txt
309 lines (308 loc) · 20.9 KB
/
test.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
pick 6ae737e5 init npu_support
pick e52cae6f not compile _core_ext
pick 10da6698 support custom_op by native
pick 136be9f3 pad input tokens/positions
pick e26bc8c7 Some fixes for multi-prompt inference acc
pick 47e1d7c7 refactor
pick 89e298e0 refactor attention and slot indices
pick d6dd6208 support api server
pick 734b1a9b add ascend in platform
pick 098447b4 Add dockerfile and tiny fix
pick 6c42bc73 small fixes
pick 73c59d41 del mindie
pick 2d346939 extract torch.npu
pick 0ca6849d add mp distributed executor
pick a8a35d4b refactor MultiprocessingNPUExecutor
pick d5acf25b add PlatformMemoryProfiler
pick 0296cc84 simplify
pick 670e2177 fixes
pick b6f50b7c fix copy blocks
pick c163b208 some fixes
pick f494b81f add quay source
pick 08287ef6 [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (#8272)
pick 58fcc854 [Frontend] Add progress reporting to run_batch.py (#8060)
pick f9b4a2d4 [Bugfix] Correct adapter usage for cohere and jamba (#8292)
pick c7cb5c33 [Misc] GPTQ Activation Ordering (#8135)
pick 6cd5e5b0 [Misc] Fused MoE Marlin support for GPTQ (#8217)
pick a1d87422 Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (#8319)
pick da1a844e [Bugfix] Fix missing `post_layernorm` in CLIP (#8155)
pick 6234385f [CI/Build] enable ccache/scccache for HIP builds (#8327)
pick 8c054b7a [Frontend] Clean up type annotations for mistral tokenizer (#8314)
pick f421f3ce [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail (#8130)
pick 02751a7a Fix ppc64le buildkite job (#8309)
pick 5faedf1b [Spec Decode] Move ops.advance_step to flash attn advance_step (#8224)
pick 04e7c4e7 [Misc] remove peft as dependency for prompt models (#8162)
pick b1f3e189 [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (#8342)
pick 22f3a4bc [Bugfix] lookahead block table with cuda graph max capture (#8340)
pick 1d5e397a [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (#8172)
pick 94144e72 [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (#8043)
pick e497b8ae [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (#8329)
pick 1230263e [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (#8299)
pick efcf946a [Hardware][NV] Add support for ModelOpt static scaling checkpoints. (#6112)
pick 6a512a00 [model] Support for Llava-Next-Video model (#7559)
pick cea95dfb [Frontend] Create ErrorResponse instead of raising exceptions in run_batch (#8347)
pick 3b7fea77 [Model][VLM] Add Qwen2-VL model support (#7905)
pick 0b952af4 [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257)
pick aea02f30 [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation (#8373)
pick 7015417f [Bugfix] Add missing attributes in mistral tokenizer (#8364)
pick 73202dbe [Kernel][Misc] register ops to prevent graph breaks (#6917)
pick 8baa4549 [Misc] Move device options to a single place (#8322)
pick 775f00f8 [Speculative Decoding] Test refactor (#8317)
pick d394787e Pixtral (#8377)
pick 3fd2b0d2 Bump version to v0.6.1 (#8379)
pick a65cb160 [MISC] Dump model runner inputs when crashing (#8305)
pick f842a7af [misc] remove engine_use_ray (#8126)
pick b71c956d [TPU] Use Ray for default distributed backend (#8389)
pick b6c75e1c Fix the AMD weight loading tests (#8390)
pick 5a60699c [Bugfix]: Fix the logic for deciding if tool parsing is used (#8366)
pick 1bf2dd9d [Gemma2] add bitsandbytes support for Gemma2 (#8338)
pick 295c4730 [Misc] Raise error when using encoder/decoder model with cpu backend (#8355)
pick 42ffba11 [Misc] Use RoPE cache for MRoPE (#8396)
pick 7de49aa8 [torch.compile] hide slicing under custom op for inductor (#8384)
pick 520ca380 [Hotfix][VLM] Fixing max position embeddings for Pixtral (#8399)
pick e56bf277 [Bugfix] Fix InternVL2 inference with various num_patches (#8375)
pick c6202dae [Model] Support multiple images for qwen-vl (#8247)
pick 8a23e933 [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance (#8403)
pick 1f0c75af [BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (#8423)
pick f2e263b8 [Bugfix] Offline mode fix (#8376)
pick a6c0f365 [multi-step] add flashinfer backend (#7928)
pick 551ce010 [Core] Add engine option to return only deltas or final output (#7381)
pick 01987725 [Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427)
pick c1636945 [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (#8425)
pick b61bd98f [CI/Build] Disable multi-node test for InternVL2 (#8428)
pick d31174a4 [Hotfix][Pixtral] Fix multiple images bugs (#8415)
pick a480939e [Bugfix] Fix weight loading issue by rename variable. (#8293)
pick 360ddbd3 [Misc] Update Pixtral example (#8431)
pick 8f44a92d [BugFix] fix group_topk (#8430)
pick 5ec9c0fb [Core] Factor out input preprocessing to a separate class (#7329)
pick 40c39653 [Bugfix] Mapping physical device indices for e2e test utils (#8290)
pick 3f79bc3d [Bugfix] Bump fastapi and pydantic version (#8435)
pick 84275504 [CI/Build] Update pixtral tests to use JSON (#8436)
pick 68210201 [Bugfix] Fix async log stats (#8417)
pick ba775279 [bugfix] torch profiler bug for single gpu with GPUExecutor (#8354)
pick acda0b35 bump version to v0.6.1.post1 (#8440)
pick 9b4a3b23 [CI/Build] Enable InternVL2 PP test only on single node (#8437)
pick cab69a15 [doc] recommend pip instead of conda (#8446)
pick 06311e29 [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (#8442)
pick a2469127 [misc][ci] fix quant test (#8449)
pick ecd7a1d5 [Installation] Gate FastAPI version for Python 3.8 (#8456)
pick 0a4806f0 [plugin][torch.compile] allow to add custom compile backend (#8445)
pick a84e598e [CI/Build] Reorganize models tests (#7820)
pick f57092c0 [Doc] Add oneDNN installation to CPU backend documentation (#8467)
pick 18e9e1f7 [HotFix] Fix final output truncation with stop string + streaming (#8468)
pick 9ba0817f bump version to v0.6.1.post2 (#8473)
pick 85172520 [Hardware][intel GPU] bump up ipex version to 2.3 (#8365)
pick 1ef0d2ef [Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310)
pick 8a0cf1dd [Model] support minicpm3 (#8297)
pick a36e070d [torch.compile] fix functionalization (#8480)
pick 47790f3e [torch.compile] add a flag to disable custom op (#8488)
pick 50e9ec41 [TPU] Implement multi-step scheduling (#8489)
pick 3724d5f6 [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (#8490)
pick fc990f97 [Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (#8357)
pick a091e2da [Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
pick 837c1968 [Frontend] Expose revision arg in OpenAI server (#8501)
pick acd5511b [BugFix] Fix clean shutdown issues (#8492)
pick 781e3b9a [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (#8506)
pick 5d73ae49 [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270)
pick 2759a43a [doc] update doc on testing and debugging (#8514)
pick 47f5e03b [Bugfix] Bind api server port before starting engine (#8491)
pick 5478c4b4 [perf bench] set timeout to debug hanging (#8516)
pick 5ce45eb5 [misc] small qol fixes for release process (#8517)
pick cca61642 [Bugfix] Fix 3.12 builds on main (#8510)
pick 546034b4 [refactor] remove triton based sampler (#8524)
pick 1c1bb388 [Frontend] Improve Nullable kv Arg Parsing (#8525)
pick ee2bceaa [Misc][Bugfix] Disable guided decoding for mistral tokenizer (#8521)
pick 99aa4edd [torch.compile] register allreduce operations as custom ops (#8526)
pick cbdb2522 [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (#8509)
pick 1b6de835 [Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495)
pick 1009e93c [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631)
pick 9855b995 [Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434)
pick a54ed802 [Model] Add mistral function calling format to all models loaded with "mistral" format (#8515)
pick 56c3de01 [Misc] Don't dump contents of kvcache tensors on errors (#8527)
pick 98f97133 [Bugfix] Fix TP > 1 for new granite (#8544)
pick fa0c114f [doc] improve installation doc (#8550)
pick 09deb472 [CI/Build] Excluding kernels/test_gguf.py from ROCm (#8520)
pick 8110e445 [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012)
pick 95965d31 [CI/Build] fix Dockerfile.cpu on podman (#8540)
pick e3515729 [Misc] Add argument to disable FastAPI docs (#8554)
pick 6ffa3f31 [CI/Build] Avoid CUDA initialization (#8534)
pick 9d104b5b [CI/Build] Update Ruff version (#8469)
pick 7c7714d8 [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (#8157)
pick a8c1d161 [Core] *Prompt* logprobs support in Multi-step (#8199)
pick d65798f7 [Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543)
pick e18749ff [Model] Support Solar Model (#8386)
pick b3195bc9 [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380)
pick db9120cd [Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039)
pick d9cd78eb [BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572)
pick 0d47bf3b [Bugfix] add `dead_error` property to engine client (#8574)
pick 4c34ce89 [Kernel] Remove marlin moe templating on thread_m_blocks (#8573)
pick 3118f633 [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (#8545)
pick 02c9afa2 Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (#8593)
pick c52ec5f0 [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616)
pick 855c8ae2 [MISC] remove engine_use_ray in benchmark_throughput.py (#8615)
pick 76515f30 [Frontend] Use MQLLMEngine for embeddings models too (#8584)
pick 9cc373f3 [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577)
pick e42c634a [Core] simplify logits resort in _apply_top_k_top_p (#8619)
pick ea4647b7 [Doc] Add documentation for GGUF quantization (#8618)
pick 9e99407e Create SECURITY.md (#8642)
pick 6cb748e1 [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (#8551)
pick de6f90a1 [Misc] guard against change in cuda library name (#8609)
pick 18ae428a [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (#8571)
pick 9e5ec35b [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (#8474)
pick 260d40b5 [Core] Support Lora lineage and base model metadata management (#6315)
pick 3b63de93 [Model] Add OLMoE (#7922)
pick 2940afa0 [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (#8670)
pick b28298f2 [Bugfix] Validate SamplingParam n is an int (#8548)
pick 035fa895 [Misc] Show AMD GPU topology in `collect_env.py` (#8649)
pick 2874bac6 [Bugfix] Config got an unexpected keyword argument 'engine' (#8556)
pick b4e4eda9 [Bugfix][Core] Fix tekken edge case for mistral tokenizer (#8640)
pick 7c8566aa [Doc] neuron documentation update (#8671)
pick 7f9c8902 [Hardware][AWS] update neuron to 2.20 (#8676)
pick 0f961b3c [Bugfix] Fix incorrect llava next feature size calculation (#8496)
pick 0057894e [Core] Rename `PromptInputs` and `inputs`(#8673)
pick d4bf085a [MISC] add support custom_op check (#8557)
pick 0455c46e [Core] Factor out common code in `SequenceData` and `Sequence` (#8675)
pick 0faab90e [beam search] add output for manually checking the correctness (#8684)
pick 71c60491 [Kernel] Build flash-attn from source (#8245)
pick 5e85f4f8 [VLM] Use `SequenceData.from_token_counts` to create dummy data (#8687)
pick 4dfdf431 [Doc] Fix typo in AMD installation guide (#8689)
pick ec4aaad8 [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (#8646)
pick 9dc7c6c7 [dbrx] refactor dbrx experts to extend FusedMoe class (#8518)
pick d66ac628 [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (#8643)
pick 13d88d41 [Bugfix] Refactor composite weight loading logic (#8656)
pick 0e40ac9b [ci][build] fix vllm-flash-attn (#8699)
pick 06ed2815 [Model] Refactor BLIP/BLIP-2 to support composite model loading (#8407)
pick 8ca5051b [Misc] Use NamedTuple in Multi-image example (#8705)
pick ca2b628b [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (#8703)
pick 5b595327 [Model][VLM] Add LLaVA-Onevision model support (#8486)
pick c6bd70d7 [SpecDec][Misc] Cleanup, remove bonus token logic. (#8701)
pick d4a2ac83 [build] enable existing pytorch (for GH200, aarch64, nightly) (#8713)
pick 92ba7e74 [misc] upgrade mistral-common (#8715)
pick 3dda7c22 [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (#8702)
pick 57a0702e [Bugfix] Fix CPU CMake build (#8723)
pick d23679eb [Bugfix] fix docker build for xpu (#8652)
pick 9b8c8ba1 [Core][Frontend] Support Passing Multimodal Processor Kwargs (#8657)
pick e551ca15 [Hardware][CPU] Refactor CPU model runner (#8729)
pick 3e83c12b [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (#8733)
pick a79e5229 [Model] Support pp for qwen2-vl (#8696)
pick f2bd246c [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (#8707)
pick ee5f34b1 [CI/Build] use setuptools-scm to set __version__ (#4738)
pick 86e9c8df [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
pick 9b0e3ec9 [Kernel][LoRA] Add assertion for punica sgmv kernels (#7585)
pick b05f5c92 [Core] Allow IPv6 in VLLM_HOST_IP with zmq (#8575)
pick 5f7bb584 Fix typical acceptance sampler with correct recovered token ids (#8562)
pick 1a2aef3e Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (#8335)
pick 530821d0 [Hardware][AMD] ROCm6.2 upgrade (#8674)
pick 88577ac9 Fix tests in test_scheduler.py that fail with BlockManager V2 (#8728)
pick 0250dd68 re-implement beam search on top of vllm core (#8726)
pick 3185fb0c Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (#8750)
pick b8747e8a [MISC] Skip dumping inputs when unpicklable (#8744)
pick 3f06bae9 [Core][Model] Support loading weights by ID within models (#7931)
pick 8ff7ced9 [Model] Expose Phi3v num_crops as a mm_processor_kwarg (#8658)
pick cc4325b6 [Bugfix] Fix potentially unsafe custom allreduce synchronization (#8558)
pick a928ded9 [Kernel] Split Marlin MoE kernels into multiple files (#8661)
pick 2529d09b [Frontend] Batch inference for llm.chat() API (#8648)
pick 72fc97a0 [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (#8748)
pick 2467b642 [CI/Build] fix setuptools-scm usage (#8771)
pick 1e7d5c01 [misc] soft drop beam search (#8763)
pick 13f9f7a3 [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (#8768)
pick 01b6f9e1 [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (#8047)
pick 6da1ab6b [Core] Adding Priority Scheduling (#5958)
pick 6e0c9d6b [Bugfix] Use heartbeats instead of health checks (#8583)
pick ee777d9c Fix test_schedule_swapped_simple in test_scheduler.py (#8780)
pick b4522474 [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (#8776)
pick fc3afc20 Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (#8752)
pick e3dd0692 [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (#8250)
pick c2395367 [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (#8770)
pick 3e073e66 [Bugfix] load fc bias from config for eagle (#8790)
pick 1ac3de09 [Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer (#8672)
pick 3368c3ab [Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (#8767)
pick 8fae5ed7 [Misc] Fix minor typo in scheduler (#8765)
pick 1c046447 [CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade (#8777)
pick 300da091 [Kernel] Fullgraph and opcheck tests (#8479)
pick c6f2485c [[Misc]] Add extra deps for openai server image (#8792)
pick 0c4d2ad5 [VLM][Bugfix] internvl with num_scheduler_steps > 1 (#8614)
pick 28e1299e rename PromptInputs and inputs with backward compatibility (#8760)
pick 64840dfa [Frontend] MQLLMEngine supports profiling. (#8761)
pick 873edda6 [Misc] Support FP8 MoE for compressed-tensors (#8588)
pick 4f1ba084 Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810)
pick 770ec602 [Model] Add support for the multi-modal Llama 3.2 model (#8811)
pick e2c6e0a8 [Doc] Update doc for Transformers 4.45 (#8817)
pick 7193774b [Misc] Support quantization of MllamaForCausalLM (#8822)
pick 4bb98f21 [Misc] Update config loading for Qwen2-VL and remove Granite (#8837)
pick f70bccac [Build/CI] Upgrade to gcc 10 in the base build Docker image (#8814)
pick 520db4db [Docs] Add README to the build docker image (#8825)
pick 68988d4e [CI/Build] Fix missing ci dependencies (#8834)
pick 70de39f6 [misc][installation] build from source without compilation (#8818)
pick d9cfbc89 [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (#8872)
pick 93d364da [Bugfix] Include encoder prompts len to non-stream api usage response (#8861)
pick b28d2104 [Misc] Change dummy profiling and BOS fallback warns to log once (#8820)
pick e2f6f26e [Bugfix] Fix print_warning_once's line info (#8867)
pick ee2da3e9 fix validation: Only set tool_choice `auto` if at least one tool is provided (#8568)
pick 71d21c73 [Bugfix] Fixup advance_step.cu warning (#8815)
pick 4b377d6f [BugFix] Fix test breakages from transformers 4.45 upgrade (#8829)
pick 1b49148e [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (#8764)
pick 344cd2b6 [Feature] Add support for Llama 3.1 and 3.2 tool use (#8343)
pick 3b00b9c2 [Core] rename`PromptInputs` and `inputs` (#8876)
pick dc4e3df5 [misc] fix collect env (#8894)
pick 0e088750 [MISC] Fix invalid escape sequence '\' (#8830)
pick 6d792d2f [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (#8892)
pick 8df2dc3c [TPU] Update pallas.py to support trillium (#8871)
pick a9b15c60 [torch.compile] use empty tensor instead of None for profiling (#8875)
pick 172d1cd2 [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (#7271)
pick c5d55356 [Bugfix] fix for deepseek w4a16 (#8906)
pick c2ec430a [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378)
pick 18e60d7d [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (#8911)
pick bd429f2b [Core] Priority-based scheduling in async engine (#8850)
pick d86f6b2a [misc] fix wheel name (#8919)
pick 260024a3 [Bugfix][Intel] Fix XPU Dockerfile Build (#7824)
pick b0298aa8 [Misc] Remove vLLM patch of `BaichuanTokenizer` (#8921)
pick 39d3f8d9 [Bugfix] Fix code for downloading models from modelscope (#8443)
pick 19d02ff9 [Bugfix] Fix PP for Multi-Step (#8887)
pick e1a3f5e8 [CI/Build] Update models tests & examples (#8874)
pick 090e945e [Frontend] Make beam search emulator temperature modifiable (#8928)
pick e585b583 [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (#8891)
pick cc276443 [doc] organize installation doc and expose per-commit docker (#8931)
pick d1537039 [Core] Improve choice of Python multiprocessing method (#8823)
pick 5bf8789b [Bugfix] Block manager v2 with preemption and lookahead slots (#8824)
pick d081da00 [Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741)
pick 26a68d5d [CI/Build] Add test decorator for minimum GPU memory (#8925)
pick 2e7fe7e7 [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (#8930)
pick bc2ef1f7 [Model] Support Qwen2.5-Math-RM-72B (#8896)
pick 3d49776b [Model][LoRA]LoRA support added for MiniCPMV2.5 (#7199)
pick 31f46a0d [BugFix] Fix seeded random sampling with encoder-decoder models (#8870)
pick 1fb9c1b0 [Misc] Fix typo in BlockSpaceManagerV1 (#8944)
pick 6c9ba48f [Frontend] Added support for HF's new `continue_final_message` parameter (#8942)
pick f13a07b1 [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533)
pick a3cb9721 some fixes
pick 23c45ae0 fix type check
pick 1df3978e fixes
pick 26429a5c revert changes on test
pick 573b9097 fix acc
# Rebase 4ef41b84..573b9097 onto 4ef41b84 (281 commands)
#
# Commands:
# p, pick <commit> = use commit
# r, reword <commit> = use commit, but edit the commit message
# e, edit <commit> = use commit, but stop for amending
# s, squash <commit> = use commit, but meld into previous commit
# f, fixup [-C | -c] <commit> = like "squash" but keep only the previous
# commit's log message, unless -C is used, in which case
# keep only this commit's message; -c is same as -C but
# opens the editor
# x, exec <command> = run command (the rest of the line) using shell
# b, break = stop here (continue rebase later with 'git rebase --continue')
# d, drop <commit> = remove commit
# l, label <label> = label current HEAD with a name
# t, reset <label> = reset HEAD to a label
# m, merge [-C <commit> | -c <commit>] <label> [# <oneline>]
# . create a merge commit using the original merge commit's
# . message (or the oneline, if no original merge commit was
# . specified); use -c <commit> to reword the commit message
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#