merge from upstream #22

l3utterfly · 2024-06-05T07:25:12Z

No description provided.

ggml-ci

…ates (ggerganov#7565) CUDA graphs require parameter updates to kernels associated with GGML_OP_CPY nodes. Previously the implementation only checked for a single CUDA kernel in such nodes, but this caused a bug in cases where 2 such kernels exist. This fixes the issue by using a vector to allow multiple function pointers to be stored and checked against. Fixes ggerganov#7942

* update HIP_UMA ggerganov#7399 add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable. - get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103) * simplify code, more consistent style --------- Co-authored-by: slaren <[email protected]>

overriden -> overridden

* markdownish codeblock fix * updating regexes

…gerganov#7524)

* ggml : generalize GGML_OP_CONCAT (WIP) ggml-ci * tests : add dim != 2 tests * metal : generalize concat kernel * tests : naming * cuda : generalize concat kernel ggml-ci * sycl : add warning and assert * ggml : fix op params handling * metal : bugfix kernel ggml-ci * ggml : reimplement CPU and Metal * cuda : add asserts ggml-ci * ggml : fix ptrs ggml-ci

…v#7436) * fix mul_mat_id to match the change of api * rm comment * rm unused or duplicated code, rename as review comment

* github: add refactor issue template [no ci] * Update 07-refactor.yml

* common : increase max number of experts to 160 * common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture * common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier * convert-hf : add model conversion support for DeepseekV2ForCausalLM * llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models * llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor) * llama : add inference support for LLM_ARCH_DEEPSEEK2 --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

* rpc : resource management rework * address review comments

…gerganov#7552)

* Add optional MLP bias for Granite models Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite. * llama: honor add_space_prefix from the model configuration propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <[email protected]> * llama: add support for small granite models it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <[email protected]> --------- Signed-off-by: Giuseppe Scrivano <[email protected]> Co-authored-by: Steffen Roecker <[email protected]>

* Update random test: add_bos_token. * Update random test: add WPM models for testing. * Build vocab.special_tokens_cache using vocab token types. * Fix and improve WPM preprocessing. - Fix unicode edge case combinations. - Split by whitspace in the same pass. * Discard all tokens when no matching found.

* align GEMM dispatch

…nd Linux distro (ggerganov#7605)

ggml-ci

* ggml : use atomic_flag for critical section * add windows shims

* tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci

@JohannesGaessler

This enforces a check that -fno-finite-math-only was set and that the operating compiling mode is not in finite maths mode. This is because during rewriting of silu and softmax for cpu ggerganov#7154 there emerged an issue where the result that was observed when >1 slot was nondeterministic as found by @JohannesGaessler. @LostRuins narrowed the problem down to -ffinite-math-only which was theorised to be due to SiLU, instead of flushing small values to 0, returns NaN or some other garbage. @jart proposed a fix that @ggerganov then implemented in this fix ref ggerganov#7154 (comment)

* Add per token attributes enum * Using phi-3 for testing 'rstrip' * Using jina-v2 for testing 'lstrip' * Brute force test for 'lstrip' and 'rstrip' * Implement 'rstrip' and 'lstrip' * Update phi-3 GGUF file (obsolete since 917dc8c) * Replace llama_token_type with llama_token_attribs

This adds tags and android ndk into the git ignore list

* Improve hipBLAS support in CMake This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK. * Set ROCM_PATH correctly

…erganov#7722) compare-commits.sh : hide stdout, use -oe to print markdown

ggml-ci

* common : gpt_params_parse do not print usage * common : rework usage print (wip) * common : valign * common : rework print_usage * infill : remove cfg support * common : reorder args * server : deduplicate parameters ggml-ci * common : add missing header ggml-ci * common : remote --random-prompt usages ggml-ci * examples : migrate to gpt_params ggml-ci * batched-bench : migrate to gpt_params * retrieval : migrate to gpt_params * common : change defaults for escape and n_ctx * common : remove chatml and instruct params ggml-ci * common : passkey use gpt_params

Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.

-ins and --instruct were moved in ggerganov#7675 I have adjusted the README accordingly. There was no trace of --chatml in the README.

ggerganov and others added 30 commits May 27, 2024 12:10

metal : add GGML_OP_REPEAT kernels (ggerganov#7557)

1d8fca7

ggml-ci

Add freq factors (ggerganov#7495)

5487593

Fix q_xxs using mul_mat_q (ggerganov#7459)

95f84d5

make: add --device-debug to NVCC debug flags (ggerganov#7542)

10b1e45

adding in x64 targets to cmake presets (ggerganov#7574)

0136966

llava : update clip.h (ggerganov#7580)

74b239b

overriden -> overridden

Markdownish code block fix (ggerganov#7571)

c417671

* markdownish codeblock fix * updating regexes

server: do not remove whitespace at the start of a completion chunk (g…

9335b96

…gerganov#7524)

[SYCL]fix ggml_sycl_mul_mat_id() to match the change of api (ggergano…

e2b0650

…v#7436) * fix mul_mat_id to match the change of api * rm comment * rm unused or duplicated code, rename as review comment

github: add refactor to issue template (ggerganov#7561)

271ff3f

* github: add refactor issue template [no ci] * Update 07-refactor.yml

llama : handle unknown utf8 bytes (ggerganov#7588)

8b99e2a

tests : fix test-tokenizer-0.sh

edc2943

rpc : resource management rework (ggerganov#7562)

2b737ca

* rpc : resource management rework * address review comments

vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (g…

56411a9

…gerganov#7552)

sycl : fix assert (ggerganov#7563)

6bd12ce

[SYCL] Align GEMM dispatch (ggerganov#7566)

b864b50

* align GEMM dispatch

ggml : fix typo in ggml.c (ggerganov#7603)

504f0c3

Add Arc A750 and Arch linux to readme-sycl.md as verified GPU model a…

0e8d8bf

…nd Linux distro (ggerganov#7605)

ggml : restore ggml_rope_xpos_inplace (ggml/0)

72de268

ggml-ci

sync : ggml

2ab9772

scripts : remove mpi remnants

00281b7

ggml : use atomic_flag for critical section (ggerganov#7598)

87bdf2a

* ggml : use atomic_flag for critical section * add windows shims

llama-bench : add support for the RPC backend (ggerganov#7435)

210d991

cuda : non-cont concat support (ggerganov#7610)

cce3dcf

* tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci

ggerganov and others added 13 commits June 4, 2024 17:01

refine .gitignore (ggerganov#7688)

b226c12

This adds tags and android ndk into the git ignore list

Improve hipBLAS support in CMake (ggerganov#7696)

987d743

* Improve hipBLAS support in CMake This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK. * Set ROCM_PATH correctly

llama-bench : allow using a different printer for stderr with -oe (gg…

adc9ff3

…erganov#7722) compare-commits.sh : hide stdout, use -oe to print markdown

readme : remove obsolete Zig instructions (ggerganov#7471)

5ca0944

llama : remove beam search (ggerganov#7736)

0cd6bd3

ggml : remove OpenCL (ggerganov#7735)

554c247

ggml-ci

Allow number of nodes in CUDA graph to change (ggerganov#7738)

b90dc56

Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.

Fix per token atrributes bits (ggerganov#7749)

c90dbe0

readme : remove -ins (ggerganov#7759)

9973e81

-ins and --instruct were moved in ggerganov#7675 I have adjusted the README accordingly. There was no trace of --chatml in the README.

Merge branch 'layla-build' into merge

0ac83d0

l3utterfly merged commit b2594c7 into layla-build Jun 5, 2024
2 of 6 checks passed

l3utterfly deleted the merge branch June 5, 2024 07:29

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python server ggml Kompute Apple Metal script nix labels Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream #22

merge from upstream #22

l3utterfly commented Jun 5, 2024

merge from upstream #22

merge from upstream #22

Conversation

l3utterfly commented Jun 5, 2024