merge upstream #27

l3utterfly · 2024-07-14T05:03:07Z

No description provided.

`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op. Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.

* conv transpose 1d passing test for 1d input and kernel * working for different input and output channel counts, added test for variable stride * initial draft appears to work with stride other than 1 * working with all old and new conv1d tests * added a test for large tensors * removed use cuda hardcoding * restored test-conv-transpose.c * removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail * fixed accumulator bug * added test to test-backend-ops * fixed mistake * addressed review * fixed includes * removed blank lines * style and warning fixes * return failure when test fails * fix supports_op --------- Co-authored-by: slaren <[email protected]>

ggml-ci

Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01) → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03) • Updated input 'flake-parts/nixpkgs-lib': 'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01) → 'https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27) → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Bakus-Naur --> Backus-Naur

* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend * Reduced verbosity of comment

…rganov#8283) * Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from ggerganov#7809 and migrate to the new filenames. * Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.

Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'

…gerganov#8402) * Load server sampling parameters from the server context by default. * Wordsmithing comment

* update internlm2 * remove unused file * fix lint

Co-authored-by: Stanisław Szymczyk <[email protected]>

* Upd gguf-py/readme * Bump patch version for release

* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files * Arm AArch64: minor code refactoring for rebase * Arm AArch64: minor code refactoring for resolving a build issue with cmake * Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code change for resolving a build issue with server-windows * retrigger checks * Arm AArch64: minor code changes for rebase * Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits * Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig * Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code refactoring * Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat * Arm AArch64: minimize changes in ggml_compute_forward_mul_mat * Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * rebase on the latest master commit 3fd62a6 and adapt to the new directory structure * Arm AArch64: remove a redundant comment * Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off * Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels * Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type

ggml-ci

ggerganov#8404) * Modify the deprecation-warning 'main' binary to build every time, instead of only when a legacy binary is present. This is to help users of tutorials and other instruction sets from knowing what to do when the 'main' binary is missing and they are trying to follow instructions. * Adjusting 'server' name-deprecation binary to build all the time, similar to the 'main' legacy name binary.

…gerganov#8418)

This should allow more easily explaining how parse_special affects tokenization.

* cuda : suppress 'noreturn' warn in no_device_code This commit adds a while(true) loop to the no_device_code function in common.cuh. This is done to suppress the warning: ```console /ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 346 | } | ^ ``` The motivation for this is to reduce the number of warnings when compilng with GGML_HIPBLAS=ON. Signed-off-by: Daniel Bevenius <[email protected]> * squash! cuda : suppress 'noreturn' warn in no_device_code Update __trap macro instead of using a while loop to suppress the warning. Signed-off-by: Daniel Bevenius <[email protected]> --------- Signed-off-by: Daniel Bevenius <[email protected]>

* ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <[email protected]>

* fix part of mul_mat_id * skip the bfloat 16 sycl ut Signed-off-by: Chen Xi <[email protected]> --------- Signed-off-by: Chen Xi <[email protected]> Co-authored-by: Meng, Hengyu <[email protected]> Co-authored-by: Chen Xi <[email protected]>

* ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names

* examples : sprintf -> snprintf ggml-ci * examples : use sizeof() instead of hardcoded constants

The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma. Signed-off-by: Jiri Podivin <[email protected]>

…v#8441) Commit b0a4699 changed the name of this script from convert-hf-to-gguf.py to convert_hf_to_gguf.py breaking how convert is called from within a Docker container.

…anov#8420) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning

This commit updates the _try_copy lambda and moves the unary minus operator to after the cast to int32_t. The motivation for this that currently the following warning is generated on windows: ```console llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator applied to unsigned type, result still unsigned ```

* server : handle content array in chat API * Update examples/server/utils.hpp Co-authored-by: Xuan Son Nguyen <[email protected]> --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

ggml-ci

* Add Vulkan to CMake pkg * Add Sycl to CMake pkg * Add OpenMP to CMake pkg * Split generated shader file into separate translation unit * Add CMake target for Vulkan shaders * Update README.md * Add make target for Vulkan shaders * Use pkg-config to locate vulkan library * Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow * Clean up tabs * Move sudo to apt-key invocation * Forward GGML_EXTRA_LIBS to CMake config pkg * Update vulkan obj file paths * Add shaderc to nix pkg * Add python3 to Vulkan nix build * Link against ggml in cmake pkg * Remove Python dependency from Vulkan build * code review changes * Remove trailing newline * Add cflags from pkg-config to fix w64devkit build * Update README.md * Remove trailing whitespace * Update README.md * Remove trailing whitespace * Fix doc heading * Make glslc required Vulkan component * remove clblast from nix pkg

) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggerganov#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggerganov#8379 * test-tokenizer-random : add a failing edge case for falcon

kevmo314 and others added 30 commits July 8, 2024 10:26

tests : fix whitespace (#0)

6847d54

sync : ggml

2ee44c9

ggml-ci

scripts : fix sync for sycl

3f2d538

sycl : fix powf call in device code (ggerganov#8368)

2ec846d

readme : fix web link error [no ci] (ggerganov#8347)

c4dd11d

labeler : updated sycl to match docs and code refactor (ggerganov#8373)

a130ecc

gguf-py : do not use internal numpy types (ggerganov#7472)

7d0e23d

readme : fix typo [no ci] (ggerganov#8389)

9beb2dd

Bakus-Naur --> Backus-Naur

cmake : allow external ggml (ggerganov#8370)

9925ca4

sycl : Reenabled mmvq path for the SYCL Nvidia Backend (ggerganov#8372)

5b0b8d8

* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend * Reduced verbosity of comment

make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (ggerganov#8392)

a03e8dd

Update README.md to fix broken link to docs (ggerganov#8399)

fd560fe

Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'

Server: Enable setting default sampling parameters via command-line (g…

a59f8fd

…gerganov#8402) * Load server sampling parameters from the server context by default. * Wordsmithing comment

py : fix extra space in convert_hf_to_gguf.py (ggerganov#8407)

8f0fad4

py : fix converter for internlm2 (ggerganov#8321)

e4dd31f

* update internlm2 * remove unused file * fix lint

llama : add assert about missing llama_encode() call (ggerganov#8400)

a8be1e6

Co-authored-by: Stanisław Szymczyk <[email protected]>

msvc : silence codecvt c++17 deprecation warnings (ggerganov#8395)

7a80710

llama : C++20 compatibility for u8 strings (ggerganov#8408)

cc61948

gguf-py rel pipeline (ggerganov#8410)

83321c6

* Upd gguf-py/readme * Bump patch version for release

ggml : move sgemm sources to llamafile subfolder (ggerganov#8394)

6b2a849

ggml-ci

[SYCL] Use multi_ptr to clean up deprecated warnings (ggerganov#8256)

f4444d9

Initialize default slot sampling parameters from the global context. (g…

278d0e1

…gerganov#8418)

llama : use F32 precision in Qwen2 attention and no FA (ggerganov#8412)

7a221b6

tokenize : add --no-parse-special option (ggerganov#8423)

9a55ffe

This should allow more easily explaining how parse_special affects tokenization.

danbev and others added 15 commits July 11, 2024 17:53

ggml : add NVPL BLAS support (ggerganov#8329) (ggerganov#8425)

3686456

* ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <[email protected]>

ggml : minor naming changes (ggerganov#8433)

370b1f7

* ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names

examples : sprintf -> snprintf (ggerganov#8434)

71c1121

* examples : sprintf -> snprintf ggml-ci * examples : use sizeof() instead of hardcoded constants

convert : remove fsep token from GPTRefactForCausalLM (ggerganov#8237)

5aefbce

The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma. Signed-off-by: Jiri Podivin <[email protected]>

docker : fix filename for convert-hf-to-gguf.py in tools.sh (ggergano…

8a4441e

…v#8441) Commit b0a4699 changed the name of this script from convert-hf-to-gguf.py to convert_hf_to_gguf.py breaking how convert is called from within a Docker container.

server : ensure batches are either all embed or all completion (ggerg…

c3ebcfa

…anov#8420) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning

main : print error on empty input (ggerganov#8456)

6af51c0

server : handle content array in chat API (ggerganov#8449)

4e24cff

* server : handle content array in chat API * Update examples/server/utils.hpp Co-authored-by: Xuan Son Nguyen <[email protected]> --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

metal : template-ify some of the kernels (ggerganov#8447)

c917b67

ggml-ci

Merge branch 'layla-build' into merge

777616d

l3utterfly merged commit 899c8d9 into layla-build Jul 14, 2024
12 of 15 checks passed

l3utterfly deleted the merge branch July 14, 2024 05:03

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python server ggml script nix labels Jul 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream #27

merge upstream #27

l3utterfly commented Jul 14, 2024

merge upstream #27

merge upstream #27

Conversation

l3utterfly commented Jul 14, 2024