merge latest changes from upstream #6

l3utterfly · 2024-03-19T13:20:46Z

No description provided.

* metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library

* Refactor dtype handling to be extensible This code is equivalent as before, but now it is prepared to easily add more NumPy dtypes. * Add support for I8, I16 and I32 These types are allowed in the GGUF specification. * Add support for I8, I16 and I32 to gguf_writer * Add support for I8, I16, I32 to gguf_reader

…nov#6037) * attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

…anov#6047) - increase time out for server - do not fail fast

Co-authored-by: Jian Liao <[email protected]>

* additional methods to read model and ctx parameters * vocab size as a part of a model metadata * models without vocabulary, convert.py part * models without vocabulary, llama.cpp part * PR clean up * converter scrypt fixes * llama_vocab_type update (renamed the new key) * pr review fixes * revert function renaming * one more NoVocab assert

There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.

…rganov#6069)

* gguf : add support for I64 and F64 arrays GGML currently does not support I64 or F64 arrays and they are not often used in machine learning, however if in the future the need arises, it would be nice to add them now, so that the types are next to the other types I8, I16, I32 in the enums, and it also reserves their type number. Furthermore, with this addition the GGUF format becomes very usable for most computational applications of NumPy (being compatible with the most common NumPy dtypes: i8, i16, i32, i64, f32, f64), providing a faster, and more versatile alternative to the `npz` format, and a simpler alternative to the `hdf5` format. The change in this PR seems small, not significantly increasing the maintenance burden. I tested this from Python using GGUFWriter/Reader and `gguf-dump`, as well as from C, everything seems to work. * Fix compiler warnings

* Fix non-intel device selection * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <[email protected]> * Update ggml-sycl.cpp Co-authored-by: Neo Zhang Jianyu <[email protected]> --------- Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: Neo Zhang Jianyu <[email protected]>

Co-authored-by: Lou Ting <[email protected]>

Information about the Command-R 35B model (128k context) can be found at: https://huggingface.co/CohereForAI/c4ai-command-r-v01 Based on the llama2 model with a few changes: 1) New hyper parameter to scale output logits (logit_scale) 2) Uses LayerNorm instead of RMSNorm 3) Transfomer layers have a single shared LayerNorm that feeds into both the self-attention and FFN layers in parallel. There is no post-attention LayerNorm. 4) No support for Rotary Position Embeddings (RoPE) scaling 5) No biases used Find GGUF files here: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF To convert model to GGUF format yourself: 1) Download Command-R Hugging Face safetensors: git lfs install git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01 2) Run: python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01

* control vector api and implementation * control-vectors : minor code style updates * disable control vector when data == nullptr use -1 for disabled range (also on init) in case we ever support controlling layer 0 (embeddings) --------- Co-authored-by: Georgi Gerganov <[email protected]>

* issues: ci - close inactive issue with workflow * ci: close issue, change workflow schedule time

* Refactor nested if causing error C1061 on MSVC. * Revert back and remove else's. * Add flag to track found arguments.

* gritlm: add initial README.md to examples/gritlm This commit adds a suggestion for an initial README.md for the gritlm example. Signed-off-by: Daniel Bevenius <[email protected]> * squash! gritlm: add initial README.md to examples/gritlm Use the `scripts/hf.sh` script to download the model file. Signed-off-by: Daniel Bevenius <[email protected]> * squash! gritlm: add initial README.md to examples/gritlm Fix editorconfig-checker error in examples/gritlm/README.md. Signed-off-by: Daniel Bevenius <[email protected]> --------- Signed-off-by: Daniel Bevenius <[email protected]>

Co-authored-by: GainLee <[email protected]>

* common: llama_load_model_from_url with libcurl dependency Co-authored-by: Georgi Gerganov <[email protected]>

The old behaviour is to use f16, but bf16 to f16 is not a lossless conversion. Change the outtype to f32 to default to a lossless conversion.

Adding support for CamembertModel architecture used by : https://huggingface.co/dangvantuan/sentence-camembert-large

* Tidy-up argument parsing. * Missing ref. * common : minor * common : add static classifier --------- Co-authored-by: Georgi Gerganov <[email protected]>

* backend : offload large batches to GPU * fix hip * code cleanup * fix CUDA split buffers * Update ggml-backend-impl.h Co-authored-by: Johannes Gäßler <[email protected]> * cuda : fix memset without set_device * imatrix : remove sched affix from weight names * sched : add a new split if the current one has too many inputs reduce max inputs per split more cleanup * update backends ggml-ci --------- Co-authored-by: Johannes Gäßler <[email protected]>

…ganov#6139)

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06) → 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)

* gguf-split: split and merge gguf files per tensor * gguf-split: build with make toolchain * gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split * split : minor style + fix compile warnings * gguf-split: remove --upload not implemented --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov and others added 30 commits March 14, 2024 10:12

embedding : print cosine similarity (ggerganov#899)

0fd6c1f

metal : build metallib + fix embed path (ggerganov#6015)

381da2d

* metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library

embedding : print all resulting embeddings (ggerganov#899)

68265eb

ggml : designate enum vals for integer types (ggerganov#6050)

3fe8d7a

llama : optimize defrag moves + fix fragmentation calculation (ggerga…

2c4fb69

…nov#6037) * attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

llama : fix typo

a44bc96

server: disable debug release type sanitizer, simplify trigger (ggerg…

43241ad

…anov#6047) - increase time out for server - do not fail fast

readme : improve readme for Llava-1.6 example (ggerganov#6044)

15a3332

Co-authored-by: Jian Liao <[email protected]>

gguf-py : fix dtype check (ggerganov#6045)

77178ee

embedding : add EOS token if not present (ggerganov#899)

044ec4b

gguf-py : bump version to 0.8.0 (ggerganov#6060)

7271077

gguf : fix resource leaks (ggerganov#6061)

6e0438d

There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.

llama : fix integer overflow during quantization (ggerganov#6063)

4755afd

llama-bench : use random tokens to improve accuracy with mixtral (gge…

b0bc9f4

…rganov#6069)

llama : add Orion chat template (ggerganov#6066)

aab606a

make : ggml-metal.o depends on ggml.h

131b058

fix set main gpu error (ggerganov#6073)

46acb36

cuda : disable unused cudaLaunchHostFunc code (ggerganov#6078)

3020327

llava : change API to pure C style for Rust FFI bindgen (ggerganov#6079)

4e9a7f7

Co-authored-by: Lou Ting <[email protected]>

llama : fix Baichuan2 13B (ggerganov#6092)

d84c485

ci : close inactive issue with workflow (ggerganov#6053)

a56d09a

* issues: ci - close inactive issue with workflow * ci: close issue, change workflow schedule time

common : refactor nested if causing error C1061 on MSVC (ggerganov#6101)

15961ec

* Refactor nested if causing error C1061 on MSVC. * Revert back and remove else's. * Add flag to track found arguments.

readme : add wllama as a wasm binding (ggerganov#6100)

dfbfdd6

amiralimi and others added 18 commits March 16, 2024 17:52

ggml : add AVX512F SIMD (ggerganov#6088)

c47cf41

ggml:fix finding transfer queue family index error (ggerganov#6094)

dc0f612

Co-authored-by: GainLee <[email protected]>

ci : close all stale issues at once (ggerganov#6115)

cd776c3

common: llama_load_model_from_url using --model-url (ggerganov#6098)

d01b3c4

* common: llama_load_model_from_url with libcurl dependency Co-authored-by: Georgi Gerganov <[email protected]>

convert : use f32 outtype for bf16 tensors (ggerganov#6106)

3a6efdd

The old behaviour is to use f16, but bf16 to f16 is not a lossless conversion. Change the outtype to f32 to default to a lossless conversion.

convert : add support for CamembertModel architecture (ggerganov#6119)

9b03719

Adding support for CamembertModel architecture used by : https://huggingface.co/dangvantuan/sentence-camembert-large

common : tidy-up argument parsing (ggerganov#6105)

496bc79

* Tidy-up argument parsing. * Missing ref. * common : minor * common : add static classifier --------- Co-authored-by: Georgi Gerganov <[email protected]>

ci : temporary disable sanitizer builds (ggerganov#6128)

4f6d133

ci : disable stale issue messages (ggerganov#6126)

ac9ee6a

backend : set max split inputs to GGML_MAX_SRC (ggerganov#6137)

5e1b7f9

clip : fix memory leak (ggerganov#6138)

104f5e0

mpt : implement backwards compatiblity with duped output tensor (gger…

d199ca7

…ganov#6139)

flake.lock: Update

2d15886

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06) → 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)

common : print usage on '-h' and '--help' (ggerganov#6145)

4c28b82

ci : exempt some labels from being tagged as stale (ggerganov#6140)

970a480

common : disable repeat penalties by default (ggerganov#6127)

b80cf3b

l3utterfly merged commit 57c46aa into layla-build Mar 19, 2024
37 of 65 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge latest changes from upstream #6

merge latest changes from upstream #6

l3utterfly commented Mar 19, 2024

merge latest changes from upstream #6

merge latest changes from upstream #6

Conversation

l3utterfly commented Mar 19, 2024