-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Gemma 7B 2B OSS models are available on Hugging Face as of 20240221 #13
Comments
i9-13900KS running dual RTX-A4500 (20+20G) Ampere and i9-14900K running dual RTX-4090 (24+24G) Ada CPU first
|
Team, thank you for integrating Gemma support into llama.cpp yesterday - this was an extremely fast and efficient alignment with a model that just came out a couple hours before. |
investigate TensorFlow 2 / keras support
|
gemma-7b model on dual RTX-4090 suprim liquid with 800W max and 2 x 24G = 48G vramThe model runs at 20% TDP or 100+100W because I am sharing the model across the PCIe bus at 8x which saturates it up to 75% at 8 x 2GBps or = 12GBps as opposed to NVlink on ampere cards at 112GBps checking context length outputs = model.generate(**input_ids, max_new_tokens=1000)
python pip summary
run
|
7b testing on CUDA 12.3 on dual NVidia RTX-A4500 Ampere with NVLink7b testing on CUDA 12.3 on dual RTX-4090 Ampere MSI liquid 24Gx2 without NVLink - on PCIeX82b testing on CUDA 12.3 on RTX-A4000 Ampere desk 16G
2b testing on CUDA 12.3 on RTX-5000 Turing mobile 16G2b testing on CUDA 12.3 on RTX-3500 ADA mobile 12G - cold
2b testing on CUDA 12.3 on RTX-3500 ADA mobile 12G - thermal throttling
2b testing on Metal 2 on M2 Pro 16G2b testing on Metal 2 on M1 Max 32G2b testing on CPU 13800H 65G mobile Lenovo P1Gen6 - with thermal throttling
7b testing on CPU 13800H 65G mobile Lenovo P1Gen6 - with thermal throttling
|
L4 on GCP running gemma 7bcached model on
Running on G2
Finops
Finish installing software - or just use a docker container
Download code extract zip, add hugging face token
Run the model - download it first at 3.5 Gbps gpu is throttled by either NVlink or straight PCIe
results: 3:22 at 50% GPU saturation
gemma 2B - 2 L4 on GCP
however running without NVidia grid as below
|
selecting devices in code
|
RTX-4090 single Gemma 2B
RTX-A4500 single Gemma 2B
|
No H100, A100 80/40 but V100 with 32G are available in amsterdam V100 16G
The V100 is only 16G - less than the L4 at 24G - not the expected 32G
install libraries
Run Google Gemma 2B from the Hugging Face repo First clone this repo
Fix the hugging face token first - use yours or add (did not need on windows L4 image - just linux V100
Linux specific - looks like transformers needs to be updated - post hugging face loginto fix
trying from (16h ago)
Issue is I need python 3.11 - running 3.7
installing python 3.12
Need Python 3.11 - switching to later image
|
Gemma on Vertex AI Model Garden |
36G vram
|
Rerun A6000 48G VRAM on rebuilt machine
I have not installed visual studio yet https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
2115
|
gemma 7b on dual A4500
|
gemma 7B on dual 4090
|
Single NVIDIA A6000 - Ampere GA102 (see L40s equivalent on GCP) CPU - 14900K - 6400MHz RAM - overclocked Dual NVIDIA 4090 - Ada AD102 Dual NVIDIA A4500 - Ampere GA102 CPU - 13900KS CPU - 13900K Dual L4 on GCP - Ampere AD104 |
see #27
https://ai.google.dev/gemma/docs?hl=en
https://www.kaggle.com/models/google/gemma
Gemma on Vertex AI Model garden
https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/335?_ga=2.34476193.-1036776313.1707424880&hl=en
https://obrienlabs.medium.com/google-gemma-7b-and-2b-llm-models-are-now-available-to-developers-as-oss-on-hugging-face-737f65688f0d
https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
https://blog.google/technology/developers/gemma-open-models/
https://huggingface.co/google/gemma-7b
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/l4/PB-11316-001_v01.pdf
pull and remake the latest llama.cpp (see previous article running llama 70b - #7
abetlen/llama-cpp-python#1207
ggerganov/llama.cpp@580111d
7B (32G model needs 64G on a CPU or a RTX-A6000/RTX-5000 Ada) and 2B (on a macbook M1Max:32G unified ram - working perfectly
https://cloud.google.com/blog/products/ai-machine-learning/performance-deepdive-of-gemma-on-google-cloud
The text was updated successfully, but these errors were encountered: