When I use node-llama-cpp to run inference, cloudrun fails with a 503 error #277

MarioSimou · 2024-07-30T07:48:01Z

Issue description

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error.

Expected Behavior

Run inference in cloudrun without any issues.

Actual Behavior

I have a simple microservice that exposes two HTTP endpoints. One endpoint is used to check the health of the service (/api/v1/healthcheck), and the other endpoint is used to run inference (/api/v1/analyze) using node-llama-cpp and a Hugging Face model.

When I deployed the service on Google Cloud Run, I could access the health check endpoint without any issues. However, when I called the analyze endpoint, the service was failing with a 503 error. Initially, I thought it was a configuration issue, so I tried all the steps mentioned here to fix it, but I had no luck.

Next, I tested the container's behavior on a different cloud provider by deploying it on AWS ECS Fargate. Unfortunately, the container was still failing. At that point, I wanted to check the logs of the Cloud Run service again. Fortunately, I noticed that the container was terminating with this warning Container terminated on signal 4, which stands for Illegal Instruction. This indicates that the CPU attempted to execute an instruction that the hardware capabilities do not allow.

Since I'm using node-llama-cpp to download and build llama.cpp binaries, I think we may be doing something wrong there that is not aligned with what Cloud Run expects. I'm not sure how to interpret this, but at this point, I'm exhausted.

Additional Notes:

The docker image uses node:iron-bookworm-slim base image, which is on amd64 architecture.
The container works fine locally.
Both versions, node-llama-cpp v2 and v3 fail in cloudrun.

Steps to reproduce

Repo

My Environment

Dependency	Version
Operating System
CPU	12th Gen Intel i7-1260P / Ubuntu Linux 20.04
Node.js version	20.x
Typescript version	5.x
`node-llama-cpp` version	2.x and 3.x

Additional Context

No response

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

The text was updated successfully, but these errors were encountered:

giladgd · 2024-07-30T11:04:19Z

I have a few suggestions for things you can try:

Don’t use :slim or :alpine tags, just use :22 or :20, as the slim ones don’t include all the necessary libraries to compile correctly when the hardware has some types of GPUs or NPUs.
Try to run npx node-llama-cpp download inside of the container before your code is running, just to make sure it has nothing to do with your build process that happens before deploying the container.
From my experience, the Illegal Instruction issue happens when the container runs inside of virtualization (for running x64 container on an arm64 machine, for example), since llama.cpp uses some instructions that are not commonly used but help maximize the performance of the hardware you have, but not all of those instructions are supported by the virtualization implementation used by docker for example.

MarioSimou · 2024-07-31T07:51:20Z

I tried all the above cases, and none of them worked. However, while I was trying to create a repo for you to use, I noticed a couple of things:

When I deployed the service from a machine that was on an amd64 architecture and used an AMD Ryzen 7 PRO 6850U with Radeon Graphics processor, the service didn't return a 503 error.
When I deployed the service from a machine that was on an amd64 architecture and used a 12th Gen Intel(R) Core(TM) i7-1260P processor, the service returned a 503 error.

So, the issue is definitely CPU-related.

I have also created the same service using the llama-cpp-python SDK, and I encountered the same problem there. At this point, the issue is not related to this repository, so I will be closing it soon. However, if you have any suggestions or ideas on how to solve this issue, feel free to share them with me.

giladgd · 2025-01-08T01:14:14Z

Closing due to inactivity.
If you still encounter issues with node-llama-cpp, let me know and I'll try to help.

MarioSimou added bug Something isn't working requires triage Requires triaging labels Jul 30, 2024

MarioSimou mentioned this issue Jul 31, 2024

Bug: Cloudrun deployment fails when llama.cpp is built with an 12th Gen Intel(R) Core(TM) i7-1260P processor ggerganov/llama.cpp#8799

Closed

giladgd closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error #277

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error #277

MarioSimou commented Jul 30, 2024 •

edited

Loading

giladgd commented Jul 30, 2024 •

edited

Loading

MarioSimou commented Jul 31, 2024

giladgd commented Jan 8, 2025

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error #277

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error #277

Comments

MarioSimou commented Jul 30, 2024 • edited Loading

Issue description

Expected Behavior

Actual Behavior

Steps to reproduce

My Environment

Additional Context

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

giladgd commented Jul 30, 2024 • edited Loading

MarioSimou commented Jul 31, 2024

giladgd commented Jan 8, 2025

MarioSimou commented Jul 30, 2024 •

edited

Loading

giladgd commented Jul 30, 2024 •

edited

Loading