You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I use node-llama-cpp to run inference, cloudrun fails with a 503 error.
Expected Behavior
Run inference in cloudrun without any issues.
Actual Behavior
I have a simple microservice that exposes two HTTP endpoints. One endpoint is used to check the health of the service (/api/v1/healthcheck), and the other endpoint is used to run inference (/api/v1/analyze) using node-llama-cpp and a Hugging Face model.
When I deployed the service on Google Cloud Run, I could access the health check endpoint without any issues. However, when I called the analyze endpoint, the service was failing with a 503 error. Initially, I thought it was a configuration issue, so I tried all the steps mentioned here to fix it, but I had no luck.
Next, I tested the container's behavior on a different cloud provider by deploying it on AWS ECS Fargate. Unfortunately, the container was still failing. At that point, I wanted to check the logs of the Cloud Run service again. Fortunately, I noticed that the container was terminating with this warning Container terminated on signal 4, which stands for Illegal Instruction. This indicates that the CPU attempted to execute an instruction that the hardware capabilities do not allow.
Since I'm using node-llama-cpp to download and build llama.cpp binaries, I think we may be doing something wrong there that is not aligned with what Cloud Run expects. I'm not sure how to interpret this, but at this point, I'm exhausted.
Additional Notes:
The docker image uses node:iron-bookworm-slim base image, which is on amd64 architecture.
The container works fine locally.
Both versions, node-llama-cpp v2 and v3 fail in cloudrun.
Don’t use :slim or :alpine tags, just use :22 or :20, as the slim ones don’t include all the necessary libraries to compile correctly when the hardware has some types of GPUs or NPUs.
Try to run npx node-llama-cpp download inside of the container before your code is running, just to make sure it has nothing to do with your build process that happens before deploying the container.
From my experience, the Illegal Instruction issue happens when the container runs inside of virtualization (for running x64 container on an arm64 machine, for example), since llama.cpp uses some instructions that are not commonly used but help maximize the performance of the hardware you have, but not all of those instructions are supported by the virtualization implementation used by docker for example.
I tried all the above cases, and none of them worked. However, while I was trying to create a repo for you to use, I noticed a couple of things:
When I deployed the service from a machine that was on an amd64 architecture and used an AMD Ryzen 7 PRO 6850U with Radeon Graphics processor, the service didn't return a 503 error.
When I deployed the service from a machine that was on an amd64 architecture and used a 12th Gen Intel(R) Core(TM) i7-1260P processor, the service returned a 503 error.
So, the issue is definitely CPU-related.
I have also created the same service using the llama-cpp-python SDK, and I encountered the same problem there. At this point, the issue is not related to this repository, so I will be closing it soon. However, if you have any suggestions or ideas on how to solve this issue, feel free to share them with me.
Issue description
When I use node-llama-cpp to run inference, cloudrun fails with a 503 error.
Expected Behavior
Run inference in cloudrun without any issues.
Actual Behavior
I have a simple microservice that exposes two HTTP endpoints. One endpoint is used to check the health of the service (
/api/v1/healthcheck
), and the other endpoint is used to run inference (/api/v1/analyze
) usingnode-llama-cpp
and a Hugging Face model.When I deployed the service on Google Cloud Run, I could access the health check endpoint without any issues. However, when I called the analyze endpoint, the service was failing with a 503 error. Initially, I thought it was a configuration issue, so I tried all the steps mentioned here to fix it, but I had no luck.
Next, I tested the container's behavior on a different cloud provider by deploying it on AWS ECS Fargate. Unfortunately, the container was still failing. At that point, I wanted to check the logs of the Cloud Run service again. Fortunately, I noticed that the container was terminating with this warning
Container terminated on signal 4
, which stands forIllegal Instruction
. This indicates that the CPU attempted to execute an instruction that the hardware capabilities do not allow.Since I'm using
node-llama-cpp
to download and build llama.cpp binaries, I think we may be doing something wrong there that is not aligned with what Cloud Run expects. I'm not sure how to interpret this, but at this point, I'm exhausted.Additional Notes:
node:iron-bookworm-slim
base image, which is onamd64
architecture.v2
andv3
fail in cloudrun.Steps to reproduce
Repo
My Environment
node-llama-cpp
versionAdditional Context
No response
Relevant Features Used
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, but I don't know how to start. I would need guidance.
The text was updated successfully, but these errors were encountered: