Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error #277

Closed
1 of 3 tasks
MarioSimou opened this issue Jul 30, 2024 · 3 comments
Closed
1 of 3 tasks
Labels
bug Something isn't working requires triage Requires triaging

Comments

@MarioSimou
Copy link

MarioSimou commented Jul 30, 2024

Issue description

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error.

Expected Behavior

Run inference in cloudrun without any issues.

Actual Behavior

I have a simple microservice that exposes two HTTP endpoints. One endpoint is used to check the health of the service (/api/v1/healthcheck), and the other endpoint is used to run inference (/api/v1/analyze) using node-llama-cpp and a Hugging Face model.

When I deployed the service on Google Cloud Run, I could access the health check endpoint without any issues. However, when I called the analyze endpoint, the service was failing with a 503 error. Initially, I thought it was a configuration issue, so I tried all the steps mentioned here to fix it, but I had no luck.

Next, I tested the container's behavior on a different cloud provider by deploying it on AWS ECS Fargate. Unfortunately, the container was still failing. At that point, I wanted to check the logs of the Cloud Run service again. Fortunately, I noticed that the container was terminating with this warning Container terminated on signal 4, which stands for Illegal Instruction. This indicates that the CPU attempted to execute an instruction that the hardware capabilities do not allow.

Since I'm using node-llama-cpp to download and build llama.cpp binaries, I think we may be doing something wrong there that is not aligned with what Cloud Run expects. I'm not sure how to interpret this, but at this point, I'm exhausted.

Additional Notes:

  1. The docker image uses node:iron-bookworm-slim base image, which is on amd64 architecture.
  2. The container works fine locally.
  3. Both versions, node-llama-cpp v2 and v3 fail in cloudrun.

Steps to reproduce

Repo

My Environment

Dependency Version
Operating System
CPU 12th Gen Intel i7-1260P / Ubuntu Linux 20.04
Node.js version 20.x
Typescript version 5.x
node-llama-cpp version 2.x and 3.x

Additional Context

No response

Relevant Features Used

  • Metal support
  • CUDA support
  • Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

@MarioSimou MarioSimou added bug Something isn't working requires triage Requires triaging labels Jul 30, 2024
@giladgd
Copy link
Contributor

giladgd commented Jul 30, 2024

I have a few suggestions for things you can try:

  • Don’t use :slim or :alpine tags, just use :22 or :20, as the slim ones don’t include all the necessary libraries to compile correctly when the hardware has some types of GPUs or NPUs.
  • Try to run npx node-llama-cpp download inside of the container before your code is running, just to make sure it has nothing to do with your build process that happens before deploying the container.
  • From my experience, the Illegal Instruction issue happens when the container runs inside of virtualization (for running x64 container on an arm64 machine, for example), since llama.cpp uses some instructions that are not commonly used but help maximize the performance of the hardware you have, but not all of those instructions are supported by the virtualization implementation used by docker for example.

@MarioSimou
Copy link
Author

I tried all the above cases, and none of them worked. However, while I was trying to create a repo for you to use, I noticed a couple of things:

  • When I deployed the service from a machine that was on an amd64 architecture and used an AMD Ryzen 7 PRO 6850U with Radeon Graphics processor, the service didn't return a 503 error.
  • When I deployed the service from a machine that was on an amd64 architecture and used a 12th Gen Intel(R) Core(TM) i7-1260P processor, the service returned a 503 error.

So, the issue is definitely CPU-related.

I have also created the same service using the llama-cpp-python SDK, and I encountered the same problem there. At this point, the issue is not related to this repository, so I will be closing it soon. However, if you have any suggestions or ideas on how to solve this issue, feel free to share them with me.

@giladgd
Copy link
Contributor

giladgd commented Jan 8, 2025

Closing due to inactivity.
If you still encounter issues with node-llama-cpp, let me know and I'll try to help.

@giladgd giladgd closed this as completed Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working requires triage Requires triaging
Projects
None yet
Development

No branches or pull requests

2 participants