Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[need help] a simple python implementation of parallel.cpp #930

Closed
wants to merge 3 commits into from

Conversation

littlebai3618
Copy link

I am in need of an HTTP API that supports continuous batch processing, so I have decided to implement it myself.

I encountered some issues while trying to implement continuous batch processing using the low-level llama.cpp API provided by this project. Therefore, I have posted my implementation here to seek help.
i mainly refer to:https://github.com/ggerganov/llama.cpp/blob/master/examples/parallel/parallel.cpp

I am experiencing a possible memory leak when performing continuous batch processing with large contexts and batches.

I have raised two separate issues, one in this repository (llama.cpp) and another in llama-cpp-python, to provide more information about the problem.

  1. llama.cpp issue: #4086
  2. llama-cpp-python issue: #924

Welcome to point out any errors, and I will fix them as soon as possible.

notice:
this demo not support grammar、 terminal args、 prompt file

@littlebai3618 littlebai3618 marked this pull request as draft November 21, 2023 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant