-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model repo #91
Comments
Thank you for your question. The PyTriton library, while functional for simple use cases where a model is directly linked to a server for deployment, has limitations in feature support and does not facilitate integration with external model stores. For scenarios requiring more complex operations, such as dynamic loading and unloading of models, it is recommended to use the Triton Inference Server instead. This server supports a Python backend, enabling the serving of models via Python scripts. For further optimization, you might also explore the Triton Model Navigator. This utility aids in converting models from frameworks like PyTorch to TensorRT, thus boosting performance. For more detailed information, you can refer to the Python backend documentation and the Triton Model Navigator GitHub repository. Is there anything else you'd like to know or any specific details you need assistance with? |
This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
@piotrm-nvidia Hello I am looking for a Python library to help me perform end-to-end inference using Triton Server without the traditional client-server communication, which consumes a large amount of network I/O. When using the aio gRPC client under high concurrency (just 100), the communication time ( from grpc_send to model INITIALIZED averaging 200+ ms and from model RELEASED to grpc_recv the result averaging 184.27+ ms) much longer than the ensemble onnx model inference time (averaging 0.08 ms). I used FastAPI to wrap the aio gRPC client (with client reuse and gzip compression enabled I have use gzip )to call Triton Server. Both the client and Triton Server are running inside the same Docker container, and this is the result I observed. Finally, I found this library, but I was disappointed because its functionality is too limited. Do you have any better suggestions? |
This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
@631068264, if you need to perform inferences on Triton without a client-server communication, you may find the Triton Python API useful. |
@pziecina-nv The throughput using aio grpc is only about 10% of what's achieved with perf_analyzer. triton-inference-server/client#815 I don't know how to solve . |
Is it possible to use pytriton to load a full models repository that is otherwise requiring the full Triton server docker container? One of the things I love about pytriton is how easy it is to install in new machines without needing a container. IT could be a great go-between
I imagine doing projects like this as they mature
1/ Start with pytriton with no models folder at start
2/ Add models folder and still use pytriton
3/ Deploy production with full triton container but continue dev using pytriton when containers are not desired
The text was updated successfully, but these errors were encountered: