Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing direct link-time dependency on ORT on Linux/macOS #693

Closed
skyline75489 opened this issue Jul 12, 2024 · 2 comments
Closed

Comments

@skyline75489
Copy link
Contributor

skyline75489 commented Jul 12, 2024

Introduction

During the work of #687, I notice some issues that blocks GenAI NuGet from working on Linux. After investigation and discussion with @baijumeswani , I'll like to propose removing direct link-time dependency on ORT, to improve the overall dynamic library loading scenarios. @skottmckay has already implemented it on Android platform. This proposal is to further extend it on Linux/macOS.

The problem with link-time dependency

Again, @skottmckay has a detailed writeup at

// If the GenAI library links against the onnxruntime library, it will have a dependency on a specific

I suggest checking this out first to have a basic understanding of the issue.

TLDR: ort-genai C++ library depends on a specific version of ORT. Simply having libonnxruntime.so isn't enough, It needs specifically libonnxruntime.so.1.18.0. This is the ldd output:

ldd -v libonnxruntime-genai.so
        linux-vdso.so.1 (0x00007ffecc25e000)
        libonnxruntime.so.1.18.0 => not found
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f59e6ab0000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f59e6aab000)
......

Traditionally, this should be handled by symlinks, if those libraries are installed system-wide. But we need ort-genai to be distributed in wheels and NuGets.

Wheel scenario

To make GenAI work with direct linking against ORT, a exact copy of libonnxruntime.so currently exists and renamed to libonnxruntime.so.1.18.0:

(genai) skyline@workstation ~/miniconda3/envs/genai/lib/python3.10/site-packages/onnxruntime_genai
>ls -al
total 33804
drwxr-xr-x  4 skyline skyline     4096 Jul 12 11:29 .
drwxr-xr-x 43 skyline skyline     4096 Jul 12 12:08 ..
-rw-r--r--  1 skyline skyline      711 Jul 12 11:29 __init__.py
drwxr-xr-x  2 skyline skyline     4096 Jul 12 11:29 __pycache__
-rw-r--r--  1 skyline skyline      617 Jul 12 11:29 _dll_directory.py
-rw-r--r--  1 skyline skyline 15285336 Jul 12 11:29 libonnxruntime.so
-rw-r--r--  1 skyline skyline 15285336 Jul 12 11:29 libonnxruntime.so.1.18.0
drwxr-xr-x  3 skyline skyline     4096 Jul 12 11:29 models
-rwxr-xr-x  1 skyline skyline  4015888 Jul 12 11:29 onnxruntime_genai.cpython-310-x86_64-linux-gnu.so

This is introduced by a pipeline step during packaging:

# TODO: Find out why do we need to to have libonnxruntime.so.ort_stable_version

I don't think this should be considered correct, but at least the ort-genai wheel works on Linux for now.

NuGet scenario

With NuGet, the same issue exists. But this time the libonnxruntime.so is directly from ORT NuGet package and the naming is still just libonnxruntime.so without any version. So ort-genai will crash because it can't find the correct library.

Solution

Proven possible by Scott's work, removing the link-time dependency and switch to dlopen will not only allow us to remove the embedded ORT libraries and fix both the wheel and NuGet scenarios mentioned above, but also provide much more flexibility that can benefit #662 and future work. The rpath or search path tweaks will be no longer needed. The searching for correct library can be handled entirely in C++.

Limitation

With the power of dlopen, it comes with price. Using dlopen means that we should only use public APIs provided by ORT headers, to prevent possible ABI breakage. We probably also need to detect if the runtime ORT version matches the one we use for building ort-genai, and warn users if a version mismatch happens.

@snnn
Copy link
Member

snnn commented Jul 12, 2024

Related: microsoft/onnxruntime#21281

@skyline75489
Copy link
Contributor Author

Related: #273

skyline75489 added a commit that referenced this issue Jul 24, 2024
A major overhaul of the NuGet packaging pipeline, including:

* Restructure capi & NuGet packaging pipeline to make multi-platform
nupkg possible. The NuGet artifacts are separated by EP(CPU, Cuda and
DirectML). A single nupkg contains native libraries for different os &
arch.
* Make win-arm64 CPU packages use DLLs from NuGet package, just like
win-x64.
* Enable win-arm64 DirectML NuGet build & packaging (Internal work item:
https://task.ms/aii/40037)
* Enable linux-x64 CPU & CUDA NuGet build & packaging (Internal work
item: https://task.ms/aii/34234)
* Continue the work of #683 to make it work with ARM64.

Run:
https://dev.azure.com/aiinfra/ONNX%20Runtime/_build/results?buildId=507547&view=results

Prior work: #498 
Supports: #637

 Future work: Needs #693 to fix Linux NuGet.

---------

Co-authored-by: Baiju Meswani <[email protected]>
Co-authored-by: kunal-vaishnavi <[email protected]>
skyline75489 added a commit that referenced this issue Sep 12, 2024
* Implement #693 on macOS for non-framework usage.
* Enable macOS x64 & ARM64 wheel build & packaging
* Enable macOS x64 & ARM64 NuGet build & packaging

Python validation run:
*
https://aiinfra.visualstudio.com/ONNX%20Runtime/_build/results?buildId=548627&view=results

NuGet validation run: 
*
https://aiinfra.visualstudio.com/ONNX%20Runtime/_build/results?buildId=548631&view=results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants