Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Build] SIGSEGV calling into native library from Java on MacOS on M2 Mac #19512

Open
garthhenning opened this issue Feb 13, 2024 · 10 comments
Open
Labels
build build issues; typically submitted using template stale issues that have not been addressed in a while; categorized by a bot

Comments

@garthhenning
Copy link

Describe the issue

A call to Map<String, NodeInfo> inputsInfo = model.getInputInfo(); shortly after loading an ONNX model from Java on an M2 Mac causes an immediate SIGSEGV. Calls to model.getInputNames() and model.getOutputNames() succeed and give correct results, so the native library did link and the ONNX model did load.

The SIGSEGV:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000010d179800, pid=7303, tid=108291
#
# JRE version: OpenJDK Runtime Environment (21.0+37) (build 21+37-LTS)
# Java VM: OpenJDK 64-Bit Server VM (21+37-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-amd64)
# Problematic frame:
# V  [libjvm.dylib+0x620800]  jni_NewObject+0x1e0
#

The Java code:

        this.model = env.createSession(modelPath.toString());
        assert this.model.getInputNames().equals(
                new HashSet<>(List.of("x"))
        );
        assert this.model.getOutputNames().equals(
                new HashSet<>(List.of("image_embeddings"))
        );
        Map<String, NodeInfo> inputsInfo = model.getInputInfo();
        final NodeInfo imageNodeInfo = inputsInfo.get("x");
        final TensorInfo imageTensorInfo = (TensorInfo) imageNodeInfo.getInfo();
        this.imageSize = (int) imageTensorInfo.getShape()[2];

The Java debugger shows that the Java code crashes in the process of trying to throw a NoSuchMethodError because it can not find an expected Tensor constructor in the native code.

Tested on an M2 Mac using ONNX runtime versions from 1.10.0 through 1.17.0. Versions 14 to 17 SIGSEGV’d and versions 10 to 13 refused to load the ONNX at all. We are testing the SegmentAnything (polygonal shape) and HAWP (line segment) detection models from here:

Urgency

No

Target platform

Java 21 ONNX on ARM-based Mac M2

Build script

        this.model = env.createSession(modelPath.toString());
        assert this.model.getInputNames().equals(
                new HashSet<>(List.of("x"))
        );
        assert this.model.getOutputNames().equals(
                new HashSet<>(List.of("image_embeddings"))
        );
        Map<String, NodeInfo> inputsInfo = model.getInputInfo();
        final NodeInfo imageNodeInfo = inputsInfo.get("x");
        final TensorInfo imageTensorInfo = (TensorInfo) imageNodeInfo.getInfo();
        this.imageSize = (int) imageTensorInfo.getShape()[2];

Error / output

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000010d179800, pid=7303, tid=108291
#
# JRE version: OpenJDK Runtime Environment (21.0+37) (build 21+37-LTS)
# Java VM: OpenJDK 64-Bit Server VM (21+37-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-amd64)
# Problematic frame:
# V  [libjvm.dylib+0x620800]  jni_NewObject+0x1e0
#

Visual Studio Version

Apache Netbeans 19

GCC / Compiler Version

Java 21

@garthhenning garthhenning added the build build issues; typically submitted using template label Feb 13, 2024
@Craigacp
Copy link
Contributor

Craigacp commented Feb 14, 2024

Have you got multiple versions of ONNX Runtime on the classpath? It sounds like it's loaded in an incompatible version of the native library that conflicts with the class files that are loaded.

For the earlier versions which didn't load the model, is that due to an unsupported opset error? Those are quite old and segment anything is quite a new model.

@Craigacp
Copy link
Contributor

Running the following through jshell on an M1 MBP works fine for me, can you check that it works in your environment?

$ jshell --class-path ~/.m2/repository/com/microsoft/onnxruntime/onnxruntime/1.17.0/onnxruntime-1.17.0.jar
|  Welcome to JShell -- Version 21.0.2
|  For an introduction type: /help intro

jshell> import ai.onnxruntime.*;

jshell> var env = OrtEnvironment.getEnvironment();
env ==> OrtEnvironment(name=ort-java,logLevel=ORT_LOGGING_LEVEL_WARNING,version=1.17.0)

jshell> var session = env.createSession("encoder-vit_b.quant.onnx")
session ==> OrtSession(numInputs=1,numOutputs=1)

jshell> var info = session.getInputInfo()
info ==> {x=NodeInfo(name=x,info=TensorInfo(javaType=FLOAT ... mNames=[batch,"","",""]))}

jshell> info
info ==> {x=NodeInfo(name=x,info=TensorInfo(javaType=FLOAT,onnxType=ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT,shape=[-1, 3, 1024, 1024],dimNames=[batch,"","",""]))}

Also it looks like that segment anything implementation is using multidimensional Java arrays for passing things back and forth, and that's going to be very slow. I recommend taking a look at the stable diffusion example which shows how to work with images using java.nio.Buffer instances - https://github.com/oracle/sd4j.

@garthhenning
Copy link
Author

Very good call. I only have an explicit dependency on the onnxruntime 1.17.0 in my POM, but it turns out there was an extra secondary dependency lurking for onnxruntime-gpu 1.10.0. I traced it to my explicit dependency on Apache OpenNLP-DL version 2.1.0, which had a stray dependency on both the onnxruntime and the onnxruntime-gpu. I updated OpenNLP to version 2.3.2 and now they have just a dependency on the onnxruntime. Thank you for the suggestion to go check that!

Also, I'll pass along to the developer working that SAM interoperability library. He is working actively to get the rough code complete and this will be a good tip. Many of us are very excited about the ONNX runtime for Java and hungry for more such tips and guidance. Thank you.

@Craigacp
Copy link
Contributor

The stable diffusion example is current as of ORT v1.14.0, I plan to update it to v1.17.0 when I get the chance which will involve pinning the various buffers as inputs & outputs to minimize Java side allocations. At some point after that I'll do a write up which walks through the design choices and considerations for efficient use in Java.

@garthhenning
Copy link
Author

I know this is now drifting quickly off topic from my initial ONNX issue, but I just borrowed and adapted your code for BxCxHxW conversion per your suggestion. It seems like many models would benefit greatly from a solid, performant BxCxHxW to BufferedImage converter (and the inverse). HAWP and Segment Anything take BxCxHxW as inputs and Segment Anything and Stable Diffusion have BxCxHxW as outputs. I'm not sure if they all have pixel float values in the range [-1.0f,1.0f], though.

public class BxCxHxWxImage {

    private float[][][][] BxCxHxW;
    
    private float pixelrange = 2f;
    
    public BxCxHxWxImage(float[][][][] BxCxHxW) {
        this.BxCxHxW = BxCxHxW;
    }
    
    public BxCxHxWxImage(float[][][][] BxCxHxW, float pixelrange) {
        this.BxCxHxW = BxCxHxW;
        this.pixelrange = pixelrange;
    }
    
    public List<BufferedImage> toBufferedImages() {
        int batches = BxCxHxW.length;
        int channels = BxCxHxW[0].length;
        int height = BxCxHxW[0][0].length;
        int width = BxCxHxW[0][0][0].length;

        var output = new ArrayList<BufferedImage>();
        for (int b = 0; b < batches; b++) {
            BufferedImage image = new BufferedImage(width, height, BufferedImage.TYPE_4BYTE_ABGR);
            WritableRaster raster = image.getRaster();

            for (var y = 0; y < height; y++) {
                for (var x = 0; x < width; x++) {
                    for (var c = 0; c<channels; c++) {
                        raster.setSample(x, y, c, convertValue(BxCxHxW[b][c][y][x]));                    
                    }
                }
            }
            output.add(image);
        }
        return output;
    }
    
    /**
     * Converts a colour value from the range [-1, 1] to [0, 255].
     * @param colourValue The colour value to convert.
     * @return An unsigned colour byte.
     */
    private int convertValue(float colourValue) {
        float scaled = (colourValue / pixelrange) + 0.5f;
        float clamped = Math.min(1.0f, Math.max(scaled, 0.0f));
        int round = Math.round(clamped * 255);
        return round;
    }    
}

@Craigacp
Copy link
Contributor

Craigacp commented Feb 15, 2024

I agree it might be useful to have some utility routines for this kind of thing, particularly as channels first vs channels last is a consistent problem with people porting their Python image processing code to Java, but I want to minimize the number of entry points in this code which operate on Java arrays. A Java 4d array of size float[8][3][512][512] is actually a nested series of 12320 512-element float[] arrays rather than a single 24kb region of memory like the numpy equivalent would be. The amount of pointer chasing and the poor cache behaviour of the arrays being scattered all over the heap means we don't want to encourage that for efficient numerical work. So if we did add such a thing it would operate on java.nio.Buffer objects, or possibly even the new MemorySegment type which is finalizing in Java 22.

@garthhenning
Copy link
Author

Absolutely! A single, performant implementation would be a huge benefit to the community so that everyone isn't rolling their own. Having the Java ONNX runtime is so critical to Java, but good utilities to better connect to the Java ecosystem, like to BufferedImage is also crucial. Most of us using ONNX are going to have our expertise elsewhere.

@skottmckay
Copy link
Contributor

You could also consider adding the layout conversion to the model using the python tools in onnxruntime-extensions instead of implementing in Java.

There are a lot of examples (we need to simplify a little as they tend to be model driven), but basically you can add pre-processing steps to the model that handle NCHW to NHWC, normalization, centered crop/letterbox, etc.

This script handles a few models: https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/add_pre_post_processing_to_model.py

But the 'steps' can be assembled as needed.

e.g. something like this can do basic pre-processing of uint8 RGB input.

    steps = [
        ImageBytesToFloat(),  
        ChannelsLastToChannelsFirst(),
        Normalize([(0.5, 0.5)], layout="CHW"),  # normalize to range -1..1
        Unsqueeze([0])  # add batch dim
    ]

Post-processing can also be added (e.g. FloatToImageBytes to convert from float back to 0..255)

As they're added to the model as ONNX operators the ORT C++ implementation does the work, which should be very performant.

Some basic docs for each Step in https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/pre_post_processing/docs/.
Generally easiest to open each .md file in https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/pre_post_processing/docs/pre_post_processing/steps/.
The vision steps are probably most interesting given your context: https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/pre_post_processing/docs/pre_post_processing/steps/vision.md

@Craigacp
Copy link
Contributor

Sure, adding the transform to the model will help, but something that gets the data out of a Java BufferedImage into an OnnxTensor is still useful. The stable diffusion port I wrote has it's own ONNX image resize pipeline mirroring a CLIPImageProcessor for similar reasons - https://github.com/oracle/sd4j/blob/main/src/test/java/com/oracle/labs/mlrg/sd4j/ResizeModelGenerator.java#L104.

Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build build issues; typically submitted using template stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

3 participants