CoreDump bad_alloc while GetShape #19991

greyamber · 2024-03-20T08:21:59Z

Describe the issue

I migrated the code and model from CentOS to Ubuntu and noticed a strange thing:
under Ubuntu, reading the output information in the graph would cause a core dump. After debugging, I found that when GetShape was called, the output (dynamic batch) dimension became a very large number (94768101806096), whereas it should have been 2 ([-1, 1]).

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffb201da859 in __GI_abort () at abort.c:79
#2  0x00007ffb205b28d1 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffb205be37c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffb205be3e7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffb205be699 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffb205b24e2 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00005630e9d7b368 in __gnu_cxx::new_allocator<long>::allocate (this=<synthetic pointer>,
    __n=<optimized out>) at /usr/include/c++/9/ext/new_allocator.h:102
#8  std::allocator_traits<std::allocator<long> >::allocate (__a=<synthetic pointer>...,
    __n=<optimized out>) at /usr/include/c++/9/bits/alloc_traits.h:443
#9  std::_Vector_base<long, std::allocator<long> >::_M_allocate (this=<synthetic pointer>,
    __n=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:343
#10 std::_Vector_base<long, std::allocator<long> >::_M_create_storage (__n=<optimized out>,
    this=<synthetic pointer>) at /usr/include/c++/9/bits/stl_vector.h:358
#11 std::_Vector_base<long, std::allocator<long> >::_Vector_base (__a=..., __n=<optimized out>,
    this=<synthetic pointer>) at /usr/include/c++/9/bits/stl_vector.h:302
#12 std::vector<long, std::allocator<long> >::vector (__a=..., __value=<optimized out>,
    __n=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/9/bits/stl_vector.h:521
#13 Ort::detail::TensorTypeAndShapeInfoImpl<Ort::detail::Unowned<OrtTensorTypeAndShapeInfo const> >::GetShape (this=<synthetic pointer>)
    at /root/cbd/predict-server/onnxruntime/include/onnxruntime_cxx_inline.h:1142

And I can load the model on the same mache by python onnx, model outputs:

import sys
onnx_model = onnx.load(sys.argv[1])
graph = onnx_model.graph
#print(graph.input)
print(graph.output)

[name: "score"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_param: "Gemmscore_dim_0"
}
dim {
dim_value: 1
}
}
}
}
]

I can even run the model normally with c++ if I skip checking output shape.
Anyone can help?

To reproduce

model:
model_onnx.zip

cmake CMakeLists.txt

cmake_minimum_required(VERSION 3.15)
project(predict_server)
set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib CACHE STRING "")
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib CACHE STRING "")
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin CACHE STRING "")
set(CMAKE_CXX_STANDARD 17)
set(BUILD_SHARED_LIBS OFF)
#set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb -Wall")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb")

find_path(GFLAGS_INCLUDE_PATH gflags/gflags.h)
find_library(GFLAGS_LIBRARY NAMES gflags libgflags)
if((NOT GFLAGS_INCLUDE_PATH) OR (NOT GFLAGS_LIBRARY))
    message(FATAL_ERROR "Fail to find gflags")
endif()
include_directories(${GFLAGS_INCLUDE_PATH})

execute_process(
    COMMAND bash -c "grep \"namespace [_A-Za-z0-9]\\+ {\" ${GFLAGS_INCLUDE_PATH}/gflags/gflags_declare.h | head -1 | awk '{print $2}' | tr -d '\n'"
    OUTPUT_VARIABLE GFLAGS_NS
)
if(${GFLAGS_NS} STREQUAL "GFLAGS_NAMESPACE")
    execute_process(
        COMMAND bash -c "grep \"#define GFLAGS_NAMESPACE [_A-Za-z0-9]\\+\" ${GFLAGS_INCLUDE_PATH}/gflags/gflags_declare.h | head -1 | awk '{print $3}' | tr -d '\n'"
        OUTPUT_VARIABLE GFLAGS_NS
    )
endif()
set(CMAKE_CPP_FLAGS "${DEFINE_CLOCK_GETTIME} -DGFLAGS_NS=${GFLAGS_NS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CPP_FLAGS} -DNDEBUG -O2 -D__const__=__unused__ -pipe -W -Wall -Wno-unused-parameter -fPIC -fno-omit-frame-pointer")

find_path(LEVELDB_INCLUDE_PATH NAMES leveldb/db.h)
find_library(LEVELDB_LIB NAMES leveldb)
if ((NOT LEVELDB_INCLUDE_PATH) OR (NOT LEVELDB_LIB))
    message(FATAL_ERROR "Fail to find leveldb")
endif()
include_directories(${LEVELDB_INCLUDE_PATH})

find_package(OpenSSL)
include_directories(${OPENSSL_INCLUDE_DIR})

set(DYNAMIC_LIB
    ${CMAKE_THREAD_LIBS_INIT}
    ${GFLAGS_LIBRARY}
    ${LEVELDB_LIB}
    ${OPENSSL_CRYPTO_LIBRARY}
    ${OPENSSL_SSL_LIBRARY}
    ${THRIFT_LIB}
    z
    dl
    )

#onnx
include_directories(${PROJECT_SOURCE_DIR}/onnxruntime/include)
link_directories(${PROJECT_SOURCE_DIR}/onnxruntime/lib)
file(GLOB ONNX_SO
    ${PROJECT_SOURCE_DIR}/onnxruntime/lib/libonnxruntime.*
    )

add_executable("test_bin" ${PROJECT_SOURCE_DIR}/test.cpp)
target_link_options(test_bin PRIVATE "-lz")
target_compile_options(test_bin PRIVATE "-pthread")
target_link_libraries(test_bin
                      onnxruntime
)

test.cpp

#include <memory>
#include <vector>
#include <unordered_map>
#include <iostream>
#include "onnxruntime_cxx_api.h"

int main() {
    const std::string model_path = "./model.onnx";
    uint32_t threads_num = 4;
    Ort::Env env;
    Ort::SessionOptions session_option;
    Ort::AllocatorWithDefaultOptions allocator;
    session_option.SetIntraOpNumThreads(threads_num);
    session_option.SetGraphOptimizationLevel(ORT_ENABLE_ALL);
    auto session = std::make_shared<Ort::Session>(env, model_path.c_str(), session_option);
    if (session == nullptr || session->GetOutputCount() == 0) {
        std::cout << "[debug] session init error" << std::endl;
        return -1;
    }
    size_t outputCount = session->GetOutputCount();
    for (size_t i = 0; i < outputCount; ++i) {
        std::shared_ptr<char> output_name = session->GetOutputNameAllocated(i, allocator);
        std::string name(output_name.get());
        std::cout << "[debug][output]" << name << " " << std::to_string(i) << std::endl;
        auto type_and_shape = session->GetOutputTypeInfo(i).GetTensorTypeAndShapeInfo();


        // core dump here!!!!! And  I can even run the model normally with c++ if I skip GetShape. 
        auto shape = type_and_shape.GetShape();
        //


        if (shape.size() <= 0 or shape[0] != -1) {
            std::cout << "[debug] batch dim is not -1" << std::endl;
            return -2;
        }
    }
    return 0;
}

I solved this problem using gdb single-step debugging. The cause of the problem was that I didn't notice that GetOutputTypeInfo returned a temporary variable instead of a reference. So later GetShape() actually accessed the memory that had already been released.
Running successfully on CentOS is just a coincidence.
right way:

auto oti = session->GetOutputTypeInfo(i);
auto type_and_shape = oti.GetTensorTypeAndShapeInfo();
auto shape = type_and_shape.GetShape();

Urgency

No response

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

baijumeswani · 2024-03-20T16:52:34Z

It is hard to tell what the problem might be without the onnx models. Are you able to share the onnx model?

Also, is my understanding correct that the execution on CentOS was ok, but as soon as you moved to Ubuntu, you started noticing this failure?

greyamber · 2024-03-21T01:50:16Z

It is hard to tell what the problem might be without the onnx models. Are you able to share the onnx model?

Also, is my understanding correct that the execution on CentOS was ok, but as soon as you moved to Ubuntu, you started noticing this failure?

thanks!
Python & C++ on CentOS is ok
Python on Ubuntu is ok
BUT C++ on Ubuntu is NOT.
I upload a demo model, CMakeLists.txt, and test.cpp

greyamber · 2024-03-21T02:21:39Z

I solved this problem using gdb single-step debugging. The cause of the problem was that I didn't notice that GetOutputTypeInfo returned a temporary variable instead of a reference. So later GetShape() actually accessed the memory that had already been released.
Running successfully on CentOS is just a coincidence.
right way:

auto oti = session->GetOutputTypeInfo(i);
auto type_and_shape = oti.GetTensorTypeAndShapeInfo();
auto shape = type_and_shape.GetShape();

greyamber closed this as completed Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreDump bad_alloc while GetShape #19991

CoreDump bad_alloc while GetShape #19991

greyamber commented Mar 20, 2024 •

edited

Loading

baijumeswani commented Mar 20, 2024

greyamber commented Mar 21, 2024

greyamber commented Mar 21, 2024

CoreDump bad_alloc while GetShape #19991

CoreDump bad_alloc while GetShape #19991

Comments

greyamber commented Mar 20, 2024 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

baijumeswani commented Mar 20, 2024

greyamber commented Mar 21, 2024

greyamber commented Mar 21, 2024

greyamber commented Mar 20, 2024 •

edited

Loading