Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreDump bad_alloc while GetShape #19991

Closed
greyamber opened this issue Mar 20, 2024 · 3 comments
Closed

CoreDump bad_alloc while GetShape #19991

greyamber opened this issue Mar 20, 2024 · 3 comments

Comments

@greyamber
Copy link

greyamber commented Mar 20, 2024

Describe the issue

I migrated the code and model from CentOS to Ubuntu and noticed a strange thing:
under Ubuntu, reading the output information in the graph would cause a core dump. After debugging, I found that when GetShape was called, the output (dynamic batch) dimension became a very large number (94768101806096), whereas it should have been 2 ([-1, 1]).

image

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffb201da859 in __GI_abort () at abort.c:79
#2  0x00007ffb205b28d1 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffb205be37c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffb205be3e7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffb205be699 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffb205b24e2 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00005630e9d7b368 in __gnu_cxx::new_allocator<long>::allocate (this=<synthetic pointer>,
    __n=<optimized out>) at /usr/include/c++/9/ext/new_allocator.h:102
#8  std::allocator_traits<std::allocator<long> >::allocate (__a=<synthetic pointer>...,
    __n=<optimized out>) at /usr/include/c++/9/bits/alloc_traits.h:443
#9  std::_Vector_base<long, std::allocator<long> >::_M_allocate (this=<synthetic pointer>,
    __n=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:343
#10 std::_Vector_base<long, std::allocator<long> >::_M_create_storage (__n=<optimized out>,
    this=<synthetic pointer>) at /usr/include/c++/9/bits/stl_vector.h:358
#11 std::_Vector_base<long, std::allocator<long> >::_Vector_base (__a=..., __n=<optimized out>,
    this=<synthetic pointer>) at /usr/include/c++/9/bits/stl_vector.h:302
#12 std::vector<long, std::allocator<long> >::vector (__a=..., __value=<optimized out>,
    __n=<optimized out>, this=<synthetic pointer>) at /usr/include/c++/9/bits/stl_vector.h:521
#13 Ort::detail::TensorTypeAndShapeInfoImpl<Ort::detail::Unowned<OrtTensorTypeAndShapeInfo const> >::GetShape (this=<synthetic pointer>)
    at /root/cbd/predict-server/onnxruntime/include/onnxruntime_cxx_inline.h:1142

image

And I can load the model on the same mache by python onnx, model outputs:

import sys
onnx_model = onnx.load(sys.argv[1])
graph = onnx_model.graph
#print(graph.input)
print(graph.output)

[name: "score"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_param: "Gemmscore_dim_0"
}
dim {
dim_value: 1
}
}
}
}
]

I can even run the model normally with c++ if I skip checking output shape.
Anyone can help?

To reproduce

model:
model_onnx.zip

cmake CMakeLists.txt

cmake_minimum_required(VERSION 3.15)
project(predict_server)
set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib CACHE STRING "")
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib CACHE STRING "")
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin CACHE STRING "")
set(CMAKE_CXX_STANDARD 17)
set(BUILD_SHARED_LIBS OFF)
#set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb -Wall")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb")

find_path(GFLAGS_INCLUDE_PATH gflags/gflags.h)
find_library(GFLAGS_LIBRARY NAMES gflags libgflags)
if((NOT GFLAGS_INCLUDE_PATH) OR (NOT GFLAGS_LIBRARY))
    message(FATAL_ERROR "Fail to find gflags")
endif()
include_directories(${GFLAGS_INCLUDE_PATH})

execute_process(
    COMMAND bash -c "grep \"namespace [_A-Za-z0-9]\\+ {\" ${GFLAGS_INCLUDE_PATH}/gflags/gflags_declare.h | head -1 | awk '{print $2}' | tr -d '\n'"
    OUTPUT_VARIABLE GFLAGS_NS
)
if(${GFLAGS_NS} STREQUAL "GFLAGS_NAMESPACE")
    execute_process(
        COMMAND bash -c "grep \"#define GFLAGS_NAMESPACE [_A-Za-z0-9]\\+\" ${GFLAGS_INCLUDE_PATH}/gflags/gflags_declare.h | head -1 | awk '{print $3}' | tr -d '\n'"
        OUTPUT_VARIABLE GFLAGS_NS
    )
endif()
set(CMAKE_CPP_FLAGS "${DEFINE_CLOCK_GETTIME} -DGFLAGS_NS=${GFLAGS_NS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CPP_FLAGS} -DNDEBUG -O2 -D__const__=__unused__ -pipe -W -Wall -Wno-unused-parameter -fPIC -fno-omit-frame-pointer")

find_path(LEVELDB_INCLUDE_PATH NAMES leveldb/db.h)
find_library(LEVELDB_LIB NAMES leveldb)
if ((NOT LEVELDB_INCLUDE_PATH) OR (NOT LEVELDB_LIB))
    message(FATAL_ERROR "Fail to find leveldb")
endif()
include_directories(${LEVELDB_INCLUDE_PATH})

find_package(OpenSSL)
include_directories(${OPENSSL_INCLUDE_DIR})

set(DYNAMIC_LIB
    ${CMAKE_THREAD_LIBS_INIT}
    ${GFLAGS_LIBRARY}
    ${LEVELDB_LIB}
    ${OPENSSL_CRYPTO_LIBRARY}
    ${OPENSSL_SSL_LIBRARY}
    ${THRIFT_LIB}
    z
    dl
    )

#onnx
include_directories(${PROJECT_SOURCE_DIR}/onnxruntime/include)
link_directories(${PROJECT_SOURCE_DIR}/onnxruntime/lib)
file(GLOB ONNX_SO
    ${PROJECT_SOURCE_DIR}/onnxruntime/lib/libonnxruntime.*
    )

add_executable("test_bin" ${PROJECT_SOURCE_DIR}/test.cpp)
target_link_options(test_bin PRIVATE "-lz")
target_compile_options(test_bin PRIVATE "-pthread")
target_link_libraries(test_bin
                      onnxruntime
)

test.cpp

#include <memory>
#include <vector>
#include <unordered_map>
#include <iostream>
#include "onnxruntime_cxx_api.h"

int main() {
    const std::string model_path = "./model.onnx";
    uint32_t threads_num = 4;
    Ort::Env env;
    Ort::SessionOptions session_option;
    Ort::AllocatorWithDefaultOptions allocator;
    session_option.SetIntraOpNumThreads(threads_num);
    session_option.SetGraphOptimizationLevel(ORT_ENABLE_ALL);
    auto session = std::make_shared<Ort::Session>(env, model_path.c_str(), session_option);
    if (session == nullptr || session->GetOutputCount() == 0) {
        std::cout << "[debug] session init error" << std::endl;
        return -1;
    }
    size_t outputCount = session->GetOutputCount();
    for (size_t i = 0; i < outputCount; ++i) {
        std::shared_ptr<char> output_name = session->GetOutputNameAllocated(i, allocator);
        std::string name(output_name.get());
        std::cout << "[debug][output]" << name << " " << std::to_string(i) << std::endl;
        auto type_and_shape = session->GetOutputTypeInfo(i).GetTensorTypeAndShapeInfo();


        // core dump here!!!!! And  I can even run the model normally with c++ if I skip GetShape. 
        auto shape = type_and_shape.GetShape();
        //


        if (shape.size() <= 0 or shape[0] != -1) {
            std::cout << "[debug] batch dim is not -1" << std::endl;
            return -2;
        }
    }
    return 0;
}

I solved this problem using gdb single-step debugging. The cause of the problem was that I didn't notice that GetOutputTypeInfo returned a temporary variable instead of a reference. So later GetShape() actually accessed the memory that had already been released.
Running successfully on CentOS is just a coincidence.
right way:

auto oti = session->GetOutputTypeInfo(i);
auto type_and_shape = oti.GetTensorTypeAndShapeInfo();
auto shape = type_and_shape.GetShape();

Urgency

No response

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

@baijumeswani
Copy link
Contributor

It is hard to tell what the problem might be without the onnx models. Are you able to share the onnx model?

Also, is my understanding correct that the execution on CentOS was ok, but as soon as you moved to Ubuntu, you started noticing this failure?

@greyamber
Copy link
Author

It is hard to tell what the problem might be without the onnx models. Are you able to share the onnx model?

Also, is my understanding correct that the execution on CentOS was ok, but as soon as you moved to Ubuntu, you started noticing this failure?

thanks!
Python & C++ on CentOS is ok
Python on Ubuntu is ok
BUT C++ on Ubuntu is NOT.
I upload a demo model, CMakeLists.txt, and test.cpp

@greyamber
Copy link
Author

I solved this problem using gdb single-step debugging. The cause of the problem was that I didn't notice that GetOutputTypeInfo returned a temporary variable instead of a reference. So later GetShape() actually accessed the memory that had already been released.
Running successfully on CentOS is just a coincidence.
right way:

auto oti = session->GetOutputTypeInfo(i);
auto type_and_shape = oti.GetTensorTypeAndShapeInfo();
auto shape = type_and_shape.GetShape();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants