Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export pack_strings() and unpack_strings() #2

Merged
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ target_include_directories(${TARGET_NAME} PRIVATE
"${CMAKE_BINARY_DIR}/third_party/dart/src/extern_dart/include/"
"${CMAKE_BINARY_DIR}/third_party/install/re2/include/")

target_include_directories(${TARGET_NAME} PUBLIC .)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically, OV extensions does not imply to provide any extra C++ API..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's @slyalin's request to call pack_strings() from OV extensions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider the fact that we are going to remove pack_strings/unpack_strigs after string support is added to OV. But we are not really certain when it happens. I think it makes sense to add it now to have functional distribution and single point of reference (people started to copy this function from one place to another because we are not providing distribution for these methods). I hope it won't cross OV releases boundary.


if(CMAKE_CL_64)
target_compile_definitions(sentencepiece-static PRIVATE _CRT_SECURE_NO_WARNINGS _SCL_SECURE_NO_WARNINGS)
endif()
Expand Down
35 changes: 0 additions & 35 deletions modules/custom_operations/user_ie_extensions/tokenizer/utils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -223,41 +223,6 @@ std::shared_ptr<Node> string_attribute_to_constant (const ov::frontend::NodeCont
#endif
}


// Pack any container with string to ov::Tensor with element type u8
// Requirements for BatchOfStrings: .size() with size and .begin(), .end() as iterators, elements with .begin(), .end() and .length()
// so basically any STL container with std::string is compatible
// Tensor destination will be reshaped according the input data
template <typename BatchOfStrings>
void pack_strings (const BatchOfStrings& strings, ov::Tensor& destination) {
auto batch_size = strings.size();

// First run over all elements: calculate total memory required to hold all strings
auto symbols_size = std::accumulate(
strings.begin(), strings.end(), size_t(0),
[](size_t accum, typename BatchOfStrings::const_reference s)
{ return accum + s.length(); });

auto total_size = 4*(1 + 1 + batch_size) + symbols_size;
destination.set_shape({total_size});

auto data = destination.data<uint8_t>();
auto pbatch_size = reinterpret_cast<int32_t*>(data);
auto pindices = pbatch_size + 1;
auto psymbols = reinterpret_cast<char*>(pindices + 1 + batch_size);
size_t current_symbols_pos = 0;

*pbatch_size = batch_size;
*pindices = 0;

for(auto s: strings) {
psymbols = std::copy(s.begin(), s.end(), psymbols);
current_symbols_pos += s.length();
*++pindices = current_symbols_pos;
}
}


std::vector<std::string> unpack_strings (const ov::Tensor& source) {
auto strings = source.data<const uint8_t>();
auto length = source.get_byte_size();
Expand Down
32 changes: 31 additions & 1 deletion modules/custom_operations/user_ie_extensions/tokenizer/utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,37 @@ bool evaluate_normalization_helper (

std::shared_ptr<ov::Node> string_attribute_to_constant (const ov::frontend::NodeContext& node, const std::string& name);

// Pack any container with string to ov::Tensor with element type u8
// Requirements for BatchOfStrings: .size() with size and .begin(), .end() as iterators, elements with .begin(), .end() and .length()
// so basically any STL container with std::string is compatible
// Tensor destination will be reshaped according the input data
template <typename BatchOfStrings>
void pack_strings (const BatchOfStrings& strings, ov::Tensor& destination);
void pack_strings (const BatchOfStrings& strings, ov::Tensor& destination) {
auto batch_size = strings.size();

// First run over all elements: calculate total memory required to hold all strings
auto symbols_size = std::accumulate(
strings.begin(), strings.end(), size_t(0),
[](size_t accum, typename BatchOfStrings::const_reference s)
{ return accum + s.length(); });

auto total_size = 4*(1 + 1 + batch_size) + symbols_size;
destination.set_shape({total_size});

auto data = destination.data<uint8_t>();
auto pbatch_size = reinterpret_cast<int32_t*>(data);
auto pindices = pbatch_size + 1;
auto psymbols = reinterpret_cast<char*>(pindices + 1 + batch_size);
size_t current_symbols_pos = 0;

*pbatch_size = batch_size;
*pindices = 0;

for(auto s: strings) {
psymbols = std::copy(s.begin(), s.end(), psymbols);
current_symbols_pos += s.length();
*++pindices = current_symbols_pos;
}
}

std::vector<std::string> unpack_strings(const ov::Tensor& source);