Skip to content

Commit

Permalink
Fixes issue #535 , fix hexa 1-char tokens in ASR output. (#550)
Browse files Browse the repository at this point in the history
- Avoid output like : `[' K', '<0x64>', '<0x79>', 'ť', ' a', '<0x75>',
  'to', 'bu', '<0x73>', '<0x75>', ... ]` with regular 500 BPE units.
- Don't rewrite 1-char tokens in range [ 0x20 (space) .. 0x7E (tilde) ]
  • Loading branch information
KarelVesely84 authored Jan 26, 2024
1 parent e7b18a2 commit 3f2a17e
Show file tree
Hide file tree
Showing 4 changed files with 13 additions and 5 deletions.
4 changes: 3 additions & 1 deletion sherpa-onnx/csrc/offline-recognizer-ctc-impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,10 @@ static OfflineRecognitionResult Convert(const OfflineCtcDecoderResult &src,
auto sym = sym_table[src.tokens[i]];
text.append(sym);

if (sym.size() == 1 && sym[0] != ' ') {
if (sym.size() == 1 && (sym[0] < 0x20 || sym[0] > 0x7e)) {
// for byte bpe models
// (but don't rewrite printable characters 0x20..0x7e,
// which collide with standard BPE units)
std::ostringstream os;
os << "<0x" << std::hex << std::uppercase
<< (static_cast<int32_t>(sym[0]) & 0xff) << ">";
Expand Down
6 changes: 4 additions & 2 deletions sherpa-onnx/csrc/offline-recognizer-transducer-impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,10 @@ static OfflineRecognitionResult Convert(
auto sym = sym_table[i];
text.append(sym);

if (sym.size() == 1 && sym[0] != ' ') {
// for byte bpe models
if (sym.size() == 1 && (sym[0] < 0x20 || sym[0] > 0x7e)) {
// for byte bpe models,
// (but don't rewrite printable characters 0x20..0x7e,
// which collide with standard BPE units)
std::ostringstream os;
os << "<0x" << std::hex << std::uppercase
<< (static_cast<int32_t>(sym[0]) & 0xff) << ">";
Expand Down
4 changes: 3 additions & 1 deletion sherpa-onnx/csrc/online-recognizer-ctc-impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,10 @@ static OnlineRecognizerResult Convert(const OnlineCtcDecoderResult &src,

r.text.append(sym);

if (sym.size() == 1 && sym[0] != ' ') {
if (sym.size() == 1 && (sym[0] < 0x20 || sym[0] > 0x7e)) {
// for byte bpe models
// (but don't rewrite printable characters 0x20..0x7e,
// which collide with standard BPE units)
std::ostringstream os;
os << "<0x" << std::hex << std::uppercase
<< (static_cast<int32_t>(sym[0]) & 0xff) << ">";
Expand Down
4 changes: 3 additions & 1 deletion sherpa-onnx/csrc/online-recognizer-transducer-impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,10 @@ static OnlineRecognizerResult Convert(const OnlineTransducerDecoderResult &src,

r.text.append(sym);

if (sym.size() == 1 && sym[0] != ' ') {
if (sym.size() == 1 && (sym[0] < 0x20 || sym[0] > 0x7e)) {
// for byte bpe models
// (but don't rewrite printable characters 0x20..0x7e,
// which collide with standard BPE units)
std::ostringstream os;
os << "<0x" << std::hex << std::uppercase
<< (static_cast<int32_t>(sym[0]) & 0xff) << ">";
Expand Down

0 comments on commit 3f2a17e

Please sign in to comment.