extend german normalizer #59

emphasize · 2023-06-03T12:13:16Z

Problem:

With the usage of whisper (most of germans were using google up until now i assume) we get a lot of hyphenated "word-combinations" like "15-Minuten-Timer".

This is an unexpected behaviour and a problem in the intent stage that can't be dealt with subsequently.

will be brought to classifiers after review

codecov · 2023-06-03T12:16:04Z

Codecov Report

❗ No coverage uploaded for pull request base (dev@d8596ee). Click here to learn what that means.
The diff coverage is n/a.

@@          Coverage Diff          @@
##             dev     #59   +/-   ##
=====================================
  Coverage       ?   0.00%           
=====================================
  Files          ?      69           
  Lines          ?   18332           
  Branches       ?       0           
=====================================
  Hits           ?       0           
  Misses         ?   18332           
  Partials       ?       0

JarbasAl · 2023-06-03T13:47:03Z

lingua_franca/lang/parse_de.py

@@ -1156,6 +1156,14 @@ class GermanNormalizer(Normalizer):
 with open(resolve_resource_file("text/de-de/normalize.json")) as f:
 _default_config = json.load(f)

+ def remove_symbols(self, utterance):


this method should go into the base normalizer and get a flag for all langs in the .json

the more performant way would be

text = "15-minute-timer" translating = str.maketrans('', '', string.punctuation) new_string = text.translate(translating)

~~str.maketrans(''.join(symbols), len(symbols)*" ")~~
gotcha (altough i have doubts with _/=)

Will add this tomorrow

* fix nice date utils * remove doubled import * added tests

emphasize · 2023-08-21T21:45:40Z

Rebased, adjusted to the latest developments and flagged every langs remove_symbols: true (normalize.json; where possible)
Changed the default in the Normalizer class to True

ovos-lingua-franca/lingua_franca/lang/parse_common.py

Lines 52 to 54 in f16bb4d

 @property 

 def should_remove_symbols(self): 

 return self.config.get("remove_symbols", True)

replaced (german hyphen splitting) regex pattern to be less error prone

Should be ready to go

extend german normalizer

061fb5f

emphasize requested review from ChanceNCounter and JarbasAl June 3, 2023 12:15

JarbasAl reviewed Jun 3, 2023

View reviewed changes

emphasize and others added 8 commits June 4, 2023 14:32

add symbols

c1c356f

fix nice date utils (#60)

e87ffa4

* fix nice date utils * remove doubled import * added tests

Increment Version

cd13bb5

move to quebra tokenizer (#61)

bb4b02c

Increment Version

5367c5c

add symbols

61fec0b

Merge branch 'dev' into add/extended_ger_normalizing

f16bb4d

replace regex

1ba2581

emphasize requested a review from JarbasAl August 21, 2023 21:48

JarbasAl merged commit 711c9e2 into dev Aug 21, 2023
10 checks passed

JarbasAl deleted the add/extended_ger_normalizing branch August 21, 2023 21:48

JarbasAl added bug Something isn't working enhancement New feature or request labels Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend german normalizer #59

extend german normalizer #59

emphasize commented Jun 3, 2023 •

edited

Loading

codecov bot commented Jun 3, 2023 •

edited

Loading

JarbasAl Jun 3, 2023

emphasize Jun 5, 2023 •

edited

Loading

emphasize commented Aug 21, 2023 •

edited

Loading

extend german normalizer #59

extend german normalizer #59

Conversation

emphasize commented Jun 3, 2023 • edited Loading

Problem:

codecov bot commented Jun 3, 2023 • edited Loading

Codecov Report

JarbasAl Jun 3, 2023

Choose a reason for hiding this comment

emphasize Jun 5, 2023 • edited Loading

Choose a reason for hiding this comment

emphasize commented Aug 21, 2023 • edited Loading

emphasize commented Jun 3, 2023 •

edited

Loading

codecov bot commented Jun 3, 2023 •

edited

Loading

emphasize Jun 5, 2023 •

edited

Loading

emphasize commented Aug 21, 2023 •

edited

Loading