Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend german normalizer #59

Merged
merged 9 commits into from
Aug 21, 2023
Merged

extend german normalizer #59

merged 9 commits into from
Aug 21, 2023

Conversation

emphasize
Copy link
Member

@emphasize emphasize commented Jun 3, 2023

Problem:

With the usage of whisper (most of germans were using google up until now i assume) we get a lot of hyphenated "word-combinations" like "15-Minuten-Timer".

This is an unexpected behaviour and a problem in the intent stage that can't be dealt with subsequently.

will be brought to classifiers after review

@codecov
Copy link

codecov bot commented Jun 3, 2023

Codecov Report

❗ No coverage uploaded for pull request base (dev@d8596ee). Click here to learn what that means.
The diff coverage is n/a.

@@          Coverage Diff          @@
##             dev     #59   +/-   ##
=====================================
  Coverage       ?   0.00%           
=====================================
  Files          ?      69           
  Lines          ?   18332           
  Branches       ?       0           
=====================================
  Hits           ?       0           
  Misses         ?   18332           
  Partials       ?       0           

@@ -1156,6 +1156,14 @@ class GermanNormalizer(Normalizer):
with open(resolve_resource_file("text/de-de/normalize.json")) as f:
_default_config = json.load(f)

def remove_symbols(self, utterance):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method should go into the base normalizer and get a flag for all langs in the .json

the more performant way would be

text = "15-minute-timer"
translating = str.maketrans('', '', string.punctuation)
new_string = text.translate(translating)

Copy link
Member Author

@emphasize emphasize Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str.maketrans(''.join(symbols), len(symbols)*" ")
gotcha (altough i have doubts with _/=)

Will add this tomorrow

@emphasize
Copy link
Member Author

emphasize commented Aug 21, 2023

Rebased, adjusted to the latest developments and flagged every langs remove_symbols: true (normalize.json; where possible)
Changed the default in the Normalizer class to True

@property
def should_remove_symbols(self):
return self.config.get("remove_symbols", True)

replaced (german hyphen splitting) regex pattern to be less error prone

Should be ready to go

@JarbasAl JarbasAl merged commit 711c9e2 into dev Aug 21, 2023
10 checks passed
@JarbasAl JarbasAl deleted the add/extended_ger_normalizing branch August 21, 2023 21:48
@JarbasAl JarbasAl added bug Something isn't working enhancement New feature or request labels Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants