Skip to content

Commit

Permalink
Don't clobber a token's text in the event only a single Word is creat…
Browse files Browse the repository at this point in the history
…ed for a supposedly MWT Token. This came up while training the Albanian MWT processor
  • Loading branch information
AngledLuffa committed Nov 18, 2024
1 parent f534d73 commit 215c69e
Showing 1 changed file with 6 additions and 0 deletions.
6 changes: 6 additions & 0 deletions stanza/models/common/doc.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,12 @@ def set_mwt_expansions(self, expansions,
word.id = idx_w
elif perform_mwt_processing == MWTProcessingType.PROCESS:
expanded = [x for x in expansions[idx_e].split(' ') if len(x) > 0]
# in the event the MWT annotator only split the
# Token into a single Word, we preserve its text
# otherwise the Token's text is different from its
# only Word's text
if len(expanded) == 1:
expanded = [token.text]
idx_e += 1
idx_w_end = idx_w + len(expanded) - 1
if token.misc: # None can happen when using a prebuilt doc
Expand Down

0 comments on commit 215c69e

Please sign in to comment.