Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few Terms bugs in our huge corpus #457

Open
jan-niestadt opened this issue Sep 28, 2023 · 0 comments
Open

A few Terms bugs in our huge corpus #457

jan-niestadt opened this issue Sep 28, 2023 · 0 comments
Assignees
Labels

Comments

@jan-niestadt
Copy link
Member

jan-niestadt commented Sep 28, 2023

In chn-intern, running the TermSerialization tool finds terms that don't correctly "round-trip" (i.e. get the id for the term, then get the term for that id again), although not too many on a 3B word corpus. All of these either include a dash or an unusual Unicode character. (the problem with the dash may be related to another "dash-like" character, as there are a few of those, e.g. endash, emdash, soft hyphen)

Should be investigated further. May be a bug in the Terms code, or during indexing, or something else.

<0xfeff>  ZERO WIDTH NO-BREAK SPACE
<0x200e>  LEFT-TO-RIGHT MARK

termId2 == -1: '-teken'
termId2 == -1: '-jes'
termId2 == -1: '-mail'
termId2 == -1: '-uitgave'
termId2 == -1: 'NVP-directeur'
termId2 == -1: 'mai-tres'
termId2 == -1: '-teken'
termId2 == -1: 'Bene-decreten'
termId2 == -1: '-day'
termId2 == -1: '-de'
termId2 == -1: '-of'
termId2 == -1: 'frai-che'
termId2 == -1: '<0xfeff>Uiteindelijk'
termId2 == -1: 'DNA<0x200e>-sporen<0x200e>'
termId2 == -1: 'Media<0x200e>-aandacht'
termId2 == -1: 'KGB<0x200e>-agente'
termId2 == -1: 'KGB<0x200e>-zaken<0x200e>'
termId2 == -1: 'Luitenant<0x200e>-kolonel'
termId2 == -1: 'NAVO<0x200e>-wiendelijk<0x200e>'
termId2 == -1: 'Play<0x200e>-off'
termId2 == -1: 'Vol'<0x200e>-licht'
termId2 == -1: 'je<0x200e>-weet<0x200e>-wel<0x200e>-wie<0x200e>'
termId2 == -1: 'make<0x200e>-up'
termId2 == -1: 'try<0x200e>-out'
termId2 == -1: 'Dunlap-mokkel'
termId2 == -1: 'pocus-gezeik'
termId2 == -1: 'priv-leven'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant