Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenaton error on german word "Fortschritt" #24

Open
DottoreG opened this issue Jun 15, 2019 · 7 comments
Open

Hyphenaton error on german word "Fortschritt" #24

DottoreG opened this issue Jun 15, 2019 · 7 comments

Comments

@DottoreG
Copy link

DottoreG commented Jun 15, 2019

import pyphen
dic = pyphen.Pyphen(lang='de_DE')
dic.inserted('Fortschritt')

results in: 'Fort-s-chritt'
The correct answer would be: 'Fort-schritt'

Although Libreoffice uses the same dictionary the result seems to be correct there.

@Skill3t
Copy link

Skill3t commented Jul 8, 2019

Same thing with medizinische it is not medizini-sche it is me-di-zi-nisch.

@wimmuskee
Copy link

A less maintenance heavy solution would be to use (myspell/hunspell) system installed hyphenations (if available). You can use a filename when calling Pyphen:

import pyphen
dic = pyphen.Pyphen(filename='/usr/share/hyphen/hyph_de_DE.dic')
dic.inserted('Fortschritt')

@liZe would you be interested in a PR for a fallback on system installed hyphenations? That way distro packagers could also opt to not install dictionary files, and rely on more up-to-date system hyphenations fully.

@FelixSchwarz
Copy link

A less maintenance heavy solution would be to use (myspell/hunspell) system installed hyphenations (if available).

We have a patch in Fedora which does something similar. The Fedora package does not ship any dictionaries from pyphen but that has its own drawbacks:

  • system dicts may not support locale inheritance ("en_US" -> "en"). At least Fedora's setup does not.
  • system dicts may be outdated: Fedora seems to package the old OpenOffice dicts in a very "decentralized" manner. Each language is shipped in a completely separate package (not subpackage) and has its own maintainer. Many language dicts were not updated within the last decade.
  • If the patch either enables pyphen's dictionaries or the system dicts user-obversable behavior may change. For example due to Fedora's usage of system dicts the WeasyPrint test suite does not pass on Fedora.

Therefore I'm planning to use pyphen's dictionaries in the a future update (assuming I get the privileges to update pyphen - finally).

Personally if you would support system-provided dicts I'd like to see a way how callers could choose the dictionary source to prevent test failures due to outdated system dicts.

@DottoreG
Copy link
Author

A less maintenance heavy solution would be to use (myspell/hunspell) system installed hyphenations (if available). You can use a filename when calling Pyphen:

import pyphen
dic = pyphen.Pyphen(filename='/usr/share/hyphen/hyph_de_DE.dic')
dic.inserted('Fortschritt')

I can't see how it would make it any better. On my system I get the same (wrong) result. I'm using Gentoo with hunspell 1.7.0 and pyphen 0.9.4.

@wimmuskee
Copy link

system dicts may not support locale inheritance ("en_US" -> "en"). At least Fedora's setup does not.

Encountered the same issue while making a patch for Gentoo. I was thinking the rewrite the fallback mechanism so a request for lang would default to lang_LANG. This would work for "de". For "en", I would pick the largest available territory dictionary.

system dicts may be outdated

Other distro's seem to have the same issues. However, now the burden to keep all dictionaries up to data falls on the Pyphen maintainers. Also, some dictionaries are not updated at upstream level.

If the patch either enables pyphen's dictionaries or the system dicts user-obversable behavior may change

I imagine changing the behaviour of pyphen will result in a version update, and perhaps resulting incompatibilities in other applications. For one part, this would be similar to introducing pyphen exceptions (where using applications would expect default python exceptions).
For another part, basing unit tests on content that is provided from other sources can get tricky. Also, would you not rather mock pyphen behaviour when unit testing from another application?

Personally if you would support system-provided dicts I'd like to see a way how callers could choose the dictionary source to prevent test failures due to outdated system dicts.

Continuing on the previous point, if you have to test from another application, I would always use the filename= argument to specify a static dictionary file which can be controlled from the testing application.

@mark-kubacki
Copy link

A word of caution, as I see this is often done wrong: Language tag substitution and expansion doesn't work like that, adding or removing a ll in/to ll_LL. In most cases you will get away with it, but it's superficial mimicry nonetheless; if someone wanted to go down that path.

For example, Swedish and Finnish are spoken in Suomi/Finland. Removing you'd run into changing the language completely, and expanding you'd face a non-trivial choice between (here: at least) two.

https://tools.ietf.org/html/bcp47

@rubenmoor
Copy link

fort-s-chreib-ba-r
fort-s-chreib-ba-re
fort-s-chreib-ba-rem
fort-s-chreib-ba-ren
fort-s-chreib-ba-re-r
fort-s-chreib-ba-res
fort-s-chrei-be
fort-s-chrei-ben
fort-s-chrei-ben-d
fort-s-chrei-ben-de
fort-s-chrei-ben-dem
fort-s-chrei-ben-den
fort-s-chrei-ben-der
fort-s-chrei-ben-des
Fort-s-chrei-bens
fort-s-chreibst
fort-s-chreib-t
Fort-s-chrei-bung
Fort-s-chrei-bun-gen
Fort-s-chrei-bungs-da-tei
fort-s-chrei-te
fort-s-chrei-ten
fort-s-chrei-ten-d
fort-s-chrei-ten-de
fort-s-chrei-ten-dem
fort-s-chrei-ten-den
fort-s-chrei-ten-der
fort-s-chrei-ten-des
Fort-s-chrei-tens
fort-s-chrei-tes-t
fort-s-chrei-tet
fort-s-chrie-b
fort-s-chrie-ben
fort-s-chriebst
fort-s-chrieb-t
fort-s-chrit-t
fort-s-chrit-te
fort-s-chrit-ten
Fort-s-chrit-tes
fort-s-chrit-tes-t
fort-s-chrit-tet
fort-s-chritt-lich
fort-s-chritt-li-che
fort-s-chritt-li-chem
fort-s-chritt-li-chen
fort-s-chritt-li-cher
fort-s-chritt-li-che-re
fort-s-chritt-li-che-rem
fort-s-chritt-li-che-ren
fort-s-chritt-li-che-rer
fort-s-chritt-li-che-res
fort-s-chritt-li-ches
Fort-s-chritt-lich-keit
fort-s-chritt-lichst
fort-s-chritt-lichs-te
fort-s-chritt-lichs-tem
fort-s-chritt-lichs-ten
fort-s-chritt-lichs-ter
fort-s-chritt-lichs-tes
Fort-s-chritts
Fort-s-chritts-an-zei-ge
Fort-s-chritts-an-zei-gen
Fort-s-chritts-bal-ken
Fort-s-chritts-bal-kens
fort-s-chritts-be-geis-ter-t
fort-s-chritts-be-geis-ter-te
fort-s-chritts-be-geis-ter-tem
fort-s-chritts-be-geis-ter-ten
fort-s-chritts-be-geis-ter-ter
fort-s-chritts-be-geis-ter-tes
Fort-s-chritts-be-geis-te-rung
Fort-s-chritts-be-griff
Fort-s-chritts-be-grif-fe
Fort-s-chritts-be-grif-fen
Fort-s-chritts-be-griffs
Fort-s-chritts-be-richt
Fort-s-chritts-be-rich-te
Fort-s-chritts-be-rich-ten
Fort-s-chritts-be-richts
Fort-s-chritts-be-we-gung
Fort-s-chritts-be-we-gun-gen
Fort-s-chritt-s-club
Fort-s-chritt-s-clubs
Fort-s-chritts-den-ken
Fort-s-chritts-den-kens
Fort-s-chritts-dok-trin
Fort-s-chritts-ef-fek-t
Fort-s-chritt-s-ei-fer
Fort-s-chritt-s-ent-wick-lung
Fort-s-chritt-s-ent-wick-lun-gen
Fort-s-chritts-er-zäh-lung
Fort-s-chritts-er-zäh-lun-gen
Fort-s-chritts-fak-tor
Fort-s-chritts-fak-to-ren
Fort-s-chritts-fak-tor-s
fort-s-chritts-feind-lich
fort-s-chritts-feind-li-che
fort-s-chritts-feind-li-chem
fort-s-chritts-feind-li-chen
fort-s-chritts-feind-li-cher
fort-s-chritts-feind-li-ches
Fort-s-chritts-feind-lich-keit
Fort-s-chritts-feind-lich-kei-ten
Fort-s-chritts-för-de-rung
Fort-s-chritts-freun-d
Fort-s-chritts-freun-des
fort-s-chritts-freund-lich
fort-s-chritts-freund-li-che
fort-s-chritts-freund-li-chem
fort-s-chritts-freund-li-chen
fort-s-chritts-freund-li-cher
fort-s-chritts-freund-li-che-re
fort-s-chritts-freund-li-che-rem
fort-s-chritts-freund-li-che-ren
fort-s-chritts-freund-li-che-rer
fort-s-chritts-freund-li-che-res
fort-s-chritts-freund-li-ches
Fort-s-chritts-funk-ti-o-n
Fort-s-chritts-funk-ti-o-nen
Fort-s-chritts-ga-ran-tie
Fort-s-chritts-ga-ran-ti-en
Fort-s-chritts-ge-dan-ke
Fort-s-chritts-ge-dan-ken
Fort-s-chritts-ge-dan-ken-s
Fort-s-chritts-ge-schich-te
Fort-s-chritts-ge-schich-ten
Fort-s-chritts-glau-be
Fort-s-chritts-glau-ben
Fort-s-chritts-glau-bens
fort-s-chritts-gläu-big
fort-s-chritts-gläu-bi-ge
fort-s-chritts-gläu-bi-gem
fort-s-chritts-gläu-bi-gen
fort-s-chritts-gläu-bi-ger
fort-s-chritts-gläu-bi-ge-s
Fort-s-chritts-gläu-big-keit
Fort-s-chritts-gra-d
Fort-s-chritts-gra-de
Fort-s-chritts-gra-den
Fort-s-chritts-hy-po-the-se
Fort-s-chritts-hy-po-the-sen
Fort-s-chritts-ide-e
Fort-s-chritts-ide-en
Fort-s-chritt-s-ideo-lo-gie
Fort-s-chritt-sil-lu-sion
Fort-s-chritts-kar-te
Fort-s-chritts-kar-ten
Fort-s-chritts-klei-d
Fort-s-chritts-klub
Fort-s-chritts-klubs
Fort-s-chritts-kon-trol-le
Fort-s-chritts-kon-trol-len
Fort-s-chritts-kon-zep-t
Fort-s-chritts-kon-zep-te
Fort-s-chritts-kri-ti-k
Fort-s-chritts-kri-ti-ken
Fort-s-chritts-kri-ti-ker
Fort-s-chritts-kri-ti-ke-rin
Fort-s-chritts-kri-ti-ke-rin-nen
Fort-s-chritts-kri-ti-kern
Fort-s-chritts-kri-ti-ker-s
Fort-s-chritts-kur-ve
Fort-s-chritts-kur-ven
Fort-s-chritts-leis-te
Fort-s-chritts-mes-sung
Fort-s-chritts-mes-sun-gen
Fort-s-chritts-mo-dell
Fort-s-chritts-mo-del-le
Fort-s-chritts-mo-dells
Fort-s-chritts-my-then
Fort-s-chritts-my-thos
Fort-s-chritt-s-op-ti-mis-mus
fort-s-chritt-s-o-ri-en-tier-t
fort-s-chritt-s-o-ri-en-tier-te
fort-s-chritt-s-o-ri-en-tier-tem
fort-s-chritt-s-o-ri-en-tier-ten
fort-s-chritt-s-o-ri-en-tier-ter
fort-s-chritt-s-o-ri-en-tier-tes
Fort-s-chritts-par-tei
Fort-s-chritts-par-tei-en
Fort-s-chrittspes-si-mis-mus
Fort-s-chritts-pro-jek-t
Fort-s-chritts-pro-jek-te
Fort-s-chritts-pro-jek-ten
Fort-s-chritts-pro-jek-tes
Fort-s-chritts-pro-jekt-s
Fort-s-chritts-pro-zess
Fort-s-chritts-pro-zes-se
Fort-s-chritts-pro-zes-sen
Fort-s-chritts-pro-zes-ses
Fort-s-chritts-punk-t
Fort-s-chritts-punk-te
Fort-s-chritts-punk-ten
Fort-s-chritts-punk-tes
Fort-s-chritts-quo-te
Fort-s-chritts-quo-ten
Fort-s-chritts-re-ak-ti-o-n
Fort-s-chritts-re-ak-ti-o-nen
Fort-s-chritts-schwei-ne-hun-de
Fort-s-chritts-schwei-ne-hun-den
Fort-s-chritts-schwei-ne-hun-des
Fort-s-chritts-sucht
fort-s-chritt-s-t
Fort-s-chritts-ten-denz
Fort-s-chritts-ten-den-zen
Fort-s-chritts-the-o-rie
Fort-s-chritts-the-o-ri-en
Fort-s-chritts-trau-ma
Fort-s-chritts-trau-mas
Fort-s-chritts-über-wa-chung
Fort-s-chritts-uni-o-n
Fort-s-chritts-uni-o-nen
Fort-s-chritt-s-u-to-pie
Fort-s-chritts-ver-fol-gung
Fort-s-chritts-ver-wei-ge-rer
Fort-s-chritts-ver-wei-ge-rern
Fort-s-chritts-ver-wei-ge-rer-s
Fort-s-chritts-ver-wei-ge-rung
Fort-s-chritts-ver-wei-ge-run-gen
Fort-s-chritts-vor-stel-lung
Fort-s-chritts-vor-stel-lun-gen
Fort-s-chritts-vor-ur-teil
Fort-s-chritts-vor-ur-tei-le
Fort-s-chritts-vor-ur-tei-len
Fort-s-chritts-vor-ur-teils
Fort-s-chritts-werk
Fort-s-chritts-wer-ke
Fort-s-chritts-wer-ken
Fort-s-chritts-werks
Fort-s-chritts-wer-tung
Fort-s-chritts-wer-tun-gen
Fort-s-chritt-szahl
Fort-s-chritt-szah-len
Fort-s-chritt-szah-len-kon-zep-t
Fort-s-chritt-szah-len-kon-zep-te
Fort-s-chritt-szah-len-kon-zep-ten
Fort-s-chritt-szah-len-kon-zept-s
Fort-s-chritt-szeit-ver-fah-ren
Fort-s-chritt-szeit-ver-fah-rens
Fort-s-chritt-szif-fer
Fort-s-chritt-szif-fern
Fort-s-chrittszu-stan-d
Fort-s-chrittszu-stan-des
Fort-s-chritt-szweif-ler
Fort-s-chritt-szweif-le-rin
Fort-s-chritt-szweif-le-rin-nen
Fort-s-chritt-szweif-lern
Fort-s-chritt-szweif-ler-s

(faulty utput from dic.inserted)

I would like to understand how the wrong hyphenation comes about. This doesn't seem to be about the .dic file, really. The single s as a syllable doesn't make too much sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants