Merge pull request #6 from twardoch/master

Version 2.0.4
kbatsuren · Aug 11, 2021 · 4a65d03 · 4a65d03
2 parents 839f4a4 + 6133236
commit 4a65d03
Show file tree

Hide file tree

Showing 53 changed files with 29,421 additions and 66 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
+_priv/
 .idea/
 .dccache
 *.code-workspace

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,28 @@
+
 # Changelog
 
-## Version 0.1 (development)
+## Version 2.0.4 (2021-08-10)
+
+- added `--stats` CLI option to list supported scripts and orthographies
+
+## Version 2.0.3 (2021-08-10)
+
+- allows for auto-detection of explicit input of ISO 15924 script
+- allows for fuzzy or explicit input of ISO 639-2/Wiktionary language code
+- new CLI options
+
+## Version 2.0.2 (2021-08-10)
+
+- added `wiktra/wikt/data/data.json`
+
+## Version 2.0.1 (2021-08-09)
+
+- added `languages/extradata*` Lua modules
+
+## Version 2.0.0 (2021-08-08)
 
-- initial
+- completely refactored
+- added `wiktrapy` CLI tool for transliteration
+- added `wiktrapy_update` CLI tool for updating the built-in Wiktionary modules
+- updated the built-in Wiktionary modules
+- works more robustly with Lua 5.4.3 and more Python 3 environments
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1 +1,2 @@
-recursive-include wiktra/wikt *.lua
+recursive-include wiktra/wikt *.lua
+recursive-include wiktra/wikt *.json
diff --git a/README.md b/README.md
@@ -4,15 +4,20 @@
 
 Internally, it uses transliteration modules [from Wiktionary](https://en.wiktionary.org/wiki/Category:Transliteration_modules). These modules are written in Lua by the Wiktionary linguists and developers. Therefore, Wiktra offers the highest quality of rule-based transliterations.
 
-This is version 2 of Wiktra, maintained by [Adam Twardoch](https://twardoch.github.io/). It’s based on [Wiktra](https://github.com/kbatsuren/wiktra/) by [Khuyagbaatar Batsuren](https://github.com/kbatsuren).
+Wiktra 1.0 was originally developed by [Khuyagbaatar Batsuren](https://github.com/kbatsuren). Wiktra 2 was rewritten by [Adam Twardoch](https://twardoch.github.io/).
 
-Wiktra 2 supports nearly all of languages supported by Wiktionary, except Korean, Japanese and Thai. Wiktra 1 supported 181 languages and its 60 orthographies. Wiktra 2 currently has a legacy Python function which uses the language codes supplied by the original developer, and also lets you use Wiktionary’s codes directly.
+Locations:
 
-**This is work in progress**.
+- [kbatsuren/wiktra](https://github.com/kbatsuren/wiktra/) — the upstream location, slower releases
+- [twardoch/wiktra2](https://github.com/twardoch/wiktra2/) — active development
+
+Wiktra 2 supports 514 orthographies in 102 scripts with the new API (nearly all of languages supported by Wiktionary, except Korean, Japanese and Thai), and 181 languages and its 60 orthographies in the legacy API.
 
 ## Installation
 
-### macOS
+### Version 2
+
+_(This has been tested on macOS 11.)_
 
 In Terminal, `cd` to the main folder and run:
 
@@ -21,11 +26,11 @@ In Terminal, `cd` to the main folder and run:
 python3 -m pip install --upgrade .
 ```
 
-This will install `brew` if needed, the installs `lua`, `luarocks`, `lua-format`, `luajit` and `python3`. Finally, it installs the Python dependencies `lupa` and `pywikiapi`.
+This will install `brew` if needed, the installs `lua`, `luarocks`, `lua-format`, `luajit` and `python3`. Finally, it installs some Python dependencies, such as `lupa` or `pywikiapi`.
 
-### Other systems
+### Other systems, version 1
 
-_This is from the original developer:_
+_This is from the original version 1. Quite possibly the Version 2 instructions (see above) should work instead._
 
 As much as you want to use your favorite version of Python, it is recommended to employ 3.5 version on the grounds that the module utilizes lupa-1.8. Lupa enables Python to adopt functionalities of Lua language, in which most of the transliteration modules are written.
 
@@ -49,21 +54,21 @@ $ python
 
 ### Troubleshooting
 
+_This should no longer be an issue with version 2._
+
 If you get `LuaError: module 'wikt.mw' not found`, try:
 
 - create a folder `lua` in `C:\ProgramData\Miniconda3\`
 - copy the entire folder of wikt from this project and paste it into `C:\ProgramData\Miniconda3\lua`
 
 ## Usage
 
-### Command-line
+### Command-line, version 2
 
 ```sh
-wiktrapy -h
-```
+$ wiktrapy -h
 
-```
-usage: wiktrapy [-h] [-t TEXT] [-i FILE] [-l LANG] [-s SCRIPT] [-v] [-V]
+usage: wiktrapy [-h] [-t TEXT] [-i FILE] [-l LANG] [-s SCRIPT] [-o SCRIPT] [-x] [--stats] [-v] [-V]
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -72,6 +77,10 @@ optional arguments:
   -l LANG, --lang LANG  Input language as ISO 639-2 code
   -s SCRIPT, --script SCRIPT
                         Input script as ISO 15924 code
+  -o SCRIPT, --to-script SCRIPT
+                        Output script as ISO 15924 code
+  -x, --explicit        Explicit language/script, no fuzzy matching
+  --stats               List supported scripts and orthographies
   -v, --verbose         -v show progress, -vv show debug
   -V, --version         show version and exit
 ```
@@ -83,15 +92,22 @@ $ wiktrapy -t "Привет" -l ru -s Cyrl
 Privet
 ```
 
-### Python (new interface)
+### Python, version 2 new API
 
 ```python
 from wiktra.Wiktra import Transliterator
 tr = Transliterator()
-print(tr.tr("Привет", "ru", "Cyrl")
+
+print(tr.tr("Привет", lang='ru', sc='Cyrl', to_sc='Latn', explicit=True)
 ```
 
-### Python (legacy `translite` function)
+- If `explicit` is `True`, you need to specify `lang` as the input language (using Wiktionary/ISO codes), `sc` as the input script (using ISO 15924 codes), and optionally `to_sc` as the output script (`Latn` is assumed if absent).
+
+- If `explicit` is `False` or omitted, Wiktra will guess the `sc` if it’s not specified, and will assume the `und` (undefined) input language for that script. Sometimes Wiktionary provides a generic script transliterator. If Wiktionary has multiple script transliterators, the language with the largest speaking population also serves as "undefined". For example, for `Cyrl` (Cyrillic script), `ru` (Russian language) serves as `und` (undefined) and is used if you only specify the script or Wiktra guesses `Cyrl` as the script.
+
+Use `wiktrapy --stats` to list all supported script and language codes, or see the [`data.yaml`](wiktra/wikt/data/data.yaml). The YAML file also lists the Wiktionary transliteration modules used.
+
+### Python, legacy `translite` function
 
 ```python
 from wiktra.Wiktra import translite as tr

diff --git a/requirements.txt b/requirements.txt
@@ -1,2 +1,4 @@
 lupa
 pywikiapi
+langcodes[data]
+fonttools[unicode]
diff --git a/setup.py b/setup.py
@@ -43,16 +43,16 @@ def get_requirements(*args):
 
 setup(
     name=f"{NAME}",
-    version="2.0.0",
+    version=get_version(),
     description="Transliteration tool using Wiktionary transliteration modules",
     long_description=long_description,
     long_description_content_type="text/markdown",
     author="Khuyagbaatar Batsuren",
     author_email="[email protected]",
-    url=f"https://twardoch.github.io/{NAME}2/",
-    project_urls={"Source": f"https://github.com/twardoch/{NAME}2/"},
+    url=f"https://github.com/kbatsuren/{NAME}/",
+    project_urls={"Source": f"https://github.com/kbatsuren/{NAME}/"},
     license="GPLv2",
-    download_url=f"https://github.com/twardoch/{NAME}2",
+    download_url=f"https://github.com/kbatsuren/{NAME}/",
     python_requires=">=3.9",
     install_requires=get_requirements("requirements.txt"),
     packages=find_packages(),

diff --git a/wiktra/Wiktra.py b/wiktra/Wiktra.py
@@ -4,19 +4,23 @@
 import os
 from pathlib import Path
 from lupa import LuaRuntime
+import logging
+import json
+import langcodes
+from fontTools import unicodedata as ucd
+from collections import Counter
 
 lua_folder = str(Path(Path(__file__).parent))
 
 os.environ["LUA_PATH"] = ";".join([
     f"{lua_folder}/?.lua",
     f"{lua_folder}/wikt/?.lua",
     f"{lua_folder}/wikt/translit/?.lua",
-    f"{lua_folder}/wikt/legacy/?.lua",
-    f"{lua_folder}/wikt/legacy/translit/?.lua",
+    f"{lua_folder}/wikt/data/?.lua",
+    f"{lua_folder}/wikt/data/translit/?.lua",
     f"{os.environ.get('LUA_PATH','')}",
 ])
 
-
 lang_map = {
     "inc-mas": ("inc-mas", ""),
     "amh": ("ethi", "Ethi"),
@@ -223,39 +227,67 @@ class Transliterator(object):
     def __init__(self):
         self.lua = LuaRuntime(unpack_returned_tuples=True)
         self.lua.execute("mw = require('wikt.mw')")
+        with open(Path(lua_folder, "wikt", "data", "data.json"), "r", encoding="utf-8") as f:
+            self.mod_map = json.load(f)
+        self.lang_tags = []
+        for sc, langs in self.mod_map.items():
+            for lang in langs:
+                self.lang_tags.append(f'''{lang}-{sc}''')
 
     def e(self, lua_str):
         self.lua.execute(lua_str)
         return self.lua.globals().res
 
+    def auto_script_lang(self, text, lang, sc):
+        in_lang_tag = lang
+        if not sc:
+            sc_count = Counter([ucd.script(c) for c in text])
+            sc = sc_count.most_common(1)[0][0]
+        if sc:
+            in_lang_tag += f'-{sc}'
+        langrec = langcodes.Language.get(
+            langcodes.closest_match(in_lang_tag, self.lang_tags)[0])
+        lang = langrec.language
+        if not lang:
+            lang = 'und'
+        sc = langrec.script
+        if not sc:
+            sc = 'Zyyy'
+        return lang, sc
+
     def tr_legacy(self, text, lang):
         lang, sc = lang_map[lang.lower()]
         lua_str = f"""res = require("wikt.translit.{lang}-translit").tr("{text}", "{lang}", "{sc}")"""
         return self.e(lua_str)
 
-    def tr(self, text, lang, sc):
+    def tr(self, text, lang='und', sc=None, to_sc='Latn', explicit=False):
+        if not lang:
+            lang='und'
+        if explicit:
+            if not sc:
+                sc = 'Latn'
+        else:
+            lang, sc = self.auto_script_lang(text, lang, sc)
+        mod = self.mod_map.get(sc, {}).get(lang, {}).get(to_sc, {}).get('translit')
+        logging.debug({
+            'lang': lang, 'script': sc, 'to_script': to_sc, 'explicit': explicit, 'module': mod
+        })
+        if not mod:
+            return text
         res = None
-        res = self.e(
-            f"""res = require("wikt.translit.translit-redirect").tr("{text}", "{lang}", "{sc}")"""
-        )
-        if not res:
+        if mod == 'pi-Latn-translit':
             res = self.e(
-                f"""res = require("wikt.translit.{lang}-translit").tr("{text}", "{lang}", "{sc}")"""
+                f"""res = require("wikt.translit.{mod}").tr("{text}", "{to_sc}")"""
             )
+        else:
+            res = self.e(
+                f"""res = require("wikt.translit.{mod}").tr("{text}", "{lang}", "{sc}")"""
+            )
+        if not res:
+            logging.debug('Problem: not transliterated')
+            res = text
         return res
 
-    def test_load(self):
-        reqs = []
-        mods = [str(Path(p.parent, p.stem)) for p in Path("wikt","translit").glob("**/*.lua")]
-        for mod in mods:
-            reqs.append(f"""require("{mod}")""")
-        sreqs = "\n".join(reqs)
-        l = f"""
-        {sreqs}
-        res = "OK"
-        """
-        return self.e(l)
-
 
 
 def translite(text, lang):

diff --git a/wiktra/__init__.py b/wiktra/__init__.py
@@ -2,5 +2,5 @@
 # -*- coding: utf-8 -*-
 from .Wiktra import *
 
-__version__ = "2.0.0"
+__version__ = "2.0.4"
 __all__ = ['translite']
diff --git a/wiktra/__main__.py b/wiktra/__main__.py
@@ -20,15 +20,38 @@ def cli():
         "--lang",
         metavar="LANG",
         dest="in_lang",
+        default=None,
         help="Input language as ISO 639-2 code",
     )
     parser.add_argument(
         "-s",
         "--script",
         metavar="SCRIPT",
         dest="in_script",
+        default=None,
         help="Input script as ISO 15924 code",
     )
+    parser.add_argument(
+        "-o",
+        "--to-script",
+        metavar="SCRIPT",
+        dest="out_script",
+        default="Latn",
+        help="Output script as ISO 15924 code",
+    )
+    parser.add_argument(
+        "-x",
+        "--explicit",
+        action="store_true",
+        dest="explicit",
+        help="""Explicit language/script, no fuzzy matching""",
+    )
+    parser.add_argument(
+        "--stats",
+        action="store_true",
+        dest="stats",
+        help="""List supported scripts and orthographies""",
+    )
     parser.add_argument(
         "-v",
         "--verbose",
@@ -64,8 +87,18 @@ def main(*args, **kwargs):
     else:
         text = opts["text"]
     tr = wiktra.Wiktra.Transliterator()
-    res = tr.tr(text, opts["in_lang"], opts["in_script"])
-    print(res)
+    if opts.get("stats", False):
+        print(f'{len(tr.mod_map.keys())} scripts: {" ".join(tr.mod_map.keys())}')
+        print(f'{len(tr.lang_tags)} orthographies: {" ".join(tr.lang_tags)}')
+    else:
+        res = tr.tr(
+            text,
+            lang=opts["in_lang"],
+            sc=opts["in_script"],
+            to_sc=opts["out_script"],
+            explicit=opts["explicit"],
+        )
+        print(res)
 
 
 if __name__ == "__main__":