Skip to content

Commit

Permalink
Merge pull request #6 from twardoch/master
Browse files Browse the repository at this point in the history
Version 2.0.4
  • Loading branch information
kbatsuren authored Aug 11, 2021
2 parents 839f4a4 + 6133236 commit 4a65d03
Show file tree
Hide file tree
Showing 53 changed files with 29,421 additions and 66 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
_priv/
.idea/
.dccache
*.code-workspace
Expand Down
27 changes: 25 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,28 @@

# Changelog

## Version 0.1 (development)
## Version 2.0.4 (2021-08-10)

- added `--stats` CLI option to list supported scripts and orthographies

## Version 2.0.3 (2021-08-10)

- allows for auto-detection of explicit input of ISO 15924 script
- allows for fuzzy or explicit input of ISO 639-2/Wiktionary language code
- new CLI options

## Version 2.0.2 (2021-08-10)

- added `wiktra/wikt/data/data.json`

## Version 2.0.1 (2021-08-09)

- added `languages/extradata*` Lua modules

## Version 2.0.0 (2021-08-08)

- initial
- completely refactored
- added `wiktrapy` CLI tool for transliteration
- added `wiktrapy_update` CLI tool for updating the built-in Wiktionary modules
- updated the built-in Wiktionary modules
- works more robustly with Lua 5.4.3 and more Python 3 environments
3 changes: 2 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
recursive-include wiktra/wikt *.lua
recursive-include wiktra/wikt *.lua
recursive-include wiktra/wikt *.json
46 changes: 31 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,20 @@

Internally, it uses transliteration modules [from Wiktionary](https://en.wiktionary.org/wiki/Category:Transliteration_modules). These modules are written in Lua by the Wiktionary linguists and developers. Therefore, Wiktra offers the highest quality of rule-based transliterations.

This is version 2 of Wiktra, maintained by [Adam Twardoch](https://twardoch.github.io/). It’s based on [Wiktra](https://github.com/kbatsuren/wiktra/) by [Khuyagbaatar Batsuren](https://github.com/kbatsuren).
Wiktra 1.0 was originally developed by [Khuyagbaatar Batsuren](https://github.com/kbatsuren). Wiktra 2 was rewritten by [Adam Twardoch](https://twardoch.github.io/).

Wiktra 2 supports nearly all of languages supported by Wiktionary, except Korean, Japanese and Thai. Wiktra 1 supported 181 languages and its 60 orthographies. Wiktra 2 currently has a legacy Python function which uses the language codes supplied by the original developer, and also lets you use Wiktionary’s codes directly.
Locations:

**This is work in progress**.
- [kbatsuren/wiktra](https://github.com/kbatsuren/wiktra/) — the upstream location, slower releases
- [twardoch/wiktra2](https://github.com/twardoch/wiktra2/) — active development

Wiktra 2 supports 514 orthographies in 102 scripts with the new API (nearly all of languages supported by Wiktionary, except Korean, Japanese and Thai), and 181 languages and its 60 orthographies in the legacy API.

## Installation

### macOS
### Version 2

_(This has been tested on macOS 11.)_

In Terminal, `cd` to the main folder and run:

Expand All @@ -21,11 +26,11 @@ In Terminal, `cd` to the main folder and run:
python3 -m pip install --upgrade .
```

This will install `brew` if needed, the installs `lua`, `luarocks`, `lua-format`, `luajit` and `python3`. Finally, it installs the Python dependencies `lupa` and `pywikiapi`.
This will install `brew` if needed, the installs `lua`, `luarocks`, `lua-format`, `luajit` and `python3`. Finally, it installs some Python dependencies, such as `lupa` or `pywikiapi`.

### Other systems
### Other systems, version 1

_This is from the original developer:_
_This is from the original version 1. Quite possibly the Version 2 instructions (see above) should work instead._

As much as you want to use your favorite version of Python, it is recommended to employ 3.5 version on the grounds that the module utilizes lupa-1.8. Lupa enables Python to adopt functionalities of Lua language, in which most of the transliteration modules are written.

Expand All @@ -49,21 +54,21 @@ $ python

### Troubleshooting

_This should no longer be an issue with version 2._

If you get `LuaError: module 'wikt.mw' not found`, try:

- create a folder `lua` in `C:\ProgramData\Miniconda3\`
- copy the entire folder of wikt from this project and paste it into `C:\ProgramData\Miniconda3\lua`

## Usage

### Command-line
### Command-line, version 2

```sh
wiktrapy -h
```
$ wiktrapy -h

```
usage: wiktrapy [-h] [-t TEXT] [-i FILE] [-l LANG] [-s SCRIPT] [-v] [-V]
usage: wiktrapy [-h] [-t TEXT] [-i FILE] [-l LANG] [-s SCRIPT] [-o SCRIPT] [-x] [--stats] [-v] [-V]

optional arguments:
-h, --help show this help message and exit
Expand All @@ -72,6 +77,10 @@ optional arguments:
-l LANG, --lang LANG Input language as ISO 639-2 code
-s SCRIPT, --script SCRIPT
Input script as ISO 15924 code
-o SCRIPT, --to-script SCRIPT
Output script as ISO 15924 code
-x, --explicit Explicit language/script, no fuzzy matching
--stats List supported scripts and orthographies
-v, --verbose -v show progress, -vv show debug
-V, --version show version and exit
```
Expand All @@ -83,15 +92,22 @@ $ wiktrapy -t "Привет" -l ru -s Cyrl
Privet
```

### Python (new interface)
### Python, version 2 new API

```python
from wiktra.Wiktra import Transliterator
tr = Transliterator()
print(tr.tr("Привет", "ru", "Cyrl")

print(tr.tr("Привет", lang='ru', sc='Cyrl', to_sc='Latn', explicit=True)
```

### Python (legacy `translite` function)
- If `explicit` is `True`, you need to specify `lang` as the input language (using Wiktionary/ISO codes), `sc` as the input script (using ISO 15924 codes), and optionally `to_sc` as the output script (`Latn` is assumed if absent).

- If `explicit` is `False` or omitted, Wiktra will guess the `sc` if it’s not specified, and will assume the `und` (undefined) input language for that script. Sometimes Wiktionary provides a generic script transliterator. If Wiktionary has multiple script transliterators, the language with the largest speaking population also serves as "undefined". For example, for `Cyrl` (Cyrillic script), `ru` (Russian language) serves as `und` (undefined) and is used if you only specify the script or Wiktra guesses `Cyrl` as the script.

Use `wiktrapy --stats` to list all supported script and language codes, or see the [`data.yaml`](wiktra/wikt/data/data.yaml). The YAML file also lists the Wiktionary transliteration modules used.

### Python, legacy `translite` function

```python
from wiktra.Wiktra import translite as tr
Expand Down
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
lupa
pywikiapi
langcodes[data]
fonttools[unicode]
8 changes: 4 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,16 +43,16 @@ def get_requirements(*args):

setup(
name=f"{NAME}",
version="2.0.0",
version=get_version(),
description="Transliteration tool using Wiktionary transliteration modules",
long_description=long_description,
long_description_content_type="text/markdown",
author="Khuyagbaatar Batsuren",
author_email="[email protected]",
url=f"https://twardoch.github.io/{NAME}2/",
project_urls={"Source": f"https://github.com/twardoch/{NAME}2/"},
url=f"https://github.com/kbatsuren/{NAME}/",
project_urls={"Source": f"https://github.com/kbatsuren/{NAME}/"},
license="GPLv2",
download_url=f"https://github.com/twardoch/{NAME}2",
download_url=f"https://github.com/kbatsuren/{NAME}/",
python_requires=">=3.9",
install_requires=get_requirements("requirements.txt"),
packages=find_packages(),
Expand Down
74 changes: 53 additions & 21 deletions wiktra/Wiktra.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,23 @@
import os
from pathlib import Path
from lupa import LuaRuntime
import logging
import json
import langcodes
from fontTools import unicodedata as ucd
from collections import Counter

lua_folder = str(Path(Path(__file__).parent))

os.environ["LUA_PATH"] = ";".join([
f"{lua_folder}/?.lua",
f"{lua_folder}/wikt/?.lua",
f"{lua_folder}/wikt/translit/?.lua",
f"{lua_folder}/wikt/legacy/?.lua",
f"{lua_folder}/wikt/legacy/translit/?.lua",
f"{lua_folder}/wikt/data/?.lua",
f"{lua_folder}/wikt/data/translit/?.lua",
f"{os.environ.get('LUA_PATH','')}",
])


lang_map = {
"inc-mas": ("inc-mas", ""),
"amh": ("ethi", "Ethi"),
Expand Down Expand Up @@ -223,39 +227,67 @@ class Transliterator(object):
def __init__(self):
self.lua = LuaRuntime(unpack_returned_tuples=True)
self.lua.execute("mw = require('wikt.mw')")
with open(Path(lua_folder, "wikt", "data", "data.json"), "r", encoding="utf-8") as f:
self.mod_map = json.load(f)
self.lang_tags = []
for sc, langs in self.mod_map.items():
for lang in langs:
self.lang_tags.append(f'''{lang}-{sc}''')

def e(self, lua_str):
self.lua.execute(lua_str)
return self.lua.globals().res

def auto_script_lang(self, text, lang, sc):
in_lang_tag = lang
if not sc:
sc_count = Counter([ucd.script(c) for c in text])
sc = sc_count.most_common(1)[0][0]
if sc:
in_lang_tag += f'-{sc}'
langrec = langcodes.Language.get(
langcodes.closest_match(in_lang_tag, self.lang_tags)[0])
lang = langrec.language
if not lang:
lang = 'und'
sc = langrec.script
if not sc:
sc = 'Zyyy'
return lang, sc

def tr_legacy(self, text, lang):
lang, sc = lang_map[lang.lower()]
lua_str = f"""res = require("wikt.translit.{lang}-translit").tr("{text}", "{lang}", "{sc}")"""
return self.e(lua_str)

def tr(self, text, lang, sc):
def tr(self, text, lang='und', sc=None, to_sc='Latn', explicit=False):
if not lang:
lang='und'
if explicit:
if not sc:
sc = 'Latn'
else:
lang, sc = self.auto_script_lang(text, lang, sc)
mod = self.mod_map.get(sc, {}).get(lang, {}).get(to_sc, {}).get('translit')
logging.debug({
'lang': lang, 'script': sc, 'to_script': to_sc, 'explicit': explicit, 'module': mod
})
if not mod:
return text
res = None
res = self.e(
f"""res = require("wikt.translit.translit-redirect").tr("{text}", "{lang}", "{sc}")"""
)
if not res:
if mod == 'pi-Latn-translit':
res = self.e(
f"""res = require("wikt.translit.{lang}-translit").tr("{text}", "{lang}", "{sc}")"""
f"""res = require("wikt.translit.{mod}").tr("{text}", "{to_sc}")"""
)
else:
res = self.e(
f"""res = require("wikt.translit.{mod}").tr("{text}", "{lang}", "{sc}")"""
)
if not res:
logging.debug('Problem: not transliterated')
res = text
return res

def test_load(self):
reqs = []
mods = [str(Path(p.parent, p.stem)) for p in Path("wikt","translit").glob("**/*.lua")]
for mod in mods:
reqs.append(f"""require("{mod}")""")
sreqs = "\n".join(reqs)
l = f"""
{sreqs}
res = "OK"
"""
return self.e(l)



def translite(text, lang):
Expand Down
2 changes: 1 addition & 1 deletion wiktra/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
# -*- coding: utf-8 -*-
from .Wiktra import *

__version__ = "2.0.0"
__version__ = "2.0.4"
__all__ = ['translite']
37 changes: 35 additions & 2 deletions wiktra/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,38 @@ def cli():
"--lang",
metavar="LANG",
dest="in_lang",
default=None,
help="Input language as ISO 639-2 code",
)
parser.add_argument(
"-s",
"--script",
metavar="SCRIPT",
dest="in_script",
default=None,
help="Input script as ISO 15924 code",
)
parser.add_argument(
"-o",
"--to-script",
metavar="SCRIPT",
dest="out_script",
default="Latn",
help="Output script as ISO 15924 code",
)
parser.add_argument(
"-x",
"--explicit",
action="store_true",
dest="explicit",
help="""Explicit language/script, no fuzzy matching""",
)
parser.add_argument(
"--stats",
action="store_true",
dest="stats",
help="""List supported scripts and orthographies""",
)
parser.add_argument(
"-v",
"--verbose",
Expand Down Expand Up @@ -64,8 +87,18 @@ def main(*args, **kwargs):
else:
text = opts["text"]
tr = wiktra.Wiktra.Transliterator()
res = tr.tr(text, opts["in_lang"], opts["in_script"])
print(res)
if opts.get("stats", False):
print(f'{len(tr.mod_map.keys())} scripts: {" ".join(tr.mod_map.keys())}')
print(f'{len(tr.lang_tags)} orthographies: {" ".join(tr.lang_tags)}')
else:
res = tr.tr(
text,
lang=opts["in_lang"],
sc=opts["in_script"],
to_sc=opts["out_script"],
explicit=opts["explicit"],
)
print(res)


if __name__ == "__main__":
Expand Down
Loading

0 comments on commit 4a65d03

Please sign in to comment.