Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated module to Unicode 15.1 #4

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
f75e77b
Update .gitignore
gaspardpetit Jan 4, 2024
1d298ea
Added unittests
gaspardpetit Jan 4, 2024
5931ff6
Create python-package.yml
gaspardpetit Jan 4, 2024
dfb1899
Create pylint.yml
gaspardpetit Jan 4, 2024
3bae7f9
Renamed unittests with proper extension
gaspardpetit Jan 4, 2024
5cf318c
Update pylint.yml
gaspardpetit Jan 4, 2024
320967f
pylint fixes
gaspardpetit Jan 4, 2024
d50a4c1
pylint fixes for setup.py
gaspardpetit Jan 4, 2024
7e7d25d
Merge pull request #2 from gaspardpetit/unittest_and_lint
gaspardpetit Jan 4, 2024
a3d5ed2
Reorganize lookup, providing ~300x performance improvements
gaspardpetit Jan 4, 2024
abcdcc9
Merge pull request #3 from gaspardpetit/optimize_performance
gaspardpetit Jan 4, 2024
1dcfd08
Sorted SCRIPT_ABBREVS and BUCKETS to facilitate update comparison
gaspardpetit Jan 4, 2024
352899c
Updated unidate to Unicode 15.1
gaspardpetit Jan 4, 2024
c3ec13e
Merge pull request #4 from gaspardpetit/update_unicode_15_1
gaspardpetit Jan 4, 2024
9842cdb
Update python-package.yml
gaspardpetit Jan 4, 2024
bf2c0f6
Update python-package.yml
gaspardpetit Jan 4, 2024
6d5d9ab
Merge pull request #5 from gaspardpetit/gaspardpetit-patch-1
gaspardpetit Jan 4, 2024
ee20325
enable doctests in pytest
gaspardpetit Jan 4, 2024
7e3c822
Fixed documentation sample
gaspardpetit Jan 4, 2024
ddad13a
Merge pull request #6 from gaspardpetit/enable-doctest-modules
gaspardpetit Jan 4, 2024
161b211
Enable coverage of README.md code in unittests
gaspardpetit Jan 4, 2024
ee5388d
Merge pull request #7 from gaspardpetit/enable_readme_tests
gaspardpetit Jan 4, 2024
ce6710b
Generating Scripts class with constants for each supported script name
gaspardpetit Jan 4, 2024
0d52387
Merge pull request #8 from gaspardpetit/generate_script_name_constants
gaspardpetit Jan 4, 2024
a49225b
Updated README - added badges and removed dumb and slow mention
gaspardpetit Jan 4, 2024
fd1f6e4
Merge pull request #9 from gaspardpetit/update_readme
gaspardpetit Jan 4, 2024
3b83c61
Updated setup.py
gaspardpetit Jan 4, 2024
f44d70b
Merge pull request #10 from gaspardpetit/update_setup
gaspardpetit Jan 4, 2024
a32c005
Added a get_scripts method to get the scripts on text
gaspardpetit Jan 4, 2024
ae4fdca
Merge pull request #11 from gaspardpetit/implement_get_scripts_method
gaspardpetit Jan 4, 2024
309069f
Pin version of the module to the version of Unicode
gaspardpetit Jan 4, 2024
7adfeda
pylint fix
gaspardpetit Jan 4, 2024
542f2ee
Merge pull request #12 from gaspardpetit/pin_version_to_unicode
gaspardpetit Jan 4, 2024
ff7170b
Added support for scripts in range 0x010000 to 0x100000
gaspardpetit Jan 4, 2024
84d19c5
pylint fix
gaspardpetit Jan 4, 2024
9bbb38c
Merge pull request #13 from gaspardpetit/support_range_0x110000
gaspardpetit Jan 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Pylint

on:
push:
branches: [ "master" ]
pull_request:
branches: [ "master" ]

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pylint
- name: Analysing the code with pylint
run: |
pylint $(git ls-files '*.py')
40 changes: 40 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Python package

on:
push:
branches: [ "master" ]
pull_request:
branches: [ "master" ]

jobs:
build:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest --doctest-modules
169 changes: 160 additions & 9 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,160 @@
__pycache__
doc
update/PropertyValueAliases.txt
update/ScriptExtensions.txt
update/Scripts.txt
dist
*.egg-info
build
files.txt
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
99 changes: 74 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,90 @@
[![Pylint](https://github.com/gaspardpetit/uniscripts/actions/workflows/pylint.yml/badge.svg)](https://github.com/gaspardpetit/uniscripts/actions/workflows/pylint.yml)
[![Python package](https://github.com/gaspardpetit/uniscripts/actions/workflows/python-package.yml/badge.svg)](https://github.com/gaspardpetit/uniscripts/actions/workflows/python-package.yml)
[![PyPI version](https://badge.fury.io/py/uniscripts.svg)](https://pypi.python.org/pypi/uniscripts/)
[![Python versions](https://img.shields.io/pypi/pyversions/uniscripts.svg)](https://pypi.org/project/uniscripts/)
[![Unicode versions](https://img.shields.io/badge/Unicode%20-15.1-blue.svg)](https://www.unicode.org/charts/)
[![License: CC0-1.0](https://img.shields.io/badge/License-CC0_1.0-lightgrey.svg)](http://creativecommons.org/publicdomain/zero/1.0/)

# Uniscripts

Simple Python 3 module to query Unicode UCD script metadata (see UAX #24).

This module is useful for querying if a text is made of Latin characters,
Arabic, hiragana, kanji (han), and so on. It works for all scripts supported
by the Unicode character database.

This module is dumb and slow. If you need speed, you probably want to
implement your own functions. See e.g. `man pcreunicode`, `man pcrepattern`
(`grep -P` supports `\p`). As of this writing, the next-generation of Python
regexpes, available as the pypi library `regex`, also supports `\p`.

Sample usage:

>>> import uniscripts
>>> uniscripts.is_script('A', 'Latin')
True
### Verify is a string is of a given script:

```python
>>> from uniscripts import is_script, Scripts

>>> is_script('A', Scripts.LATIN)
True

# if you pass it a string, all characters must match
>>> is_script('はるはあけぼの', Scripts.HIRAGANA)
True

>>> is_script('はるはAkebono', Scripts.HIRAGANA)
False

# ...but by default, it ignores 'Common' characters, such as punctuation.
>>> is_script('はるは:あけぼの', Scripts.HIRAGANA)
True

>>> is_script('中華人民共和国', Scripts.HAN) # 'Han' = kanji or hànzì
True

```
See docstrings for `is_script()`.


### Detect the script of a character:

```python
>>> from uniscripts import which_scripts

>>> which_scripts('z')
['Latin']

>>> which_scripts('は')
['Hiragana']

>>> which_scripts('ー') # U+30FC
['Bopomofo', 'Common', 'Han', 'Hangul', 'Hiragana', 'Katakana', 'Yi']

```
See docstrings for `is_script()`.


### Detect the script of a text

```python
>>> from uniscripts import get_scripts
>>> sorted(get_scripts("こんにちは"))
['Hiragana']

>>> sorted(get_scripts("チョコレート"))
['Bopomofo', 'Common', 'Han', 'Hangul', 'Hiragana', 'Katakana', 'Yi']

# if you pass it a string, all characters must match
>>> uniscripts.is_script('はるはあけぼの', 'Hiragana')
True
>>> sorted(get_scripts("ਚਾਕਲੇਟ"))
['Gurmukhi']

>>> uniscripts.is_script('はるはAkebono', 'Hiragana')
False
>>> sorted(get_scripts("초콜릿"))
['Hangul']

# ...but by default, it ignores 'Common' characters, such as punctuation.
>>> uniscripts.is_script('はるは:あけぼの', 'Hiragana')
True
>>> sorted(get_scripts("σοκολάτα"))
['Greek']

>>> uniscripts.is_script('中華人民共和国', 'Han') # 'Han' = kanji or hànzì
True
>>> sorted(get_scripts("شوكولاتة"))
['Arabic']

>>> uniscripts.which_scripts('z')
['Latin']
>>> sorted(get_scripts("chocolat"))
['Common', 'Latin']

>>> uniscripts.which_scripts('は')
['Hiragana']
```

>>> uniscripts.which_scripts('ー') # U+30FC
['Common', 'Katakana', 'Hiragana', 'Hangul', 'Han', 'Bopomofo', 'Yi']
See docstrings for `get_scripts()`.

See docstrings for `is_script()`, `which_scripts()`.
Loading