-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batteries included python package #310
Conversation
* Remove doc for moving into python * Include python.rst in toc * Use 18.04 cp3.7 wheel * Integrating python documentation
Default is a merge by next Monday in whatever state the code is. Should anyone have the time and courage to - please do look at the 1500 line diff and leave comments. I expect to address critical stuff. Non-critical comments will still be addressed, hopefully as smaller changes behind this PR. There are known rough edges, but what is here is a minimum-viable product. @jelmervdl finds the prospect of using python HTML experimentation and evaluations useful. Life would be easier with 2 repositories than the 3 (bergamot-translator, lemonade, tagtransfer) we're working with. @kpu has given a no-objection for python bindings sitting somewhere not slowing down the library source and has also kindly observed that this also happens to be python bindings for "marian", which bergamot is built on top of. macOS buildability and python target only builds are courtesy @jelmervdl. A lot of workflow code and model zoo stuff is thanks to XapaJIaMnu/translateLocally - @jelmervdl and @XapaJIaMnu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes to self... and call for help.
bindings/python/__init__.py
Outdated
"browsermt", "https://translatelocally.com/models.json" | ||
), | ||
TranslateLocallyLike( | ||
"opus", "https://object.pouta.csc.fi/OPUS-MT-models/app/models.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attribution...? Help on howto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the bergamot.cpp file, I approve.
inventory = requests.get(self.url).text | ||
with open(self.models_file_path, "w+") as models_file: | ||
models_file.write(inventory) | ||
self.data = json.loads(inventory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the models.json and model_info.json to contain repository information. Eg: https://translatelocally.com/models.json
https://github.com/browsermt/students/blob/master/enfa/enfa.student.tiny11/model_info.json#L7
bindings/python/bergamot.cpp
Outdated
py::arg("numWorkers") = 1, py::arg("cacheEnabled") = false, | ||
py::arg("cacheSize") = 20000, py::arg("cacheMutexBuckets") = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does cacheSize = 20000
but cacheEnabled = false
entail that it does allocate space for 20k entries, but never use it by default? Is cacheMutexBuckets=1
a good default if someone were to change the numWorkers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, w.r.t cacheMutexBuckets
, I want to stop exposing them to the user, same as what translateLocally is doing. As for the cacheEnabled
what do you think of wrapping cache in std::optional
? It is also possible to set a per request enabled/disabled as well even if we've created the cache-store.
In any case, choosing to defer this for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, w.r.t cacheMutexBuckets, I want to stop exposing them to the user, same as what translateLocally is doing.
I think this is a good idea. Like, there has to be a theoretical optimum value with respect to the number of threads and the cacheSize. bergamot-translator can figure out that relation. That shouldn't be up to the user.
As for the cacheEnabled what do you think of wrapping cache in std::optional? It is also possible to set a per request enabled/disabled as well even if we've created the cache-store.
My problem is that you allocate a 20.000 ptr vector by default, even when cacheEnabled is false. Easiest would be to just add a ternary operator here to set cacheSize to 0 and save a dynamic allocation for it when cacheEnabled is off.
Other options would be making cache a Ptr (so you can manage it on your own, like clearing it when switching models in translateLocally since you know that old entries wont be very relevant at that point, or keep it between AsyncService changes when changing number of workers or something) or keep it like it is now, but having cacheSize=0 shortcut all calls into it or something. Or indeed wrap it in a std::optional to save it from being initialised when cacheEnabled=0.
Tested it locally, and after applying the macOS patch it compiles and installs like a charm! One thing I did notice as an annoyance while working in a jupyter notebook with it: You cannot recreate service due the logger issue (can't find the ticket for it anymore) where there can only be one instance of edit: found it again! #290 |
This PR imports python bindings from https://github.com/jerinphilip/lemonade along with a bit of python code that makes this repository a little easier to start with.
The additions bring about the following that is of value:
Summary of Changes
bindings/python
folder containing python bindings and additional python code for batteries included package.3rd_party/pybind11
, new dependency.doc/conf.py
anddoc/requirement.txt
changes for packages (argparse docs).doc/python.rst
, wired todoc/index.rst
to enable python documentation.CMakeLists.txt
changes to allow python to integrate into the existing source..github/workflows/python.yml
imports workflows for building wheels on python..github/workflows/doc.yml
job is moved into.github/workflows/python.yml
to be ran after the wheel is built to tap into python documentation.Python checks
All python code in this repository (including
doc/conf.py
) is made to adhere to the following:In addition, since we lack serious tests a static type checker from google (pytype) is used to fish out stuff. Addition of a proper testing framework will only happen after this PR if at all. For now, command line is tested through GitHub CI for no error runs which should cover a decent chunk of code.
Where all does it work?
Currently supports Linux and at least one Mac (although adding more should be easier).
One build targets Colab (Could be useful for #207). MKL etc appears to be linked statically thus making it easy to distribute. The pre-built wheels have been tested to be smoothly running on:
@jerinphilip has had GLIBC version issues when trying Ubuntu 18.04 py3.8 with Ubuntu 20.04 built wheels. The general solution seems to be to go for a manylinux wheel built on a CentOS image. (Related: ugermann/ssplit-cpp#30). @jerinphilip is also squatting on pypi package name, for until the (manylinux) wheel is ready.
Fixes #234.