Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminology #448

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
c43c57a
Basic terminology API
XapaJIaMnu May 9, 2023
06ec79b
reference to the code
XapaJIaMnu May 9, 2023
867fc6c
Update marian with gcc 12
kpu May 9, 2023
9153041
WiP python iface
XapaJIaMnu May 25, 2023
e5977e6
More WiP
XapaJIaMnu May 26, 2023
b1cb3bc
Works except stdin
XapaJIaMnu May 26, 2023
7a93f7c
Python interface
XapaJIaMnu May 26, 2023
5be7b96
Merge branch 'main' into terminology
XapaJIaMnu May 26, 2023
c1a659e
Small fixes, removes pybind submodule
XapaJIaMnu May 26, 2023
1f8ba76
Allow dictionary maps. Work in progress
XapaJIaMnu May 26, 2023
cc44014
Convert the map to python map
XapaJIaMnu May 26, 2023
6c7fe75
Allow dictionary terminology set up
XapaJIaMnu May 26, 2023
c586e09
Attempt to install pybind11 for the wheel build
XapaJIaMnu May 26, 2023
26529dc
Merge branch 'main' into terminology
XapaJIaMnu Jun 6, 2023
82cc687
Add support for different terminology format
XapaJIaMnu Jun 13, 2023
5c9161b
Try to update the workflows.
XapaJIaMnu Jun 14, 2023
7d6f4e5
Refactor terminology replace
jelmervdl Jun 15, 2023
f53879d
Fix formatting
jelmervdl Jun 15, 2023
a95001d
Update marian dev which should allow for compilation on newer platforms
XapaJIaMnu Jun 18, 2023
316c5dd
Fix for latest argparse
XapaJIaMnu Jun 28, 2023
58e5363
technology -> terminology
kpu Jun 28, 2023
0a6be45
Buffer input for efficiency
kpu Jun 28, 2023
ca37e8f
Pass terminology_form from CLI to Translator
graemenail Jul 4, 2023
4011f88
Leave USE_STATIC_LIBS off by default
kpu Jul 9, 2023
19ca40d
Enable cuda compilation
XapaJIaMnu Aug 1, 2023
1a8b90c
Merge branch 'main' into terminology
XapaJIaMnu Aug 1, 2023
1e80e79
Working, except in python
XapaJIaMnu Aug 2, 2023
3d37edf
Simplify invocation a bit
XapaJIaMnu Aug 2, 2023
e5d4ed0
Formatting fixes
XapaJIaMnu Aug 2, 2023
72ade1d
Update the terminology format
XapaJIaMnu Aug 4, 2023
5f9858f
Merge branch 'main' into terminology
XapaJIaMnu Aug 8, 2023
168d589
Use 0 GPU workers by default
XapaJIaMnu Aug 9, 2023
3eab045
Attempt to fix tests
XapaJIaMnu Aug 9, 2023
88e7f28
Fix error in workflow syntax
XapaJIaMnu Aug 9, 2023
1db9d09
Fix typing error
XapaJIaMnu Aug 9, 2023
537f4e1
I hate python linters
XapaJIaMnu Aug 9, 2023
042acc2
pytype can't access C++ modules
XapaJIaMnu Aug 9, 2023
e3b4a7c
Small fixes
XapaJIaMnu Aug 11, 2023
05a7379
Merge branch 'main' into terminology
XapaJIaMnu Oct 2, 2023
5479c20
Merge with main
XapaJIaMnu Oct 2, 2023
d2356a6
Merge branch 'main' into terminology
kpu Dec 7, 2023
97c8da4
Pull in submodule fixing clang compilation
kpu Dec 7, 2023
095d602
Update marian-dev with newer fbgemm for clang
kpu Dec 7, 2023
007b578
Merge branch 'main' into terminology
kpu Dec 7, 2023
2417225
Merge branch 'main' into terminology
kpu Dec 7, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions 3rd_party/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,6 @@ get_directory_property(CMAKE_CXX_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_CXX
set(CMAKE_C_FLAGS ${CMAKE_C_FLAGS} PARENT_SCOPE)
set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} PARENT_SCOPE)

if(COMPILE_PYTHON)
add_subdirectory(pybind11)
endif(COMPILE_PYTHON)
#if(COMPILE_PYTHON)
# add_subdirectory(pybind11)
#endif(COMPILE_PYTHON)
XapaJIaMnu marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 1 addition & 1 deletion 3rd_party/marian-dev
Submodule marian-dev updated 1 files
+1 −1 CMakeLists.txt
52 changes: 52 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,55 @@ A short example of how to use the APIs is provided in `app/bergamot.cpp` file.
### Using WASM version

Please follow the `README` inside the `wasm` folder of this repository that demonstrates how to use the translator in JavaScript.

### Using python API

Compile and install:
```
export CMAKE_BUILD_PARALLEL_LEVEL=8 # Use 8 cores to compile
pip install wheel
pip install .

# Desktop app
% bergamot-translator --help
bergamot-translator interfance
XapaJIaMnu marked this conversation as resolved.
Show resolved Hide resolved

options:
-h, --help show this help message and exit
--config CONFIG, -c CONFIG
Model YML configuration input.
--num-workers NUM_WORKERS, -n NUM_WORKERS
Number of CPU workers.
--logging LOGGING, -l LOGGING
Set verbosity level of logging: trace, debug, info, warn, err(or), critical, off. Default is off
--cache-size CACHE_SIZE
Cache size. 0 for caching is disabled
--terminology-tsv TERMINOLOGY_TSV, -t TERMINOLOGY_TSV
Path to a terminology file TSV
--force-terminology, -f
Force terminology to appear on the target side.
--path-to-input PATH_TO_INPUT, -i PATH_TO_INPUT
Path to input file. Uses stdin if empty
```
Using the python interface
```python
from bergamot.translator import Translator
print(Translator.__doc__)
Bergamot translator interfacing with the C++ code.

Attributes:
num_workers Number of parallel CPU workers.
cache: Cache size. 0 to disable cache.
logging: Log level: trace, debug, info, warn, err(or), critical, off. Default is off
terminology: Path to a TSV terminology file
force_terminology Force the terminology to appear on the target side. May affect translation quality negatively.
XapaJIaMnu marked this conversation as resolved.
Show resolved Hide resolved

_config Translation model config
_model: Translation model
_responseOpts What to include in the response (alignment, html restoration, etc..)
_service The translation service

translator = Translator("/path/to/model.npz.best-bleu.npz.decoder.brg.yml", terminology="/path/to/terminology.tsv")
translator.translate(["text"])
output
```
1 change: 1 addition & 0 deletions bindings/python/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
find_package(pybind11 REQUIRED)
find_package(Python COMPONENTS Interpreter Development.Module REQUIRED)

message("Using Python: " ${Python_EXECUTABLE})
Expand Down
12 changes: 9 additions & 3 deletions bindings/python/bergamot.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -198,16 +198,22 @@ PYBIND11_MODULE(_bergamot, m) {
.def("pivot", &ServicePyAdapter::pivot);

py::class_<Service::Config>(m, "ServiceConfig")
.def(py::init<>([](size_t numWorkers, size_t cacheSize, std::string logging) {
.def(py::init<>([](size_t numWorkers, size_t cacheSize, std::string logging,
std::string pathToTerminologyFile, bool terminologyForce) {
Service::Config config;
config.numWorkers = numWorkers;
config.cacheSize = cacheSize;
config.logger.level = logging;
config.terminologyFile = pathToTerminologyFile;
config.terminologyForce = terminologyForce;
return config;
}),
py::arg("numWorkers") = 1, py::arg("cacheSize") = 0, py::arg("logLevel") = "off")
py::arg("numWorkers") = 1, py::arg("cacheSize") = 0, py::arg("logLevel") = "off",
py::arg("pathToTerminologyFile") = "", py::arg("terminologyForce") = false)
.def_readwrite("numWorkers", &Service::Config::numWorkers)
.def_readwrite("cacheSize", &Service::Config::cacheSize);
.def_readwrite("cacheSize", &Service::Config::cacheSize)
.def_readwrite("pathToTerminologyFile", &Service::Config::terminologyFile)
.def_readwrite("terminologyForce", &Service::Config::terminologyForce);

py::class_<_Model, std::shared_ptr<_Model>>(m, "TranslationModel");
}
109 changes: 109 additions & 0 deletions bindings/python/translator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/usr/bin/env python3
import bergamot
import argparse
from sys import stdin
from typing import List

class Translator:
"""Bergamot translator interfacing with the C++ code.

Attributes:
num_workers Number of parallel CPU workers.
cache: Cache size. 0 to disable cache.
logging: Log level: trace, debug, info, warn, err(or), critical, off. Default is off
terminology: Path to a TSV terminology file
force_terminology Force the terminology to appear on the target side. May affect translation quality negatively.

_config Translation model config
_model: Translation model
_responseOpts What to include in the response (alignment, html restoration, etc..)
_service The translation service
"""
num_workers: int
cache: int
logging: str
terminology: str
force_terminology: bool

_config: bergamot.ServiceConfig
_model: bergamot.TranslationModel
_responseOpts: bergamot.ResponseOptions
_service: bergamot.Service

def __init__(self, model_conifg_path: str, num_workers: int=1, cache: int=0, \
logging="off", terminology: str="", force_terminology: bool=False):
"""Initialises the translator class

:param model_conifg_path: Path to the configuration file for the translation model.
:param num_workers: Number of CPU workers.
:param cache: cache size. 0 means no cache.
:param logging: Log level: trace, debug, info, warn, err(or), critical, off.
:param terminology: Path to terminology file, TSV format
:param force_terminology: Force terminology to appear on the target side. May impact translation quality.
"""
self.num_workers = num_workers
self.cache = cache
self.logging = logging
self.terminology = terminology
self.force_terminology = force_terminology

self._config = bergamot.ServiceConfig(self.num_workers, self.cache, self.logging, self.terminology, self.force_terminology)
self._service = bergamot.Service(self._config)
self._responseOpts = bergamot.ResponseOptions() # Default false for all, if we want to enable HTML later, from here
self._model = self._service.modelFromConfigPath(model_conifg_path)

def resetTerminology(self, terminology: str="", force_terminology: bool=False) -> None:
"""Resets the terminology of the model
:param terminology: path to the terminology file.
:param force_terminology: force terminology
:return: None
"""
self.terminology = terminology
self.force_terminology = force_terminology
self._config = bergamot.ServiceConfig(self.num_workers, self.cache, self.logging, self.terminology, self.force_terminology)
self._service = bergamot.Service(self._config)

def resetNumWorkers(self, num_workers) -> None:
"""Resets the number of workers
:param num_workers: number of parallel CPU threads.
:return: None
"""
self.num_workers = num_workers
self._config = bergamot.ServiceConfig(self.num_workers, self.cache, self.logging, self.terminology, self.force_terminology)
self._service = bergamot.Service(self._config)

def translate(self, sentences: List[str]) -> str:
"""Translates a list of strings
:param sentences: A List of strings to be translated.
:return: Translation output.
"""
responses = self._service.translate(self._model, bergamot.VectorString(sentences), self._responseOpts)
ret = ""
for response in responses:
ret = ret + response.target.text
return ret
XapaJIaMnu marked this conversation as resolved.
Show resolved Hide resolved
#@TODO add async translate with futures

def main():
parser = argparse.ArgumentParser(description="bergamot-translator interfance")
parser.add_argument("--config", '-c', required=True, type=str, help='Model YML configuration input.')
parser.add_argument("--num-workers", '-n', type=int, default=1, help='Number of CPU workers.')
parser.add_argument("--logging", '-l', type=str, default="off", help='Set verbosity level of logging: trace, debug, info, warn, err(or), critical, off. Default is off')
parser.add_argument("--cache-size", type=int, default=0, help='Cache size. 0 for caching is disabled')
parser.add_argument("--terminology-tsv", '-t', default="", type=str, help='Path to a terminology file TSV')
parser.add_argument("--force-terminology", '-f', action="store_true", help='Force terminology to appear on the target side.')
parser.add_argument("--path-to-input", '-i', default=None, type=str, help="Path to input file. Uses stdin if empty")
args = parser.parse_args()

translator = Translator(args.config, args.num_workers, args.cache_size, args.logging, args.terminology_tsv, args.force_terminology)

if args.path_to_input is not None:
with open(args.path_to_input, 'r', encoding='utf-8') as infile:
lines = infile.readlines()
print(translator.translate(lines))
else:
for line in stdin:
print(translator.translate([line.strip()]))

if __name__ == '__main__':
main()
7 changes: 4 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,8 +195,8 @@ def run(self):
setup(
name="bergamot",
version=version,
author="Jerin Philip",
author_email="[email protected]",
author=["Jerin Philip", "Nikolay Bogoychev"],
author_email=["[email protected]", "[email protected]"],
url="https://github.com/browsermt/bergamot-translator/",
description="Translate text-content locally in your machine across langauges.",
long_description=long_description,
Expand All @@ -209,10 +209,11 @@ def run(self):
python_requires=">=3.6",
packages=["bergamot"],
package_dir={"bergamot": "bindings/python"},
install_requires=["requests", "pyyaml>=5.1", "appdirs"],
install_requires=["requests", "pyyaml>=5.1", "appdirs", "pybind11"],
entry_points={
"console_scripts": [
"bergamot = bergamot.__main__:main",
"bergamot-translator = bergamot.translator:main"
],
},
# Classifiers help users find your project by categorizing it.
Expand Down
116 changes: 116 additions & 0 deletions src/translator/service.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,86 @@
#include "batch.h"
#include "byte_array_util.h"
#include "definitions.h"
#include <regex>

namespace marian {
namespace bergamot {

namespace {

// Replacement_fn taken from https://stackoverflow.com/questions/3418231/replace-part-of-a-string-with-another-string
// Sue me.
size_t CountOccurrences(std::string_view s, std::string_view needle) {
size_t res = 0;
size_t pos = 0;
while ((pos = s.find(needle, pos)) != std::string_view::npos) {
++res;
pos += needle.size();
}
return res;
}

std::string ReplaceNotLonger(std::string s, std::string_view what, std::string_view with) {
assert(what.size() >= with.size());
std::string_view::size_type wpos = 0;
std::string_view::size_type rpos = 0;
while (true) {
auto new_rpos = s.find(what, rpos);
if (new_rpos == std::string::npos) {
new_rpos = s.size();
}
auto n = new_rpos - rpos;
std::copy(s.begin() + rpos, s.begin() + new_rpos, s.begin() + wpos);
wpos += n;
rpos = new_rpos;
if (rpos == s.size()) {
break;
}
std::copy(with.begin(), with.end(), s.begin() + wpos);
wpos += with.size();
rpos += what.size();
}
s.resize(wpos);
return s;
}

std::string ReplaceLonger(std::string s, std::string_view what, std::string_view with) {
assert(what.size() < with.size());
auto occurrences = CountOccurrences(s, what);
auto rpos = s.size();
auto wpos = rpos + occurrences * (with.size() - what.size());
s.resize(wpos);

while (wpos != rpos) {
auto new_rpos = s.rfind(what, rpos - what.size());
if (new_rpos == std::string::npos) {
new_rpos = 0;
} else {
new_rpos += what.size();
}
auto n = rpos - new_rpos;
std::copy_backward(s.begin() + new_rpos, s.begin() + rpos, s.begin() + wpos);
wpos -= n;
rpos = new_rpos;
if (wpos == rpos) {
break;
}
std::copy_backward(with.begin(), with.end(), s.begin() + wpos);
wpos -= with.size();
rpos -= what.size();
}
return s;
}

std::string Replace(std::string s, std::string_view what, std::string_view with) {
assert(!what.empty());
if (what.size() >= with.size()) {
return ReplaceNotLonger(std::move(s), what, with);
}
return ReplaceLonger(std::move(s), what, with);
}


// Combines two responses with first.target == second.source mapping alignments etc accordingly.
// There are several constraints which are matched by only the pivoting workflow in <>Service source, therefore this
// function is not for external use and in a hidden namespace.
Expand Down Expand Up @@ -137,6 +211,43 @@ AsyncService::AsyncService(const AsyncService::Config &config)
logger_(config.logger) {
ABORT_IF(config_.numWorkers == 0, "Number of workers should be at least 1 in a threaded workflow");
workers_.reserve(config_.numWorkers);
// Initiate terminology map if present
if (!config_.terminologyFile.empty()) {
// Create an input filestream
std::ifstream myFile(config_.terminologyFile);

// Make sure the file is open
if(!myFile.is_open()) throw std::runtime_error("Could not open file: " + config_.terminologyFile);
std::string line;
while(std::getline(myFile, line)) {
// Create a stringstream of the current line
std::stringstream ss(line);

std::string srcword;
std::string replacementword;
getline(ss, srcword, '\t');
getline(ss, replacementword, '\n');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure people don't use windows line endings, hehe. Incidentally, if we could do this terminology map loading in Python you'd get their line ending strippping code for free.

// @TODO it seems like removing the tags forces the model to copy which is
// I guess just as good and more reliable. In that case we just don't tell the model
// what the original source is and it just has no choice BUT to generate the target.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: ah you say it here explicitly. Copy is a copy with the assumption it won't try to translate because it doesn't know the translation. For Chinese <-> English I can imagine this working, but no way that English <-> French would accept something like that… right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the model just learns to copy when there's accidentally target text on the source side. I expected it it would work just fine for English <-> french. Would be a problem with multilingual models, potentially.

if (!config_.terminologyForce) {
replacementword = srcword + " <tag0> " + replacementword + " </tag0> ";
}
this->terminologyMap_.insert({srcword, replacementword});
XapaJIaMnu marked this conversation as resolved.
Show resolved Hide resolved
}

// Close file
myFile.close();

//Testing
if (config.logger.level == "debug") {
std::cerr << "Printing out terminology...:" << std::endl;
for (auto&& item : terminologyMap_) {
std::cerr << item.first << " " << item.second << std::endl;
}
}
}

for (size_t cpuId = 0; cpuId < config_.numWorkers; cpuId++) {
workers_.emplace_back([cpuId, this] {
// Consumer thread main-loop. Note that this is an infinite-loop unless the monitor is explicitly told to
Expand Down Expand Up @@ -202,6 +313,11 @@ void AsyncService::pivot(std::shared_ptr<TranslationModel> first, std::shared_pt
void AsyncService::translate(std::shared_ptr<TranslationModel> translationModel, std::string &&source,
CallbackType callback, const ResponseOptions &responseOptions) {
// Producer thread, a call to this function adds new work items. If batches are available, notifies workers waiting.
// Tagging
for (auto&& teminologyPair : terminologyMap_) {
source = Replace(source, teminologyPair.first, teminologyPair.second);
}
jelmervdl marked this conversation as resolved.
Show resolved Hide resolved

Ptr<HTML> html = std::make_shared<HTML>(std::move(source), responseOptions.HTML);
auto internalCallback = [html, callback](Response &&response) {
html->restore(response);
Expand Down
Loading