-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to parallel articles #3
Comments
@wammar any updates on this? Since the Title lengths are too small in most of the cases it wouldn't suffice to build a well crafted MT system |
Automatically aligning sentence pairs is a non-trivial task, perhaps a few orders of magnitude larger than this repo. For getting sentence pairs automatically aligned from within articles, I recommend WikiMatrix. See https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix and https://ai.facebook.com/blog/wikimatrix/ The main downside in my view is that it doesn't cover low-resource languages/pairs. (There is also the newer and larger CCMatrix in the same repo, but it's extraction script is not ready yet, and it covers even fewer language pairs.) |
Sorry for the slow reply and thanks for the suggestion @Tellyang7! Expanding this to cover article content will definitely give richer text, at the expense of higher complexity in deciding which parts are parallel. I'm not actively working on this but feel free to contribute a new script and I'd be happy to merge. |
There is no doubt that this work is very powerful and great. And I also successfully implemented the Chinese to English transfer operation. My question is that the text content in the titles is too small. Is there any way to convert the content of the article? How should I operate?
The text was updated successfully, but these errors were encountered: