Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to parallel articles #3

Open
Tellyang7 opened this issue Nov 11, 2019 · 3 comments
Open

how to parallel articles #3

Tellyang7 opened this issue Nov 11, 2019 · 3 comments

Comments

@Tellyang7
Copy link

There is no doubt that this work is very powerful and great. And I also successfully implemented the Chinese to English transfer operation. My question is that the text content in the titles is too small. Is there any way to convert the content of the article? How should I operate?

@VP007-py
Copy link

VP007-py commented Jan 9, 2020

@wammar any updates on this?

Since the Title lengths are too small in most of the cases it wouldn't suffice to build a well crafted MT system

@bittlingmayer
Copy link

bittlingmayer commented Feb 19, 2020

Automatically aligning sentence pairs is a non-trivial task, perhaps a few orders of magnitude larger than this repo.

For getting sentence pairs automatically aligned from within articles, I recommend WikiMatrix.

See https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix and https://ai.facebook.com/blog/wikimatrix/

The main downside in my view is that it doesn't cover low-resource languages/pairs.

(There is also the newer and larger CCMatrix in the same repo, but it's extraction script is not ready yet, and it covers even fewer language pairs.)

@wammar
Copy link
Member

wammar commented Feb 19, 2020

Sorry for the slow reply and thanks for the suggestion @Tellyang7! Expanding this to cover article content will definitely give richer text, at the expense of higher complexity in deciding which parts are parallel. I'm not actively working on this but feel free to contribute a new script and I'd be happy to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants