Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing bug with multiple workers #74

Open
itaiperi opened this issue Jan 29, 2019 · 6 comments
Open

Parsing bug with multiple workers #74

itaiperi opened this issue Jan 29, 2019 · 6 comments
Labels

Comments

@itaiperi
Copy link
Contributor

itaiperi commented Jan 29, 2019

Hey,
I've found a bug occurring when using multiple workers.
Take for example the tinywiki dataset.

When I run the following code:

const dumpster = require('dumpster-dive');
options = {
	file: process.argv[2],
	db: 'tinywiki',
	skip_redirects: false,
	skip_disambig: false,
	batch_size: 1000,
	workers: 4,
	custom: function(doc) {
        console.log(doc.title(), doc.text().length);
        return {};
	}
};
dumpster(options, () => console.log('Parsing is Done!'));

where I pass the script the path to the tinywiki XML file through argv[2] which is ./tests/tinywiki-latest-pages-articles.xml.

When I run it with 1 worker, I get the following print:

Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 788
Redirect page 0
Disambiguation page 238
Bodmin 7921

In contradiction to what I get when I run it with 4 workers (look at what happens to the Bodmin and Big Page text lengths):

Redirect page 0
Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 0
Disambiguation page 238
Bodmin 0

Haven't looked at how the work is divided among the workers, but my guess is that the file is getting chopped in the middle of pages, making their text unreadable by the parser maybe?

Thanks!

@spencermountain
Copy link
Owner

yeah, i've seen this too, I think it's an artifact of the file is being small.
when we split the file, we probably split it in the middle of an article,
so to save the article, we bump the margins a little bit each time.

if you can think of a smarter method for this, I'm all for it. I've seen this before and also thought it was a bug

@itaiperi
Copy link
Contributor Author

Hmm let me take a look. Can you please reference the file and line of the function that does this? Would save me some time

@spencermountain
Copy link
Owner

sorry, looked briefly and couldn't find it. I may be wrong.

I don't believe though, this effects a file larger than a few pages. Please let me know if you can discover anything.

the file-reader is here, and dumpster-dive is using percentages, so it could be a rounding-error too
cheers

@itaiperi
Copy link
Contributor Author

From my understanding, it picks a specific line in the file, at the 25% let's say (for the 2nd worker from 4 workers), so obviously it is very possible that it will not pick exactly the <page> line, but another line, and may lose the entry, because it doesn't have all the data required and cannot detect the beginning of the entry which is indicated by the <page> line. Maybe it falls in the middle of a text xml tag.

I did find this occuring in a large wikidump - the simple english wikipedia.

@spencermountain
Copy link
Owner

ah, ok. shoot. I didn't think it was happening, because duplicate pages throw errors on mongo-writes, and I didn't see any.
let's try to isolate it.

@itaiperi
Copy link
Contributor Author

I don't think it's a matter of duplicates, but rather a page split into two different workers, each worker not getting all the information it needs, and just skips it (to the next <page> tag), ending up missing the entry in both workers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants