Parsing bug with multiple workers #74

itaiperi · 2019-01-29T14:14:38Z

Hey,
I've found a bug occurring when using multiple workers.
Take for example the tinywiki dataset.

When I run the following code:

const dumpster = require('dumpster-dive');
options = {
	file: process.argv[2],
	db: 'tinywiki',
	skip_redirects: false,
	skip_disambig: false,
	batch_size: 1000,
	workers: 4,
	custom: function(doc) {
        console.log(doc.title(), doc.text().length);
        return {};
	}
};
dumpster(options, () => console.log('Parsing is Done!'));

where I pass the script the path to the tinywiki XML file through argv[2] which is ./tests/tinywiki-latest-pages-articles.xml.

When I run it with 1 worker, I get the following print:

Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 788
Redirect page 0
Disambiguation page 238
Bodmin 7921

In contradiction to what I get when I run it with 4 workers (look at what happens to the Bodmin and Big Page text lengths):

Redirect page 0
Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 0
Disambiguation page 238
Bodmin 0

Haven't looked at how the work is divided among the workers, but my guess is that the file is getting chopped in the middle of pages, making their text unreadable by the parser maybe?

Thanks!

The text was updated successfully, but these errors were encountered:

spencermountain · 2019-01-29T14:59:11Z

yeah, i've seen this too, I think it's an artifact of the file is being small.
when we split the file, we probably split it in the middle of an article,
so to save the article, we bump the margins a little bit each time.

if you can think of a smarter method for this, I'm all for it. I've seen this before and also thought it was a bug

itaiperi · 2019-01-29T15:54:33Z

Hmm let me take a look. Can you please reference the file and line of the function that does this? Would save me some time

spencermountain · 2019-01-29T17:19:50Z

sorry, looked briefly and couldn't find it. I may be wrong.

I don't believe though, this effects a file larger than a few pages. Please let me know if you can discover anything.

the file-reader is here, and dumpster-dive is using percentages, so it could be a rounding-error too
cheers

itaiperi · 2019-01-29T17:32:45Z

From my understanding, it picks a specific line in the file, at the 25% let's say (for the 2nd worker from 4 workers), so obviously it is very possible that it will not pick exactly the <page> line, but another line, and may lose the entry, because it doesn't have all the data required and cannot detect the beginning of the entry which is indicated by the <page> line. Maybe it falls in the middle of a text xml tag.

I did find this occuring in a large wikidump - the simple english wikipedia.

spencermountain · 2019-01-29T19:00:27Z

ah, ok. shoot. I didn't think it was happening, because duplicate pages throw errors on mongo-writes, and I didn't see any.
let's try to isolate it.

itaiperi · 2019-01-29T19:07:49Z

I don't think it's a matter of duplicates, but rather a page split into two different workers, each worker not getting all the information it needs, and just skips it (to the next <page> tag), ending up missing the entry in both workers

spencermountain added the unknown label Apr 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing bug with multiple workers #74

Parsing bug with multiple workers #74

itaiperi commented Jan 29, 2019 •

edited

Loading

spencermountain commented Jan 29, 2019

itaiperi commented Jan 29, 2019

spencermountain commented Jan 29, 2019

itaiperi commented Jan 29, 2019

spencermountain commented Jan 29, 2019

itaiperi commented Jan 29, 2019

Parsing bug with multiple workers #74

Parsing bug with multiple workers #74

Comments

itaiperi commented Jan 29, 2019 • edited Loading

spencermountain commented Jan 29, 2019

itaiperi commented Jan 29, 2019

spencermountain commented Jan 29, 2019

itaiperi commented Jan 29, 2019

spencermountain commented Jan 29, 2019

itaiperi commented Jan 29, 2019

itaiperi commented Jan 29, 2019 •

edited

Loading