Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error messages when loading wikipedia pages into mongodb #111

Open
e501 opened this issue Jul 4, 2023 · 3 comments
Open

Error messages when loading wikipedia pages into mongodb #111

e501 opened this issue Jul 4, 2023 · 3 comments

Comments

@e501
Copy link

e501 commented Jul 4, 2023

Many thanks again for all your great work with Wikipedia dumpster-dive !

While working to update my previous download of Wikipedia articles, I encountered a "TypeError: Cannot read properties of undefined (reading '0')" error message that occurs on a frequent basis.

The following is a snippet of example output that is printed to the terminal output:

---Error on "Marlon Brando"
TypeError: Cannot read properties of undefined (reading '0')
at Object.blockquote (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:6225:41)
at parseTemplate (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7375:34)
at parseNested (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7651:26)
at /usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7672:30
at Array.forEach ()
at allTemplates (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7672:12)
at process (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7683:26)
at new Section (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:8095:7)
at parseSections (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:8681:21)
at new Document (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:8776:24)
─── worker #3 ───:
+109,000 pages
-168,559 redirects
-5,436 disambig
0 ns
#2 +500 pages - 327ms - "Yu Lan"
#1 +500 pages - 388ms - "Rosanna Cabot"
#0 +500 pages - 1s - "Myasthenia gravis"
#3 +500 pages - 281ms - "Rohan Rangarajan"

 current: 111,000 pages - "Zieria fraseri"     

#1 +500 pages - 389ms - "Northern Ireland Schools Debating Competition"
#3 +500 pages - 374ms - "Dolno Konjari"
#2 +500 pages - 357ms - "Chapman Taylor"

---Error on "Messerschmitt Me 262"
TypeError: Cannot read properties of undefined (reading '0')
at Object.blockquote (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:6225:41)
at parseTemplate (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7375:34)
at parseNested (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7651:26)
at /usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7672:30
at Array.forEach ()
at allTemplates (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7672:12)
at process (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:7683:26)
at new Section (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:8095:7)
at parseSections (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:8681:21)
at new Document (/usr/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/builds/wtf_wikipedia.cjs:8776:24)
─── worker #0 ───:
+112,500 pages
-177,470 redirects
-5,613 disambig
0 ns
─── worker #1 ───:
+112,500 pages
-177,608 redirects
-5,621 disambig
0 ns
─── worker #2 ───:
+112,500 pages
-177,616 redirects
-5,621 disambig
0 ns

 current: 112,500 pages - "Tom Mackie"     

#1 +500 pages - 322ms - "Louis Secretan"
─── worker #3 ───:
+113,000 pages
-177,895 redirects
-5,632 disambig

Please let me know if there may be a configuration issue for my download.

Greatly appreciate your help in tracking down this issue.

@spencermountain
Copy link
Owner

spencermountain commented Jul 4, 2023

hey thanks @e501 this is a good issue.

I haven't been able to reproduce this directly, but i've updated this lib, which may do the trick. Can you try 5.6.3 and see if these disappear?

Otherwise, I'd love some help getting to the bottom

import wtf from 'wtf_wikipedia'
let doc = await wtf.fetch("Messerschmitt Me 262", "en")
console.log(doc.json())
console.log(doc.text())

cheers

@e501
Copy link
Author

e501 commented Jul 5, 2023

Many thanks for the quick turnaround on this ... The updated library appears to have fixed the error messages and the new install of dumpster-dive has already worked through 5.9M+ articles with no errors reported !

My thinking is that I may rerun with all of the options activated: "dumpster ./my-wiki-dump.xml --infoboxes=true --citations=true --categories=true --links=true"

If possible, would be nice to also include all the templates (i.e. "--templates=true") or at least the navigation boxes (i.e. "--navboxes")

Please let me know if these additional options may be available as command-line arguments/options.

Thanks again !

@spencermountain
Copy link
Owner

yeah sure - any PRs welcomed. The api really needs some work!
Feel free to make any changes you'd like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants