-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update XML dump file namespace version #288
Conversation
The bz2 file at here https://github.com/tatuylonen/wiktextract/blob/master/tests/test-pages-articles.xml.bz2 also need to be updated after this pr is merged. |
If the only change we do here is just update the namespace string, it feels like we shouldn't break older dump files. Is it possible to dynamically determine if the dump file is either 0.10 or 0.11 and pick between them in |
I'll check how to use multiple xml namespaces in lxml's functions. |
Use `*` wildcards to remove the namespace limitation. New dump files start from 20240601 use version 0.11: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038392
The code works fine on the new 20240601 zh edition dump file and is ready to be merged. |
Thank you! If the dump file works on your side, I'll switch back to using |
I didn't extract all pages, I only check if there are any empty pages and I think the Wikimedia developers have fixed the empty page bug. |
I notice all 20240601 dump files' size are increasing compare to 0501 files. en: 1.1G -> 1.3G, fr: 588.7M -> 669.7M. And these files are compressed .bz2 files, extracted files will be larger. I hope the sever has enough disk spaces... |
The 0501 files were the corrupted ones, so we're returning to the state that was in April, so it should (the most dangerous word) be fine. |
0520 files are corrupted and removed from dumps.wikimedia.org, 0501 files are fine. |
New dump files start from 20240601 use version 0.11: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038392
Please note with this change older dump files will not be extracted. I have checked en, zh and de editions and all these dump files don't have empty pages.