Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError 'datatype' when preprocessing the latest wikidata dump (as of April 16) #7

Open
phucdoitoan opened this issue Apr 18, 2024 · 2 comments

Comments

@phucdoitoan
Copy link

Hi,

Thank you for the useful github code.

When I run the code in preprocess_dump.py to process the lastest wikidata dump (as of April 16) with 28 processes, I got the following error with processes 28. However, the code seems still running and produce processed tables.

Do you know if the error is something I should care about or I can just ignore it?

Thank you a lot!

Process Process-28: Traceback (most recent call last): File "**/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "**/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "**/simple-wikidata-db/simple_wikidata_db/preprocess_utils/worker_process.py", line 151, in process_data out_queue.put(process_json(ujson.loads(json_obj), language_id)) File "**/simple-wikidata-db/simple_wikidata_db/preprocess_utils/worker_process.py", line 91, in process_json datatype = claim['mainsnak']['datatype'] KeyError: 'datatype'

@neelguha
Copy link
Owner

I haven't gotten a chance to try and reproduce the error, but it looks like at least one of the claim objects doesn't have a datatype key. I haven't seen this error previously, so I wonder if it's something in most recent dump?

One small fix would be to disregard all claims which don't have a datatype key, and then count how many you drop (or write them to some error log file)?

@phucdoitoan
Copy link
Author

Hi there,

Thanks a lot for your reply.
I do not know much about wikidta so I'm not sure datatype key is something recent.
I'll try your suggestion.
However, even with the error reported, it seems like the code works fine and all the output tables seem okie.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants