-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel blocking for pgsql_big_dedupe_example.py #114
base: main
Are you sure you want to change the base?
Conversation
predicates_without_index.append(full_predicate) | ||
|
||
# Use only predicates WITHOUT indexes for parallel blocking | ||
deduper.fingerprinter.predicates = predicates_without_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found a problem here... due to these lines:
The enumerate
is causing the predicate to have a different pred_id
compared to the serial version. This only happens if there's an index predicate. What would be the best way to solve this? Changing Dedupe internally or doing some workaround here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. that is interesting. enumerate is just identifying the predicate. we could accomplish the same thing by having the predicates be hashable and using the hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that looks good to me.
Related to dedupeio/dedupe#831
Tested with Python 3.7.7 on macOS 10.15.6 (19G2021) and PostgreSQL 12.3.
Running times for blocking only are:
Both versions are available at commit c1c8384 along with a script for testing (
pgsql_big_dedupe_example/test_parallel_vs_serial.sh
).The settings file and the training file are here: settings-and-training.zip. The no-index one was trained with 0.99 recall. The index one with 0.90.