Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel blocking for pgsql_big_dedupe_example.py #114

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

fjsj
Copy link

@fjsj fjsj commented Sep 19, 2020

Related to dedupeio/dedupe#831

Tested with Python 3.7.7 on macOS 10.15.6 (19G2021) and PostgreSQL 12.3.

Running times for blocking only are:

Parallel pgsql_big_dedupe_example_settings.no-indexes:
real    0m58.265s
user    4m42.656s
sys     0m3.991s

Serial pgsql_big_dedupe_example_settings.no-indexes:
real    2m32.944s
user    2m1.261s
sys     0m1.676s

Parallel pgsql_big_dedupe_example_settings.with-indexes:
real    0m44.619s
user    1m58.348s
sys     0m1.527s

Serial pgsql_big_dedupe_example_settings.with-indexes:
real    1m0.310s
user    0m57.828s
sys     0m0.393s

Both versions are available at commit c1c8384 along with a script for testing (pgsql_big_dedupe_example/test_parallel_vs_serial.sh).

The settings file and the training file are here: settings-and-training.zip. The no-index one was trained with 0.99 recall. The index one with 0.90.

predicates_without_index.append(full_predicate)

# Use only predicates WITHOUT indexes for parallel blocking
deduper.fingerprinter.predicates = predicates_without_index
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found a problem here... due to these lines:

https://github.com/dedupeio/dedupe/blob/c250911590a72e77612bf9549c78c90e2fb01705/dedupe/blocking.py#L89-L91

The enumerate is causing the predicate to have a different pred_id compared to the serial version. This only happens if there's an index predicate. What would be the best way to solve this? Changing Dedupe internally or doing some workaround here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. that is interesting. enumerate is just identifying the predicate. we could accomplish the same thing by having the predicates be hashable and using the hash.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants