-
-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a script-type plugin to detect duplicate archives using thumbnails #338
Comments
I wrote a simple perl plugin based on your suggestion: package LANraragi::Plugin::Scripts::DuplicateFinder;
use strict;
use warnings;
no warnings 'uninitialized';
use LANraragi::Utils::Logging qw(get_plugin_logger);
use LANraragi::Model::Config;
sub plugin_info {
return (
name => "Duplicate Finder",
type => "script",
namespace => "duplfind",
author => "dixonym",
version => "1.0",
description => "Find potential duplicate archives by comparing thumbnail hashes using Hamming distance.",
icon => "",
oneshot_arg => "hamming distance threshold (defaults to 5)"
);
}
# Hamming distance function
sub hammingdistance {
my ($a, $b) = @_;
my $distance = 0;
# Assuming the thumbhashes are hex strings, convert them to binary
my $binary_a = unpack("B*", pack("H*", $a));
my $binary_b = unpack("B*", pack("H*", $b));
for (my $i = 0; $i < length($binary_a); $i++) {
if (substr($binary_a, $i, 1) ne substr($binary_b, $i, 1)) {
$distance++;
}
}
return $distance;
}
sub run_script {
shift;
my $lrr_info = shift;
my $logger = get_plugin_logger();
my $threshold = $lrr_info->{oneshot_param};
# Check if the threshold is not set or is an empty string, use default value of 5
$threshold = 5 if (!defined($threshold) || $threshold eq '');
# Convert the threshold to an integer
$threshold = int($threshold);
$logger->info("Set Hamming distance threshold to " . $threshold);
my $redis = LANraragi::Model::Config->get_redis;
# Get all archive IDs (40-character long keys only)
my @keys = $redis->keys('????????????????????????????????????????');
# Store thumbhashes
my %thumbhashes;
# Collect thumbhashes for all archives
foreach my $id (@keys) {
my %hash = $redis->hgetall($id);
my $thumbhash = $hash{'thumbhash'};
# Only consider entries that have a thumbhash
if ($thumbhash) {
$thumbhashes{$id} = $thumbhash;
}
}
# Array to store pairs of duplicates
my @duplicates;
# Compare each archive thumbhash with others
foreach my $id1 (keys %thumbhashes) {
foreach my $id2 (keys %thumbhashes) {
next if $id1 eq $id2; # Skip self-comparison
# Calculate Hamming distance
my $distance = hammingdistance($thumbhashes{$id1}, $thumbhashes{$id2});
# Compare distance to the threshold for considering two hashes as duplicates
if ($distance <= $threshold) {
# Log the potential duplicate
$logger->info("Found potential duplicate: $id1 and $id2 with distance $distance");
# Add the pair to the duplicates list
push @duplicates, [$id1, $id2];
}
}
}
# Return list of pairs of potential duplicates
return \@duplicates;
} It seems to work. I manually checked some of the detected galleries and they are actual duplicates. full plugin log
But there seems to be a problem with the minion job. After ~4h the minion worker "went away". I'm not quite sure, why that happens?
My library stats:
All in all it seems quite a hassle to check for duplicates this way. IMO a better approach would be integrating the duplicate checking directly in the UI, similar to stashapp: |
Good to know the approach works! It was mostly theoretical. I wouldn't necessarily mind integrating this within the server directly, but that wouldn't solve the problem that checking for dupes across an entire library is always going to get more expensive the more files you have on record. It might be worth trying to multi-thread the loop a bit further though |
Certainly, integrating it would enhance usability rather than boost performance. Duplicate checking is always resource-intensive and requires balancing storage with processing effort. Performance isn’t my main concern here; I don’t mind how long it takes as long as I can initiate the job and return when it’s completed. However, I'm unsure if the minion worker can support that, as it appears to stop after 4 hours. I could add multi-threading to the loop to improve performance if needed. I’m still uncertain about how to manage duplicates found by the plugin. Should I add an option to automatically delete duplicates based on the number of tags? |
This one could be pretty fun, I think.
The script should go through the entire archive list and return a list of potential duplicate pairs at the end.
I see two potential ways to detect dupes:
The hashes already exist in the database since they're used for reverse image searches. This would be the easiest and fastest way to go. Here's some example code I got from who knows where:
This would be super expensive computationally speaking, but if the first way doesn't yield decent results I don't see any other solution.
The text was updated successfully, but these errors were encountered: