Find exact and near-duplicate images in your local folders using Flow 5 as analysis and search backend. Detected duplicates are exported as interactive HTML graph visualization and as JSON with filepaths for further processing.
- Local Processing: Analyze and process images directly on your device.
- Advanced Duplicate Detection: Identify duplicates with variations in saturation, brightness, color, size, format, orientation, clarity, cropping, and text addition.
- Adjustable Sensitivity: Customize detection levels for exact and near-duplicates.
- Graphical Result Inspection: Utilize graph visualization for easy examination of duplicates.
- JSON Output: Receive organized results as JSON file, listing groups of duplicate images.
List of system requirements:
- OS: Linux
- CPU
x86_64
/AMD64
architecture - Docker
- Python >= 3.9
- In project root dir create a Python virtual environment
python3 -m venv venv
- Activate venv
source venv/bin/activate
- Install packages in the virtual env
pip3 install --upgrade -r requirements.txt
- Start a local Flow 5 server (image analysis and search engine) in background.
docker run -d --rm -p8983:8983 --name local-flow-server pixolution/flow
-
Ensure that setup is ok and local Flow service is accessible.
python3 scanner.py status
-
First you have to index local images to the Flow collection. The images are analyzed by and indexed to the running Flow server.
python3 scanner.py index --dir <PATH-TO-IMAGE-FOLDER>
Depending on your CPU and the number of images this process takes a while (each image is analyzed by multiple AI models).
-
Once the images are indexed you can scan the collection for duplicate images.
python3 scanner.py scan
This is an example scan output:
(venv) $ python3 scanner.py scan Process 544 images. Start scanning (this may take a while)... 100%|████████████████████████████████████| 544/544 [00:07<00:00, 74.19scans/s] Export 81 duplicate groups to duplicates.json... Detected 290 duplicate images. Export results to duplicates.html ... 100%|████████████████████| 290/290 [00:01<00:00, 220.92thumbnails generated/s]
The scan results are exported in two files:
|- duplicates.html
|- duplicates.json
The duplicates.html
contains an interactive visualization of duplicate images as graph network.
This HTML file is standalone and can be easily shared, since all images are embedded as thumbnails.
Zoom in to inspect the results visually. Hover over an edge to see the similarity score.
Hover over an image to see its file path.
The color of the edges between the nodes (images) display the similarity of the two connected images.
- Green = exact duplicates with no or minor changes
- Yellow = near-duplicates
- Red = near-duplicates with major changes
The duplicates.json
file can be easily processed further and contains the file paths per duplicate group.
{
"group-1": [
"/home/user/images/page.png",
"/home/user/images/work/page-1.jpg"
],
"group-2": [
"/home/user/images/201002.jpg",
"/home/user/images/201100021-2.png",
"/home/user/images/copy.jpg",
],
[...]
}
We are the team behind the image analysis and search product Flow 5 and we hope that this Python tool is useful to you.
The free Starter version of Flow 5 is limited to 5,000 images. If you need to process high-volumes and want to integrate de-duplication in your production workflows, please see Flow 5 pricing.
Also, if you are interested in more sophisticated de-duplication or image analysis services feel free to contact us.
To stop the Flow Docker once finished, be aware that indexed data will not be retained.
docker stop local-flow-server