Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find and remove image duplicates #802

Merged
merged 22 commits into from
Nov 3, 2023

Conversation

frimpongopoku
Copy link
Contributor

@frimpongopoku frimpongopoku commented Sep 22, 2023

Related ticket : massenergize/frontend-admin#1073

This PR is going to allow us to identify duplicate images all over the platform.

HOW?

Hashes! We just use hashes. If you generate the hash of two images, if the images are the same in terms of visual content, the hash string generated will be the same. It also means that if the image is the exact same and you give it a different name, the generated hash will still be the same, which will allow us to clock it as a duplicate.

What counts as a duplicate?

  • The images are the exact same in terms of visual content, dimensions, aspect ratio, etc. Everything!

We are currently only able to identify exact duplicates and not images that are close in similarity. This means that when a user uploads an image, and they upload a cropped version of the same image, or the same image with different dimensions --- they are not the same! (Which I think is fair, and reliable ✅ )

Code Implementation (⚠️ DB Modification)

Since the whole process depends on the hash values of our images, I have added a DB field to the "Media" model -- "hash".
This is the field that is queried to identify duplicates.

  • For future uploads, I have modified the "save" function for the "Media" model to automatically generate the hash value for any image that is uploaded via "media.save()" as we do on the B.E.

  • And for all media that exist and do not have their hash values calculated, I have made a route that we will need to hit after deployment once to make sure that the hash values of all images that we have are calculated and saved.

Meaning after deploying to DEV/CAN/PROD:
You need to

  1. Migrate
  2. Hit the route /gallery.generate.hashes ( And you wont have to do it ever again )

New Routes

  1. /gallery.generate.hashes : Lets you generate hashes for all existing images
  2. /gallery.duplicates.summarize: This route finds all images that are duplicates, groups them and returns a response of all the duplicates
  3. /gallery.duplicates.summary.print: Lets you download a CSV of all groups of duplicates available
  4. /gallery.duplicates.clean: Identifies all use cases for duplicates, transfers them to only one media record, and then deletes the remaining duplicates.

How do we clean?

For every group of duplicates images, we look at each of them and check to see if they are being used in (action, events, teams, vendors, community, community homepage) and note them down. After this is done for each, we simply choose one ( I chose the first one 🤣 ) of the records, and re-attach all those relationships to that "one". Then, we delete the remaining duplicates.

How its used

  • This cleaning routine is setup here as a task. You will see that it has been added as "Remove Duplicate Images" when you check the task page as a super admin.
  • The whole routine is also covered by a feature flag. So, depending on what time frame you set (@BradHN1), it will clean up duplicates of only communities that are subscribed to the feature flag. I named the flag Remove Duplicate Images so that the resulting FF key is computed to remove-duplicate-images-feature-flag. So simply create a new feature flag with name Remove Duplicate Images and thats it.

Everytime the tasks runs, it will transfer all the relationships for each of the duplicates to only one of them, then delete the other occurrences of the image (both media records & s3 object where necessary).

Then the admins of the subscribed communities will then receive an email that looks something like this:

Screenshot 2023-10-19 at 11 32 22
The postmark template has an alias "media-library-cleanup" and its called "Media Library Cleanup Admin Notification"

Attached to the admin email, there will also be a summary in the form of CSV that looks something like this:
summary_of_duplicates_2023-09-22T02_49_33.069Z.csv

⚠️ Tests to follow soon!

closes massenergize/frontend-admin#1073

@frimpongopoku frimpongopoku marked this pull request as draft October 16, 2023 10:22
@frimpongopoku frimpongopoku force-pushed the find-and-remove-image-duplicates branch from 923f70f to dac1ecf Compare October 19, 2023 09:51
@frimpongopoku frimpongopoku force-pushed the find-and-remove-image-duplicates branch from dac1ecf to 9638491 Compare October 19, 2023 10:40
@frimpongopoku frimpongopoku marked this pull request as ready for review October 19, 2023 11:50
@BradHN1
Copy link
Contributor

BradHN1 commented Oct 20, 2023 via email

@frimpongopoku
Copy link
Contributor Author

@frimpongopoku - looking better. This task will now send a separate message about each community to the superadmin who creates the task. That would be overwhelming and I’d need to combine all those files in order to see the results of running the task. The other database cleanup tasks send one message summarizing the database cleanup for all the communities, which would be preferable.

On Oct 20, 2023, at 5:00 AM, Frimpong Opoku Agyemang @.***> wrote: @frimpongopoku commented on this pull request. In src/task_queue/database_tasks/media_library_cleanup.py <#802 (comment)>: > + to the flag, duplicates will be removed from their libraries. + """ + try: + flag = FeatureFlag.objects.filter(key=REMOVE_DUPLICATE_IMAGE_FLAG_KEY).first() + + for community in flag.communities.all(): + ids = [community.id] + grouped_dupes = find_duplicate_items(False, community_ids=ids) + num_of_dupes_in_all = get_duplicate_count(grouped_dupes) + csv_file = summarize_duplicates_into_csv(grouped_dupes) + admins = get_admins_of_communities(ids) + + for hash_value in grouped_dupes.keys(): + remove_duplicates_and_attach_relations(hash_value) + + for admin in admins: Added this, but now you will have to name the task "Media Library Cleanup Routine" when you create it. Or make the code match what name you choose if you do change it. — Reply to this email directly, view it on GitHub <#802 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBVDZC25A7HA7JB5MKCXOTYAI4R5AVCNFSM6AAAAAA5CPS3S6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMOBZGU3TQNJQGM. You are receiving this because you were mentioned.

Yes, I saw that too. But I thought that was the point 🤣 . To handle each of the communities separately and let you know what's happened where. I will switch to a general summary.

@frimpongopoku frimpongopoku requested a review from BradHN1 October 20, 2023 10:52
@BradHN1
Copy link
Contributor

BradHN1 commented Oct 20, 2023 via email

@frimpongopoku
Copy link
Contributor Author

I’ve fixed it since the code blows up if the routine doesn’t accept the task argument. Did you test ttodays changes locally? .enabled_communities also blows up, I’ve fixed that too. Maybe we’re running different versions of python? I’m on 3.8.8

On Oct 20, 2023, at 7:55 AM, Frimpong Opoku Agyemang @.***> wrote: @frimpongopoku commented on this pull request. In src/task_queue/database_tasks/media_library_cleanup.py <#802 (comment)>: > +from task_queue.models import Task + + +REMOVE_DUPLICATE_IMAGE_FLAG_KEY = "remove-duplicate-images-feature-flag" + + +def remove_duplicate_images(): + """ + This checks all media on the platform and removes all duplicates. + Its based on the "Remove Duplicate Images" feature flag. For communities that are subscribed + to the flag, duplicates will be removed from their libraries. + """ + try: + flag = FeatureFlag.objects.filter(key=REMOVE_DUPLICATE_IMAGE_FLAG_KEY).first() + communities = flag.enabled_communities() + task = Task.objects.filter(name = "Media Library Cleanup Routine").first() I guess I will call him 🤣 🤣 🤣 cos I quickly looked around to see if there was a chance that I could access the task directly in there and didnt find any example. — Reply to this email directly, view it on GitHub <#802 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBVDZCLZIFT3MENZR2ZGDTYAJRCHAVCNFSM6AAAAAA5CPS3S6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMOBZHA3TCMJVGQ. You are receiving this because you were mentioned.

I have tested now, and fixed the bugs so you can try again when you are ready.

@frimpongopoku frimpongopoku requested a review from BradHN1 October 20, 2023 12:17
Copy link
Contributor

@BradHN1 BradHN1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - test in dev

@BradHN1 BradHN1 merged commit 2b1ce97 into development Nov 3, 2023
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] - Identify image duplicates & remove
2 participants