-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find and remove image duplicates #802
Conversation
923f70f
to
dac1ecf
Compare
…content from bucket
…n to delete images too
dac1ecf
to
9638491
Compare
@frimpongopoku - looking better.
This task will now send a separate message about each community to the superadmin who creates the task. That would be overwhelming and I’d need to combine all those files in order to see the results of running the task. The other database cleanup tasks send one message summarizing the database cleanup for all the communities, which would be preferable.
… On Oct 20, 2023, at 5:00 AM, Frimpong Opoku Agyemang ***@***.***> wrote:
@frimpongopoku commented on this pull request.
In src/task_queue/database_tasks/media_library_cleanup.py <#802 (comment)>:
> + to the flag, duplicates will be removed from their libraries.
+ """
+ try:
+ flag = FeatureFlag.objects.filter(key=REMOVE_DUPLICATE_IMAGE_FLAG_KEY).first()
+
+ for community in flag.communities.all():
+ ids = [community.id]
+ grouped_dupes = find_duplicate_items(False, community_ids=ids)
+ num_of_dupes_in_all = get_duplicate_count(grouped_dupes)
+ csv_file = summarize_duplicates_into_csv(grouped_dupes)
+ admins = get_admins_of_communities(ids)
+
+ for hash_value in grouped_dupes.keys():
+ remove_duplicates_and_attach_relations(hash_value)
+
+ for admin in admins:
Added this, but now you will have to name the task "Media Library Cleanup Routine" when you create it. Or make the code match what name you choose if you do change it.
—
Reply to this email directly, view it on GitHub <#802 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBVDZC25A7HA7JB5MKCXOTYAI4R5AVCNFSM6AAAAAA5CPS3S6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMOBZGU3TQNJQGM>.
You are receiving this because you were mentioned.
|
Yes, I saw that too. But I thought that was the point 🤣 . To handle each of the communities separately and let you know what's happened where. I will switch to a general summary. |
I’ve fixed it since the code blows up if the routine doesn’t accept the task argument. Did you test ttodays changes locally? .enabled_communities also blows up, I’ve fixed that too.
Maybe we’re running different versions of python? I’m on 3.8.8
… On Oct 20, 2023, at 7:55 AM, Frimpong Opoku Agyemang ***@***.***> wrote:
@frimpongopoku commented on this pull request.
In src/task_queue/database_tasks/media_library_cleanup.py <#802 (comment)>:
> +from task_queue.models import Task
+
+
+REMOVE_DUPLICATE_IMAGE_FLAG_KEY = "remove-duplicate-images-feature-flag"
+
+
+def remove_duplicate_images():
+ """
+ This checks all media on the platform and removes all duplicates.
+ Its based on the "Remove Duplicate Images" feature flag. For communities that are subscribed
+ to the flag, duplicates will be removed from their libraries.
+ """
+ try:
+ flag = FeatureFlag.objects.filter(key=REMOVE_DUPLICATE_IMAGE_FLAG_KEY).first()
+ communities = flag.enabled_communities()
+ task = Task.objects.filter(name = "Media Library Cleanup Routine").first()
I guess I will call him 🤣 🤣 🤣 cos I quickly looked around to see if there was a chance that I could access the task directly in there and didnt find any example.
—
Reply to this email directly, view it on GitHub <#802 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBVDZCLZIFT3MENZR2ZGDTYAJRCHAVCNFSM6AAAAAA5CPS3S6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMOBZHA3TCMJVGQ>.
You are receiving this because you were mentioned.
|
I have tested now, and fixed the bugs so you can try again when you are ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - test in dev
Related ticket : massenergize/frontend-admin#1073
This PR is going to allow us to identify duplicate images all over the platform.
HOW?
Hashes! We just use hashes. If you generate the hash of two images, if the images are the same in terms of visual content, the hash string generated will be the same. It also means that if the image is the exact same and you give it a different name, the generated hash will still be the same, which will allow us to clock it as a duplicate.
What counts as a duplicate?
We are currently only able to identify exact duplicates and not images that are close in similarity. This means that when a user uploads an image, and they upload a cropped version of the same image, or the same image with different dimensions --- they are not the same! (Which I think is fair, and reliable ✅ )
Code Implementation (⚠️ DB Modification)
Since the whole process depends on the hash values of our images, I have added a DB field to the "Media" model -- "hash".
This is the field that is queried to identify duplicates.
For future uploads, I have modified the "save" function for the "Media" model to automatically generate the hash value for any image that is uploaded via "media.save()" as we do on the B.E.
And for all media that exist and do not have their hash values calculated, I have made a route that we will need to hit after deployment once to make sure that the hash values of all images that we have are calculated and saved.
Meaning after deploying to DEV/CAN/PROD:
You need to
/gallery.generate.hashes
( And you wont have to do it ever again )New Routes
/gallery.generate.hashes
: Lets you generate hashes for all existing images/gallery.duplicates.summarize
: This route finds all images that are duplicates, groups them and returns a response of all the duplicates/gallery.duplicates.summary.print
: Lets you download a CSV of all groups of duplicates available/gallery.duplicates.clean
: Identifies all use cases for duplicates, transfers them to only one media record, and then deletes the remaining duplicates.How do we clean?
For every group of duplicates images, we look at each of them and check to see if they are being used in (action, events, teams, vendors, community, community homepage) and note them down. After this is done for each, we simply choose one ( I chose the first one 🤣 ) of the records, and re-attach all those relationships to that "one". Then, we delete the remaining duplicates.
How its used
Remove Duplicate Images
so that the resulting FF key is computed toremove-duplicate-images-feature-flag
. So simply create a new feature flag with nameRemove Duplicate Images
and thats it.Everytime the tasks runs, it will transfer all the relationships for each of the duplicates to only one of them, then delete the other occurrences of the image (both media records & s3 object where necessary).
Then the admins of the subscribed communities will then receive an email that looks something like this:
The postmark template has an alias "media-library-cleanup" and its called "Media Library Cleanup Admin Notification"
Attached to the admin email, there will also be a summary in the form of CSV that looks something like this:
summary_of_duplicates_2023-09-22T02_49_33.069Z.csv
closes massenergize/frontend-admin#1073