Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Being able to identify media not uploaded #102

Closed
Tracked by #1635
mahalakshme opened this issue May 27, 2024 · 6 comments
Closed
Tracked by #1635

Being able to identify media not uploaded #102

mahalakshme opened this issue May 27, 2024 · 6 comments
Assignees

Comments

@mahalakshme
Copy link
Contributor

mahalakshme commented May 27, 2024

https://avni.freshdesk.com/a/tickets/3932

Need:

Currently we don't have a mechanism to know if media URLs synced have been successfully uploaded to s3. This is important considering the frequent media anomalies we have encountered so far like media URLs present but media not present in S3, thumbnails not generated, etc., This will help us to establish a monitoring mechanism before user reports it. When there is no automatic way to determine we miss to check it.

We are not doing the same to make sure data(other than images) are synced because of the below reasons:

  • With data it is straightforward - eg: when some mandatory fields are alone not updated we get to know easily from bugsnag. Here, when the URLs are there and when the media is missing, there is no way for us to know currently.
  • Say when no data has got saved(synced), there are reports maintained setup for the organisations to track daily statistics(like registrations and due visits done) via which we will come to know. But in media, even if we have reports now based on database entries it will appear as if the media is synced, but actually the images are missing in S3.

Context:

Media anomaly: One of the below:

  • Invalid url in db
  • Valid url in db but not present in S3 bucket
  • Present in S3 bucket but thumbnail not generated

AC:

  • Just automate the process something like mentioned here: https://avni.freshdesk.com/support/solutions/articles/36000504548-comparing-image-files-in-s3-bucket-and-media-table. The process can be changed based on whatever developers feels easy or better.
  • Need to run once a week in a background job and dump the results in a table. Sample column names for the table: column_referencing_media_table(media.id), Valid url (Y/N), Present in S3 (Y/N), Thumbnail generated (Y/N), audit_info
  • Anomalies need to be determined for invalid URLs, actual image missing in s3 bucket, thumbnail missing in s3 bucket
  • Every week after the job is run, the table should ve the updated information. Should not have information on any anomalies fixed in the past 1 week, incase we do incremental update of tables.

Post deployment steps

  • Link the table to a report such that when a new anomaly is created. In the report have a threshold count based on existing anomalies count and when this increases, only then raise an alert and consequently a ticket on supprt.
  • For now enable for Goonj on production alone since they use media heavily and we will get to know the media issues via them anyways.

Tech details if needed for reference: Ignore if not relevant

  • To get s3 contents after a particular date: Link
  • to check for presence of media and thumbnail on s3: Link
  • to fetch the entire list: Link
  • Code to be in ETL repo

Old:

AC:

  • In ETL when creating each new entry in media table check in s3 for media anomaly
  • table entry/update in media_anomaly table with columns: column_referencing_media_table, Valid url (Y/N), Present in S3 (Y/N), Thumbnail generated (Y/N), audit_info
  • Once in a week update media_anomaly table
  • To be done for all orgs for which ETL is enabled

Out of scope:

To check if there are media anomalies for the existing data in the media table

Added above based on suggestions from here

Old: Ignore

Use cases

  • Determine current list of media anomalies for an org
  • Check if action taken has fixed an anomaly

Definitions

Media anomaly: One of the below:

  • Invalid url in db
  • Valid url in db but not present in S3 bucket
  • Present in S3 bucket but thumbnail not generated

AC: (Based on the suggestions from here)

If never run for org,

  • Use s3 listObjectsV2 api with prefix as org s3 prefix to fetch all file names (with pagination) for the org in s3 and store temporarily.
  • If ETL is enabled for org, compare media table entries with list of filenames retrieved in step 1 to detect missing images. Store anomalies in db. Else return 501 HTTP status code.
  • Check list output from s3 for missing thumbnails and store anomalies in db

If previously run for org,

  • For each previously stored anomaly + media last_modified_date_time (from ETL or from actual tx table) > last anomaly job run time, use s3 headObject api to check for presence of media and thumbnail on s3 and update anomalies.

To get s3 contents after a particular date:

Input:

  • In /admin/organisationDetails page, add a toggle saying 'Determine media anomalies'
  • If enabled for an org, run the bg job once in a day
  • Once disabled the bg job should not run

Output:

  • table entry/update in media_anomaly table with columns: column_referencing_media_table, Valid url (Y/N), Present in S3 (Y/N), Thumbnail generated (Y/N), audit_info

Input:

  • from scratch - some ETL running will take more time
  • himesh wrote aws shell script
  • monitoring job - regularly running -
  • check when sync itself
  • thumbnails - different issue

Input issues:

  • We need a easier automatic monitoring mechanism that sends alerts
  • checking in mobile db for issue in mobile sync - doesn't seem reliable
  • what is the concern if ETL takes longer time
  • support team or scheduled job - weekend
@mahalakshme mahalakshme converted this from a draft issue May 27, 2024
@mahalakshme mahalakshme moved this from In Analysis to In Analysis Review in Avni Product May 27, 2024
@mahalakshme mahalakshme moved this from In Analysis Review to In Analysis in Avni Product May 29, 2024
@mahalakshme mahalakshme moved this from In Analysis to In Analysis Review in Avni Product Jun 18, 2024
@mahalakshme mahalakshme moved this from In Analysis Review to Ready in Avni Product Jul 11, 2024
@himeshr himeshr self-assigned this Jul 12, 2024
@himeshr himeshr moved this from Ready to In Progress in Avni Product Jul 12, 2024
@himeshr himeshr moved this from In Progress to Ready in Avni Product Jul 15, 2024
@himeshr himeshr removed their assignment Jul 15, 2024
@himeshr himeshr moved this from Ready to In Progress in Avni Product Jul 16, 2024
@himeshr himeshr self-assigned this Jul 16, 2024
@himeshr
Copy link
Contributor

himeshr commented Jul 16, 2024

Discussion point for later implementation:

  • Find duplicate media urls for same entity_concept

himeshr added a commit that referenced this issue Jul 16, 2024
…s for each organisation: Sync and MediaAnalysis
himeshr added a commit that referenced this issue Jul 16, 2024
@himeshr himeshr moved this from In Progress to Hold in Avni Product Jul 17, 2024
himeshr added a commit that referenced this issue Jul 17, 2024
himeshr added a commit that referenced this issue Jul 19, 2024
 - Consolidate all Create Job requests validation
 - Truncate jobDetail description
himeshr added a commit that referenced this issue Jul 19, 2024
…g for UUID and excluding Mobile and Adhoc entries
@himeshr himeshr moved this from Hold to In Progress in Avni Product Jul 19, 2024
himeshr added a commit that referenced this issue Jul 22, 2024
@himeshr
Copy link
Contributor

himeshr commented Jul 24, 2024

Additionally introudced isHavingDuplicates column to mediaAnalysis table, which specifies if duplicates exist in the media table, the count isn't stored as a separate column but can be determined by joining with media table.

select * from media m join media_analysis ma on m.uuid = ma.uuid and m.image_url = ma.image_url;
-- where m.uuid  = '60b99f84-37cf-4873-b1a6-5e3f7f88af3b';

@himeshr
Copy link
Contributor

himeshr commented Jul 24, 2024

Sql query to use in setup alert on metabase when there are new anomalies.

-- set role goonj;
with counts as (select count(*) AS Total,
    sum(case when is_valid_url = false then 1 else 0 end) AS InvalidURLsCount,
    sum(case when is_present_in_storage =  false then 1 else 0 end) AS MissingMediaCount,
    sum(case when is_thumbnail_generated =  false then 1 else 0 end) AS MissingThumbnailCount,
    sum(case when is_having_duplicates =  true then 1 else 0 end) AS MediaWhichHasAtleastOneDuplicateCount
from media_analysis) select * from counts
where Total > 0 and (InvalidURLsCount > 0 or MissingMediaCount > 138 or MissingThumbnailCount > 372 or MediaWhichHasAtleastOneDuplicateCount > 291);

@himeshr
Copy link
Contributor

himeshr commented Jul 24, 2024

For reference and dev testing, use postman collection for triggering ETL Jobs for Sync and/or MediaAnalysis.

AVNI ETL.postman_collection.json

@himeshr himeshr moved this from In Progress to Code Review Ready in Avni Product Jul 24, 2024
@1t5j0y 1t5j0y moved this from Code Review Ready to In Code Review in Avni Product Jul 24, 2024
@1t5j0y
Copy link
Contributor

1t5j0y commented Jul 24, 2024

Add sensible values for staging/prerelease/prod for AVNI_MEDIA_ANALYSIS_JOB_REPEAT_INTERVAL in avni-infra. Also isn't the default of 2 minutes too low (if env var not configured)?

MediaAnalysisTableRegenerateAction.process

  • Create different MediaDTO in to have only the fields we are interested in for analysis (url and uuid) to reduce memory footprint?
  • We should be able to get rid of some of the local list objects as well to reduce memory footprint i.e. generate Map from the unpartitioned List instead of via intermediate Lists.

@1t5j0y 1t5j0y moved this from In Code Review to Code Review with Comments in Avni Product Jul 24, 2024
himeshr added a commit that referenced this issue Jul 24, 2024
@himeshr
Copy link
Contributor

himeshr commented Jul 24, 2024

@1t5j0y Absorbed the review comments

@himeshr himeshr moved this from Code Review with Comments to Code Review Ready in Avni Product Jul 24, 2024
@1t5j0y 1t5j0y moved this from Code Review Ready to In Code Review in Avni Product Jul 25, 2024
@1t5j0y 1t5j0y moved this from In Code Review to QA Ready in Avni Product Jul 25, 2024
@AchalaBelokar AchalaBelokar moved this from QA Ready to In QA in Avni Product Jul 30, 2024
@AchalaBelokar AchalaBelokar moved this from In QA to Done in Avni Product Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants