Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate images with respect to data #634

Closed
mahalakshme opened this issue Oct 20, 2023 · 7 comments
Closed

Fix duplicate images with respect to data #634

mahalakshme opened this issue Oct 20, 2023 · 7 comments
Assignees

Comments

@mahalakshme
Copy link
Contributor

mahalakshme commented Oct 20, 2023

As a user, I dont want any data created by me to be uploaded multiple times, leading to duplicates.

As part of the issue mentioned in this card, there are duplicate images in the data created as a result of duplicate sync(simultaneous automatic and manual syncs).

Analysis:

select distinct(concept_uuid) from (select distinct o.name as org_name, o.id as org_id, c.uuid as concept_uuid, f.form_type as f_type
from concept c
         inner join form_element fe on c.id = fe.concept_id
         inner join organisation o on c.organisation_id = o.id
inner join form_element_group feg on fe.form_element_group_id = feg.id
inner join form f on feg.form_id = f.id
where c.data_type in ('Image', 'Video', 'Audio', 'File')
  and fe.type = 'MultiSelect'
  and c.is_voided is false
  and f.form_type = 'Encounter'
  and fe.is_voided is false) sample;

Using the variation of the above query concept_uuids for media data types and the organisation uuids that use it were identified. And it was found such concepts are used only in the form types Encounter, IndividualProfile, ProgramEncounter, ProgramEnrolment.

A plpgsql function to find duplicates given the table names and concept_uuids was created.

CREATE OR REPLACE FUNCTION find_duplicate_array_values_updated(table_names text[], jsonb_column_name text, keys text[])
    RETURNS TABLE(id integer, organisation_id integer, table_name text) AS $$
DECLARE
    key_element text;
    table_name text;
BEGIN
    FOREACH table_name IN ARRAY table_names
    LOOP
    FOREACH key_element IN ARRAY keys
        LOOP
            RETURN QUERY EXECUTE format('
    select id, organisation_id, %L as table_name from(SELECT id, organisation_id, %L AS key,
        duplicate_value AS duplicate_value,
        COUNT(*)::integer AS count
    FROM (
        SELECT id, organisation_id,
            jsonb_array_elements_text(%I->%L)::text AS duplicate_value
        FROM %I
        WHERE
            jsonb_typeof(%I->%L) = ''array'' and organisation_id IN (''42'', ''60'', ''131'', ''132'', ''148'', ''149'', ''150'', ''186'', ''215'', ''217'', ''219'', ''235'', ''258'', ''279'', ''285'', ''289'', ''298'', ''301'', ''310'', ''314'', ''317'', ''318'', ''325'', ''327'', ''347'', ''350'', ''356'', ''357'', ''368'', ''373'', ''390'', ''391'', ''392'')
    ) subquery
    GROUP BY key, duplicate_value,organisation_id,id
    HAVING COUNT(*) > 1) final',
    table_name, key_element, jsonb_column_name, key_element, table_name, jsonb_column_name, key_element);
        END LOOP;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

The above things were used in the below query to find the duplicates:

SELECT count(*) FROM find_duplicate_array_values_updated(ARRAY['individual'], 'observations', ARRAY['1c4eb3f2-4d1e-4618-aaae-74475c5d8e08', '23641a68-fa4f-4f6d-96c1-52d351fc225a', '2d9f1f02-d5ba-485b-a2b0-06a68cd28e41', '2eb0ad24-83bd-437e-bcde-fcbf2c4559b4', '33bfd120-e0c1-4d3a-abd5-54bef7795b1d', '673c880a-fee8-4abf-890c-50eb703aed98', '9ced444d-b29d-44e5-88e7-11e48cdd4351', 'a7348bc3-5d8d-4196-be40-664e1754793d', 'bcdcc8bb-5308-4348-bf4b-02303ae6a853', 'c1801d0c-1d84-46f7-9a62-daaf12e22336', 'cb456ef2-f20f-4927-ac4b-a3efdddd5680', 'e14d53b2-7183-4489-b241-3f24b9a6338b', 'e4a59e54-058f-4634-93dd-3d36a5439b01', 'ed133199-dbae-4ba7-8c5c-2f23f119edf5']);
--- 723
SELECT count(*) FROM find_duplicate_array_values_updated(ARRAY['encounter'], 'observations', ARRAY['3e601a14-6b47-4745-97a9-87d350da4851', '40c4abed-73dc-4ad3-ac08-0150cc812778', '50adba2e-8345-40ac-8c9c-e4d8f86f6846', '59dca7f7-0be7-4b19-a601-1f1938ee1691', '5b9e52d5-6918-4b9d-a612-7916a58ca589', '8e10f101-eb20-4f80-8176-2bbc25eb5f59', 'aafdf1f6-46c2-4c4a-b69e-eb8489c9ebde', 'ae101a9a-7271-4bfc-8b02-6afaf6334712', 'b48274f3-e71f-4e0e-9ec3-7f7fbe2a40b0', 'f303372a-28ae-49bf-afb2-67d81b254309']);
-- 241
SELECT count(*) FROM find_duplicate_array_values_updated(ARRAY['program_encounter'], 'observations', ARRAY['15bf280c-a9d5-43c6-8fb7-b208e99df9d3', '67aa6045-b158-466e-ae42-107f738d3500', 'aafdf1f6-46c2-4c4a-b69e-eb8489c9ebde', 'd1b1a8a1-cfb5-4222-83bf-1fbe1028ec60']);
-- 0
SELECT count(*) FROM find_duplicate_array_values_updated(ARRAY['program_enrolment'], 'observations', ARRAY['f6d3405b-a946-4a39-9992-196c6205b2bc']);
-- 0

Results of analysis:

  • 723 duplicates in individual table in the orgs GDGSGOM(org id - 310) and Goonj(org id - 391)
  • 241 duplicates in encounter table in the orgs with ids 314(RWB NGOs 2023), 298(RWB 2023), 310(GDGSGOM), 347(Ward Implementation and Management Committee - AKRSP), 219(RWB NITI Aayog UAT), 301(AKRSP UAT)

Acceptance criteria:

    • Delete the 964 duplicate media values found in the above result.
    • Make sure the duplicates are deleted from the media table as well.
@mahalakshme mahalakshme converted this from a draft issue Oct 20, 2023
@mahalakshme mahalakshme moved this from In Analysis to In Analysis Review in Avni Product Oct 20, 2023
@mahalakshme mahalakshme moved this from In Analysis Review to In Analysis in Avni Product Oct 20, 2023
@mahalakshme mahalakshme moved this from In Analysis to In Analysis Review in Avni Product Oct 22, 2023
@vinayvenu vinayvenu moved this from In Analysis Review to Ready in Avni Product Oct 23, 2023
@mahalakshme mahalakshme moved this from Ready to Finalized for the upcoming release in Avni Product Oct 24, 2023
@mahalakshme mahalakshme moved this from Finalized for the upcoming release to Ready in Avni Product Oct 24, 2023
@1t5j0y 1t5j0y moved this from Ready to In Progress in Avni Product Oct 24, 2023
@1t5j0y 1t5j0y self-assigned this Oct 24, 2023
@1t5j0y 1t5j0y moved this from In Progress to Code Review Ready in Avni Product Oct 26, 2023
@1t5j0y
Copy link
Contributor

1t5j0y commented Oct 26, 2023

Media table deletes not done as ETL will update this automatically.

@petmongrels petmongrels moved this from Code Review Ready to In Code Review in Avni Product Oct 27, 2023
@petmongrels petmongrels moved this from In Code Review to Code Review Ready in Avni Product Oct 27, 2023
@ashusvnath ashusvnath moved this from Code Review Ready to QA Ready in Avni Product Oct 30, 2023
@AchalaBelokar AchalaBelokar moved this from QA Ready to Done in Avni Product Oct 31, 2023
@mahalakshme mahalakshme moved this from Done to Ready in Avni Product Oct 31, 2023
@1t5j0y 1t5j0y removed their assignment Oct 31, 2023
@1t5j0y
Copy link
Contributor

1t5j0y commented Oct 31, 2023

Queries have been created here

Needs to be executed on prod.

@himeshr
Copy link
Contributor

himeshr commented Oct 31, 2023

Queries have been created here

Needs to be executed on prod.

@1t5j0y , when i ran this on pre-release environment, i had to run the update queries multiple times for Individuals and encunters to bring it down to 0, is this expected.?

@himeshr himeshr moved this from Ready to QA Failed in Avni Product Oct 31, 2023
@himeshr
Copy link
Contributor

himeshr commented Oct 31, 2023

Entries found on pre-release:

  • individuals : 750
  • encounters: 301
  • Others: 0

Moving to QA Failed as the updated queries had to be run multiple times for Individuals and encunters to bring the count down to 0.

@1t5j0y 1t5j0y moved this from QA Failed to In Progress in Avni Product Nov 1, 2023
@1t5j0y
Copy link
Contributor

1t5j0y commented Nov 1, 2023

From the updates, looks like query for individual had to be executed 3 times and for encounter 2 times. @himeshr could you confirm?

Checked the rows that needed to be processed multiple times and they contained multiple image concepts so I guess this is an edge case and might have been caused due to some row locking by the update statement.

Do we need to spend any more time on this given that the number of times the script needs to run is minimal and changing the approach to handle this edge case will take time?

@1t5j0y 1t5j0y self-assigned this Nov 1, 2023
@himeshr
Copy link
Contributor

himeshr commented Nov 1, 2023

From the updates, looks like query for individual had to be executed 3 times and for encounter 2 times. @himeshr could you confirm?

Checked the rows that needed to be processed multiple times and they contained multiple image concepts so I guess this is an edge case and might have been caused due to some row locking by the update statement.

Do we need to spend any more time on this given that the number of times the script needs to run is minimal and changing the approach to handle this edge case will take time?

Yes, query for individual had to be executed 3 times and for encounter 2 times.

Do we need to spend any more time on this given that the number of times the script needs to run is minimal and changing the approach to handle this edge case will take time?
No, lets make a note that this has to be run multiple times till select queries return 0.

@1t5j0y
Copy link
Contributor

1t5j0y commented Nov 1, 2023

Prod run log:

openchs.public> --individual count query
[2023-11-01 11:13:11] 1 row retrieved starting from 1 in 34 s 20 ms (execution: 33 s 976 ms, fetching: 44 ms)
--755
openchs.public> --encounter count query
[2023-11-01 11:17:02] 1 row retrieved starting from 1 in 2 m 24 s 519 ms (execution: 2 m 24 s 498 ms, fetching: 21 ms)
--303
openchs.public> --program_encounter count query
[2023-11-01 11:18:20] 1 row retrieved starting from 1 in 53 s 57 ms (execution: 53 s 34 ms, fetching: 23 ms)
--0
openchs.public> --program_enrolment count query
[2023-11-01 11:18:29] 1 row retrieved starting from 1 in 2 s 60 ms (execution: 2 s 21 ms, fetching: 39 ms)
--0


openchs.public> --individual update query
[2023-11-01 11:20:06] 513 rows affected in 41 s 114 ms
openchs.public> --individual count query
[2023-11-01 11:20:50] 1 row retrieved starting from 1 in 12 s 645 ms (execution: 12 s 634 ms, fetching: 11 ms)
--26
openchs.public> --individual update query
[2023-11-01 11:21:11] 4 rows affected in 15 s 869 ms
openchs.public> --individual count query
[2023-11-01 11:21:29] 1 row retrieved starting from 1 in 5 s 949 ms (execution: 5 s 907 ms, fetching: 42 ms)
--14
openchs.public> --individual update query
[2023-11-01 11:21:48] 3 rows affected in 8 s 119 ms
openchs.public> --individual count query
[2023-11-01 11:22:01] 1 row retrieved starting from 1 in 5 s 733 ms (execution: 5 s 691 ms, fetching: 42 ms)
--0
openchs.public> --encounter update query
[2023-11-01 11:31:21] Cancelling…
[2023-11-01 11:31:22] [57014] ERROR: canceling statement due to user request
[2023-11-01 11:31:22] Where: PL/pgSQL function find_duplicate_array_values_updated(text[],text,text[]) line 10 at RETURN QUERY
openchs.public> --encounter update query
[2023-11-01 11:34:22] 206 rows affected in 2 m 35 s 990 ms
openchs.public> --encounter count query
[2023-11-01 11:34:41] 1 row retrieved starting from 1 in 6 s 43 ms (execution: 5 s 998 ms, fetching: 45 ms)
--12
openchs.public> --encounter update query
[2023-11-01 11:34:55] 10 rows affected in 7 s 552 ms
openchs.public> --encounter count query
[2023-11-01 11:35:18] 1 row retrieved starting from 1 in 7 s 916 ms (execution: 7 s 906 ms, fetching: 10 ms)
--0

@1t5j0y 1t5j0y closed this as completed Nov 1, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Avni Product Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants