EVA-3409 Count rsid in release files #420

tcezard · 2023-09-28T12:59:17Z

count_rs_for_all_files.sh:
This PR adds a script that extract the rsids in release files and annotates each RS with the species/assembly/rs_type.
Then the annotation are aggregated per rsid and counted.
This allows for a generic count of rsid across species assemblies and type to be able to construct aggregate for each of these axes.

gather_release_counts.py:
Determine the species and assembly that should be counted togethe, run count_rs_for_all_files.sh then parse and aggregate the results

…f species, assembly and type

tcezard · 2023-10-02T06:23:13Z

eva-accession-release-automation/publish_release_to_ftp/create_assembly_name_symlinks.py

    try:
+        response = requests.get(ENA_XML_API_URL)


This is a fix applied during the release

eva-accession-release-automation/publish_release_to_ftp/publish_release_files_to_ftp.py

sundarvenkata-EBI · 2023-10-04T09:05:30Z

eva-accession-release-automation/gather_clustering_counts/bash/count_rs_for_all_files.sh

+    OUTPUT=tmp_${SC_NAME}_${ASSEMBLY}_${TYPE}.txt
+    if [[ ${INPUT} == *.vcf.gz ]]
+    then
+        zcat  "${INPUT}" | grep -v '^#' | awk -v annotation="${ASSEMBLY}-${SC_NAME}-${TYPE}" '{print $3" "annotation}' > ${OUTPUT}


Optional: Given the normalization and stability issues with scientific names, I am wondering if we should do a "reverse lookup" (from a static table or EVAPRO) to use the taxonomy instead of SC_NAME.

I don't think that's necessary. The bash script deals with files and return annotation based on the filepath. As long as the file path reflect information we want to gather, we should be fine. We should deal with lookup in the python layer.

Alternatively, we could also pass the annotation with the file name so that there is no guess work in the bash script.

sundarvenkata-EBI · 2023-10-04T09:38:46Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+    value, (in our case this is a list of assemblies linked to a taxonomy and the list of taxonomy linked to a assembly)
+    , this recursive function starts from one of the value and find all the related keys and values and provided them
+    in 2 frozen sets. For any key that belong to a relationship, this should provide the same pair of frozensets
+    regardless of the starting key.


Optional: A small example could be helpful here in understanding this function..

Agreed. You could maybe even include this example in the form of a test case...

sundarvenkata-EBI · 2023-10-04T09:57:00Z

eva-accession-release-automation/gather_clustering_counts/bash/count_rs_for_all_files.sh

+set -e
+
+OUTPUT_FILE=$1
+FILE_WITH_ALL_INPUTS=$2


Optional: As a matter of style, this order could be swapped...

sundarvenkata-EBI · 2023-10-04T10:01:05Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+    all_files = []
+    for species in set_of_species:
+        species_dir = os.path.join(release_directory, species)
+        assembly_directories = glob.glob(os.path.join(species_dir, "GCA_*"))


Optional: I am wondering if we shoud use the release tracker table as the source of truth instead of the release directory... This way we can also validate if all the intended species were released and update that table if not.

sundarvenkata-EBI · 2023-10-04T10:11:12Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+            f"select distinct c.taxonomy, t.scientific_name "
+            f"from eva_progress_tracker.clustering_release_tracker c "
+            f"join evapro.taxonomy t on c.taxonomy=t.taxonomy_id "
+            f"where release_version={self.release_version} AND release_folder_name='{species_folder}'"


Hope this works with release folders with weird names or trailing underscores in them...

I think it should. It is currently working for all folder names.

apriltuesday

Mostly just suggestions, this looks great! Hopefully this is a counting script we can actually stick with for a while...

apriltuesday · 2023-10-04T11:45:43Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+                ]) + '\n')
+
+
+class ReleaseCounter(AppLogger):


I'd put this in a separate file, this one is pretty big and this is a natural way to break it up.

apriltuesday · 2023-10-04T11:49:31Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

I'd suggest either adding a DEPRECATED comment to, or removing entirely, the qc_release_counts script as a part of this PR. If we leave it and don't correct it, future poor souls (AKA us) are liable to run it and get confused all over again.

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

apriltuesday · 2023-10-04T12:00:12Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+                    results[assembly][metric] = row[index + 1]
+        return results
+
+    def parse_logs(self, all_logs):


Which logs is this parsing? I guess it's the output of the bash script, maybe add a docstring here and/or at the class level stating this.

apriltuesday · 2023-10-04T12:27:36Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+    return all_files
+
+
+def calculate_all_logs(release_dir, output_dir, species_directories=None):


Maybe calculate_all_counts? "Calculating logs" sounds weird to me...

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

apriltuesday · 2023-10-04T14:09:55Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+    value, (in our case this is a list of assemblies linked to a taxonomy and the list of taxonomy linked to a assembly)
+    , this recursive function starts from one of the value and find all the related keys and values and provided them
+    in 2 frozen sets. For any key that belong to a relationship, this should provide the same pair of frozensets
+    regardless of the starting key.


Agreed. You could maybe even include this example in the form of a test case...

apriltuesday · 2023-10-04T14:28:05Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+                    linked_set1.update(dict2.get(value1))
+    # if one of the set is still growing we check again
+    if linked_set1 != source_linked_set1 or linked_set2 != source_linked_set2:
+        tmp_linked_set1, tmp_linked_set2 = find_link(linked_set1, dict1, dict2, linked_set1, linked_set2)


I'm interpreting this algorithm (in computer science-speak) as finding connected components in a bipartite graph, if so that's pretty cool!

Not sure if efficiency is an issue, but if you maintained a list of "visited" vertices you could ensure that you don't repeat work in the recursion (as I think linked_set1 is always a superset of key_set).

apriltuesday · 2023-10-04T14:29:11Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+        species_to_search = all_species_2_assemblies.keys()
+    logger.info(f'Process {len(species_to_search)} species')
+    for species in species_to_search:
+        set_of_species, set_of_assemblies = find_link({species}, all_species_2_assemblies, all_assemblies_2_species)


Similar to above comment, would be slightly more efficient to keep track of which species have been added in any all_sets_of_species sets and not re-do the work.

apriltuesday · 2023-10-04T14:42:46Z

eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py

+        for annotation1 in dict_of_counter:
+            for annotation2 in dict_of_counter[annotation1]:
+                open_file.write("\t".join([
+                    str(annotation1), str(annotation2), str(dict_of_counter[annotation1][annotation2])


Would prefer more descriptive variable names if possible...

Co-authored-by: April Shen <[email protected]>

tcezard added 4 commits September 28, 2023 13:20

Script to count the number of rsid associated with each combination o…

7e5d1c8

…f species, assembly and type

Script to count the number of rsid associated with each combination o…

5498989

…f species, assembly and type

Generate tsv files at the end of the counting process

3ab4ba6

Add code to compare with DB

2605040

tcezard requested review from apriltuesday, sundarvenkata-EBI and nitin-ebi October 1, 2023 21:14

Fix output csv generation

b8f5f74

tcezard commented Oct 2, 2023

View reviewed changes

eva-accession-release-automation/publish_release_to_ftp/publish_release_files_to_ftp.py Show resolved Hide resolved

Fix bugs

6c5426c

sundarvenkata-EBI assigned tcezard Oct 4, 2023

sundarvenkata-EBI reviewed Oct 4, 2023

View reviewed changes

sundarvenkata-EBI approved these changes Oct 4, 2023

View reviewed changes

apriltuesday reviewed Oct 4, 2023

View reviewed changes

tcezard and others added 2 commits October 5, 2023 22:54

Apply suggestions from code review

4675896

Co-authored-by: April Shen <[email protected]>

Address review comments

08056da

apriltuesday approved these changes Oct 6, 2023

View reviewed changes

sundarvenkata-EBI approved these changes Oct 6, 2023

View reviewed changes

nitin-ebi approved these changes Oct 9, 2023

View reviewed changes

tcezard merged commit c72ff13 into EBIvariation:master Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EVA-3409 Count rsid in release files #420

EVA-3409 Count rsid in release files #420

tcezard commented Sep 28, 2023 •

edited

Loading

tcezard Oct 2, 2023

sundarvenkata-EBI Oct 4, 2023

tcezard Oct 5, 2023

tcezard Oct 5, 2023

sundarvenkata-EBI Oct 4, 2023

apriltuesday Oct 4, 2023

sundarvenkata-EBI Oct 4, 2023

sundarvenkata-EBI Oct 4, 2023

sundarvenkata-EBI Oct 4, 2023

tcezard Oct 5, 2023

apriltuesday left a comment

apriltuesday Oct 4, 2023

apriltuesday Oct 4, 2023

apriltuesday Oct 4, 2023

apriltuesday Oct 4, 2023

apriltuesday Oct 4, 2023

apriltuesday Oct 4, 2023

apriltuesday Oct 4, 2023

apriltuesday Oct 4, 2023

		return all_files


		def calculate_all_logs(release_dir, output_dir, species_directories=None):

EVA-3409 Count rsid in release files #420

EVA-3409 Count rsid in release files #420

Conversation

tcezard commented Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apriltuesday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tcezard commented Sep 28, 2023 •

edited

Loading