Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVA-3409 Count rsid in release files #420

Merged
merged 8 commits into from
Oct 9, 2023

Conversation

tcezard
Copy link
Member

@tcezard tcezard commented Sep 28, 2023

count_rs_for_all_files.sh:
This PR adds a script that extract the rsids in release files and annotates each RS with the species/assembly/rs_type.
Then the annotation are aggregated per rsid and counted.
This allows for a generic count of rsid across species assemblies and type to be able to construct aggregate for each of these axes.

gather_release_counts.py:
Determine the species and assembly that should be counted togethe, run count_rs_for_all_files.sh then parse and aggregate the results

try:
response = requests.get(ENA_XML_API_URL)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fix applied during the release

OUTPUT=tmp_${SC_NAME}_${ASSEMBLY}_${TYPE}.txt
if [[ ${INPUT} == *.vcf.gz ]]
then
zcat "${INPUT}" | grep -v '^#' | awk -v annotation="${ASSEMBLY}-${SC_NAME}-${TYPE}" '{print $3" "annotation}' > ${OUTPUT}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Given the normalization and stability issues with scientific names, I am wondering if we should do a "reverse lookup" (from a static table or EVAPRO) to use the taxonomy instead of SC_NAME.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary. The bash script deals with files and return annotation based on the filepath. As long as the file path reflect information we want to gather, we should be fine. We should deal with lookup in the python layer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could also pass the annotation with the file name so that there is no guess work in the bash script.

value, (in our case this is a list of assemblies linked to a taxonomy and the list of taxonomy linked to a assembly)
, this recursive function starts from one of the value and find all the related keys and values and provided them
in 2 frozen sets. For any key that belong to a relationship, this should provide the same pair of frozensets
regardless of the starting key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: A small example could be helpful here in understanding this function..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. You could maybe even include this example in the form of a test case...

set -e

OUTPUT_FILE=$1
FILE_WITH_ALL_INPUTS=$2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: As a matter of style, this order could be swapped...

all_files = []
for species in set_of_species:
species_dir = os.path.join(release_directory, species)
assembly_directories = glob.glob(os.path.join(species_dir, "GCA_*"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I am wondering if we shoud use the release tracker table as the source of truth instead of the release directory... This way we can also validate if all the intended species were released and update that table if not.

f"select distinct c.taxonomy, t.scientific_name "
f"from eva_progress_tracker.clustering_release_tracker c "
f"join evapro.taxonomy t on c.taxonomy=t.taxonomy_id "
f"where release_version={self.release_version} AND release_folder_name='{species_folder}'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this works with release folders with weird names or trailing underscores in them...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should. It is currently working for all folder names.

Copy link
Contributor

@apriltuesday apriltuesday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just suggestions, this looks great! Hopefully this is a counting script we can actually stick with for a while...

]) + '\n')


class ReleaseCounter(AppLogger):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this in a separate file, this one is pretty big and this is a natural way to break it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest either adding a DEPRECATED comment to, or removing entirely, the qc_release_counts script as a part of this PR. If we leave it and don't correct it, future poor souls (AKA us) are liable to run it and get confused all over again.

results[assembly][metric] = row[index + 1]
return results

def parse_logs(self, all_logs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which logs is this parsing? I guess it's the output of the bash script, maybe add a docstring here and/or at the class level stating this.

return all_files


def calculate_all_logs(release_dir, output_dir, species_directories=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe calculate_all_counts? "Calculating logs" sounds weird to me...

value, (in our case this is a list of assemblies linked to a taxonomy and the list of taxonomy linked to a assembly)
, this recursive function starts from one of the value and find all the related keys and values and provided them
in 2 frozen sets. For any key that belong to a relationship, this should provide the same pair of frozensets
regardless of the starting key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. You could maybe even include this example in the form of a test case...

linked_set1.update(dict2.get(value1))
# if one of the set is still growing we check again
if linked_set1 != source_linked_set1 or linked_set2 != source_linked_set2:
tmp_linked_set1, tmp_linked_set2 = find_link(linked_set1, dict1, dict2, linked_set1, linked_set2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interpreting this algorithm (in computer science-speak) as finding connected components in a bipartite graph, if so that's pretty cool!

Not sure if efficiency is an issue, but if you maintained a list of "visited" vertices you could ensure that you don't repeat work in the recursion (as I think linked_set1 is always a superset of key_set).

species_to_search = all_species_2_assemblies.keys()
logger.info(f'Process {len(species_to_search)} species')
for species in species_to_search:
set_of_species, set_of_assemblies = find_link({species}, all_species_2_assemblies, all_assemblies_2_species)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above comment, would be slightly more efficient to keep track of which species have been added in any all_sets_of_species sets and not re-do the work.

for annotation1 in dict_of_counter:
for annotation2 in dict_of_counter[annotation1]:
open_file.write("\t".join([
str(annotation1), str(annotation2), str(dict_of_counter[annotation1][annotation2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer more descriptive variable names if possible...

@tcezard tcezard merged commit c72ff13 into EBIvariation:master Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants