-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EVA-3409 Count rsid in release files #420
Conversation
…f species, assembly and type
…f species, assembly and type
try: | ||
response = requests.get(ENA_XML_API_URL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fix applied during the release
eva-accession-release-automation/publish_release_to_ftp/publish_release_files_to_ftp.py
Show resolved
Hide resolved
OUTPUT=tmp_${SC_NAME}_${ASSEMBLY}_${TYPE}.txt | ||
if [[ ${INPUT} == *.vcf.gz ]] | ||
then | ||
zcat "${INPUT}" | grep -v '^#' | awk -v annotation="${ASSEMBLY}-${SC_NAME}-${TYPE}" '{print $3" "annotation}' > ${OUTPUT} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Given the normalization and stability issues with scientific names, I am wondering if we should do a "reverse lookup" (from a static table or EVAPRO) to use the taxonomy instead of SC_NAME.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's necessary. The bash script deals with files and return annotation based on the filepath. As long as the file path reflect information we want to gather, we should be fine. We should deal with lookup in the python layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could also pass the annotation with the file name so that there is no guess work in the bash script.
value, (in our case this is a list of assemblies linked to a taxonomy and the list of taxonomy linked to a assembly) | ||
, this recursive function starts from one of the value and find all the related keys and values and provided them | ||
in 2 frozen sets. For any key that belong to a relationship, this should provide the same pair of frozensets | ||
regardless of the starting key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: A small example could be helpful here in understanding this function..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. You could maybe even include this example in the form of a test case...
set -e | ||
|
||
OUTPUT_FILE=$1 | ||
FILE_WITH_ALL_INPUTS=$2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: As a matter of style, this order could be swapped...
all_files = [] | ||
for species in set_of_species: | ||
species_dir = os.path.join(release_directory, species) | ||
assembly_directories = glob.glob(os.path.join(species_dir, "GCA_*")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I am wondering if we shoud use the release tracker table as the source of truth instead of the release directory... This way we can also validate if all the intended species were released and update that table if not.
f"select distinct c.taxonomy, t.scientific_name " | ||
f"from eva_progress_tracker.clustering_release_tracker c " | ||
f"join evapro.taxonomy t on c.taxonomy=t.taxonomy_id " | ||
f"where release_version={self.release_version} AND release_folder_name='{species_folder}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope this works with release folders with weird names or trailing underscores in them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should. It is currently working for all folder names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just suggestions, this looks great! Hopefully this is a counting script we can actually stick with for a while...
]) + '\n') | ||
|
||
|
||
class ReleaseCounter(AppLogger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put this in a separate file, this one is pretty big and this is a natural way to break it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest either adding a DEPRECATED
comment to, or removing entirely, the qc_release_counts script as a part of this PR. If we leave it and don't correct it, future poor souls (AKA us) are liable to run it and get confused all over again.
eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py
Outdated
Show resolved
Hide resolved
results[assembly][metric] = row[index + 1] | ||
return results | ||
|
||
def parse_logs(self, all_logs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which logs is this parsing? I guess it's the output of the bash script, maybe add a docstring here and/or at the class level stating this.
return all_files | ||
|
||
|
||
def calculate_all_logs(release_dir, output_dir, species_directories=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe calculate_all_counts
? "Calculating logs" sounds weird to me...
eva-accession-release-automation/gather_clustering_counts/gather_release_counts.py
Outdated
Show resolved
Hide resolved
value, (in our case this is a list of assemblies linked to a taxonomy and the list of taxonomy linked to a assembly) | ||
, this recursive function starts from one of the value and find all the related keys and values and provided them | ||
in 2 frozen sets. For any key that belong to a relationship, this should provide the same pair of frozensets | ||
regardless of the starting key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. You could maybe even include this example in the form of a test case...
linked_set1.update(dict2.get(value1)) | ||
# if one of the set is still growing we check again | ||
if linked_set1 != source_linked_set1 or linked_set2 != source_linked_set2: | ||
tmp_linked_set1, tmp_linked_set2 = find_link(linked_set1, dict1, dict2, linked_set1, linked_set2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interpreting this algorithm (in computer science-speak) as finding connected components in a bipartite graph, if so that's pretty cool!
Not sure if efficiency is an issue, but if you maintained a list of "visited" vertices you could ensure that you don't repeat work in the recursion (as I think linked_set1
is always a superset of key_set
).
species_to_search = all_species_2_assemblies.keys() | ||
logger.info(f'Process {len(species_to_search)} species') | ||
for species in species_to_search: | ||
set_of_species, set_of_assemblies = find_link({species}, all_species_2_assemblies, all_assemblies_2_species) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above comment, would be slightly more efficient to keep track of which species have been added in any all_sets_of_species
sets and not re-do the work.
for annotation1 in dict_of_counter: | ||
for annotation2 in dict_of_counter[annotation1]: | ||
open_file.write("\t".join([ | ||
str(annotation1), str(annotation2), str(dict_of_counter[annotation1][annotation2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer more descriptive variable names if possible...
Co-authored-by: April Shen <[email protected]>
count_rs_for_all_files.sh
:This PR adds a script that extract the rsids in release files and annotates each RS with the species/assembly/rs_type.
Then the annotation are aggregated per rsid and counted.
This allows for a generic count of rsid across species assemblies and type to be able to construct aggregate for each of these axes.
gather_release_counts.py
:Determine the species and assembly that should be counted togethe, run
count_rs_for_all_files.sh
then parse and aggregate the results