Clarification on NGSpID output #24

omarkr8 · 2022-12-07T08:03:54Z

I'm looking at the output from the pipeline and referring to text on the main page :

"The final cluster information is given in a tsv file final_clusters.tsv present in the specified output folder.
In the cluster TSV-file, the first column is the cluster ID and the second column is the read accession.
if there are n reads there will be n rows. Some reads might be singletons. The rows are ordered with respect to the size of the cluster (largest first)."

please do correct me if my understanding is lacking here..

so let's say my final_clusters.tsv has 31k rows. so that's 31k reads from the fastq.
the first column goes from 0-22, so there's 23 clusters. 0 and 1 has far more rows than the others. so more reads were used for those. the remaining rows are just sequencing run details.
supposedly each cluster is processed in its own medaka_cl_id_# folder. but I only see 17 folders. why not 23, one for each cluster? this sample has 9 singleton clusters.
another sample has 0-36 clusters, but 28 medaka folders, 12 of those clusters are singletons.
I assume that the first medaka folder in the directory corresponds to the first cluster (cluster 0). Is there a way to verify this?
or perhaps it is practical to parse the read number from the consensus.fasta headers?
does anyone have a loop handy to cat all the consensus.fastas from a single sample, and loop through multiple samples?

I guess im needing more explanation on how these clusters are designated. but its a very cool tool for sure!

ksahlin · 2022-12-07T17:32:55Z

Correct, but column 2 refer to the actual read id, see https://github.com/ksahlin/isONclust#output
NGSpeciesID does several more steps in between clustering and final consensus; (i) use only clusters with more that a fraction of total reads, (ii) reverse complement removal, and (iii) barcode removal. Therefore, some clusters(/consensuses) may be further merged and removed based on low fraction. This is why you don't see all 23 consensus. (see Figure 1 in the paper for the whole pipeline)
Yes, the medaka folder id corresponds to the cluster id.
I think it succices to cat folder/medaka_cl_id*/consensus.fasta > all_consensi.fa but maybe your looking for something more advanced.

omarkr8 · 2022-12-08T01:19:00Z

Thanks. some follow-up
2) so between final_cluster.tsv, and actual medaka content, some clusters will be dropped, or merged. is the fate of the clusters recorded somewhere? I assume that high abundance clusters are unlikely to be dropped this way.
3) in terms of finding out the size of each cluster, is it correct to parse the headers? for example, ">consensus_cl_id_414_total_supporting_reads_31235_segment0" would mean this cluster has 31235 reads associated with it?
4) thanks, i ended up writing something similar.

ksahlin · 2022-12-08T08:33:26Z

Not really, but you can see from the output 'how many passed the abundance threshold'. Also, there is a parameter to set this lower if you want more clusters through. Similarly, output will say how many reverse complements it detected and merged.
yes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on NGSpID output #24

Clarification on NGSpID output #24

omarkr8 commented Dec 7, 2022

ksahlin commented Dec 7, 2022

omarkr8 commented Dec 8, 2022

ksahlin commented Dec 8, 2022 •

edited

Loading

Clarification on NGSpID output #24

Clarification on NGSpID output #24

Comments

omarkr8 commented Dec 7, 2022

ksahlin commented Dec 7, 2022

omarkr8 commented Dec 8, 2022

ksahlin commented Dec 8, 2022 • edited Loading

ksahlin commented Dec 8, 2022 •

edited

Loading