Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on NGSpID output #24

Open
omarkr8 opened this issue Dec 7, 2022 · 3 comments
Open

Clarification on NGSpID output #24

omarkr8 opened this issue Dec 7, 2022 · 3 comments

Comments

@omarkr8
Copy link

omarkr8 commented Dec 7, 2022

I'm looking at the output from the pipeline and referring to text on the main page :

"The final cluster information is given in a tsv file final_clusters.tsv present in the specified output folder.
In the cluster TSV-file, the first column is the cluster ID and the second column is the read accession.
if there are n reads there will be n rows. Some reads might be singletons. The rows are ordered with respect to the size of the cluster (largest first)."

please do correct me if my understanding is lacking here..

  1. so let's say my final_clusters.tsv has 31k rows. so that's 31k reads from the fastq.
    the first column goes from 0-22, so there's 23 clusters. 0 and 1 has far more rows than the others. so more reads were used for those. the remaining rows are just sequencing run details.

  2. supposedly each cluster is processed in its own medaka_cl_id_# folder. but I only see 17 folders. why not 23, one for each cluster? this sample has 9 singleton clusters.
    another sample has 0-36 clusters, but 28 medaka folders, 12 of those clusters are singletons.

  3. I assume that the first medaka folder in the directory corresponds to the first cluster (cluster 0). Is there a way to verify this?
    or perhaps it is practical to parse the read number from the consensus.fasta headers?

  4. does anyone have a loop handy to cat all the consensus.fastas from a single sample, and loop through multiple samples?

I guess im needing more explanation on how these clusters are designated. but its a very cool tool for sure!

@ksahlin
Copy link
Owner

ksahlin commented Dec 7, 2022

  1. Correct, but column 2 refer to the actual read id, see https://github.com/ksahlin/isONclust#output
  2. NGSpeciesID does several more steps in between clustering and final consensus; (i) use only clusters with more that a fraction of total reads, (ii) reverse complement removal, and (iii) barcode removal. Therefore, some clusters(/consensuses) may be further merged and removed based on low fraction. This is why you don't see all 23 consensus. (see Figure 1 in the paper for the whole pipeline)
  3. Yes, the medaka folder id corresponds to the cluster id.
  4. I think it succices to cat folder/medaka_cl_id*/consensus.fasta > all_consensi.fa but maybe your looking for something more advanced.

@omarkr8
Copy link
Author

omarkr8 commented Dec 8, 2022

Thanks. some follow-up
2) so between final_cluster.tsv, and actual medaka content, some clusters will be dropped, or merged. is the fate of the clusters recorded somewhere? I assume that high abundance clusters are unlikely to be dropped this way.
3) in terms of finding out the size of each cluster, is it correct to parse the headers? for example, ">consensus_cl_id_414_total_supporting_reads_31235_segment0" would mean this cluster has 31235 reads associated with it?
4) thanks, i ended up writing something similar.

@ksahlin
Copy link
Owner

ksahlin commented Dec 8, 2022

  1. Not really, but you can see from the output 'how many passed the abundance threshold'. Also, there is a parameter to set this lower if you want more clusters through. Similarly, output will say how many reverse complements it detected and merged.
  2. yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants