Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulating 16S metabarcoding reads? #196

Open
marieleoz opened this issue Oct 30, 2024 · 3 comments
Open

Simulating 16S metabarcoding reads? #196

marieleoz opened this issue Oct 30, 2024 · 3 comments

Comments

@marieleoz
Copy link

Dear CAMISIM team,

I am starting to try and use CAMISIM for a bunch of projects we have, but wonder if all of them will fit. For instance, is it possible to use CAMISIM for simulating amplicon sequencing reads, especially 16S metabarcoding reads? My guess is that I would need a specific 16S ref file to replace tools/assembly_summary_complete_genomes.txt, but

  1. I don't know if you have one available
  2. I don't know how to generate it if you don't
  3. I don't know if this would be the only change to do

Could you help me figure this out please?

Thanks a lot!
Marie

@AlphaSquad
Copy link
Collaborator

Hi Marie, thank you for your interest in CAMISIM.
In theory, CAMISIM can simulate any data set which a sequencer can produce, but the process to get a realistic data set might not be straightforward and unfortunately we do not have an example for amplicon sequencing/16S metabarcoding ready. In any case, you probably need to have the 16S sequences you are interested in available.

You need the assembly_summary_complete_genomes.txt if you want to simulate your data set from a given (16S) profile and instead of the address to the full genomes would need the address to the 16S sequence(s). You should be able to use local/relative addresses here, too. However, CAMISIM tries to match "genomes" with scientific names in this mode and if you already know which "genomes" (i.e. 16S sequences) you want to use, it might be easier to not use the from profile mode but instead run de novo and use the distribution_file_paths option to set the genomes and abundances manually (from the profile).

If you don't have a 16S profile to start from anyway, then you don't need the assembly_summary_complete_genomes.txt, but instead can use your 16S sequences as input genomes in a de novo run. CAMISIM just assumes a log-normal distribution of the genomes, so if you want to include artifacts like PCR amplification biases you would have to tune things manually.

I hope this already helps a little bit

@marieleoz
Copy link
Author

Hi Adrian,

Thanks a lot, I may not fully understand but it does help :)

I would like to simulate reads from a given profile / set of profiles indeed, so I'll try to focus on the from_profile option first. Can you confirm that I should:

  • find at least one representative 16S sequence for each taxon in my profile
  • cut the sequences so they match the specific subregion my primers amplified
  • store each sequence in a separate file (fasta format? what header?)
  • edit the assembly_summary_complete_genomes.txt so that I have tsv-separated taxid, scientific name and path to the sequence file for each representative sequences

I appreciate your help!

@AlphaSquad
Copy link
Collaborator

Hi Marie,
if you were to use the from_profile option, this sounds like the correct way to proceed. The "genomes" - in this case the 16S sequences - need to be in fasta format, correct. They don't need specific headers, but CAMISIM has had problems with special characters like " and _ or - in the past, so if possible I would try to avoid these.
There are a couple of caveats:

  • If you have taxons with new IDs or entirely new taxons which do not appear in the NCBI database CAMISIM provides, these might cause errors, so it is possible you have to use a newer taxonomy dump
  • The matching via scientific names is not entirely foolproof, so it is possible that you meant for one specific genome to match, but CAMISIM chooses another genome or fails to match. I recommend using the --community-only option first, then CAMISIM will produce all the files needed for a run, but does not perform the simulation so you can check whether the matching worked as you desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants