Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAMISIM: Generate the abundance profiles #19

Open
ndreey opened this issue Mar 12, 2023 · 5 comments
Open

CAMISIM: Generate the abundance profiles #19

ndreey opened this issue Mar 12, 2023 · 5 comments

Comments

@ndreey
Copy link
Owner

ndreey commented Mar 12, 2023

With the chosen genomes, generate the abundance profiles.
Smartest seems to create a bash script that generates it for us.
Here is info from my comment on [#13]

HOW I UNDERSTAND THE ABUNDANCE CALCULATION
The abundance is calculated based on the total sum of genome sizes.

  • Say i have G1, G2, G3, ..., G10, Orchid genomes. And i want Orchid to have an abundance of 50%.
  • G1:G5 is 1000bp each, G6:G10 is 1500bp each and Orchid is 12000bp
Genome    Size
G1        1000
G2        1000
...   
G6        1500
G7        1500
...    
Orchid    12000
  • Calculate the total genome size
    • tot = 1000 x 5 + 1500 x 5 + 12000 = 19500bp
  • Calculate the abundance value for each genome.
    • abu = 1 / (number of genomes - 1)
    • For G1 to G10 there are 10-1 = 9 genomes.
    • abu for G1 to G10 are 1/9 = 0.1111
  • Set the abundance value of Orchid to 0.5
  • Calculate the total abundance value for all genomes.
    • abu_tot = abu_orchid + sum(abu_G1:abu_G10)
    • abu_tot= 0.5 + 9 x 0.1111 = 1.5
  • Normalize the abundance values so they sum up to 1.
    • nrm_abu_orchid = abu_ochid / abu_tot= 0.5 / 1.5 = 0.3333
    • nrm_abu_Gn = 0.1111 / 1.5 = 0.0741
  • BOOOM there is your relative abundance. But NOTE there should not be a heading row in the abundance.tsv file
  • However, it seems that abundance don't have to sum up to 1 as can be seen in the example above. But doing it this way i am able to sum all abundances to 1 "ish". 0.0741 x 10 + 0.3333 ~ 1
Genome    Abundance
G1        0.0741
G2        0.0741
... 
G10       0.0741
Orchid    0.3333

Good info on these issues

@ndreey
Copy link
Owner Author

ndreey commented Mar 14, 2023

CREATING DATAFRAME

  • I want a df with these columns
    genome_id size taxid tax_group group

pseudo code

list_genome_id = []
list_size = []
list_NCBI = []
list_tax_group = []
list_group = []

for *.fasta in source_genomes/:
	
	# Get genome_id
	match genome_id with *.fasta filename using genome_to_id.txt
	add genome_id to list_genome_id

	# Get size	
	match size with *.fasta filename using report_genome.txt
	add size to list_size
	
	# Get taxid
	match NCBI_ID with genome_id using metadata.tsv
	add NCBI_ID to list_NCBI
	
	# Get taxonomy group id
	match tax_group with NCBI_ID using taxonomic_profile.tsv
		in $TAXPATH get first number ^[0-9]|      # 2|1239|186801   --> 2
	add tax_group to list_tax_group
	
	# Get humanized group name
	match tax_group with group using if
	if tax_group != 2759:
		group = "not_euk"
	else:
		group = "euk"
	add group to list_group

	# Create dataframe
	df <- data.frame(genome_id = list_genome_id, size=, taxid=, tax_group=, group=)

@ndreey
Copy link
Owner Author

ndreey commented Mar 15, 2023

  • Thaliana genome (GCF_000001735.4_TAIR10.1_genomic) has been replaced with Platanthera_zijinensis_chr

  • Added these to source_genomes/

    • Ceratobasidium_sp_CerAGI
    • Rhizoctonia_solani_Rhisola1
    • Tulasnella_calospora_Tulcal1

@ndreey
Copy link
Owner Author

ndreey commented Mar 15, 2023

#26

@ndreey
Copy link
Owner Author

ndreey commented Mar 16, 2023

Going with 70-80% Fungi, 20-30% Bacteria/Archaea and 0.5-2% Plasmids/Circular DNA/Virus #17

@ndreey
Copy link
Owner Author

ndreey commented Mar 16, 2023

Example

  • Lets generate 10 GB data from the total genome size of all the 100 genomes:
  • Host abundance: 50% --> 5GB data belongs to host
  • Endophyte abundance: 1 - Host abundance --> 5GB belong to endophytes.

The relative abundance between the endophytes are:

  • Fungi: 67,8% | 3.39GB
    • OMF: 45,2% | 1.53GB (2 x rfungi)
      • Three OMFs --> 45.2%/3 = 15.06667% | 0.2305GB
    • rfungi: 22.6% |
      • Decided randomly but gotta sum up to 22.6%
  • Bark: 29,7%
    • Decided randomly but gotta sum up to 29.7%
  • Plasm: 2.5%
    • Decided randomly but gotta sum up to 2.5%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant