02_variantCalling_ancestrals_c.Rmd

---
title: "Variant calling for ancestral genomes"
author: "Ricardo Ramiro"
date: "20/07/2020"
output: 
  html_document:
    code_folding: show
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo=TRUE, 
                      message=FALSE, 
                      error = TRUE,
                      warning=FALSE, 
                      paged.print=FALSE, 
                      fig.align = "center")

knitr::opts_hooks$set(eval = function(options) {
  if (options$engine == "bash") {
    options$eval <- FALSE
  }
  options
})

```


## Run breseq v0.34.1 to call mutations in B. thetaiotaomicron populations  

```{bash}

# activate conda environment where breseq is installed
conda activate breseq0.34.1

# change permissions on script to run it
chmod u+x breseq_ancestrals_polymorphismMQ_v3ref_c.sh

# run script (this is set up to run multiple samples in parallel)
nohup bash breseq_ancestrals_polymorphismMQ_v3ref_c.sh &

```


## get output.gd files into a single folder
```{bash, eval=FALSE, echo=FALSE}

# go to dir
cd analysis/genomics/ancestrals_breseq

# make new dir to put all gd files
mkdir breseq_gdOutput_ancestrals

# copy and rename gd files

for DIR in TDA*; do
    cd "$DIR"/output
    cp output.gd ../../breseq_gdOutput_ancestrals/$DIR.gd
    cd ../..
done


```

## use gdtools compare to get mutation tables
```{bash}

# activate conda environment with breseq
source activate breseq0.34.1

# reference sequence
REF=analysis/genomics/TDA1000/TDA1000_hybridAssembly/results/assemblies/prokka_annotation/TDA1000_chromosome_plasmid_v3.gff3

# move to folder with gd files
cd breseq_gdOutput_ancestrals

# generate html file comparing all runs of the ancestral
gdtools COMPARE -r $REF --repeat-header 0 -o ../breseq_output_ancestralsMQ.html `ls *MQ.gd`

```

The above html file was then copied in to google sheets of the same name. Two tabs were created:  
- raw: just copy paste  
- edited. In this sheet I did the following:  
  - google sheets considers anything that starts as a plus sign as a formula and gives an error. This is important in column C (mutation). To change this behaviour I replaced, using REGEX, 
  "=\+" by "'+". The apostrophe allows the plus sign to be the first character
  - when the same gene is mutated with both SNPs and deletions, breseq will add Δ in the frequency field for samples that got deletions. I replaced all these Δ by nothing  
  - replaced "\?" by noting
  - I changed the format of the frequencies of the mutations. These are expressed as %, I changed to decimal
  - some positions are written as ######:# (e.g. 314,132:1 or 314,132:2). These lead to problems when making position column as numeric. Thus, I replaced "\:[0-9]" by nothing
  - in the data from day 84 there were a few mutations in overlapping genes, with two gene names per cell, which would have caused problems in the plotting of the data. However, none of these were parallel, so these are mostly irrelevant for our anaylisis, so I didn't do anything about it  
  - change the name of the column "seq id" to "seq_id"  
  - in the gene column, I replaced all spaces, ← and → by nothing 

## get individual mutations that are present in the ancestrals
```{r}

#load libraries
library(tidyverse)
library(readxl)

# read the data
ancestrals<-read_xlsx("analysis/genomics/ancestrals_breseq/downstreamAnalysis/breseq_output_ancestralsMQ.xlsx",sheet=2)

# reorder coluns
ancestrals<-ancestrals[c(1:3,9:11,4:8)]

# reformat data to long format
ancestrals_long<-gather(ancestrals,key=sample,value=frequency,TDA1000_mi_pol_MQ:TDA1002_ne_pol_MQ)

#remove all NAs
ancestrals_long<-ancestrals_long[!is.na(ancestrals_long$frequency),]

# get the number of samples with a mutation per run type

ancestrals_long_s<-ancestrals_long %>% 
  group_by(seq_id,position,mutation,annotation,gene,description) %>%
  summarize(sample_count=n())


# writexl::write_xlsx(ancestrals_long_s,"analysis/genomics/ancestrals_breseq/downstreamAnalysis/ancestral_mutations_polymorphismHighMappingQualMQ.xlsx")


DT::datatable(ancestrals_long_s, extensions = "Buttons", 
          options = list(pageLength = 30,dom = "Blfrtip", buttons = "csv"))
```