lab_kallisto.Rmd

---
title: 'Kallisto '
subtitle: Workshop on RNA-Seq
output:
  bookdown::html_document2:
    highlight: textmate
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
      print: false
    toc_depth: 4
    number_sections: true
    df_print: default
    code_folding: none
    self_contained: false
    keep_md: false
    encoding: 'UTF-8'
    css: "assets/lab.css"
    include:
      after_body: assets/footer-lab.html
---

```{r,child="assets/header-lab.Rmd"}
```


# Kallisto

**Kallisto** is an "alignment-free" RNA-Seq quantification method that runs very fast with a small memory footprint, so that it can be run on most laptops. It is a command-line program that can be downloaded as binary executables for Linux or Mac, or in source code format. For a first insight into the program, read [here](https://liorpachter.wordpress.com/2015/05/10/near-optimal-rna-seq-quantification-with-kallisto/) and for the published article, see [here](http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3519.html). 

Kallisto is geared towards quantification on the transcript (isoform) level, rather than the gene-level (although the latter can also be done by post-processing Kallisto output.) However, read assignment to transcript isoforms cannot (in general) be done unambiguously, so there is an intrinsic "quantification noise" or variability in this process. Kallisto can thus be run either in a single step (which is very fast) or in "bootstrap" mode (which takes longer, but can be done on several processors in parallel) in order to get uncertainty estimates for the expression levels - a kind of error bars for the quantification process. Running with bootstraps is mandatory if you want to perform differential expression analysis of isoforms with Sleuth (see below).

Kallisto is primarily meant for quantification of an existing set of FASTA sequences, that is, it does not perform transcript assembly and it cannot quantify the expression of novel transcripts that are not in the transcript index that you provide to it. With that said, you can of course use contigs from an assembly that you have produced in some other program in your Kallisto index. It would also be possible to use the software for eg: metagenomics or metatranscriptomics quantification.


## Build index

Download the cDNA reference and create a index file to run on (This has already been done!). But if you want to do it yourself you can uncomment the code and run it yourselves. 


```{sh,eval=FALSE,chunk.title=TRUE}
#cd ~/RNAseq/
#mkdir rawData/references/ref_kallisto
   
#cd rawData/references/ref_kallisto

#wget ftp://ftp.ensembl.org/pub/release-101/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz


#kallisto index \
#  --index kallisto_mm10_transcriptome_index.idx \
#  Mus_musculus.GRCm38.cdna.all.fa.gz


#cd ~/RNAseq/

```


It takes less than 10 minutes.


## Pseudoalign reads

Now you can pseudoalign all your reads to the Kallisto index using Kallisto quantification

```{sh,eval=FALSE,chunk.title=TRUE}

cd ~/RNAseq/

for sample_name in $(ls rawData/fastqFiles/*.clean.fq.gz.1M_reads.fq.gz); do

  file=${sample_name#rawData/fastqFiles/}
  sample=${file%.clean.fq.gz.1M_reads.fq.gz}
   
  echo $sample

  mkdir results/05-kallisto/$sample

  kallisto quant \
    --index rawData/references/ref_kallisto/kallisto_mm10_transcriptome_index.idx \
    --output-dir results/05-kallisto/$sample \
    --threads 3 \
    --single \
    --fragment-length 200 \
    --sd 20 \
    $sample_name

done

```


# Quantification

## Get gene quantifications
Since Kallisto uses transcripts estimates and many programs, like DEseq and EdgeR, works better with gene counts there are programs to get gene counts from transcript estimates in our case we will use txImport. TxImport can import estimates from a lot of different sources. You can read more about it in the [txImport vignette](https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html)

For the program to work, it needs a two-column-file that assigns each transcript to a gene. This can found using ensembl database or other public databases. In our case the relationship between the transcripts and gene is found in the sequencenames in the cDNA file that we downloaded. By using awk we can extract the transcript gene dependency and write it to a file. 


```{sh,eval=FALSE,chunk.title=FALSE}
# Example of a sequence name in file 
#  >ENSMUST00000177564.1 cdna chromosome:GRCm38:14:54122226:54122241:1 gene:ENSMUSG00000096176.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd2 description:T cell receptor delta diversity 2 [Source:MGI Symbol;Acc:MGI:4439546]


# Extract all transcriptnames (1st) and genenames (4th) from  sequence names and write to a file. If "gzcat" does not work, you can try replacing it by "gunzip -c"  
gzcat rawData/references/ref_kallisto/Mus_musculus.GRCm38.cdna.all.fa.gz| \
grep '>' |
awk '{FS= " "}BEGIN{ print "TXNAME,GENEID"};{print substr($1,2) "," substr($4,6)};' \
>rawData/references/ref_kallisto/tx2gene.mm.GRCm38.cdna.csv 

```


## TxImport

The next steps are carried out in R so make sure you start R in the conda environment. 

```{r ,eval=FALSE,chunk.title=FALSE}
library(dplyr) # data wrangling
library(ggplot2) # plotting
library(DESeq2) # rna-seq
library(edgeR) # rna-seq
library(tximport) # importing kalisto transcript counts to geneLevels
library(readr) # Fast readr of files.
library(rhdf5) # read/convert kalisto output files.  

```


If *tximport* and *rhdf5* are not installed you can install them now. 
```{r ,eval=FALSE,chunk.title=FALSE} 

#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")

#BiocManager::install("tximport")
#BiocManager::install("rhdf5")
```


Load the metadata for the samples. 
```{r  ,eval=FALSE,chunk.title=FALSE}

# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")

# download metadata

download_data("data/metadata_raw.csv")
mr <- read.csv("data/metadata_raw.csv",header=TRUE,stringsAsFactors=F,row.names=1)

mr
```


###  Convert to gene counts 

Read in the Kallisto transcript abundances and convert the TxImport.

TxImport is created to 

```{r  ,eval=FALSE,chunk.title=FALSE}
setwd("~/RNAseq")
files <- paste("results/05-kallisto" ,  
               list.files(path = "results/05-kallisto",
                          pattern = "abundance.tsv", 
                          recursive = TRUE),
               sep = "/")
names(files) <- mr$SampleName


tx2gene <- read_csv("rawData/references/ref_kallisto/tx2gene.mm.GRCm38.cdna.csv")


txi.kallisto.tsv <- tximport(files, 
                             type = "kallisto", 
                             tx2gene = tx2gene, 
                             ignoreAfterBar = TRUE)
```


### Convert to DEseq2 object for analysis

```{r ,eval=FALSE,chunk.title=FALSE}
mr = mr %>% mutate(Day = as.factor(Day))
dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, mr, ~Day)


test = DESeq(dds)


```
and now you can go on with the differential gene expression in a similar way as the one you did with the feature count table.


### Convert to a edgeR object for analysis

```{r,eval=FALSE,chunk.title=FALSE}
cts <- txi.kallisto.tsv$counts
normMat <- txi.kallisto.tsv$length
normMat <- normMat/exp(rowMeans(log(normMat)))
library(edgeR)
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)

# y is now ready for estimate dispersion functions see edgeR User's Guide


```


# Session info

```{r,echo=FALSE}
sessionInfo()
```

***