Skip to content

Hybrid scaffolding of NA12878

adigenova edited this page Dec 16, 2017 · 28 revisions

Table of Contents

FAST-SG inputs

Illumina assembly

We used the DISCOVARdenovo illumina assembly of NA12878 as input, we donwloaded the assembly from:

wget ftp://ftp.broadinstitute.org/pub/crd/Discovar/assemblies/51400.newchem/a.lines.fasta -O NA12878.asm.fa

Then we selected the contigs larger than 2kb using the following commands:

samtools faidx NA12878.asm.fa

sort -rn -k2,2 NA12878.asm.fa.fai | awk '{if($2 >=2000){print $1}}' > NA12878.asm.2kb.lst

samtools faidx NA12878.asm.fa $(cat NA12878.asm.2kb.lst | xargs) > NA12878.asm.2kb.fa

Nanopore reads

We downloaded the ultra-long nanopore reads (rel4) produced by Jain et al for NA12878. The flow cell used were the followings:

Flowcell Kit FASTQ
FAF15665 Ultra FASTQ
FAF13748 Ultra FASTQ
FAF10039 Ultra FASTQ
FAF09968 Ultra FASTQ
FAF09277 Ultra FASTQ
FAF14035 Ultra FASTQ
FAF15694 Ultra FASTQ
FAF09713 Ultra FASTQ
FAF18554 Rapid FASTQ
FAF15630 Ultra FASTQ
FAF09640 Ultra FASTQ
FAF09701 Ultra FASTQ
FAF15586 Ultra FASTQ
FAF05869 Ligation FASTQ

Then, all FASTQ files were concatenated and the reads were rename and transformed to FASTA format using the following command:

zcat *.fastq.gz | awk 'BEGIN{h=1;s=2;i=1}{if(NR==h){print ">ONT_"i; h+=4;i++;} if(NR == s){print $0; s+=4;}}' | fold | gzip > ULTRA-LONG-RENAME-FOLD.fa.gz

Run FAST-SG

FAST-SG configuration file

For scaffolding the DISCOVARdenovo assembly we extracted 20 synthetic mate pair libraries using FAST-SG from the ultra-long Nanopore reads. The synthetic libraries ranged from 2Kb to 180Kb (BAC size). The following configuration file (space separated) was used:

ultra-long-conf.txt:

long ultra_ont_raw ULTRA-LONG-RENAME-FOLD.fa.gz 2000,4000,6000,8000,10000,12000,14000,16000,18000,20000,30000,40000,50000,60000,70000,80000,100000,120000,150000,180000 1

FAST-SG command

We run Fast-SG using the following command :

FAST-SG.pl -k 22 -l ultra-long-conf.txt –r NA12878.asm.2kb.fa -p NA12878-ultra –t 20 > NA12878.fastsg.log

We used a k-mer size of 22 (-k 22), 20 CPUs (-t 20) and default options for long reads. We recommend use k-mer sizes in the range of 15-22 for uncorrected long reads. If the long reads are error corrected (i.e using illumina reads) you can increase the k-mer size.

FAST-SG output

The Fast-SG output correspond to a sam file for each synthetic library, in this case the following SAM files were generated:

ultra_ont_raw.I2000.fast-sg_K22.sam ultra_ont_raw.I4000.fast-sg_K22.sam
ultra_ont_raw.I6000.fast-sg_K22.sam ultra_ont_raw.I8000.fast-sg_K22.sam
ultra_ont_raw.I10000.fast-sg_K22.sam ultra_ont_raw.I12000.fast-sg_K22.sam
ultra_ont_raw.I14000.fast-sg_K22.sam ultra_ont_raw.I16000.fast-sg_K22.sam
ultra_ont_raw.I18000.fast-sg_K22.sam ultra_ont_raw.I20000.fast-sg_K22.sam
ultra_ont_raw.I30000.fast-sg_K22.sam ultra_ont_raw.I40000.fast-sg_K22.sam
ultra_ont_raw.I50000.fast-sg_K22.sam ultra_ont_raw.I60000.fast-sg_K22.sam
ultra_ont_raw.I70000.fast-sg_K22.sam ultra_ont_raw.I80000.fast-sg_K22.sam
ultra_ont_raw.I100000.fast-sg_K22.sam ultra_ont_raw.I120000.fast-sg_K22.sam
ultra_ont_raw.I150000.fast-sg_K22.sam ultra_ont_raw.I180000.fast-sg_K22.sam

Useful information in the log file produced by FAST-SG

The log file ("NA12878.fastsg.log") contains useful information about the insert size estimation of each synthetic library:

Estimating insert-sizes using 1000 long reads
insert-size d=2000 observed average insert-size 2010
insert-size d=4000 observed average insert-size 4166
insert-size d=6000 observed average insert-size 6323
insert-size d=8000 observed average insert-size 8478
insert-size d=10000 observed average insert-size 10636
insert-size d=12000 observed average insert-size 12793
insert-size d=14000 observed average insert-size 14950
insert-size d=16000 observed average insert-size 17106
insert-size d=18000 observed average insert-size 19259
insert-size d=20000 observed average insert-size 21414
insert-size d=30000 observed average insert-size 32169
insert-size d=40000 observed average insert-size 42912
insert-size d=50000 observed average insert-size 53641
insert-size d=60000 observed average insert-size 64340
insert-size d=70000 observed average insert-size 74955
insert-size d=80000 observed average insert-size 85557
insert-size d=100000 observed average insert-size 106753
insert-size d=120000 observed average insert-size 127813
insert-size d=150000 observed average insert-size 159463
insert-size d=180000 observed average insert-size 191339
End of insert-size inference

Boxplot of the synthetic libraries extracted by FAST-SG

The insert sizes were computed using the following commands (get-insert-size.sh):

egrep -v "^@" $1 |awk '{fwd=$1 ;ctg=$3; start=$4; getline; if(ctg != $3){}else{ v=start-$4; if(v < 0){v=v*-1;};print v}}' | head -n 900000 > $1.insert.txt

where $1 is the sam file generated by FAST-SG for each synthetic library.

Finally, we can use the insert size files (*.insert.txt) to compose a boxplot using the following R-code:

files=read.table(file="files-sort.txt")
pdf(file="Insert-Human.pdf")
par(mfrow=c(5,4),mar=c(1, 4, 2, 1))
n<-c("2kb","4kb","6kb","8kb","10kb","12kb","14kb",
"16kb","18kb","20kb","30kb","40kb","50kb","60kb","70kb",
"80kb","100kb","120kb","150kb","180kb")
for(i in 1:length(files$V1)){
b[i]=read.table(file=as.vector(files$V1[i]))
boxplot(b[[i]],outline=F,ylab="Insert size (kb)",las=2,boxwex=0.3,col=color[i])
}
dev.off()

The boxplot is draw using a total of 900,000 insert sizes from read pairs mapped within contigs for each synthetic library (SAM files). The percentage of outliers detected ranged from a minimum of 1.19% and a maximum of 11.56% for 2kb and 180kb respectively.

You should obtain a boxplot like this:

Insert sizes for NA12878

Scaffolding NA12878 using the synthetic libraries

The synthetic mate pair libraries extracted by FAST-SG were used to feed ScaffMatch (short read scaffolder).

Preparing the SAM files for ScaffMatch

First, we need to split the SAM files in forward and reverse using the following command:

awk -f fastSG2scaff.awk -v name=$(basename ultra_ont_raw.I2000.fast-sg_K22.sam) -v k=22 ultra_ont_raw.I2000.fast-sg_K22.sam

where fastSG2scaff.awk is a script provided with the FAST-SG source code. The result of the previous command are two SAM files ultra_ont_raw.I2000.fast-sg_K22.fwd.sam and ultra_ont_raw.I2000.fast-sg_K22.rev.sam. We need to run the previous command for each synthetic libraries created by FAST-SG.

Run ScaffMatch

To run ScaffMatch V0.9 we need to provide the average insert size (-i), the standard deviation (-s) and the orientation for each synthetic library. The standard deviation was set to 10% of the average insert size estimated by FAST-SG for each library (see log file). The ScaffMatch command used was the following:

scaffmatch -m -w UONT-K22-7X -c NA12878.asm.2kb.fa \
-s 201,416,632,847,1063,1279,1495,1710,1925,2141,3216,4291,5364,6434,7495,8555,10675,12781,15946,19133 \
-i 2010,4166,6323,8478,10636,12793,14950,17106,19259,21414,32169,42912,53641,64340,74955,85557,106753,127813,159463,191339 \
-p fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr \
-1 ultra_ont_raw.I2000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I4000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I6000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I8000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I10000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I12000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I14000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I16000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I18000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I20000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I30000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I40000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I50000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I60000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I70000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I80000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I100000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I120000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I150000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I180000.Fast-SG_K22.fwd.sam \
-2 ultra_ont_raw.I2000.Fast-SG_K22.rev.sam,ultra_ont_raw.I4000.Fast-SG_K22.rev.sam,ultra_ont_raw.I6000.Fast-SG_K22.rev.sam,ultra_ont_raw.I8000.Fast-SG_K22.rev.sam,ultra_ont_raw.I10000.Fast-SG_K22.rev.sam,ultra_ont_raw.I12000.Fast-SG_K22.rev.sam,ultra_ont_raw.I14000.Fast-SG_K22.rev.sam,ultra_ont_raw.I16000.Fast-SG_K22.rev.sam,ultra_ont_raw.I18000.Fast-SG_K22.rev.sam,ultra_ont_raw.I20000.Fast-SG_K22.rev.sam,ultra_ont_raw.I30000.Fast-SG_K22.rev.sam,ultra_ont_raw.I40000.Fast-SG_K22.rev.sam,ultra_ont_raw.I50000.Fast-SG_K22.rev.sam,ultra_ont_raw.I60000.Fast-SG_K22.rev.sam,ultra_ont_raw.I70000.Fast-SG_K22.rev.sam,ultra_ont_raw.I80000.Fast-SG_K22.rev.sam,ultra_ont_raw.I100000.Fast-SG_K22.rev.sam,ultra_ont_raw.I120000.Fast-SG_K22.rev.sam,ultra_ont_raw.I150000.Fast-SG_K22.rev.sam,ultra_ont_raw.I180000.Fast-SG_K22.rev.sam

Finally, the scaffold sequences are found in UONT-K22-7X/Scaffolds.fa