-
Notifications
You must be signed in to change notification settings - Fork 3
Hybrid scaffolding of NA12878
We used the DISCOVARdenovo illumina assembly of NA12878 as input, we donwloaded the assembly from:
wget ftp://ftp.broadinstitute.org/pub/crd/Discovar/assemblies/51400.newchem/a.lines.fasta -O NA12878.asm.fa
Then we selected the contigs larger than 2kb using the following commands:
samtools faidx NA12878.asm.fa
sort -rn -k2,2 NA12878.asm.fa.fai | awk '{if($2 >=2000){print $1}}' > NA12878.asm.2kb.lst
samtools faidx NA12878.asm.fa $(cat NA12878.asm.2kb.lst | xargs) > NA12878.asm.2kb.fa
We downloaded the ultra-long nanopore reads (rel4) produced by Jain et al for NA12878. The flow cell used were the followings:
Flowcell | Kit | FASTQ |
---|---|---|
FAF15665 | Ultra | FASTQ |
FAF13748 | Ultra | FASTQ |
FAF10039 | Ultra | FASTQ |
FAF09968 | Ultra | FASTQ |
FAF09277 | Ultra | FASTQ |
FAF14035 | Ultra | FASTQ |
FAF15694 | Ultra | FASTQ |
FAF09713 | Ultra | FASTQ |
FAF18554 | Rapid | FASTQ |
FAF15630 | Ultra | FASTQ |
FAF09640 | Ultra | FASTQ |
FAF09701 | Ultra | FASTQ |
FAF15586 | Ultra | FASTQ |
FAF05869 | Ligation | FASTQ |
Then, all FASTQ files were concatenated and the reads were rename and transformed to FASTA format using the following command:
zcat *.fastq.gz | awk 'BEGIN{h=1;s=2;i=1}{if(NR==h){print ">ONT_"i; h+=4;i++;} if(NR == s){print $0; s+=4;}}' | fold | gzip > ULTRA-LONG-RENAME-FOLD.fa.gz
For scaffolding the DISCOVARdenovo assembly we extracted 20 synthetic mate pair libraries using FAST-SG from the ultra-long Nanopore reads. The synthetic libraries ranged from 2Kb to 180Kb (BAC size). The following configuration file (space separated) was used:
ultra-long-conf.txt:
long ultra_ont_raw ULTRA-LONG-RENAME-FOLD.fa.gz 2000,4000,6000,8000,10000,12000,14000,16000,18000,20000,30000,40000,50000,60000,70000,80000,100000,120000,150000,180000 1
We run Fast-SG using the following command :
FAST-SG.pl -k 22 -l ultra-long-conf.txt –r NA12878.asm.2kb.fa -p NA12878-ultra –t 20 > NA12878.fastsg.log
We used a k-mer size of 22 (-k 22), 20 CPUs (-t 20) and default options for long reads. We recommend use k-mer sizes in the range of 15-22 for uncorrected long reads. If the long reads are error corrected (i.e using illumina reads) you can increase the k-mer size.
The Fast-SG output correspond to a sam file for each synthetic library, in this case the following SAM files were generated:
ultra_ont_raw.I2000.fast-sg_K22.sam ultra_ont_raw.I4000.fast-sg_K22.sam
ultra_ont_raw.I6000.fast-sg_K22.sam ultra_ont_raw.I8000.fast-sg_K22.sam
ultra_ont_raw.I10000.fast-sg_K22.sam ultra_ont_raw.I12000.fast-sg_K22.sam
ultra_ont_raw.I14000.fast-sg_K22.sam ultra_ont_raw.I16000.fast-sg_K22.sam
ultra_ont_raw.I18000.fast-sg_K22.sam ultra_ont_raw.I20000.fast-sg_K22.sam
ultra_ont_raw.I30000.fast-sg_K22.sam ultra_ont_raw.I40000.fast-sg_K22.sam
ultra_ont_raw.I50000.fast-sg_K22.sam ultra_ont_raw.I60000.fast-sg_K22.sam
ultra_ont_raw.I70000.fast-sg_K22.sam ultra_ont_raw.I80000.fast-sg_K22.sam
ultra_ont_raw.I100000.fast-sg_K22.sam ultra_ont_raw.I120000.fast-sg_K22.sam
ultra_ont_raw.I150000.fast-sg_K22.sam ultra_ont_raw.I180000.fast-sg_K22.sam
The log file ("NA12878.fastsg.log") contains useful information about the insert size estimation of each synthetic library:
Estimating insert-sizes using 1000 long reads
insert-size d=2000 observed average insert-size 2010
insert-size d=4000 observed average insert-size 4166
insert-size d=6000 observed average insert-size 6323
insert-size d=8000 observed average insert-size 8478
insert-size d=10000 observed average insert-size 10636
insert-size d=12000 observed average insert-size 12793
insert-size d=14000 observed average insert-size 14950
insert-size d=16000 observed average insert-size 17106
insert-size d=18000 observed average insert-size 19259
insert-size d=20000 observed average insert-size 21414
insert-size d=30000 observed average insert-size 32169
insert-size d=40000 observed average insert-size 42912
insert-size d=50000 observed average insert-size 53641
insert-size d=60000 observed average insert-size 64340
insert-size d=70000 observed average insert-size 74955
insert-size d=80000 observed average insert-size 85557
insert-size d=100000 observed average insert-size 106753
insert-size d=120000 observed average insert-size 127813
insert-size d=150000 observed average insert-size 159463
insert-size d=180000 observed average insert-size 191339
End of insert-size inference
The insert sizes were computed using the following commands (get-insert-size.sh):
egrep -v "^@" $1 |awk '{fwd=$1 ;ctg=$3; start=$4; getline; if(ctg != $3){}else{ v=start-$4; if(v < 0){v=v*-1;};print v}}' | head -n 900000 > $1.insert.txt
where $1 is the sam file generated by FAST-SG for each synthetic library.
Finally, we can use the insert size files (*.insert.txt) to compose a boxplot using the following R-code:
files=read.table(file="files-sort.txt")
pdf(file="Insert-Human.pdf")
par(mfrow=c(5,4),mar=c(1, 4, 2, 1))
n<-c("2kb","4kb","6kb","8kb","10kb","12kb","14kb",
"16kb","18kb","20kb","30kb","40kb","50kb","60kb","70kb",
"80kb","100kb","120kb","150kb","180kb")
for(i in 1:length(files$V1)){
b[i]=read.table(file=as.vector(files$V1[i]))
boxplot(b[[i]],outline=F,ylab="Insert size (kb)",las=2,boxwex=0.3,col=color[i])
}
dev.off()
The boxplot is draw using a total of 900,000 insert sizes from read pairs mapped within contigs for each synthetic library (SAM files). The percentage of outliers detected ranged from a minimum of 1.19% and a maximum of 11.56% for 2kb and 180kb respectively.
You should obtain a boxplot like this:
The synthetic mate pair libraries extracted by FAST-SG were used to feed ScaffMatch (short read scaffolder).
First, we need to split the SAM files in forward and reverse using the following command:
awk -f fastSG2scaff.awk -v name=$(basename ultra_ont_raw.I2000.fast-sg_K22.sam) -v k=22 ultra_ont_raw.I2000.fast-sg_K22.sam
where fastSG2scaff.awk is a script provided with the FAST-SG source code. The result of the previous command are two SAM files ultra_ont_raw.I2000.fast-sg_K22.fwd.sam and ultra_ont_raw.I2000.fast-sg_K22.rev.sam. We need to run the previous command for each synthetic libraries created by FAST-SG.
To run ScaffMatch V0.9 we need to provide the average insert size (-i), the standard deviation (-s) and the orientation for each synthetic library. The standard deviation was set to 10% of the average insert size estimated by FAST-SG for each library (see log file). The ScaffMatch command used was the following:
scaffmatch -m -w UONT-K22-7X -c NA12878.asm.2kb.fa \
-s 201,416,632,847,1063,1279,1495,1710,1925,2141,3216,4291,5364,6434,7495,8555,10675,12781,15946,19133 \
-i 2010,4166,6323,8478,10636,12793,14950,17106,19259,21414,32169,42912,53641,64340,74955,85557,106753,127813,159463,191339 \
-p fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr,fr \
-1 ultra_ont_raw.I2000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I4000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I6000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I8000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I10000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I12000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I14000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I16000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I18000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I20000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I30000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I40000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I50000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I60000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I70000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I80000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I100000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I120000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I150000.Fast-SG_K22.fwd.sam,ultra_ont_raw.I180000.Fast-SG_K22.fwd.sam \
-2 ultra_ont_raw.I2000.Fast-SG_K22.rev.sam,ultra_ont_raw.I4000.Fast-SG_K22.rev.sam,ultra_ont_raw.I6000.Fast-SG_K22.rev.sam,ultra_ont_raw.I8000.Fast-SG_K22.rev.sam,ultra_ont_raw.I10000.Fast-SG_K22.rev.sam,ultra_ont_raw.I12000.Fast-SG_K22.rev.sam,ultra_ont_raw.I14000.Fast-SG_K22.rev.sam,ultra_ont_raw.I16000.Fast-SG_K22.rev.sam,ultra_ont_raw.I18000.Fast-SG_K22.rev.sam,ultra_ont_raw.I20000.Fast-SG_K22.rev.sam,ultra_ont_raw.I30000.Fast-SG_K22.rev.sam,ultra_ont_raw.I40000.Fast-SG_K22.rev.sam,ultra_ont_raw.I50000.Fast-SG_K22.rev.sam,ultra_ont_raw.I60000.Fast-SG_K22.rev.sam,ultra_ont_raw.I70000.Fast-SG_K22.rev.sam,ultra_ont_raw.I80000.Fast-SG_K22.rev.sam,ultra_ont_raw.I100000.Fast-SG_K22.rev.sam,ultra_ont_raw.I120000.Fast-SG_K22.rev.sam,ultra_ont_raw.I150000.Fast-SG_K22.rev.sam,ultra_ont_raw.I180000.Fast-SG_K22.rev.sam
Finally, the scaffold sequences are found in UONT-K22-7X/Scaffolds.fa