The raw data was downloaded from the PacBio ftp site, that provides public accessible PacBio Sequel II dataset for many genomes. We used Arabidopsis thaliana dataset for assembly.
List of input files were generated as follows:
ls *.scraps.bam > scraps.fofn
ls *.subreads.bam > subreads.fofn
Canu was used to assemble the subreads as-is. First the reads were converted to fasta format and then a config file was created with the canu parameters.
samtools fasta --threads 16 m54113_160914_092411.subreads.bam > m54113_160914_092411.subreads.fasta
samtools fasta --threads 16 m54113_160913_184949.subreads.bam > m54113_160913_184949.subreads.fasta
cat *.subreads.fasta > subreads-raw.fa
The config file for the raw dataset was as follows:
gridEngineResourceOption="--time=4:00:00 -N 1 --cpus-per-task=THREADS --mem-per-cpu=MEMORY"
gridOptionsCNS="--time=8:00:00 -N 1 -p freefat --cpus-per-task=8 --mem-per-cpu=40GB"
Canu was executed as follows:
source /work/LAS/mhufford-lab/shared_dir/minconda/20181213/etc/profile.d/
conda activate denovo_asm
tdate=$(date '+%Y%m%d')
canu \
-p $aname \
-d "canu-${tdate}" \
-s $cfg \
-pacbio-raw $fq
The stats were calculated using the Assemblathon Script:
Assumed genome size (Mbp) 135.00
Number of scaffolds 555
Total size of scaffolds 124850704
Total scaffold length as percentage of assumed genome size 92.5%
Longest scaffold 4243121
Shortest scaffold 2357
Number of scaffolds > 1K nt 555 100.0%
Number of scaffolds > 10K nt 528 95.1%
Number of scaffolds > 100K nt 219 39.5%
Number of scaffolds > 1M nt 30 5.4%
Number of scaffolds > 10M nt 0 0.0%
Mean scaffold size 224956
Median scaffold size 65852
N50 scaffold length 699106
L50 scaffold count 46
NG50 scaffold length 637854
LG50 scaffold count 54
N50 scaffold - NG50 scaffold length difference 61252
scaffold %A 31.80
scaffold %C 18.18
scaffold %G 18.20
scaffold %T 31.82
scaffold %N 0.00
scaffold %non-ACGTN 0.00
Number of scaffold non-ACGTN nt 0
The job summary stats (MaxRSS, runtime, CPUtime) were calculated using the standard slurm
command sacct
sacct --format JobId,JobName,ReqCPUS,ReqMem,ReqNodes,Elapsed,SystemCPU,CPUTime,MaxRSS,MaxVMSize,State,Start,End -u arnstrm
Canu was used for assembly, but the reads were first processed using SequelTools. After processing, the reads were converted to fasta.
Filtering using SequelTools:
module purge
ml r-devtools samtools python -c scraps.txt -u subreads.txt -t F -C -Z 5000 -n 16 -o runSequelT-F_len
cd runSequelT-F_len
samtools fasta --threads 16 m54113_160913_184949.subSampledSubs.sam > m54113_160913_184949.subSampledSubs.fasta
samtools fasta --threads 16 m54113_160914_092411.subSampledSubs.sam > m54113_160914_092411.subSampledSubs.fasta
cat *.subSampledSubs.fasta > subSampledSubs5kb.fa
The config file for the raw dataset was as follows:
gridEngineResourceOption="--time=4:00:00 -N 1 --cpus-per-task=THREADS --mem-per-cpu=MEMORY"
gridOptionsCNS="--time=8:00:00 -N 1 -p freefat --cpus-per-task=8 --mem-per-cpu=40GB"
Canu was executed as follows:
source /work/LAS/mhufford-lab/shared_dir/minconda/20181213/etc/profile.d/
conda activate denovo_asm
tdate=$(date '+%Y%m%d')
canu \
-p $aname \
-d "canu-${tdate}" \
-s $cfg \
-pacbio-l5kb $fq
The stats were calculated using the Assemblathon Script:
Assumed genome size (Mbp) 135.00
Number of scaffolds 659
Total size of scaffolds 126349435
Total scaffold length as percentage of assumed genome size 93.6%
Longest scaffold 4243141
Shortest scaffold 6520
Number of scaffolds > 1K nt 659 100.0%
Number of scaffolds > 10K nt 649 98.5%
Number of scaffolds > 100K nt 253 38.4%
Number of scaffolds > 1M nt 24 3.6%
Number of scaffolds > 10M nt 0 0.0%
Mean scaffold size 191729
Median scaffold size 62543
N50 scaffold length 576605
L50 scaffold count 60
NG50 scaffold length 493763
LG50 scaffold count 68
N50 scaffold - NG50 scaffold length difference 82842
scaffold %A 31.80
scaffold %C 18.20
scaffold %G 18.22
scaffold %T 31.77
scaffold %N 0.00
scaffold %non-ACGTN 0.00
Number of scaffold non-ACGTN nt 0
The job summary stats (MaxRSS, runtime, CPUtime) were calculated using the standard slurm
command sacct
sacct --format JobId,JobName,ReqCPUS,ReqMem,ReqNodes,Elapsed,SystemCPU,CPUTime,MaxRSS,MaxVMSize,State,Start,End -u arnstrm