Skip to content

Commit

Permalink
update cxxargs and check & fix usage instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
tmaklin committed Feb 6, 2020
1 parent a2e99a1 commit b488d1b
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 15 deletions.
61 changes: 47 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,25 @@
# msweep-assembly

mSWEEP genome assembly plugin code.
mSWEEP binning + assembly plugin code.

# Installation
## Dependencies
To run the binning + assembly pipeline, you will need a program that
does pseudoalignment and another program that estimates an assignment
probability matrix for the reads to the alignment targets.

We recommend to use [Themisto](https://github.com/jnalanko/themisto)
(v0.1.1 or newer) for pseudoalignment and
[mSWEEP](https://github.com/probic/msweep-assembly) (v1.3.2 or newer)
for estimating the probability matrix.

## Compiling from source
### Requirements
- C++11 compliant compiler.
- cmake

### Compilation
Clone the repository (note the --recursive option in git clone)
Clone the repository (note the *--recursive* option in git clone)
```
git clone --recursive https://github.com/PROBIC/msweep-assembly.git
```
Expand All @@ -20,42 +30,62 @@ enter the directory and run
> cmake ..
> make
```
This will compile the read_alignment, assign_reads, and build_sample executables in the build/bin/ directory.

This will compile the read_alignment, assign_reads, build_sample, and telescope executables in the build/bin/ directory.

# Usage
Align paired-end reads 'reads_1.fastq.gz' and 'reads_2.fastq.gz' with [Themisto]()
## Indexing
Build a [Themisto](https://github.com/jnalanko/themisto) index to
align against.
```
pseudoalign --index-dir themisto_index --query-file reads_1.fastq.gz --outfile pseudoalignments_1.txt --rc --temp-dir tmp --n-threads 16 --mem-megas 8192
pseudoalign --index-dir themisto_index --query-file reads_2.fastq.gz --outfile pseudoalignments_2.txt --rc --temp-dir tmp --n-threads 16 --mem-megas 8192
mkdir themisto_index
mkdir themisto_index/tmp
build_index --k 31 --input-file example.fasta --auto-colors --index-dir themisto_index --temp-dir themisto_index/tmp
```

Convert the pseudoalignment to [kallisto]() format using [telescope]()
Align paired-end reads 'reads_1.fastq.gz' and 'reads_2.fastq.gz' with Themisto
```
pseudoalign --index-dir themisto_index --query-file reads_1.fastq.gz --outfile pseudoalignments_1.txt --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192
pseudoalign --index-dir themisto_index --query-file reads_2.fastq.gz --outfile pseudoalignments_2.txt --rc --temp-dir themisto_index/tmp --n-threads 16 --mem-megas 8192
```

Convert the pseudoalignment to
[kallisto](https://github.com/pachterlab/kallisto) format using
[telescope](https://github.com/tmaklin/telescope) (supplied with the msweep-assembly installation).
```
mkdir outfolder
ntargets=$(sort themisto_index/coloring-names.txt | uniq | wc -l)
telescope --n-refs $ntargets -r pseudoalignments_1.txt,pseudoalignments_2.txt -o outfolder --mode intersection
```

Create a fake kallisto-style run_info.json file
Create a fake kallisto-style run_info.json file using the
Themisto_run_info.sh script in the root directory of this project
```
Themisto_run_info.sh $(wc -l outfolder_1.txt) $ntargets > outfolder/run_info.json
Themisto_run_info.sh $(wc -l < pseudoalignments_1.txt) $ntargets > outfolder/run_info.json
```

Determine read assignments to equivalence classes from the kallisto
format files
```
read_alignment -e outfolder/outfolder.ec -s outfolder/read-to-ref.txt -o outfolder --write-ecs --themisto --n-refs $ntargets --gzip-output
read_alignment -e outfolder/pseudoalignments.ec -s outfolder/read-to-ref.txt -o outfolder --write-ecs --themisto --n-refs $ntargets --gzip-output
```

Estimate the relative abundances with mSWEEP
Estimate the relative abundances with mSWEEP (reference_grouping.txt
should contain the groups the sequences in 'example.fasta' are
assigned to. See the [mSWEEP](https://github.com/probic/msweep-assembly) usage instructions for details).
```
mSWEEP -f outfolder -i reference_grouping.txt -o msweep-out --write-probs --gzip-probs
```

Extract the names of the 3 most abundant reference groups
(Optional) Extract the names of the 3 most abundant reference
groups.
```
grep -v "^[#]" msweep-out_abundances.txt | sort -rgk2 | cut -f1 | head -n3 > most_abundant_groups.txt
```
If you use a more refined method or know which reference groups (as
specified in the reference_grouping.txt file) you want to assemble,
put their names in a .txt file where each line corresponds to a
cluster name instead.

Assign reads to the 3 most abundant reference groups based on the estimated probabilities
```
Expand All @@ -66,6 +96,9 @@ Construct the binned samples from the original files

```
while read -r sample; do
build_sample -a outfolder/$sample\"\"_reads.txt.gz -o outfolder/$sample -1 reads_1.fastq.gz -2 reads_2.fastq.gz --gzip-output
build_sample -a outfolder/$sample""_reads.txt.gz -o outfolder/$sample -1 reads_1.fastq.gz -2 reads_2.fastq.gz --gzip-output
done < most_abundant_groups.txt
```
This will create the <group name>_1.fastq.gz and <group
name>_2.fastq.gz files in the outfolder, which you can assemble with
your assembler of choice.
9 changes: 9 additions & 0 deletions Themisto_run_info.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
echo "{
"n_targets": $2,
"n_bootstraps": 0,
"n_processed": $1,
"kallisto_version": "0.43.1",
"index_version": 10,
"start_time": "Tue Nov 5 16:19:25 2019",
"call": "/proj/temaklin/kallisto/kallisto pseudo -i /wrk/users/temaklin/reference_msweep_preprint_all_removed -o /wrk/users/temaklin/splits/ERR434699 /wrk/users/temaklin/msweep_reads/reads/ERR434699_1.fastq.gz /wrk/users/temaklin/msweep_reads/reads/ERR434699_2.fastq.gz"
}"
2 changes: 1 addition & 1 deletion external/cxxargs
Submodule cxxargs updated 1 files
+2 −2 include/cxxargs.hpp

0 comments on commit b488d1b

Please sign in to comment.