Build a proof of concept of a RNA-Seq pipeline intended to show Nextflow scripting and reproducibility capabilities.
-
Enter AWS console under this link.
-
Services > EC2 > Instances > Launch Instance
-
My AMIs >
ContraAMI_0.5
(Select) -
Choose an Instance Type:
t2.micro
-
Skip to
"5. Add tags"
, and click"Add tag"
``` Key: Name Value: <your_name> ```
-
"Review and Launch"
-
"Launch"
-
Select existing key pair:
CONTRA
. Confirm you have access to this key. -
View instances
. Look at theName
column for your name. Your instance will be pending and running in a momment.
-
-
Mark you instance and click
Connect
. Usessh -i "CONTRA.pem" ubuntu@host_name
to log into the instance. Note that Amazon suggestsroot
user, change it toubuntu
-
Open the folder with this project in Visual Studio Code.
-
In Visual Studio Code install
sftp
extension. -
Update 2 fields in
.vscode/sftp.json
:host
with your ec2host_name
andprivateKeyPath
with the path to CONTRA.pem -
Mark all the files, right-click > Upload. All your files should get transferred to the server
~/contra-nextflow/
. -
Now upon each save a file will be uploading.
Pull the required Docker image as shown below:
`make pull`
or build one
`make build`
Checkout git tag task1-checkpoint
to set your repository in the starting point.
You can use:
git checkout -b my-solution task1-checkpoint
This will create branch my-solution
for you on which you can commit your steps.
-
Create a
nextflow.config
basic based on documentation that includes:- enables docker by default, otherwise nextflow will try to execute all processes in your local environment
- indicates what container to use (
nextflow/rnatoy:latest
) - indicates that reports from execution are created by default in
reports/report.html
The file is started for you.
-
Create
main.nf
based on nextflow basic example that takes both [data/ggal_gut_1.fa
,data/ggal_gut_2.fa
] and prints each record in standard output in one process.The file is started for you.
-
Use
make run
to execute the pipeline.
In result after running the pipeline, in terminal you should see loads of DNA lines similar to those below.
...
GGCGTAGCCACCAACTGCTTGACGACTTCATTTCCAAAAAGCAGGATTTAATGAGTCTGGAGCACAAGTCTTATGAGGAGCAGCTGAGGGAACTGGGATTGCTTA
GGTTGGCCTCTTTTCCCACATAACTAGCAGTAGGACTAGAGGGGATGGCCTCAGTTTCGCGGCAGGGAAGATTCAGGTTGGGTGTTAGGAAAAGTTTCTCTGAAA
GAGGAGGGTCAGGCACTGGAATGGGCTGCCCAGGGTGGTGGAGTCACCATCCCTGTTGGGGATCAAGAAACATTTCACTGTGGTACTGAGGGATGTGGTTTAGTG
GGGGAGAGTCGGGTTGGGTGTTAGGAAAAGTTTCTCTGAAAGGGATGGTCAGGCACTGGAATGGGCTGCCCAGGGTGGTGGAGTCACCATCCCTGTTGGGGATCA
GGATGGCCTCAGTTTCGCGGCAGGGAAGATTCAGGTTGGGTGTTAGGAAAAGTTTCTCTGAAAGGGATGGTCAGGCACTGGAATGGGCTGCCCAGGGTGGTGGAG
...
If you have trouble achieving this effect, check the solution by checking out, the starting point for task 2. Checkout task2-checkpoint
and execute make run
. If you achieved this, you can just progress to the next task.
-
Start building RnaSeq pipeline by modifying
main.fa
to have 1 stage calledbuildIndex
. For the provided genome:/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa
, build index with the following bowtie command:bowtie2-build ${genome} genome.index
-
The result should be published in the
results
folder. See reference for publishDir directive. -
Run the pipeline. You should see 6 files appear in the
results
folder:
genome.index.1.bt2
genome.index.2.bt2
genome.index.3.bt2
genome.index.4.bt2
genome.index.rev.1.bt2
genome.index.rev.2.bt2
If you have trouble achieving this effect, check the solution by checking out, the starting point for the next task. Checkout task3-checkpoint
and execute make run
. If you achieved this, you can just progress to the next task.
In this task you add another stage to your pipeline called mapping
. In result you should have a 2 stage pipeline that firstly indexes the genome and then maps the indexed genome obtained bam files.
-
Create channel that contains read pairs (i.e. pairs of fastq files) as in
(ggal_gut_1.fq, ggal_gut_2.fq)
. See documentation for Channel factory and fromFilePairs. -
Create
mapping
process -
Define 2 inputs:
- accept genome index from previous stage
- accept reads from read pairs channel with something like:
set pair_id, file(reads) from read_pairs
-
Add command to be executed:
tophat2 genome.index ${reads}
-
tophat2
by default creates results intophat_out/
. We are interested intophat_out/accepted_hits.bam
. Rename this file by usingpair_id
to$pair_id.bam
. -
The result should be published in the
results
folder. -
Run the pipeline. You should see 2 files appear in the
results/tophat_oout
folder:
ggal_gut.bam
ggal_liver.bam
If you have trouble achieving this effect, check the solution by checking out, the starting point for the next task. Checkout task4-checkpoint
and execute make run
. If you achieved this, you can just progress to the next task.
In this task you add last stage to your pipeline called makeTranscript
. In result you should have a 3 stage pipeline that takes genome and produces transcripts.
-
Modify
mapping
to not to rename thetophat_out/accepted_hits.bam
output any more. -
Modify
mapping
to construct (pair_id
, bam_file) tuple and push it tobam_files
channel. Refer to set operator. -
Create
makeTranscript
process -
Construct input to accept
pair_id
andbam_file
from thebam_files
in the same way as it was created. -
Run
cufflinks
tool on eachbam_file
without any additional arguments. -
Rename the resulting
transcripts.gtf
totranscript_${pair_id}.gtf
-
The result should be published in the
results
folder. -
Run the pipeline. You should see 2 files appear in the
results
folder:
transcript_ggal_gut.gtf
transcript_ggal_liver.gtf
If you have trouble achieving this effect, check the solution by checking out, the starting point for the next task. Checkout task5-checkpoint
and execute make run
. If you achieved this, you can just progress to the next task.
This task is additional for eager participants. It's all about refining your outputs and communication to the user.
-
Tag the processes that run in parallel to display which read pair they are processing. See documentation.
-
Display a message at the end of the workflow about whether it was successful or not. See documentation.
-
Investigate the report that is being generated on each run. See Tracing and visualisation section of documentation to see what other reports can be generated. Try generating them.
-
Limit the memory on mapping to 2 MB on
mapping
process with process selector, and define retries with higher memory so the pipeline passes by using dynamic computing resources.
Partial solution can be found by checking out the final-solution-checkpoint
tag.