forked from mrossello/RNAseqTHS
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtutorial.txt
executable file
·469 lines (314 loc) · 46 KB
/
tutorial.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
TrackHub Workflow: Create trackhubs for a single species
********************************************************
Note: this guide was set up using TCSH shell. Bash equivalents look like this:
setenv PERL5LIB ${PERL5LIB}:${RNASEQ}/src/BioPerl-1.6.924
=
export PERL5LIB="${PERL5LIB}:${RNASEQ}/src/BioPerl-1.6.924"
set stand = (${RNASEQ}/ensembl-hive/scripts/standaloneJob.pl)
=
stand="${RNASEQ}/ensembl-hive/scripts/standaloneJob.pl"
Part 1: Set up eHive, RNAseqTHS and Ensembl perl API
====================================================
Clone repo from:
https://github.com/mrossello/RNAseqTHS.git
Directory is called 'RNAseqTHS'
for convenience make the path into a variable:
setenv RNASEQ "/homes/mrosello/EN/THP_out/RNAseqTHS"
Next add it to your PERL5LIB variable so that Perl can find the modules in the THP directory.
setenv PERL5LIB "${RNASEQ}:${PERL5LIB}"
You need eHive installed:
https://ensembl-hive.readthedocs.io/en/version-2.5/quickstart/install.html
You also need the the eHive dependencies installed unless you are using Ensembl's common Perl build which is here:
/nfs/software/ensembl/RHEL7-JUL2017-core2/plenv/shims/perl
Set up this perl environment (be aware you may already be doing this on start up) with
. /nfs/software/ensembl/RHEL7-JUL2017-core2/envs/plenv.sh
For alternative method see later ...
For the purpose of this tutorial I have installed eHive in $RNASEQ directory so I have 'ensembl-hive' directory there too:
> ls $RNASEQ/ensembl-hive
you may have Ensembl perl API already downloaded but if not you can put it in RNAseqTHS. Follow these instructions
https://www.ensembl.org/info/docs/api/api_installation.html
when you get to step 3 you can do:
setenv PERL5LIB ${PERL5LIB}:${RNASEQ}/src/BioPerl-1.6.924
setenv PERL5LIB ${PERL5LIB}:${RNASEQ}/src/ensembl/modules
setenv PERL5LIB ${PERL5LIB}:${RNASEQ}/src/ensembl-compara/modules
setenv PERL5LIB ${PERL5LIB}:${RNASEQ}/src/ensembl-variation/modules
setenv PERL5LIB ${PERL5LIB}:${RNASEQ}/src/ensembl-funcgen/modules
if you already have Ensembl perl API elsewhere makesure it is added to your PERL5LIB.
Part 2: Set up eHive Scripts
============================
runnables can be run in stand alonemode using the script here:
${RNASEQ}/ensembl-hive/scripts/standaloneJob.pl
The first line shows the Perl environment:
>head -n 1 ${RNASEQ}/ensembl-hive/scripts/standaloneJob.pl
#!/usr/bin/env perl
If you didn't set up your Perl environment earlier you can change these eHive scripts like this:
#!/usr/bin/env /nfs/software/ensembl/RHEL7-JUL2017-core2/plenv/shims/perl
Add the standalone script as a variable:
set stand = (${RNASEQ}/ensembl-hive/scripts/standaloneJob.pl)
Other scripts we will be using a lot are worth adding as variables:
set seed = (${RNASEQ}/ensembl-hive/scripts/seed_pipeline.pl)
set runloop = (${RNASEQ}/ensembl-hive/scripts/beekeeper.pl -loop -sleep 0.1)
set runloopd = (${RNASEQ}/ensembl-hive/scripts/beekeeper.pl -loop)
set initp = (${RNASEQ}/ensembl-hive/scripts/init_pipeline.pl)
Instead of the above you can also put
${RNASEQ}/ensembl-hive/scripts/
In your PATH.
eHive docs recommends this as an alternative.
$runloop is a faster version of $runloopd in that it loops every 10 seconds as opposed to default value of every minute.
Part 3: Set up config file and Databases
========================================
You will find a config file in the THP directory called config.pl
The library needed to use this file, and its dependencies, are relatively small so I have included the installed version in RNAseqTHS. It is in the 'conf' directory and it should be added to PERL5LIB:
setenv PERL5LIB "${RNASEQ}/conf:${PERL5LIB}"
The installation is good for Ensembl's common Perl build (/nfs/software/ensembl/RHEL7-JUL2017-core2/plenv/shims/perl) which is v5.14.4 at time of writing. If it has been updated or you are using a different you can install the config library yourself from here: https://metacpan.org/pod/Config::File
eHive needs a mysql database to operate.
RNAseqTHS also needs a mysql database to operate.
Ensembl offers its internal users several options for mysql hosts.
For this eHive pipeline I created 2 databases, but you can remove these/use your own. Pipeline set ups do not stick around very long and each of the stages below begin with a fresh reset of the database.
mysql://ensrw:scr1b3hive@mysql-ens-hive-prod-2:4411/mross_hive_work
mysql://ensrw:scr1b3hive@mysql-ens-hive-prod-2:4411/mross_hive_work_2
You can log into these DBs using common ensembl mysql script (run it at comand line):
/nfs/software/ensembl/mysql-cmds/ensembl/bin/mysql-ens-hive-prod-2
then, for mross_hive_work:
> use mross_hive_work
To visualise the pipeline loaded to this database on a web browser go to:
http://guihive.ebi.ac.uk:8080/
in the box paste:
mysql://ensrw:scr1b3hive@mysql-ens-hive-prod-2:4411/mross_hive_work
password (as for all DBs at mysql-ens-hive-prod-2 is scr1b3hive.
Add the location of this DB to a variable. You will need it later:
> setenv EHIVE_URL mysql://ensrw:scr1b3hive@mysql-ens-hive-prod-2:4411/mross_hive_work
The process is identical for mross_hive_work_2. Only the database name has changed.
As mentioned, RNAseqTHS needs a database too.
I have used one of the plants servers.
It is 1 of the Ensembl collection of servers so the username and password are the same as above
--host=mysql-ens-plants-prod-2 --port=4208 --user=ensrw
and created 'TrackHubPipeline' db.
You can access in write mode by runing shell script:
/nfs/software/ensembl/mysql-cmds/ensembl/bin/mysql-ens-plants-prod-2-ensrw
> use TrackHubPipeline;
You can use a different host and DB. You will need to add the connection details to the config.pl file:
DB[name] = TrackHubPipeline
DB[host] = mysql-ens-plants-prod-2
DB[port] = 4208
DB[user] = ensrw
DB[pw] = scr1b3d3
This database needs to be set up with some tables. The tables are outlined in docs_01.txt and docs_02.txt. Simply run all the commands to create the required tables.
Part 4: Test and Get Started
============================
There is a test module called Test.pm which tries to use the config file and database connection to selects from a table called AERUNS (which needs to exist). This will help discover if your environment is set up.
Run it in standalone mode like this:
$stand THP::Test
Step 1: Download details of cram files that have already been submitted to the ENA
----------------------------------------------------------------------------------
**Module: THP::EnaStart**
First choose a piperun id. This is simply a way to identify this series of activities. Most tables in the database have a 'piperun' column to identify each row. For example if we start with a piperun id of 1 then all activities downstream of these earlier events will work on data that has been generated for piperun 1. It is not necessary to pay much attention to this parameter except to know that it can enable some advanced usage if you choose to run fragments of pipeline at different times and with different filters. A more basic usage is to use a piperun of 1 today and if you run this step again in a month's time you can (if you want) upgrade to 2. The data type of piperun is 'TINYINT' so apply a value between 0 and 255.
The first step is to get all the details of cram files that are already submitted to the ENA. Crams in the past will have been submitted to a specific ENA study so you need to add this in the config file. If you have never submitted to ENA before you can create a new 'project' on their website. It is also fine to use a study someone else has already created (it doesn't matter if you have a different account to the study id)
Add the study to the config file. For plants group it is currently:
enastudy = ERP014374
DB table CRAMS is required in TrackHubPipeline
THP::EnaStart module also uses the storage parameter in the config file. It uses this location to download a TSV file which it then subsequently parses in the CRAMS table in the DB. Provide a location:
storage = /nfs/production/panda/ensemblgenomes/development/mrossello/thp
A file will appear that looks like this:
ERP014374_crams_05_Sep_19.txt
You can delete this files after the job is finished.
run in standalone mode:
$stand THP::EnaStart -input_id "{ 'PIPERUN' => 1 }"
Some warnings may be printed out if some of the rows in the TSV can not be read as expected. You can ignore these as most will be old crams that are no longer used in track hubs any longer.
example of warning:
"can not obtain run id from column 'analysis_title' line 19983 (+ 1 incl header) in /nfs/production/panda/ensemblgenomes/development/mrossello/thp/ERP014374_crams_05_Sep_19.txt"
Downstream we will work with md5sums instead of run_ids so these warnings do not prevent anything from working. Files archived with the ENA will remain indefinitely but track hubs are only updated with the latest. Therefore you may want to create a new ENA study and work with it instead, when the original study is getting cluttered with old cram files.
You can check table (first 10 rows for example) afterwards:
> select * from CRAMS limit 10;
See part 3 above for connecting to the mysql database (TrackHubPipeline in this case but you may have named another)
Step 2: Get Details of all Species that ATLAS are aligning to
-------------------------------------------------------------
**Module THP::SpeciesFact**
Expression Atlas look for read data in the ENA and align it via RNASeq-er tool (https://www.ebi.ac.uk/fg/rnaseq/api/) to their listed assemblies. We need a list of these species/assemblies because downstream we will look for available crams for each of the species.
Put the kingdom that you want the species list for in the config file (options: ensembl, plants, metazoa, fungi, protists, wbps):
division = plants
You need table ORGS in your DB (TrackHubPipeline in this case)
run in standalone mode:
$stand THP::SpeciesFact -input_id "{ 'PIPERUN' => 1 }"
you can now check the ORGS table:
> select distinct reference_organism from ORGS where piperun = 1;
This tutorial will focus on one reference organism for clarity, so pick one from the above mysql output. For this tutorial we will use 'theobroma_cacao' (one from the plants division)
Step 3: Get Details of all Crams available for given reference organism
-----------------------------------------------------------------------
**Module THP::FindCrams**
This module works at the reference organism level and so will be multiplied into a fan when used as part of an eHive analysis. For now we will run it in standalone mode and pass it a specific organism name (from the ORGS table).
The module uses parameter AEGET[runsXorg] in the config file.
AEGET[runsXorg] = https://www.ebi.ac.uk/fg/rnaseq/api/json/$quality/getRunsByOrganism/
this is the Atlas/RNASeq-er API endpoint. It simply appends the organism name to it, parses the JSON output and puts it into a DB table called AERUNS. Therefore make sure your DB (TrackHubPipeline in this case) has this table.
This module also uses 'skipped' parameter in the config file. This is to deposit details about any cram that is missed out (one file per missed cram). It could be that there is no cram file available or that the md5 sum file could not be found. A few missing crams does not break the pipeline and often will get picked up next time it is run.
skipped = /nfs/production/panda/ensemblgenomes/development/mrossello/thp/skipped
You may want to clear this folder out every so often.
run in standalone mode:
$stand THP::FindCrams -input_id "{ 'PIPERUN' => 1, 'organism' => 'theobroma_cacao' }"
Check the cram files that have been put into AERUNS. For example the first 10:
> select * from AERUNS where piperun = 1 limit 10;
Collected into the AERUNS table from the Atlas/RNASeq-er API are useful fields such as the location of the cram, the ENA study that the source ENA run was part of, the assembly name that it is aligned to, the ENA sample of the source ENA run.
Check how many cram files have been found for your chosen organism:
> select count(*) from AERUNS where piperun = 1;
If the result is quite large (> 100) you may want to select another organism so that you can progress though this tutorial more quickly.
The md5sum is also available but this is not collected from the API. Instead the module tries to download and parse the md5 file that it expects to be in the same location as the cram file but with an '.md5' extension instead of the '.cram' extension. This is an important column because in THP::FindFinished (below) we will check the CRAMS table for each md5sum and if there is a match then we know this cram file is already submitted to ENA and as a public ftp location.
Step 4: Check if any Crams are already submitted to ENA
-------------------------------------------------------
**Module THP::FindFinished**
The next stage of the pipeline is to submit the crams to the ENA. Some of these crams might have been submitted previously so with this module we check the md5s in AERUNS table with the md5s in the CRAMS table. Matches get switched to 'finished' (a boolean column in AERUNS table). Downstream at the submission stage any cram marled 'finished' will not get submitted.
$stand THP::FindFinished -input_id "{ 'PIPERUN' => 1 }"
This will take a while if the ENA study containing the submitted crams is large. At time of writing it took 2 hours to compare 121566 rows in CRAMS table with 79709 rows in AERUNS, 20748 matches (first activity in nearly a year so backlog expected). You can start a new study any time in the future but it will create a large backlog in the first instance because all crams will be seen as new and they will all need submitting.
Passing the 'PIPERUN' parameter is useless in this context because it is useful to mark any row in AERUNS as 'finished' if it has definitely been submitted to ENA already. So it is actually set to ignore this parameter! You can force it to only flag finished crams from the specific piperun with an additional parameter set to off/0: 'ignore_pipe' => 0
The reason it still takes a PIPERUN parameter is because when it is part of an eHive analysis it can relay the value down branch 1 (if it is not already the end point of the analysis). But when running in standalone mode the parameter can be omitted:
$stand THP::FindFinished
Step 1 to 4 in an eHive analysis work flow
------------------------------------------
**Analysis THP::EnaAeFindFlag_conf**
This is an eHive analysis, meaning that it combines all modules in steps 1 to 4 together into 1 work flow. First it grabs existing already-submitted crams in the ENA, then it finds the available species at Expression Atlas, then comes the first fan, where each organism is used to find the available crams at Expresison Atlas. Finally md5s are compared and already-submitted crams are marked as such.
using the shell variables we set up earlier initiatwe the pipeline:
>$initp THP::EnaAeFindFlag_conf -pipeline_url $EHIVE_URL -hive_force_init 1
then seed it from the beginning:
>$seed -url $EHIVE_URL -logic_name ena_start -input_id '{ "PIPERUN" => 1, 'orgs' => ['theobroma_cacao'] }'
To view, see errors and make alterations to the pipeline use a browser:
http://guihive.ebi.ac.uk:8080/
Use $EHIVE_URL in the connect box.
If you are using the eHive db mentioned above it will look like this:
mysql://ensrw:scr1b3hive@mysql-ens-hive-prod-2:4411/mross_hive_work
Finally run it:
$runloopd -url $EHIVE_URL
Notice how the first job (and the rest of the jobs except THP::FindCrams) take a list parameter for 'orgs'. This is because you can provide multiple organisms for most jobs. Module THP::FindCrams on the other hand is fanned from the THP::SpeciesFact 'factory' and it splits the list up so that each THP::FindCrams gets a single 'organism' string.
For full use of this analysis you will want to run through all the organisms in the ORGS table because then you will have all available crams from Expression Atlas loaded into AERUNS table (for your division) and from there you can make more specific selections for the rest of the pipeline in terms of specific organisms, specific studies, and even specific cram files (may be helpful for debugging).
In conclusion, it is recommended to load all crams into AERUNS at this stage even if you want to narrow down later to specific organisms/studies/runs.
Part 2: Submit the new crams to the ENA
=======================================
**Analysis THP::CramUpSubmit_conf**
This eHive analysis begins with THP::CramUpFact.
THP::CramUpFact can also take a list of organisms using the 'orgs' parameter so we can specify only theobroma_cacao again in this case. Alternatively, most of the factory runnables can take a list of studies (parameter 'CHOOSE_STUDIES') or even a list of individual crams/runs for further narrowing down. Only provide 1 filter type, you can not combine for instance 'orgs' with 'CHOOSE_STUDIES'. If you do the runnables will default to the study list and ignore the 'orgs' list.
THP::CramUpFact finds the cram files (from table AERUNS) corresponding to the organism provided and creates THP::CramUp jobs for each one. THP::CramUpFact takes parameter 'FIND_FINISHED' but this is set to off/0 by default. This does the same job as THP::FindFinished and is useful if you have repopulated ENASUBS table to get the latest submissions in it. This cross checking step will prevent the upload and submission of files that are already submitted to the ENA on a previous occasion.
THP::CramUp (one job per cram file) downloads the cram file from Expression Atlas and checks the md5sum to confirm the integrity of the transfer. It gets both the location and the md5sum from when AERUNS was populated upstream. The file will be removed if all goes well at the end of the runnable but a directory is required to put all the downloads in. This directory should be allowed to increase in size quite significantly for obvious reasons. Ensure to specify the directory to put the cram downloads in using the config.pl file in the variable called 'storage'.
Next, THP::CramUp logs in to ENA's ftp server which is account based. You can easily get an ENA submission account (go to their website) and the same account is used in the next stage of the pipeline. Enter your account details into the config file under variables ENAUSER.ftphost webin.ebi.ac.uk at time of writing and likely to stay the same), ENAUSER.name (will look something like Webin-XXX) and ENAUSER.pw (the password used when creating an account). ENA has a test server for submissions but only one ftp server so in all cases you will be uploading to ENAUSER.ftphost. THP::CramUp uploads the cram file it just downloaded to the ENA ftp server. Your account gets its own subdirectory on the ENA ftp server and the file will go in there.
Next, THP::CramUp uploads the file to ENA ftp server. It is not possible to calculate the md5sum of a file on a remote ftp server but THP::CramUp makes use of a parameter called 'LONGCHECK' which is turned on/1 by default. When the parameter is on, THP::CramUp will download the file back from the ENA ftp server (back into your storage directory, with a slightly different name). It them calculates the md5 on the local file and tries to match it again with the original md collected from Expression Atlas. If there is no match the job will fail. eHive retries failed jobs by default (about 3 times) but the rest of the pipeline can carry on even when some uploads have failed. You can start the pipeline again and give the flag -forgive_failed_jobs to beekeeper.pl. I tend to add it as a shell variable for ease:
> set runloop4give = ($runloop -forgive_failed_jobs)
You can turn 'LONGCHECK' off in the main THP::CramUpSubmit_conf file in sub pipeline_wide_parameters. This will help the pipeline to run more quickly because it removes 1 transfer and 1 md5 calculation and both these events can take a long time on large files. If the file is corrupted during one of the transfers but it still gets submitted (later) ('LONGCHECK' is off) the ENA will send an email with a list of files that it could did not pass their md5 check. You can re upload the files to fix this but currently this will need doing manually, or, next time you run the pipeline from the beginning (re-running THP::FindCrams to reset the flags) it will not find the file in the ENA archive and it will try to upload it again. If successful it will go on to submit the file and the original submission becomes redundant. So you can email ENA and ask them to cancel the submissions that did not pass their md5 test because they will get submitted next time. However, having 'LONGCHECK' set to on/1 will reduce the error emails from the ENA significantly so it is recommended (and is the default).
If successful THP::CramUp removes its cram file from your local storage directory (the one specified in the config file) because it is now uploaded to the ftp server. It is worth periodically checking between pipeline runs if there are any cram files left in local storage because they are only removed when the job is successful. If successful THP::CramUp also sets the 'uploaded' column in AERUNS tables to TRUE/1 for the uploaded cram row. This means that if you run the pipeline again with the same arguments it will attempt the whole process again on any crams that were not successful before but will not repeat the uploaded ones. The uploaded flag and all the other flags in AERUNS table including 'piperun' get reset when THP::FindCrams is run (see above) so once you have populated the DB table AERUNS you don't need to populate it again until you have uploaded, submitted and built the trackhubs for all the organisms or studies of your choice- and you can build these trackhubs across multiple separate runs and stages of the pipeline (make sure the 'piperun' id stays the same).
When all THP::CramUp jobs are done, parameters are sent down branch 1 to a THP::SubmitFact job. This starts a new logical step - to submit the files that have been uploaded. There is a shorter analysis pipeline called THP::CramUpStart_conf which only does download and upload part already discussed. If you want to isolate that part of the pipeline then you can use this instead. Submission to the ENA is mandatory - at the moment the files are just sitting in your private ftp area on the ENA FTP servers. When the files get submitted they get moved out of your private account based ftp area and join a queue for processing. Processing involves checking the md5 sum against the one that it should be (provided at submission time- from AERUNS table) and indexing the cram file to create a '.crai'. At submission time ENA sends back an XML with a 'success' attribute of 'TRUE' or 'FALSE' and if 'TRUE' it also includes an accession for that cram file which looks like 'ERZ123456'. Confusingly, this accession is called an 'analysis' id by ENA (because it is not a run or a sample or a study as per ENAs submission objects schema).
THP::SubmitFact is a simply factory. It finds all the cram files that are 'uploaded' (but not 'submitted') according to the AERUNS table (and according to the existing filter parameters) and for each one it triggers a THP::SubCram job. THP::SubCram submits the file that is already uploaded in the previous step by creating an 'analysis' xml with the details of the file and what study it should be attached to. This is the same study that THP::EnaStart uses to see what is already submitted. It is in the config file under variable 'enastudy'. You do not have to have submitted the study under your own account, you can add to it from any account. The same account user and password as logging in to the ftp server are required to post a submission XML. When creating your ENA account you supplied a 'center name'. This is not necessary for posting the submission (ENA knows it because it's part of your account) BUT I was using a 'broker' ENA submission account because I have submitted genomes on behalf of 3rd parties in the past so I must provide the center name for whom I am submitting for (config.center) or the submission fails. If you have a normal submission account you can leave this blank in the config file). Additionally you must add ENAUSER.posturl to the config file which is where XML is posted to. ENA also has a test server for submissions. This goes in ENAUSER.testurl. It behaves the same way as the production server but upon success of the submission nothing happens - the analysis id accession is fake and the file will stay in the ftp area. However, THP::SubCram will still flag the success in the database table so it is not recommended to do tests in this way. But you can, by switching the variable ENAUSER.test in the config file to 1 (on). At time of writing ENA's production submission server was at 'https://www.ebi.ac.uk/ena/submit/drop-box/submit/' and this is unlikely to change.
THP::SubCram takes parameter 'ACTION' which references a term in the content section of the POST form. This is the prefered way to do test submissions (instead of the ENAUSER.test boolean as mentioned). The 2 options are 'ADD', which is the default, and 'VALIDATE' which you can apply in the THP::CramUpSubmit_conf file if you want to override the default. If 'VALIDATE' is used ENA server will validate the XML only, it will not create an analysis id accession nor will the file be removed from the private ftp area. The advantage of testing like this is that THP::SubCram will not mark the cram as 'submitted' in AERUNS table so if you run again with 'ADD' your test cases won't get ignored. 'ADD' action is the default action which will attempt to submit properly. The server response is in XML format too and THP::SubCram will look for the "success" result and the analysis id accession. There is also submission id accession. If THP::SubCram finds the three values mentioned it marks the cram as 'submitted' in the AERUNS table. It also adds the analysis and submission ids to the table 'ENASUBS'. This table is for reference - it is not used in any way by the pipeline code. More important to the pipeline for tracking are the columns 'uploaded','submitted' and 'finished'. 'submitted' is now switched on by THP::SubCram leaving only 'finished'. 'finished can only be set to TRUE/1 when THP::FindFinished has found the cram file under the ENA study (config.enastudy) that you have added the cram file using THP::SubCram. It will take a few days for it to appear under the ENA study because after the submission it gets queued and processed. When done it will appear under config.enastudy and it will have a public ftp address. To get it from the ENA API output into the CRAMS table so that THP::FindFinished can cross check the md5s you need to repeat part 4, step 1 (THP::EnaStart) and part 4 step 3 (FindFinished.pm) above. Or, when you start the whole process again, this will get done (for instance, eHive analysis module THP::EnaAeFindFlag_conf). You do not need to wait for the files to processed by ENA and recieve a public ftp location before continuing with this workflow. The runnables in the next step can create track hubs with the original ftp locations at Expression Atlas and it will replace these urls with ENA locations as they become available during future runs of the pipeline.
Just as there is a smaller pipeline analysis available for the upload part of THP::CramUpSubmit_conf then there is also a smaller pipeline analysis that just deals with the submission part. It is called THP::SubCramStart_conf. You can use this to isolate the second part of this step in the tutorial.
If THP::SubCram fails (for example the XML response could be a success="false") then the job will fail. The analysis XML that it writes to post to ENA submission server is stored in config.storage directory and is only deleted if the submission does not fail. This is useful because THP::SubCram will also print, as part of its error message, the linux cURL command that is equivalent to the POST form it tried to use to post the XML. You can copy and paste this straight into the command line and get ENA's response message directly, for troubleshooting and if necessary for sending to ENA help desk so that they can investigate the problem. When part of a looping eHive pipeline the individual error messages do not get printed to the screen but you can check the 'last_error' column in guiHive or you can run the pipeline one worker at a time with runWorker.pl which is recommended for debugging. In this case you can make use of the smaller pipeline THP::SubCramStart_conf and pass it a list of offending biorep ids with 'CHOOSE_RUNS' parameter.
to run (specifying species theobroma_cacao in this case):
> $initp THP::CramUpSubmit_conf -pipeline_url $EHIVE_URL -hive_force_init 1
> $seed -url $EHIVE_URL -logic_name cramupsub -input_id '{ "PIPERUN" => 1, "orgs" => ["theobroma_cacao"] }'
> $runloopd -url $EHIVE_URL
The upload and download and md5 calculations can take a long time so you can also send the job to the lsf cluster and use guiHive to keep track (instead of screen output). There is also the linux 'screen' function.
> bsub $runloopd -url $EHIVE_URL
Check what changes in the database have happened
------------------------------------------------
successfully uploaded:
> select biorep_id, md5_sum from AERUNS where piperun = 1 and uploaded;
successfully submitted:
> select biorep_id, md5_sum from AERUNS where piperun = 1 and submitted;
> select * from ENASUBS where piperun = 1;
Extra tips
----------
Alternatively, run the 1st half (upload files to ftp site), and then the 2nd half (submit files to ENA) sequentially. This is what THP::CramUpSubmit_conf does anyway. There are more logical ways to organise these pipelines but for version 1 I opted for simple and compartmentalised. They will be quite easy to rearrange for your own purposes after you have familiarity with all the runnables:
> $initp THP::CramUpStart_conf -pipeline_url $EHIVE_URL -hive_force_init 1
> $seed -url $EHIVE_URL -logic_name -input_id '{ "PIPERUN" => 1, "orgs" => ["theobroma_cacao"] }'
> bsub $runloopd -url $EHIVE_URL
> $initp THP::SubCramStart_conf -pipeline_url $EHIVE_URL -hive_force_init 1
> $seed -url $EHIVE_URL -logic_name cramsub_start -input_id '{ "PIPERUN" => 1, "orgs" => ["theobroma_cacao"] }'
> $runloopd -url $EHIVE_URL
To run the second half and focus on just 2 crams (that may be failing their submission) use runWorker.pl to run single worker at a time. When the first runnable is done, run runWorker.pl again for the next runnable:
> set one_bee = (${RNASEQ}/ensembl-hive/scripts/runWorker.pl)
> $initp THP::SubCramStart_conf -pipeline_url $EHIVE_URL -hive_force_init 1
> $seed -url $EHIVE_URL -logic_name cramsub_start -input_id '{ "PIPERUN" => 1, "CHOOSE_RUNS" => ["SRR851888","SRR851128"] }'
> $one_bee -url $EHIVE_URL
#and again ...
> $one_bee -url $EHIVE_URL
Part 3: Get Metadata from ENA Studies and Samples and Create Track Hubs
=======================================================================
**Analysis THP::GetMetaData_conf**
1. Analysis THP::GetMetaData_conf starts with THP::StudyMetaFact which creates a list of studies based on filter criteria (for example we are specifying 1 organism) to send down branch 2 to THP::GetStudyMet.
2. Each THP::GetStudyMet job gets 1 study id. It uses ENAGET[view] URL in the config file to download the study metadata from ENA in XML format and puts it into DB table STUDY. THP::GetStudyMet is also a factory, so it then creates a list of affiliated samples and sends it down its own branch 2 to THP::GetSampMet.
3. Each THP::GetSampMet job gets 1 sample id. It uses ENAGET[view] URL in the config file to download the sample metadata from ENA in XML format and puts it into DB table SAMPLE. If the sample contains additional annotations in the form of XML ATTRIBUTE tags then it puts these into table ATTRIBUTES.
4. When 3 from above is done THP::GetStudyMet confirms by sending its study id down branch 1 to THP::MetaDone. This job will set the study in the STUDY table to 'has_samples' because its samples have been grabbed. If 'CHOOSE_RUNS' parameter is used then 'has_samples' will not be set to true because an incomplete set of samples are expected.
> $initp THP::GetMetaData_conf -pipeline_url $EHIVE_URL -hive_force_init 1
> $seed -url $EHIVE_URL -logic_name study_fan -input_id '{ "PIPERUN" => 1, "orgs" => ["theobroma_cacao"] }'
> $runloopd -url $EHIVE_URL
You can now check what has been entered into the database tables:
select study_id,alias from STUDY where piperun = 1;
select sample_id,alias from SAMPLE where piperun = 1;
select * from ATTRIBUTES where piperun = 1;
You can search for any ATTRIBUTES.tag entries that will make good track dimensions and add these to the config file 'metatags' parameter ('%' separated list). For instance:
metatags = strain%genotype%truseq adapter sequence
This means that later when writing trackhub files the tracks will be assigned their dimension value if their source sample contains any of those tags and the browser will be able to filter tracks based on their dimension value.
The browser requires dimensions to be labelled X, Y, A, B, C and so on. So you will get a line like this in the trackdb file:
"
dimensions dimX=strain dimY=genotype dimA=truseq_adapter_sequence
"
Guidelines suggest no more than 9 sub groups at time of writing.
The '%' separator was changed from a white space so that tags with spaces can be used as dimensions. The spaces are replaced with underscores when printing in the trackdb files. More on these files later.
Parameter 'only_finished' is defaulted to 0 which means that studies and samples containing crams that are not 'finished' in the AERUNS tables will still be used. If they are not finished then it means there will be no ftp location for the cram file at the ENA BUT in this case the original Atlas location can be used. crams are often not finished in AERUNS table because they are awaiting processing from the ENA so do not appear yet in the CRAMS table. When they do appear in some future run of the pipeline then the Atlas ftp location will be swapped out for the ENA ftp location in the track hub files. If 'only_finished' is set to true/1 then only crams that have an ENA ftp location can be made into tracks and studies and samples are filtered accordingly. Parameter 'only_finished' is used in the following step of the pipeline and you may notice several downstream modules
**Analysis THP::WriteHub_conf**
1. THP::WriteHub_conf starts with THP::TrackHubFact which is a factory and creates a fan of THP::TrackHubDir jobs. It is one THP::TrackHubDir per study so if you pass an organism like "theobroma_cacao" it will create 6 jobs because in this case there are 6 studies that have reached 'written' stage in the previous step (and match the organism and the piperun id). THP::TrackHubFact makes use of a boolean parameter: fill_namecheck'. The default is 1/yes. This adds a step which populates table NAME_CHECK in the database. Only studies aligned to assemblies that exist in this table can be made into trackhubs. Otherwise the Ensembl browser can not load the crams because it will not have the assembly. This step uses 'ENSGET[genomes]' in the config file which in turn uses 'ensdivision' also in the config file. The parameter together form a call to the Ensembl REST API to get all the assemblies (by name and GCA accession) that Ensembl has. The only time to overwrite the default 1/yes is if you are running it several times and you know that NAME_CHECK has an up to date list of assembles.
2. THP::TrackHubDir is launched once per study so if running in standalone mode it will need the parameter 'study_id'. If it is part of an analysis then it will be passed this parameter from upstream based on upstream parameters, in this case the 'orgs' list given to THP::TrackHubFact. THP::TrackHubDir writes a trackhub for the study at the path you provide in 'THRACC[path]' parameter in the config file. This is a path to a directory that also hosts an ftp location. The equivalent ftp location is also required in the config file (THRACC[ftp]). The Track Hub Registry offers are a test server and you can turn this on with THRTEST[on] in the config file. If test is on then THRTEST[path] will be used instead (which can be the same as THRACC[path]). To write a trackhub, a directory is created with the name of the study. For example one study in this example is SRP148703 (theobroma_cacao):
SRP148703
├── Criollo_cocoa_genome_V2
│ └── trackDb.txt
├── genomes.txt
└── hub.txt
THP::TrackHubDir makes use of parameter 'remove_old' which defaults to 1/yes. If there is already a directory for the study in question (from a previous run) then THP::TrackHubDir will check what genomes exist already (genomes.txt) and if they are present in the NAME_CHECK table then they will be included in the registration step downstream (so that they persist and are not overwritten). If they are not in NAME_CHECK any longer (they could be from an old version of an assembly) they are removed from the genomes.txt file along with the appropriate trackDb.txt and the folder that contains it. This is unless 'remove_old' is switched off/0. If 'remove_old' is off then the old unknown assemblies are not deleted but downstream they will fail the registration step so it is not recommended. 'remove_old' option is only useful for a future development THP::TrackHubDir also makes use of config file parameters ENAVIEW, EMAIL, AEGET[runsXstudy], AEGET[seqer]. These allow link backs to the source studies and samples in ENA in the metadata of the trackhubs and probably do not need changing. It also uses parameter UMASK in the config file which sets the permissions so that the ftp server can host them and other people can also take over the pipeline and access previously written files.
For each THP::TrackHubDir job that successfully completes, the study is marked as 'written' in the STUDY table that was populated upstream of the pipeline (THP::GetStudyMet).
3. The THP::TrackHubDir jobs funnel into THP::TrackHubReg which takes its parameters from THP::TrackHubFact down branch 1. This module works perfectly well in stand alone mode because it does not get it's input from the completed THP::TrackHubDir jobs, instead it generates a fresh list from the 'orgs' list or whatever other filter is applied at the beginning and if the studies are 'written' according to the STUDY table it will go on to register each one of them with the Track Hub Resgistry (THR). This job is not fanned because the THR server may struggle with a lot of traffic and particularly parallel registrations to the same account. This job therefore takes most time because each study is registered one at a time. This module needs to be able to log in to the track hub registry so it uses THRACC[user] and THRACC[pw] or THRTEST[user] and THRTEST[pw] if THRTEST[on] is on. Credentials are likely to be the same whether test is on or off because the test server is a copy of the production server so the credentials will be the same. This module also uses parameter 'registry_output' in the config file which is a path to dump warnings and errors. You can check this directory if a study is not getting registered as expected. Each run of THP::TrackHubReg creates a new file and if the file is empty it means that there were no errors or warnings. THP::TrackHubReg also takes parameter 'gca_hash' which defaults to yes/1. It used to be possible to register track hubs just with the assembly name because the THR could look up the GCA accession and apply it. This does not work at them moment so THP::TrackHubReg will grab the assembly name from the genomes.txt file created by THP::TrackHubDir and then it will get the GCA accession from the NAME_CHECK table. Finally THP::TrackHubDir makes use of 'delete_first' which defaults to off/0. If turned on it tries to delete all trackDbs associated with the study before registering the study afresh. This option was added in case some residual elements do not get overwritten effectively during registration (if the study has been registered in the past). However at time of writing the THR API option 'DELETE /api/trackdb/$ID' is not functioning. If the delete is unsuccessful the registration will go ahead anyway but unless you need a fresh registration it is recommended to keep it off. For each study completed successfully THP::TrackHubReg marks the study as 'finished' in the table STUDY.
If registration of a trackhub fails the THP::TrackHubReg job itself does not, it continues to iterate through and try to register the rest in the list so you need to check the log file that is generated and stored in directory config.registry_output. The file is dated and has a time stamp so you will find it. Example: 20Sep_12-32-29.txt. You should also check the database (see below).
All the default parameters mentioned above can be overwritten in the THP::WriteHub_conf module (no matter which modules they appear in in the analysis) itself using sub pipeline_wide_parameters.
> $initp THP::WriteHub_conf -pipeline_url $EHIVE_URL -hive_force_init 1
> $seed -url $EHIVE_URL -logic_name hub_start -input_id '{ "PIPERUN" => 1, "orgs" => ["theobroma_cacao"] }'
> $runloopd -url $EHIVE_URL
Checks you can make in the DB
-----------------------------
> select * from NAME_CHECK;
This should be populated with all species that are available in ensembl browser after THP::TrackHubFact is run. In stand alone mode:
$stand THP::TrackHubFact -input_id '{ "PIPERUN" => 1, "orgs" => ["theobroma_cacao"] }'
> select study_id, written,finished from STUDY where piperun = 1;
This will display the studies that have been written by THP::TrackHubDir (check for a '1' in the written column). The written ones will appear as directories in config.THRACC[path]. It also displays if the trackhub was successfully registered (check for a '1' in the 'finished' column). If not registered you can check the error message and POST string in the log file that appears in config.registry_output
Both runnables are possible in standalone mode:
$stand THP::TrackHubDir -input_id '{ "PIPERUN" => 1, "study_id" => "SRP148703" }'
$stand THP::TrackHubReg -input_id '{ "CHOOSE_STUDIES" => ["SRP018753"] }'
Checks you can make on the Ensembl browser and the THR
------------------------------------------------------
After THP::TrackHubReg is done you should be able to view the trackhubs.
in standalone:
$stand THP::TrackHubReg -input_id '{ "PIPERUN" => 1, "orgs" => ["theobroma_cacao"] }'
There is a module called THP::CheckBrowser which uses the DB to generate the urls required for the browser to find and load the trackhub (it needs to know where the hub.txt file is) and to know which assembly that the trackhub refers to (it simply needs the organism name):
$stand THP::CheckBrowser -input_id '{ "PIPERUN" => 1, "orgs" => ["theobroma_cacao"] }'
This will generate urls that will load even if the trackhubs have not been registered yet (result of THP::TrackHubDir). For example:
http://plants.ensembl.org/TrackHub?url=ftp://ftp.ensemblgenomes.org/pub/misc_data/Track_Hubs/SRP004925/hub.txt;species=Theobroma_cacao
This is what the THR does: it formulates the link to the ensemble browser. To see the effect of THP::TrackHubReg go to the registry and search by the study id (http://www.trackhubregistry.org/)
Doing all of Part 3 in one analysis pipeline
--------------------------------------------
**THP::GetMetaWriteHub_conf**
THP::GetMetaWriteHub_conf eHive analysis file does all the above steps (from part 3) in order. It starts with populating the metadata tables using THP::StudyMetaFact which creates a fan of THP::GetStudyMet jobs which in turn creates a fan of THP::GetSampMet jobs. When all the THP::GetSampMet jobs are done THP::MetaDone is run (for each THP::StudyMetaFact) and when all the THP::StudyMetaFact jobs are done Trackhub writing begins with THP::TrackHubFact.
THP::TrackHubFact creates a fan of THP::TrackHubDir jobs to get each study written into a trackhub. When all trackhubs are written (all THP::TrackHubDir jobs are done) then THP::TrackHubReg is called to register them with the track hub registry.
THP::CheckBrowser can be used (as above) to look into some results. Remember to check STUDY table for 'finished' studies to ensure the registry was able to register this one and if not check the registry_output directory as specified in the config file.
$initp THP::GetMetaWriteHub_conf -pipeline_url $EHIVE_URL -hive_force_init 1
$seed -url $EHIVE_URL -logic_name study_fan -input_id '{ "PIPERUN" => 1, "orgs" => ["theobroma_cacao"] }'
$runloopd -url $EHIVE_URL
Conclusions
===========
Remember for all the pipelines and most standalones in this tutorial, if you ommit all filter parameters ('orgs', 'CHOOSE_STUDIES', 'CHOOSE_RUNS') then all available data for the given piperun will be acted on.
==Getting counts with THP::Checks runnable==
You can run THP::Checks (use standalone mode) to get a report by providing a file name:
$stand THP::Checks -input_id '{ "PIPERUN" => 1, "report_file" => "./stats.txt" }'
In this case it will write the counts to ./stats.txt and also to standard out
You can also use THP::Checks for making sure that all the assembly names used in RNASeq-er actually exist in Ensembl genomes. This is a comparison between AERUNS.assembly values with NAME_CHECK.assembly_default values. AERUNS is populated in one of the early steps of the workflow (THP::FindCrams fanned by THP::SpeciesFact) and NAME_CHECK is populated before trackhub writing occurs (THP::TrackHubFact). If you want to do this check before writing any trackhubs you can run THP::TrackHubFact in standalone. It needs parameter 'fill_namecheck' switched on but this is the default anyway:
$stand THP::TrackHubFact -input_id '{ "PIPERUN" => 1, "fill_namecheck" => 1 }'
To cross check assembly names use THP::Checks like this:
$stand THP::Checks -input_id '{ "PIPERUN" => 4, "check_ass_name" => 1 }'
For instance I got:
------
assembly with name 'Theobroma_cacao.Criollo_cocoa_genome_V2' is mentioned in RNASEQ-er but is not in Ensembl API
------
When I checked NAME_CHECK.assembly_default for species_production_name = 'theobroma_cacao' I found that Ensembl is using the assembly name 'Criollo_cocoa_genome_V2'. This is very likely to be the same assembly so THP::Checks lets you swap the assembly name in AERUNS so that it matches the assembly name in Ensembl (found in table NAME_CHECK or more directly in API endpoint http://rest.ensembl.org/info/genomes/division/EnsemblPlants?content-type=application/json (for plants))
$stand THP::Checks -input_id '{ "PIPERUN" => 4, "change_ass_name" => ["Theobroma_cacao.Criollo_cocoa_genome_V2","Criollo_cocoa_genome_V2"] }'
You simply pass an array. Every first string is the existing name in AERUNS and every second name is the value to change it to: ["need_change_1","alt_name_1","need_change_2","alt_name_2"]