-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.txt
117 lines (82 loc) · 4.15 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
Author: Daniel Oreper, UNC, Valdar Lab, 2016-06-13
Contents: Description of code for aligning EXIT experiment reads and for postprocessing to
create summary information about the read counts.
License: The LGPL License, Version 3.0. See LICENSE.txt
Third party licenses: See NOTICE.txt
Disclaimer: This is academic code, written merely to run correctly-- not to be reusable!
======================================================
Prerequisites that are not included in this bundle.
=======================================================
BWA 0.7.12-r1039 or greater, installed on path
SAMTOOLS 1.3 or greater, installed on path
JRE 1.8.0 or greater, installed on path
R prerequisites:
R 3.2 or greater
R453Plus1Toolbox
Biostrings
yaml
Rsamtools
ggplot2
plyr
doMC
ggplot2
gtools
epitools
registerDoMC
ShortRead
qrqc
data.table
seqLogo
data.table
Python prerequsites: (uncertain if libs come with python)
Python 2.7.11 or greater
yaml
loadConfig
logging
============================================================
Run Code
=============================================================
To run alignment, using PROJECT_LOCATION/src as working directory,
> python clusterDistribute.py ../config/config.txt
To run alignment on the the cluster (with parameters tuned for UNC kure cluster)
> python clusterDistribute.py ../config/config.txt ../config/configCluster.txt
If cluster params need to be changed see config/configCluster.txt, especially clusterRunCommand property.
When alignment (i.e. clusterDistribute.py) completes, to collate all information, run
> R CMD BATCH '--args ../config/config.txt' postProcessAlignment.R
or
> R CMD BATCH '--args ../config/config.txt ../config/configCluster.txt' postProcessAlignment.R
if on cluster.
After running these scripts, PROJECT_LOCATION/output/postProcess will contain resulting collated information in the form of .csv files.
==============================================================
Debugging logs
=============================================================
If clusterDistribute fails,
PROJECT_LOCATION/output/clusterLogs contains an R log file for each experiment/block of reads.
Check these files if folders PROJECT_LOCATION/output/blocks/BLOCKNAME fail to contain javaNew.bam files.
If postProcessAlignment fails,
PROJECT_LOCATION/src/postProcessAlignment.ROut will contain the error message.
==============================================================
Adding experiments
=============================================================
If additional samples need to be processed, 3 steps must be taken.
1) Add the EXIT_****_1.reads, EXIT_****_3.reads, and EXIT_****.qvals files for the sample to PROJECT_LOCATION/dataset/seq
2) Edit PROJECT_LOCATION/dataset/seq/experimentData.csv to include all experiment metadata
3) Edit PROJECT_LOCATION/config/config.txt and PROJECT_LOCATION/config/configCluster.txt
so that the includedExperiments list contains the new sample ids.
================================================================
Overview of Directory Structure
================================================================
PROJECT_LOCATION/src : R and python source code. Run from this directory
/dataset : input files
/dataset/seq :raw sequencing files,
of the form EXIT_****_1.reads, EXIT_****_3.reads, EXIT_****.qvals, experimentData.csv
where EXIT_****_1.reads is the right end read
EXIT_****_3.reads is the left end read,
EXIT_****.qvals is the file containing q values for both reads
and **** is a numeric experimentId.
experimentData.csv is a tab separated file containing metadata about each experiment id
/java/src : java code implementing banded dynamic programming alignment, given a set of candidate regions
/java/lib : external jars upon which java code depends, as well as runnable jars of compiled source from /java/src; i.e. jalign5.jar and ParsePositionsSAMSE.jar
/config : configuration files, including config.txt, the set of default parameters for alignment, etc.
/output : location to which output files are written.
/output/postprocess : location to which final csv summary files are written.