-
Notifications
You must be signed in to change notification settings - Fork 3
/
CAGEscan-Clustering.readme
173 lines (141 loc) · 8.18 KB
/
CAGEscan-Clustering.readme
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
NAME
CAGEscan-Clustering - Create CAGEscan cluster from BED12, BAM or SAM
pairs
SYNOPSIS
CAGEscan-Clustering.pl -i <input bed12, bam or sam file to be clustered>
DESCRIPTION
Create CAGEscan cluster from paired-end data in BED12, BAM or SAM
format. Input data can be sent to STDIN, in which case the format ("-f")
of the input must be specified, or read from a file "-i", in which case
the format of the data can be specified using the "-f" option, in the
absence of which it is guessed from the suffix of the input file. Valid
suffixes / format are: 'bed', 'sam' 'bam' or gziped version of those
formats).
Importantly if the data is sent in BAM or SAM format, it must be sorted
by name.
An optional BED6 formatted cluster file ("-c") can be used to guide the
construction of CAGEscan clusters. Note that only paired-end data whose
TSS overlap with at least one cluster will be grouped.
In the absence of such an optional BED6 formatted cluster file,
"guiding" clusters are created from the paired-end data themselves using
a simple "single-linkage clustering" approach using the distance "-d"
(ie clustering together all the data based on their TSS being distant of
"-d" basepair). The result of the TSS single-linkage-based clustering
can be stored (BED6) if a file path ("-s") is provided or is otherwise
discarded.
Output data will be BED12 formatted CAGEscan cluster.
* the name of each cluster will correspond to the name of the guiding
cluster (BED6 formatted cluster name), or, if clusters are created
from the paired-end data, will correspond to:
"Lx_chr_strand_start_end" with the chr,strand,start and end of the
TSS single-linkage derived clusters.
* The score of each cluster will correspond to the number of input
paired-end tags composing the resulting CAGEscan cluster.
* The hallmark of the TSS/guiding cluster on the CAGEscan cluster will
be encoded by a 3'UTR region with :
* The *cdsStart* will correspond to either the start of CAGEscan
cluster or end of the guiding cluster (when on the plus strand),
or to the start of the guiding cluster or end of the CAGEScan
cluster (when on the minus strand).
* Similarly, *cdsEnd* will correspond to either the start of the
guiding cluster or end of the CAGEscan cluster (when on the plus
strand), or to the start of the CAGEscan cluster or end of the
guiding cluster (when on the minus strand).
Note about using guiding cluster
(instead of de-novo clustering from the paired-end TSS)
Whenever the boundaries of the guiding cluster extend beyond those of
the intersected paired-end input data, the reported CAGEscan cluster
boundaries will be those of the paired-end data, not that of the guiding
cluster. As it can be problematic to integrate multiple run with various
input paired-end data, keep in mind that the cluster name is that of the
guide and shall be used.
Note about dealing with very large input
The rate limiting step of this script is the preparation of the
paired-end input data that need to be transformed into a BED6 with
*start* and *end* corresponding to the 0-based pe_tss position, name
corresponding to the semicolon delimited concatenated full BED12 line
and (more importantly and time-intensive) sorted by TSS position.
This step can be `precomputed` and skipped providing the Boolean flag
"-p" which will informs the script that the provided input file/stdin
already correspond to the precomputed data content, format and sorting.
Also since CAGEscan cluster are by definition, assemblies of reads
located on the same chromosome and strand, I recommend for large files
to split up the input paired-end reads by chromosome and strand and
precompute the sorted BED6 internal temporary file (with *start* and
*end* corresponding to the 0-based pe_tss position, *name* corresponding
to the semicolon delimited concatenated full BED12 line and sorted by
TSS position).
OPTIONS
-h Show this message
-v Verbose level (1-9)
BASIC OPTIONS
-i,--input
Input paired-end data file. Can be a BAM, SAM, BED or gzip
compressed BED12 file. It is recommended to use sensible suffix
for the detection of the format of the file, although this can
also be specified using the "-f" option. Note that this
parameter is not mandatory, as input data can also be read from
STDIN.
-q,--min_map_quality
The minumum "mapping quality value" for including the input data
into a CAGEscan cluster. If the input is SAM or BAM this value
corresponds to the sum of each pair member MapQ. If the input is
BED12 , this value is extracted from the score column. This
value is inclusive (aka "bigger or equal than"). [Default :
undefined, aka no filtering]
-f,--format
Format of the input (paired-end) data. Valid format are: "bed",
"sam", "bam", "bed.gz", "sam.gz", or "bam.gz". "-f" is mandatory
if the input comes from STDIN but optionnal if the input ("-i")
is a file ending with one of the valid formats.
-c,--cluster_file
Path to a BED6 formatted cluster file directing the construction
of CAGEscan clusters. Note that only paired-end data whose TSS
overlap with at least one cluster will be grouped. If none
provided, the input data itself will be used.
-d,--denovo_cluster_distance
The distance for paired-end data TSS single linkage clustering.
If no BED6 formatted cluster file directing the construction of
CAGEscan clusters were provided, the paired-end data themselves
using a simple "single-linkage clustering" approach using this
distance to group neighboring TSS.
-s,--stored_denovo_cluster_file
The path of the BED6 file where (TSS) cluster file directing the
construction of CAGEscan clusters is to be stored (otherwise
stored in a temporary file).
-x,--stored_uprocessed_bam_file
The path where non properly paired or pairs not fulfilling the
min_map_quality criteria of BAM or SAM input can be stored.
-p,--presorted_bed6_tss_input_file
Boolean flag skipping the transformation of the input paired-end
data into a BED6 with *start* and *end* correspond to the
0-based pe_tss position name correspond to the semicolon
delimited concatenated full BED12 line sort by TSS position.
Importantly, it of course means that the input file ("-i") must
conform to the above mentionned specifications
OTHER OPTIONS (path to binaries and temp files)
--tmp_bed6_tss_input_file
Path to the location where a temporary BED12 formatted file
holding the input data might be created. This is in particular
the case when "guiding" clusters are created from the paired-end
data themselves. [default
'/tmp/CAGEscan-Clustering.bed12_formatted_input.PID.tmp']
--tmp_bed6_cluster_file
Path to the location where a temporary BED6 formatted file
holding the "guiding" clusters can be created. [default
'/tmp/CAGEscan-Clustering.bed6_formatted_induced_cluster.PID.tmp
']
--bin_samtools
Path to "samtools" binary ("samtools view -Sb ..." used only
when input is in SAM format).
--bin_bamtobed
Path to "pairedBamToBed12" (re-enginering of BEDTools
"bamToBed") binary which transform properly paired BAM
alignments into a single BED entry (used only when input is in
BAM or SAM format).
--bin_mergebed
Path to BedTools::mergeBed binary (used only when "guiding"
clusters must be generated from single-linkage clustering of the
the paired-end data themselves).
--bin_intersectbed
Path to BedTools "intersectBed" binary (always used).