-
Notifications
You must be signed in to change notification settings - Fork 0
/
filtering.Rmd
138 lines (104 loc) · 3.54 KB
/
filtering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Comparative Genomics - Project - Blastp filtering"
author: "Melissa SAICHI & Sandrine FARTEK"
output: html_document
---
# DATA
These data correspond to the result of blastp command. There are 4 500 699 alignment and 8 variables : </br>
- qseqid : query sequence id </br>
- sseqid : subject sequence id </br>
- pident : percentage of identical matches </br>
- qlen : query sequence length </br>
- length : alignment length </br>
- mismatch : number of mismatches </br>
- gapopen : number of gap openings </br>
- evalue : expect value * </br>
- bitscore : bit score * </br>
- qstart : </br>
- qend : </br>
* Description of e-value and bitscore here : http://www.metagenomics.wiki/tools/blast/evalue
```{r, eval=FALSE}
dat = read.table("blastp_result.txt", header = F)
colnames(dat) = c("qseqid", "sseqid", "pident", "qlen", "length", "mismatch", "gapopen", "evalue", "bitscore", "qstart", "qend")
save(dat, file = "blastp_result.RData")
```
```{r}
load("blastp_result.RData")
colnames(dat)
# Number of unique protein sequences
length(unique(c(dat$qseqid, dat$sseqid)))
```
# ANALYSIS TO CHOSE THRESHOLDS
```{r}
# Remove alignment between identical proteins
tmp = which(dat$qseqid == dat$sseqid)
length(tmp)
dat = dat[-tmp,]
# Create a new column for coverage
dat = cbind(dat, qcov = (dat$qend-dat$qstart)/dat$qlen)
```
```{r}
# Histogram identity percentages
hist(dat$pident, main = "Histogram of identity percentage")
print("Identity")
quantile(dat$pident, probs = c(0.25, 0.5, 0.75))
# Histogram expect value
hist(-log10(dat$evalue), main = "Histogram of the -log10 of the evalue")
print("E-value")
quantile(-log10(dat$evalue), probs = c(0.25, 0.5, 0.75))
# Histogram expect bitscore
hist(dat$bitscore, main = "Histogram of the bitscore")
print("Bitscore")
quantile(dat$bitscore, probs = c(0.25, 0.5, 0.75))
# Histogram Coverage
hist(dat$qcov, main = "Histogram of coverage")
print("Coverage")
quantile(dat$qcov, probs = c(0.25, 0.5, 0.75))
```
# FILTER DATA
Before going to further study, we need to filter the result of the blastp all-vs-all. We use multiple criterion to do so :
- alignment with 100% identity : they correspond to alignment of identical sequences
- identity < 60% (5 821 015 alignments)
- e-value > 10^-10 (159 599 alignments)
- coverage < 40% (120 986 alignments)
- bitscore :
Remaining 252 243 alignments corresponding to 25 649 proteins (39 021 in filtered fasta file).
```{r}
# Remove alignments with % identity < 40%
tmp = which(dat$pident < 60)
length(tmp)
dat = dat[-tmp,]
# Remove alignments with evalue > 0.01
tmp = which(dat$evalue > 10^-10)
length(tmp)
dat = dat[-tmp,]
# Remove alignments with coverage < 30%
tmp = which(dat$qcov < 0.4)
length(tmp)
dat = dat[-tmp,]
# Number of unique protein sequences left
length(unique(c(dat$qseqid, dat$sseqid)))
```
# STATISTICS ON FILTERED DATA
```{r}
# Histogram identity percentages
hist(dat$pident, main = "Histogram of identity percentage")
print("Identity")
quantile(dat$pident, probs = c(0.25, 0.5, 0.75))
# Histogram expect value
hist(-log10(dat$evalue), main = "Histogram of the -log10 of the evalue")
print("E-value")
quantile(-log10(dat$evalue), probs = c(0.25, 0.5, 0.75))
# Histogram expect bitscore
hist(dat$bitscore, main = "Histogram of the bitscore")
print("Bitscore")
quantile(dat$bitscore, probs = c(0.25, 0.5, 0.75))
# Histogram Coverage
hist(dat$qcov, main = "Histogram of coverage")
print("Coverage")
quantile(dat$qcov, probs = c(0.25, 0.5, 0.75))
```
# SAVE DATA
```{r}
write.table(dat[, -12], "blastp_result_filtered.txt", col.names = T, row.names = F, quote = F)
```