-
Notifications
You must be signed in to change notification settings - Fork 1
/
GeneticExample.Rmd
101 lines (76 loc) · 5.38 KB
/
GeneticExample.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
title: "Biol 801/Evrn 420/720, Example of unit testing using fake genetic data"
date: "March 2018"
output:
pdf_document:
highlight: tango
fontsize: 11pt
geometry: margin=1in
documentclass: article
---
```{r general_setup, echo=F}
source("mtime.R") #A function for cache management
```
# The problem
In `GeneticData.txt` is some made up genetic data. Find all microsatellites.
For each one, find the start position and number of repeats. For our purposes, microsatellites
are 2-6 bases long and involve at least 30 repeats. Carry out good code design and unit testing
during the process.
# Top-most design
Starting from the top and not thinking at all yet about how we will accomplish it, we need a
function (call it, say, `find_microsats`, that takes the sequence (in some form yet to be determined),
as well as the size limits of the repeating sequence of bases to look for (2 to 6), as well as the
minimum number of repeats needed for it to be considered a microsatellite (30 above). So the function
first line `find_microsats<-function(seq,lenmin,lenmax,repmin)` seems to make sense. We need to decide
on a data structure for `seq`, so let's use a single R `character` variable. These can be large. See
below for a few points on how to manipulate these
That function could call another one (call it `getseqs`), which will have an argument, `len`, for the length
of the prospective microsatellite, and will find all possibilities. So, for instance, if you call it
with `len=2` we should get AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT (this is already
giving us some ideas for unit testing). `find_microsats` will call `getseqs` for all the lengths to be considered
to get a list of what sequences to look for.
We will also need a function (call it `find_repeat`) that takes a specific prospective microsatellite
(sequence of length 2-6) and looks for it in seq. So the function first line `find_repeat<-function(ms,seq,repmin)`
makes sense.
Note that we have divided up the task into parts: one part that finds all the base sequences of lengths 2-6,
and then another part that finds whether there are repeats of one of these in our given sequence and how many and where.
We still do not know how to accomplish these tasks.
We can
read the data in with `scan("GeneticData.txt",what=character())`, and then check you get a `character` of
length 36564 (use `nchar` to check the length) that begins with "TAAATGTACAAATGTTCTTATTGTGGCTTGTTAGGGACCAAGGTACATCG".
Check out the function `substr`, which comes in handy. There are many tools for manipulating regular expressions
(and, indeed, for doing all sorts of genetic analyses), we are assuming, for the purpose of this exercise,
we do not want to use those.
# Next-level design
So far we know we will have a function `find_microsats<-function(seq,lenmin,lenmax,repmin)` that will call
`getseqs<-function(len)` and also `find_repeat<-function(ms,seq,repmin)` to do its job, but we do not yet know
how `find_repeat` is supposed to work. We just know finding whether a given prospective microsatellite actually
occurs seems easier than finding all microsatellites, so we decided to accomplish the overall task by getting a
list of all sequences of length 2-6 and then looking for them each in turn. We still need to figure out
how to look for one of these (that's `find_repeat`).
How about if `find_repeat` calls a function, call it `find_fixed` that finds the locations of all exact matches
of a given sequence? Perhaps `find_repeat` can then use that information to accomplish its goals. So we imagine
a function `find_fixed<-function(ms,seq)` that returns a vector with the start location of all instances
of `ms` in `seq`. That should be straightforward, and it also seems clear that can be used to accomplish the larger
goals of `find_repeat`.
# Function specs, then unit tests, then functions themselves, then run the tests.
Having performed this top down design (above), we were then ready to write function specs. This was done (see the `.R` files). Having spec'd the functions, we were then ready to write the unit tests. This was done (see `GeneticExample_tests.Rmd`). Having written the unit tests, we were then ready to write the code. This was done. After some debugging, all tests passed.
# Methods
Easy at this point, we just run this code:
```{r analyze_data, echo=T,cache=T, cache.extra=list(mtime("GeneticData.txt"),mtime("find_fixed.R"),mtime("find_microsats.R"),mtime("find_repeat.R"),mtime("getseqs.R"))}
dat<-scan("GeneticData.txt",what=character())
source("getseqs.R")
source("find_fixed.R")
source("find_repeat.R")
source("find_microsats.R")
res<-find_microsats(seq=dat,lenmin=2,lenmax=6,repmin=30)
```
# Results
The microsats are as follows:
```{r echo=F}
res
```
Note that each microsatellite appears together with its shifts.
# Discussion
We found `em. The use of unit testing was an up-front investment, but it made the process of writing the code and being relatively sure it works quite quick and easy.
Note: I have nominally written this document using introductory material, Methods, Results, and Discussion, but it does not quite fit because this is a demo of a coding design technique, not a scientific paper. When you do your homework, all the unit testing aspects should be parallel to what I have done here, but since your homework will also be presenting your proposal for a final project, it will fit better in the style of a scientific paper.