forked from hcorrada/IntroDataSciBaltimore
-
Notifications
You must be signed in to change notification settings - Fork 0
/
baltimore.Rmd
262 lines (189 loc) · 9.56 KB
/
baltimore.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
A quick analysis of Baltimore crime
========================================================
```{r setup, echo=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```
I'm going to do a very simple analysis of Baltimore crime to show off R. We'll use data downloaded from Baltimore City's awesome open data site (this was downloaded a couple of years ago so if you download now, you will get different results).
### Getting data
* Arrest data: https://data.baltimorecity.gov/Crime/BPD-Arrests/3i3v-ibrt
* CCTV data: https://data.baltimorecity.gov/Crime/CCTV-Locations/hdyb-27ak
Let's load the data:
```{r}
arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
cctv_tab=read.csv("CCTV_Locations.csv", stringsAsFactors=FALSE)
# these columns are mislabeled, so fix them
tmp=arrest_tab$sex
arrest_tab$sex=arrest_tab$race
arrest_tab$race=tmp
```
### Exploring data
```{r}
# dimension of table (data.frame)
dim(arrest_tab)
# what are the columns
names(arrest_tab)
# what is the average arrest age?
mean(arrest_tab$age)
# the range of arrest ages
range(arrest_tab$age)
# how many arrests per sex
table(arrest_tab$sex)
# what are the most common offenses
head(sort(table(arrest_tab$incidentOffense),decreasing=TRUE))
# what are the offenses that only happen once
tab <- table(arrest_tab$incidentOffense)
tab[tab == 1]
# range of arrests after removing those w/ age==0
range(arrest_tab$age[arrest_tab$age>0])
```
Offenses by sex
```{r}
tab <- table(arrest_tab$incidentOffense, arrest_tab$sex)
```
Let's see a table of arrests by sex and race
```{r}
table(sex=arrest_tab$sex,race=arrest_tab$race)
```
A histogram of age
```{r}
hist(arrest_tab$age,nc=100)
with(arrest_tab,hist(age[sex=="M"],nc=100)) # males only
with(arrest_tab,hist(age[sex=="F"],nc=100)) # females only
```
### Are males and females arrested at different ages on average?
Let's take a look at how age depends on sex. Let's plot age as a function of sex first (notice how we indicate that sex is a `factor`).
```{r}
plot(arrest_tab$age~factor(arrest_tab$sex))
```
One of the neat things about R is that statistical model building and testing is built-in. The model we use is $y_i=\beta_0+\beta_1 x_i$ where $y_i$ is age of sample (example) $i$ and $x_i$ is an indicator variable $x_i \in \{0,1\}$ with $x_i=1$ if the $i$-th record (example) is male. You can check that $\beta_1$ is the difference in mean age between females and males.
We use the formula syntax to build a linear regression model.
```{r}
# let's ignore those records with missing sex
fit=lm(age~factor(sex),data=arrest_tab,subset=arrest_tab$sex %in% c("M","F"))
summary(fit)
```
We see that $\beta_1 \approx -0.2$ meaning that the arrest age for males is about 2.5 months younger. So there is very little difference in the average age (which is what the linear model is testing) but we see that the probability of observing this difference from a sample of this size **when there is no difference in average age** is small $p \approx 0.01$. Since we have a very large number of examples, or records, this testing framework will declare very small differences as *statistically significant*. We'll return to this theme later in class.
### Geographic distribution of arrests.
First we need to extract latitude and longitude from location, we'll use some string functions to do this
```{r}
tmp=gsub("\\)","",gsub("\\(","",arrest_tab$Location))
tmp=strsplit(tmp,split=",")
arrest_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
arrest_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
```
Now let's plot
```{r}
plot(arrest_tab$lon, arrest_tab$lat, xlab="Longitude", ylab="Latitude", main="Arrests in Baltimore")
```
We can also use density estimates to make this nicer:
```{r}
smoothScatter(arrest_tab$lat, arrest_tab$lon, xlab="Latitude", ylab="Longitude", main="Arrests in Baltimore")
```
Let's make this fancier using the `ggplot2` graphics systems and the `maps` package containing map data.
```{r}
library(maps)
library(ggplot2)
balto_map = subset(map_data("county", region="maryland"),subregion=="baltimore city")
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=arrest_tab,aes(x=lon,y=lat),color="blue",alpha=.1)
print(plt)
```
Now let's add CCTV cameras.
```{r}
tmp=gsub("\\)","",gsub("\\(","",cctv_tab$Location))
tmp=strsplit(tmp,split=",")
cctv_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
cctv_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=arrest_tab,aes(x=lon,y=lat),color="blue",alpha=.1)
plt=plt+geom_point(data=cctv_tab,aes(x=lon,y=lat),color="red")
print(plt)
```
### A challenge
Is there any relationship between the number of CCTV cameras and the number of arrests? Divide the city into a grid and plot the number of CCTV cameras vs. the number of arrests.
```{r}
# step 1: divide city intro grid for arrest data
# step 1a: find the range of latitude and longitude
latRange=range(arrest_tab$lat,na.rm=TRUE)
lonRange=range(arrest_tab$lon,na.rm=TRUE)
# step 1b: discretize latitude into 50 bins
latGrid=seq(min(latRange),max(latRange),len=50)
latFactor=cut(arrest_tab$lat,breaks=latGrid)
# now longitude
lonGrid=seq(min(lonRange),max(lonRange),len=50)
lonFactor=cut(arrest_tab$lon,breaks=lonGrid)
# step 1c: make a factor indicating geographic grid location
gridFactor=factor(paste(lonFactor,latFactor,sep=":"))
# step 2: do the same for the cctv data
latFactor=cut(cctv_tab$lat,breaks=latGrid)
lonFactor=cut(cctv_tab$lon,breaks=lonGrid)
cctvGridFactor=factor(paste(lonFactor,latFactor,sep=":"))
arrestTab=table(gridFactor)
cctvTab=table(cctvGridFactor)
m=match(names(cctvTab),names(arrestTab))
plot(arrestTab[m]~factor(cctvTab),xlab="Number of CCTV cameras", ylab="Number of Arrests")
```
### Extra analyses
As part of Project 1 you will add to this analysis. Please use the following template:
#### Mihai Sirbu
What question are you asking?:
I am trying to answer: at what time are most people arrested?
For this prelimary analysis, I plan on making a plot where
hour is the x-axis and the number of arrest is the y-axis. This
will produced an "Arrest Timeseries.
What is the code you use to answer it?:
```{r}
time <- strptime(arrest_tab$arrestTime, "%H:%M")
arrest_tab$hours <- as.numeric(format(time, "%H"))
hours_df <- as.data.frame(table(arrest_tab$hours))
names(hours_df) <- c("hour","count")
g <- ggplot(hours_df, aes(hour, count, group=1))+geom_line(color="blue")+geom_point(color="blue")
g <- g+labs(title = "Arrest Timeseries", x="Time of Day",y="Num of Arrests")
g <- g+scale_x_discrete(breaks=seq(0,23,2))
g <- g + theme(plot.title=element_text(size=16,face="bold"),axis.title.x=element_text(size=16,face="bold"),axis.title.y=element_text(size=16,face="bold"))
g
```
What did you observe?
I had originally thought that there would be very little arrests until 8 pm at which point there would be a giant spike from 8 pm to 5 am. But that was not the case. Instead, the two biggest hours of arrest were 6 pm followed by 10 am (!!). At this point, I'm not entirely sure why that might be. I would be surprised, however, if all offenses followed this exact same pattern.
#### Aaron Dugatkin
What question are you asking?: I am trying to find out how cameras affect the sorts of crimes in their area, both in reducing certain types of crime, or leading to finding more of other types of crime.
What is the code you use to answer it?:
```{r aarondugatkin}
# modified code from above, to create factors, but remove NA
# added by HCB to restore original arrest table
arrest_tab_original = arrest_tab
#
arrest_tab = arrest_tab[!is.na(arrest_tab$lat) & !is.na(arrest_tab$lon),]
latRange=range(arrest_tab$lat,na.rm=TRUE)
lonRange=range(arrest_tab$lon,na.rm=TRUE)
latGrid=seq(min(latRange),max(latRange),len=50)
latFactor=cut(arrest_tab$lat,breaks=latGrid)
lonGrid=seq(min(lonRange),max(lonRange),len=50)
lonFactor=cut(arrest_tab$lon,breaks=lonGrid)
gridFactor=factor(paste(lonFactor,latFactor,sep=":"))
latFactor=cut(cctv_tab$lat,breaks=latGrid)
lonFactor=cut(cctv_tab$lon,breaks=lonGrid)
cctvGridFactor=factor(paste(lonFactor,latFactor,sep=":"))
arrestTab=table(gridFactor)
cctvTab=table(cctvGridFactor)
#count crimes in areas with and without camera
arrestOnCamera = gridFactor %in% names(cctvTab)
count_crime_tab <- table(arrest_tab$incidentOffense, arrestOnCamera)
#merge the two tables, and calculate the difference in crime frequency in the two situations
crime_tab <- data.frame(count_crime_tab[,1], count_crime_tab[,2])
colnames(crime_tab)[1] <- "noCamCrimes"
colnames(crime_tab)[2] <- "camCrimes"
crime_tab$names <- rownames(crime_tab)
crime_tab$campct <- crime_tab$camCrimes/sum(crime_tab$camCrimes)*100
crime_tab$nocampct <- crime_tab$noCamCrimes/sum(crime_tab$noCamCrimes)*100
crime_tab$pctchange <- crime_tab$campct - crime_tab$nocampct
#display the change in crime frequency with crime name in descending order, with the most increased (caught) crimes first
crime_tab <- crime_tab[with(crime_tab, order(-pctchange)), ]
options(scipen=999)
subset(crime_tab, select=c("pctchange"))
# added by HCB to restore original arrest table
arrest_tab = arrest_tab_original
```
What did you observe? The results were interesting. We see a large increase in charges of narcotics, which may be due to camera surveillance. We also see a decrease in assault, which may be due to the perpetrators of such crimes realizing the dangers of committing such crimes in front of a camera. However, the vast majority of crimes do not even see a 1% change between the two situations, so it would appear as though, overall, cameras do not have a major affect on criminal activity.