-
Notifications
You must be signed in to change notification settings - Fork 63
/
Copy pathkiva_comp.Rmd
587 lines (391 loc) · 19.1 KB
/
kiva_comp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
---
title: 'Document Classification Case Study: Kiva Loans'
author: "Dr. Stephen W. Thomas, Queen's University"
---
```{r}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, fig.align='center')
```
```{r}
library(tidyverse)
library(scales)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(MLmetrics)
library(topicmodels)
library(tidytext)
library(knitr)
library(rpart)
library(rpart.plot)
```
```{r, echo=FALSE, out.width = "150px"}
knitr::include_graphics("kivaloans.jpg")
```
# Introduction
[Kiva Microfunds](https://www.kiva.org/) is a non-profit that allows people to lend money to low-income entrepreneurs and students around the world. Started in 2005, Kiva has crowd-funded millions of loans with a repayment rate of 98% to 99%.
Kiva includes both traditional demographic information, such as gender and location, as well as personal stories on each borrower because Kiva wants lenders to connect with the borrowers on a human level. An example:
*Evelyn is 40 years old and married with 3 kids. She is in the Karura Hope women group and her life has been changed by the first KIVA loan she received last year which she is completing this quarter. Before she received the loan, she used to sell 9 litres of milk daily to local residents. After receiving the loan she bought iron sheets, five cement packets, one lorry of sand, some ballast and animal feed for her cows and improved her cow shed. Today she sells a daily average of 40 litres of milk to the Kiamba Dairy cooperative society, which is affiliated to the Kenya Cooperative Creameries at a cost of USD 0.28 per litre. Her daily farming has really grown. Evelyn intends to buy another dairy cow and a tank of water for home consumption and for her cows. She intends to repay in monthly installments.*
Despite her uplifting story, and her previous successful loan, Evelyn defaulted on her next loan of 900 USD.
In this case study, we will explore past Kiva loans and build a prediction model (in particular, a decision tree classifier) to predict which future borrowers will pay back loans, and which will default. A key question we will explore is: does adding text (i.e., the personal stories) to the prediction model increase or decrease the model's prediction power?
This case study will provide lots of data, tables, and graphs, but is intentionally light on commentary, analysis, and decision making. That's your job!
## Case Discussion Questions
At the end of this case study, we will have a group discussion around the following questions:
1. Does text data help in predicting which loan seeker will default?
2. Which words are most biased towards defaulting? Is this expected/intuitive?
3. According to the decision tree prediction models, what variables best predict a default?
4. As a decision maker, would you recommend the use of textual data in your prediction models?
5. As a lender, what other information would you like to have?
# Kiva Background
The key concepts in the Kiva world are:
- **Loan**. A loan is the most important concept at Kiva. Most other concepts are in some way related to a loan.
- **Borrower**. A borrower is someone who has requested a loan. Borrowers are often
referred to as *businesses* or *entrepreneurs* in order to emphasize the entrepreneurial spirit of
these individuals who work to make a difference in their lives.
- **Lender**. A lender is a user registered on the Kiva website for the purposes of lending money and
participating in the community. Some lenders have public profiles, known as lender pages, on
the Kiva website, where they can share details about their activities and mission. Most lenders,
however, refrain from displaying their public information and are referred to as “anonymous.”
- **Partner.** A partner, or Kiva field partner, is usually a microfinance institution with which Kiva works to
find and fund loans. Every loan at Kiva is offered by a partner to a borrower, and the partner
works with Kiva to get funding for that loan from lenders.
# Loading the Data
```{r}
df_full <- read_csv("kiva.csv")
```
```{r}
str(df_full)
df_full$id = 1:nrow(df_full)
df_full$status = as.factor(df_full$status)
df_full$sector = as.factor(df_full$sector)
df_full$country = as.factor(df_full$country)
df_full$gender = as.factor(df_full$gender)
df_full$nonpayment = as.factor(df_full$nonpayment)
```
Let's look at a sample of our data.
```{r}
head(df_full, n=20)
summary(df_full)
```
# Sample
Optional. To make some of the models and tools run faster below, let's start with a small sample. Once we're happy with the rest of the script, we can change the size of the sample and re-run everything below.
```{r}
set.seed(12345)
df = df_full %>%
sample_frac(.20)
```
# Data Cleaning
```{r}
# Remove HTML Tags
df = df %>%
mutate(en = gsub("<.*?>", "", en))
# Convert into tidytext format
str(df)
text_df <- df %>%
select(id, status, en) %>%
unnest_tokens(word, en)
## Remove stopwords
custom_stop_words = data.frame(word=c("loan", "business"))
text_df <- text_df %>%
anti_join(stop_words, by=c("word"="word")) %>%
anti_join(custom_stop_words, by=c("word"="word")) %>%
arrange(id)
# Stem words?
#library(SnowballC)
#df = df %>%
# mutate(en = wordStem(en))
```
# Feature Engineering
## Latent Dirichlet Allocation
Let's use a topic modeling technique called Latent Dirichlet Allocation (LDA) to extract the topics from each document.
First, we need to build a document term matrix (DTM).
```{r}
# Count each word in each document.
word_counts = text_df %>%
group_by(id, word) %>%
summarize(count = n())
# Create a document term matrix
dtm = word_counts %>% cast_dtm(id, word, count)
# Remove sparse terms from the document term matrix.
library(tm)
dtm2.nosparse <- removeSparseTerms(dtm, 0.9995)
rowTotals <- apply(dtm2.nosparse, 1, sum) # Find the sum of words in each Document
dtm.new <- dtm2.nosparse[rowTotals> 0, ]
```
Run the LDA model.
```{r}
num_topics = 12
# Run the model
library(topicmodels)
lda <- LDA(dtm.new, k = num_topics, control = list(seed = 1234))
# Name each topic, using the top words from each topic
t = terms(lda, k=4)
topic_names = apply(t, 2, function(x) paste(x, collapse = "_"))
lda_beta <- tidy(lda, matrix = "beta")
lda_gamma <- tidy(lda, matrix = "gamma")
lda_gamma$document = as.integer(lda_gamma$document)
tn = data.frame(id=1:12, topic_name = as.character(t(topic_names)))
tn$topic_name = as.character(tn$topic_name)
tn$topic_name = sprintf("%02d: %s", 1:12, tn$topic_name)
```
Add the resulting document topic probabilities to the `df` dataframe. These are the features we'll use later
for the classification.
```{r}
lda_gamma_new = lda_gamma %>% spread(topic, gamma)
df_new = df %>% left_join(lda_gamma_new, by=c("id" = "document"))
library(data.table)
setnames(df_new, old=sprintf("%d", c(1:12)), new=sprintf("topic %d: %s", c(1:12), topic_names))
head(df_new)
```
## TODO: What other features to add?
Cluster memberships, top words/phrases, length, etc?
# Data Description
The data in this case study was collected from [http://build.kiva.org](http://build.kiva.org/), Kiva's website that provides snapshots of Kiva loan data. In the full dataset, about 98% of loans are paid and 2% defaulted. In this case study, we look at only a sample of the data, where the split between paid and defaulted is closer to 50%-50%.
Let's look at our sample to understand the size, shape, values, and patterns in the variables. The sample includes `r ncol(df)` variables, named: `r colnames(df)`. Each variable is explored in turn. The `en` variable is the text variable, i.e., the personal story of the loan seeker, and will be our main source of investigation. There are `r nrow(df)` records/rows/loans in our sample.
## Variable: status
The `status` variable indicates whether a loan was `paid` or `defaulted`. As previously described, the data has a fairly even split between these two options.
```{r fig.height=4}
qplot(status, data=df, geom="bar", fill=status, xlab="status") + theme(legend.position = "none")
```
## Variable: sector
The figure below shows the number of loans in each sector, coloured by the loan's `status`.
```{r fig.height=4}
qplot(sector, data=df, geom="bar", fill=status, xlab="sector")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
## Variable: country
Below is the same for the `country` variable.
```{r fig.height=2.5}
qplot(country, data=df, geom="bar", fill=status)
```
## Variable: gender
```{r fig.height=2.5}
qplot(gender, data=df, geom="bar", fill=status)
```
## Variable: nonpayment
The nonpayment variable captures who is liable if a loan defaults: the lender, or the partner.
```{r fig.height=4}
qplot(nonpayment, data=df, geom="bar", fill=status)
```
## Variable: loan_amount
Unlike the other variables thus far, the `loan_amount` variable is numeric. Below is a density plot, which shows the most popular loan amounts.
```{r fig.height=2.5}
df %>%
ggplot(aes(loan_amount)) +
geom_density(fill = "blue", alpha=0.6)
```
Below are separate density curves for `status=defaulted` and `status=paid`.
```{r fig.height=2.5}
df %>%
ggplot(aes(loan_amount, colour=status, fill=status)) +
geom_density(alpha=0.1)
```
Below is a filled density plot:
```{r fig.height=2.5}
df %>%
ggplot(aes(loan_amount, colour=status, fill=status)) +
geom_density(alpha=0.1, position="fill")
```
```{r}
df$loan_amount.cut = cut(df$loan_amount, breaks=c(0, 300, 600, 900, 1500))
addmargins(table(df$loan_amount.cut, df$status, dnn=c("loan_amount.cut", "status")))
```
## Variable: en
The `en` variable is raw English text, and there's lots of ways to look at it.
### Length
The figure below is a density plot of the length (number of characters/letters).
```{r fig.height=2.5}
df %>%
mutate(en_length = nchar(en)) %>%
ggplot(aes(en_length, colour=status, fill=status)) +
geom_density(alpha=0.1) +
labs(x = "Number of characters in `en`")
```
### Top Words
The table below shows the top (i.e, most frequently occuring) words.
```{r}
text_df %>% group_by(word) %>%
summarize(count=n()) %>%
mutate(freq = count / sum(count)) %>%
arrange(desc(count)) %>%
top_n(20)
```
### Most Biased Words
The plots below show which words are most biased towards being `paid` or `defaulted`, using the log odds ratio metric.
```{r}
status_words_count = text_df %>% group_by(status, word) %>%
summarize(count=n()) %>%
arrange(desc(count))
log_ratios = status_words_count %>%
spread (status, count) %>%
select(-`<NA>`) %>%
mutate(defaulted = ifelse(is.na(defaulted), 0, defaulted)) %>%
mutate(paid = ifelse(is.na(paid), 0, paid)) %>%
mutate(total=defaulted+paid) %>%
mutate(log_ratio = log2(paid/defaulted))
```
```{r, fig.height=4}
log_ratios %>%
filter(total > 500) %>%
group_by(log_ratio < 0) %>%
top_n(15, abs(log_ratio)) %>%
ungroup() %>%
mutate(word = reorder(word, log_ratio)) %>%
ggplot(aes(word, log_ratio, fill = log_ratio < 0)) +
geom_col() +
coord_flip() +
ylab("log odds ratio") +
scale_fill_discrete(name = "", labels = c("paid", "default"))
```
Below are tabular versions of the same data above, starting with the words that are biased towards `paid`:
```{r}
log_ratios %>%
filter(total > 500) %>%
arrange(desc(log_ratio)) %>%
top_n(17)
```
And those that are biased towards `default`:
```{r rows.print=20}
log_ratios %>%
filter(total > 500) %>%
arrange((log_ratio)) %>%
top_n(-20)
```
Let's use the TF-IDF metric to see which words are the most "important":
```{r, eval=FALSE}
book_words <- text_df %>%
select(-status) %>%
count(id, word, sort = TRUE) %>%
ungroup()
book_words %>% arrange(desc(n))
total_words <- book_words %>%
group_by(id) %>%
summarize(total = sum(n))
total_words %>% arrange(desc(total))
book_words <- left_join(book_words, total_words)
freq_by_rank <- book_words %>%
group_by(id) %>%
mutate(rank = row_number(),
`term frequency` = n/total)
book_words_tfidf <- book_words %>%
bind_tf_idf(word, id, n)
book_words_tfidf %>%
arrange(desc(tf_idf))
```
## LDA Topics
Next, we used a techinque called Latent Dirichlet Allocation (LDA) to automatically extract high-level topics from the documents. We told LDA to extract the `12` most important topics; LDA will also tell us which topics are in which documents. For example, a document might have 50% of its words come from topic 1, 25% of its words come from topic 5, and the remaining 25% of its words come from topic 12.
### LDA Top Terms Per Topic
This figure shows the top terms (words) in each of the 12 discovered topic.
```{r,fig.width=10,fig.height=8}
ap_top_terms <- lda_beta %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder(term, beta)) %>%
left_join(tn, by=c("topic" = "id")) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic_name, scales = "free", ncol=3) +
coord_flip()
```
### Documents per LDA Topic
The figure below shows the number of documents that contain each topic (i.e., more than 5%), coloured by `status`.
```{r}
topic_totals = lda_gamma %>%
left_join(df, by=c("document" = "id")) %>%
select(c(-en)) %>%
filter(gamma >= 0.05) %>%
group_by(topic, status) %>%
summarize(count=n()) %>%
spread(status, count) %>%
mutate(total = defaulted + paid) %>%
left_join(tn, by=c("topic" = "id")) %>%
select(topic, topic_name, everything())
```
```{r fig.height=5}
tmp_gathered = topic_totals %>%
select(topic, topic_name, defaulted, paid) %>%
gather(Status, Value, defaulted, paid)
ggplot(tmp_gathered, aes(x=topic_name, y=Value, fill=Status)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 65, hjust = 1))
```
# Building a Classifier Model
Now that we have explored the data, it’s time to dive deeper. Which variable(s) are the biggest predictors of `status`? This is where classifier models shine. They can tell us exactly how all the variables relate to each other, and which are most important.
A decision tree is a popular classifier model in analytics. Here, the decision tree is automatically created by a machine learning algorithm as it learns simple decision rules from the data. These automatically-learned rules can then be used to both understand the variables and to predict future data. A big advantage of decision trees over other classifier models is that they are relatively simple for humans to understand and interpret.
A decision tree consists of nodes. Each node splits the data according to a rule. A rule is based on a variable in the data. For example, a rule might be “Age greater than 30.” In this case, the node splits the data by the age variable; those passengers that satisfy the rule (i.e., are greater than 30) follow the left path out of the node; the rest follow the right path out of the node. In this way, paths from the root node down to leaf nodes are created, describing the fate of certain types of passengers.
A decision tree path always starts with a root node (node number 1), which contains the most important splitting rule. Each subsequent node contains the next most important rule. After the decision tree is automatically created by the machine learning algorithm, one can use the decision tree to classify an individual by simply following a path: start at the root node and apply each rule to follow the appropriate path until you hit an end.
When creating a decision tree from data, the analyst can specify the number of nodes for the machine learning algorithm to create. More nodes leads to a more accurate model, at the cost of a more complicated and harder-to-interpret model. Likewise, fewer nodes usually leads to a less accurate model, but the model is easier to understand and interpret.
## A Prediction Model without the Text
First, as a baseline, we trained a decision tree classifier model without using any of the text or topics. Below is a graphical depiction of the model after it has been trained:
```{r fig.height=3}
set.seed(123)
# Don't want to use either of these for prediction, and the - sign doesn't work
# with rpart forumulas.
df_notext = subset(df_new, select=c(status, sector, country, gender, loan_amount, nonpayment))
# Split the data into training and testing.
train_notext <- sample_frac(df_notext, 0.8)
test_notext <- setdiff(df_notext, train_notext)
# Let's train the model.
form = as.formula(status ~ .)
tree <- rpart(form, train_notext, method="class")
rpart.plot(tree, extra=2)
```
To measure the prediction performance, we used some never-before-seen data (called _testing data_). We gave the testing data to the classifier, asked it to make a prediction (i.e., whether the borrower will pay or not), and then compared it to the true answer.
The following table summarizes the predictions of the classifier.
```{r}
predicted = predict(tree, test_notext, type="class")
actual = test_notext$status
preds = data.frame((table(predicted, actual))) %>%
spread(actual, Freq) %>%
mutate(total = defaulted + paid) %>%
select(predicted, total, everything())
preds
```
Below is the accuracy and other metrics of the classifier on the testing data.
```{r}
print(sprintf("Accuracy: %.3f", Accuracy(y_true=actual, y_pred=predicted)))
print(sprintf("Precision: %.3f", Precision(y_true=actual, y_pred=predicted)))
print(sprintf("Recall: %.3f", Recall(y_true=actual, y_pred=predicted)))
print(sprintf("F1 Score: %.3f", F1_Score(predicted, actual)))
print(sprintf("Sensitivity: %.3f", Sensitivity(y_true=actual, y_pred=predicted)))
print(sprintf("Specificity: %.3f", Specificity(y_true=predicted, y_pred=actual)))
```
## A Prediction Model with the Text
We then built the same kind of decision tree classifier model as before, except now, we included the LDA topics, which were built from the text. (Note: there are many _other_ textual features we could include in this model: individual words, clusters, etc. However, we kept it simple for now.) Below is the result.
```{r fig.height=3}
set.seed(123)
# Don't want to use either of these for prediction, and the - sign doesn't work
# with rpart forumulas.
df_text = subset(df_new, select=c(-id, -en))
# Split the data into training and testing.
train_text <- sample_frac(df_text, 0.8)
test_text <- setdiff(df_text, train_text)
# Let's create the model.
form = as.formula(status ~ .)
tree <- rpart(form, train_text, method="class")
rpart.plot(tree, extra=2)
```
Below is a summary of its predictions:
```{r}
predicted = predict(tree, test_text, type="class")
actual = test_text$status
preds.text = data.frame((table(predicted, actual))) %>%
spread(actual, Freq) %>%
mutate(total = defaulted + paid) %>%
select(predicted, total, everything())
preds.text
```
Metrics:
```{r}
print(sprintf("Accuracy: %.3f", Accuracy(y_true=actual, y_pred=predicted)))
print(sprintf("Precision: %.3f", Precision(y_true=actual, y_pred=predicted)))
print(sprintf("Recall: %.3f", Recall(y_true=actual, y_pred=predicted)))
print(sprintf("F1 Score: %.3f", F1_Score(predicted, actual)))
print(sprintf("Sensitivity: %.3f", Sensitivity(y_true=actual, y_pred=predicted)))
print(sprintf("Specificity: %.3f", Specificity(y_true=predicted, y_pred=actual)))
```
# Appendix: Further Reading
- [Kiva.org](https://www.kiva.org/). Kiva's homepage.
- [Build.Kiva](http://build.kiva.org/). Kiva data dumps and data description.