-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.Rmd
828 lines (681 loc) · 32.5 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
---
title: "Resources for CZI Proposals for NumFOCUS projects"
author: "Breck Baldwin"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, include = TRUE)
```
# Point of all this
I have spent a ton of time doing metrics for open source scientific software as part of my job at Columbia. Stan has been the focus since Columbia contributes to much of the ecosystem but it includes tracking other software as a counter point for grant proposals and progress reports. I'd like to see these metrics used more broadly if it helps people write better proposals. I am currently working on the Chan Zuckerberg Initiative (CZI) [https://chanzuckerberg.com/wp-content/uploads/2021/03/EOSS-4-Combined-RFA-Packet.pdf](https://chanzuckerberg.com/wp-content/uploads/2021/03/EOSS-4-Combined-RFA-Packet.pdf) on behalf of the Stan organization and it looks like many other NumFOCUS projects are applying as well. This document is for them.
## Some context
In January of 2021 NASA had an open source infrastructure call for proposals and the Stan org joined up with PyMC and ArviZ since we all play in the Bayesian space. So why not do it together? I didn't want to get the funds at the expense of PyMC and ArviZ since we are all on the same team. Towards the end of the NASA effort it became clear that around 10 proposals were coming out of NumFOCUS--isn't AstroPy also on my team too? It made me uncomfortable.
The NASA proposal was a solid month's work and I had the privilege of being able to hit it pretty much full time due to my boss's generosity (Andrew Gelman). I was using Columbia resources to support the Stan org at NumFOCUS and Columbia only indirectly benefits from Stan org funding. I had a think about our competitive advantage due to my having the cycles to focus exclusively on getting a good pitch and then I had a thunk: resources beget more resources to the expense of those without resources and here I am competing with organizations that I have no interest in 'defeating' in the funding game.
In response to my 'thunk' I decided to create this repo with my metrics to hopefully help other NumFOCUS projects for CZI. Sorry, they are in R but that is the language I am trying to learn but it is all pretty simple stuff, just a hassle to sort out queries, API access etc that take more time than I am willing to admit. It also represents what 'worked', many things were tried--looking at you Google scholar and your lack of an API and way to retrieve results without getting locked out. I also tried many high level access packages that inevitably lacked some feature or information I wanted so all the code is pretty low level `GET` interactions but that level of control is worth it and has become my starting place.
So in short I am not accepting the zero-sum basis of scientific funding and I don't want to compete with my NumFOCUS neighbors. Maybe CZI allocates more funds because of all of the compelling proposals and excellent metrics justifying the projects. How about a threshold system of funding meaning that all proposals above an evaluation threshold are funded or if there are limited funds then the above threshold awards are selected randomly. Totally ordering the relative merit of proposals or 'going by gut' are does not appeal. How about assigning 50% as I suggest and see how well they do over time vs the more standard process.
## Navigation and communication
This file is `index.Rmd` at [https://github.com/breckbaldwin/CZI_NumFOCUS_materials](https://github.com/breckbaldwin/CZI_NumFOCUS_materials) and is rendered as `index.html` with the github pages address being [https://breckbaldwin.github.io/CZI_NumFOCUS_materials/](https://breckbaldwin.github.io/CZI_NumFOCUS_materials/).
I am trying the discussion feature that Github has at [https://github.com/breckbaldwin/CZI_NumFOCUS_materials/discussions](https://github.com/breckbaldwin/CZI_NumFOCUS_materials/discussions) so that is the place for questions/comments I assume.
For now I can be reached at [email protected] or on NumFOCUS slack, @breck.
# The scripts
Stan and Bayesian software as a whole is growing rapidly so most of the scripts focus on growth over time as a proxy for relevance to science. Your package may have different qualities. There are probably inertial effects at play that bake in growth because software is getting downloaded more and automated systems like continuous integration systems are growing in popularity which apparently can drive downloads. For CRAN downloads I do add regression lines to show relative growth to baseline packages like ggplot2 to account for this.
Research citations have inertial effects too and scopus.com undercounts considerably what scholar.google.com reports.
**WARNING** This is a quick effort and not a well designed open source project. Documentation is minimal, mistakes have probably been made but I have tried to keep variables understandable and comments helpful. I welcome bug fixes and/or extensions and I hope it helps your proposal.
Onto the scripts:
## Subject distribution for Scopus.com
Scopus.com is Elsevier's academic search engine over the research literature, both what they publish and other sources. They offer a decent API for searching and subject classification by category.
This code uses a subscription to scopus.com which I got because of my Columbia University affiliation. Information about getting credentials is at [https://dev.elsevier.com/sc_apis.html](https://dev.elsevier.com/sc_apis.html), there appears to be free access available but I don't know if the free layer gives you all the features I am using below. It does require an interaction with a human I believe--just submit a request and you get it back in a day I recall.
I will be losing these credentials soon but am happy to run queries for other projects for the remaining month. I have been giving projecs a wide and long version of all the data broken down by Scopus category--look at the `write.csv` calls below.
Email [email protected] with likely search strings that might be mentioned in the citations of a research publication. Hopefully your project has a unique name that functions as a rigid designator, i.e., use of the name is unique to your project which 'Stan' entirely fails at--see [https://statmodeling.stat.columbia.edu/2019/04/29/we-shouldntve-called-it-stan-i-shouldve-listened-to-bob-and-hadley/](https://statmodeling.stat.columbia.edu/2019/04/29/we-shouldntve-called-it-stan-i-shouldve-listened-to-bob-and-hadley/).
A good rigid designator is 'matplotlib' because it is not likely to refer something else. Articles that mention 'mathplotlib' in passing are counted as much as those that are all about the library. No distinction between the body and references section is made.
After having done a few of these for others I find that the inclusion of 'py' someplace in the package name is really helping with search because it makes it unique.
The subject categories are filtered to be biomedicine relevant which is CZI specific. The complete list is at [https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus/](https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus/), see `categories()` function below for possible values.
The query being run below is 'jupyter', see how that nice 'py' in the middle makes for a nice rigid designator.
```{r}
rm(list = ls())
library(tidyr)
library(ggplot2)
library(dplyr)
library(lubridate)
library(httr)
library(jsonlite)
library(stringr)
library(ggrepel)
library(wkb)
credentials <- read_json('Scopus_credentials.json')
API_KEY <- credentials$API_KEY
INSTITUTION_TOKEN <- credentials$INSTITUTION_TOKEN
# Format of Scopus_credentials.json
# {
# "API_KEY":"XXXXXXXXXXXXXXXXXXXXXXXXXxx",
# "INSTITUTION_TOKEN":"XXXXXXXXXXXXXXXXXXXXXXXX"
# }
BASE_URL = 'https://api.elsevier.com/content/search/scopus'
USE_CACHE = TRUE #will have to setup a redis server
REPORT_PROGRESS = FALSE
if (USE_CACHE) {
library(redux)
redis <- redux::hiredis()
get_results <- function(url) {
cache <- redis$GET(url) # check redis first
if (!is.null(cache)) {
result <- unserialize(cache)
if (result$status_code == 200) {
if (REPORT_PROGRESS) {
cat(paste("\nhitting redis for", url))
}
return(result)
}
}
random_wait <- abs(rnorm(1, 1, 1))
if (REPORT_PROGRESS) {
cat(paste("\ncache miss, querying:", url, "\n"))
cat(paste(
"\nWaiting",
random_wait,
"seconds to be nice to webserver\n"
))
}
Sys.sleep(random_wait)
result <- GET(
url,
add_headers('X-ELS-APIKey' = API_KEY,
'X-ELS-Insttoken' = INSTITUTION_TOKEN)
)
redis$SET(url, serialize(result, NULL))
return(result)
}
}
year_start <- 2012
year_end <- 2020 # want complete years or graph looks odd
years <- year_start:year_end
df <- data.frame(years)
stan_eco_q <-
'(brms+AND+burkner)+OR+(gelman+AND+hoffman+AND+stan)+OR+mc-stan.org+OR+rstanarm+OR+pystan+OR+(rstan+AND+NOT+mit)'
pymc_arviz_stan_eco_q <-
paste('pymc*', 'arviz', stan_eco_q, sep = '+OR+')
matplotlib_q <- 'matplotlib'
jupyter_q <- 'jupyter'
query = pymc_arviz_stan_eco_q
years <- year_start:year_end
scopus.df <- data.frame(years)
package = query
total_count <- 0
for (year in year_start:year_end) {
year_span <- paste(year - 1, "-", year, sep = '')
url <-
paste(
BASE_URL,
'?query=',
package,
"+AND+PUBYEAR+=+",
year,
'&facets=subjarea(count=101)',
sep = ''
)
if (USE_CACHE) {
result <- get_results(url)
}
else {
if (REPORT_PROGRESS) {
cat(print(paste("hitting scopus with:", url)))
}
result <- GET(
url,
add_headers('X-ELS-APIKey' = API_KEY,
'X-ELS-Insttoken' = INSTITUTION_TOKEN)
)
}
if (result$status_code != 200) {
print(sprintf("got non 200 status from query: %d", result$status_code))
stop()
}
json_txt <- rawToChar(as.raw(strtoi(result$content, 16L)))
data <- jsonlite::fromJSON(json_txt)
total_count <-
as.numeric(data$`search-results`$`opensearch:totalResults`) + total_count
facet_count <- length(data$`search-results`$facet$category$name)
j <- 1
while (j < facet_count) {
name <- data$`search-results`$facet$category$name[j]
name <- data$`search-results`$facet$category$label[j]
name <- str_replace(name, " \\(all\\)", "")
hitCount <-
as.numeric(data$`search-results`$facet$category$hitCount[j])
if (!name %in% colnames(scopus.df)) {
scopus.df[name] <- rep(0, year_end - year_start + 1)
# print(paste("name=",name,", count=",hitCount))
}
scopus.df[name][scopus.df$years == year, ] <- hitCount
j <- j + 1
}
}
column_names <- colnames(scopus.df)
column_sums <- colSums(scopus.df)
df_long <- gather(scopus.df,
key = 'topic',
value = 'yr_count',
column_names[2]:column_names[length(column_names)])
write.csv(
scopus.df,
file = paste("scopus_data/", query, ".csv", sep = ""),
row.names = FALSE
) #not going to work in general
write.csv(
df_long,
file = paste("scopus_data/", query, "_long.csv", sep = ""),
row.names = FALSE
) #not going to work in general
```
Two files are created named after the query, `scopus_data/jupyter.csv` and `scopus_data/jupyter_long.csv` with results. Not clear what happens with complex queries regarding the file system.
**Note that the same article can be counted in more than one subject category**
Processing continues below where I filter for biomedicine categories:
```{r}
# continues with values from previous chunk
#got the raw csv data, now lets graph it
#add total count to data frame for each category
df_long$total <- rep(0, nrow(df_long))
for (t in column_names[2:length(column_names)]) {
df_long[df_long$topic == t, ]$total <- column_sums[[t]]
}
# assign label to last year for display '<topic> <total>'
# can use to scatter the labels to points other than max(years)
df_long_label <- df_long %>%
mutate(label = if_else(years == max(years),
paste(as.character(topic), total), NA_character_))
# category mapping to description at
# https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus/
categories <- function() {
url <- 'https://api.elsevier.com/content/subject/scopus'
result <- GET(url,
add_headers('X-ELS-APIKey' = API_KEY,
'X-ELS-Insttoken' = INSTITUTION_TOKEN))
json_txt <- rawToChar(as.raw(strtoi(result$content, 16L)))
data <- jsonlite::fromJSON(json_txt)
catsDf = data$`subject-classifications`$`subject-classification`
return(catsDf)
}
# pulled from categories returned by below, uncomment to run
# unique(categories()$description)
# [1] "Multidisciplinary" "Agricultural and Biological Sciences"
# [3] "Arts and Humanities" "Biochemistry, Genetics and Molecular Biology"
# [5] "Business, Management and Accounting" "Chemical Engineering"
# [7] "Chemistry" "Computer Science"
# [9] "Decision Sciences" "Earth and Planetary Sciences"
# [11] "Economics, Econometrics and Finance" "Energy"
# [13] "Engineering" "Environmental Science"
# [15] "Immunology and Microbiology" "Materials Science"
# [17] "Mathematics" "Medicine"
# [19] "Neuroscience" "Nursing"
# [21] "Pharmacology, Toxicology and Pharmaceutics" "Physics and Astronomy"
# [23] "Psychology" "Social Sciences"
# [25] "Veterinary" "Dentistry"
# [27] "Health Professions"
medicine_categories = paste(
"Health Professions",
"Pharmacology, Toxicology and Pharmaceutics",
"Psychology",
"Biochemistry, Genetics and Molecular Biology",
"Immunology and Microbiology",
"Nursing",
"Medicine",
"Neuroscience",
"Veterinary",
"Agricultural and Biological Sciences",
sep = "|"
)
# filter for medicine categories
df_long_label_filtered <-
df_long_label[str_detect(df_long_label$topic, medicine_categories), ]
#plot df_long_label to see all categories
plot2 <- ggplot(data = df_long_label_filtered, aes(
x = years,
y = yr_count,
group = topic,
color = topic
)) +
geom_line() +
geom_point() +
geom_label_repel(aes(label = label),
max.overlaps = 17, # adjust to allow for all labels
na.rm = TRUE) +
scale_color_discrete(guide = FALSE) + #removes guide on right
labs(y = "PyMC, Stan, ArviZ articles by topic",
caption = "Fig 1: Annual subject counts with totals in Scopus.com") +
theme(plot.caption=element_text(size=12, hjust=0, margin=margin(15,0,0,0)))
print(plot2)
```
Annual count of Scopus subject categories for the query "jupyter" with total counts across biomedicine subject categories. **Note that the same article can be counted in more than one subject category**.
## PyPi downloads
From Google's Big Query public data set [indigo-epigram-312023](https://console.cloud.google.com/bigquery?project=indigo-epigram-312023) the download data is stored for [PyPi](https://pypi.org/). The query for 'Keras' is below:
```
#standardSQL
SELECT
COUNT(*) AS num_downloads,
DATE_TRUNC(DATE(timestamp), MONTH) AS `month`
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
file.project = 'keras'
-- Only query the last x months of history
AND DATE(timestamp)
BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 120 MONTH), MONTH)
AND CURRENT_DATE()
GROUP BY `month`
ORDER BY `month` DESC
```
You will have to setup an account, you get $300 in credit as of now so this will be free but they want a credit card anyway. I can run queries for projects, takes 5 min.
Place resulting .csv file in appropriate folder, I have accumulated some current examples below and they are in the repo. Add yours with the package name following 'data/PyPi' and as a prefix before the first '-'.
```{r}
rm(list = ls())
library(ggplot2)
library(ggrepel)
library(tidyverse)
library(stringr)
# this list controls what is displayed
packagePyPiData = c('data/PyPi/ArviZ-results-20210502-112557.csv',
# 'data/PyPi/Keras-results-20210502-132857.csv',
'data/PyPi/PyMC3-results-20210502-112819.csv',
# 'data/PyPi/PyStan-results-20210502-132744.csv',
'data/PyPi/PyTorch-results-20210502-130326.csv',
'data/PyPi/TensorFlow-results-20210502-131806.csv'
# 'data/PyPi/NumPy-results-20210503-200308.csv',
# 'data/PyPi/ggplot-results-20210503-201446.csv'
)
# note format for display name extraction: "data/PyPi<display name>-resul...csv"
packagesPyPi = str_match(packagePyPiData, "data/PyPi/([^-]+)-results.*")[,2]
pkgPyPiDf = data.frame()
longest = 0
for ( i in 1:length(packagePyPiData)) { # iterate from one list
df = read.csv(packagePyPiData[i])
if (longest < nrow(df)) {
longest = nrow(df)
pkgPyPiDf = data.frame(month = as.Date(df$month))
}
}
for (i in 1:length(packagesPyPi)) { # iterate from co-indexed list
df = read.csv(packagePyPiData[i])
pkgPyPiDf[[packagesPyPi[i]]] = c(df$num_downloads, rep(NA, longest - nrow(df)))
}
pkgLongDf = gather(pkgPyPiDf, key = "package", value = "downloads", packagesPyPi)
label_month = as.Date("2018-08-01")
pkgLongDf = pkgLongDf %>% mutate(label = if_else(month == label_month,
package,
NA_character_))
pyPiPlot = ggplot(data = pkgLongDf, aes(x = month, y = downloads,
color = package, group = package)) +
geom_line(na.rm = TRUE) +
scale_x_date(limits = as.Date(c(min(pkgLongDf$month), "2021-04-01")),
breaks = seq.Date(from = as.Date("2016-01-01"),
to = as.Date("2021-01-01"),
by = "1 year")) +
scale_color_discrete(guide = FALSE) +
scale_y_continuous(breaks=c(0, 100, 1000, 10000, 100000, 1e+06, 1e+07, 1e+08),
trans = scales::log_trans()) +
geom_label_repel(label = pkgLongDf$label, na.rm = TRUE) +
labs(y = "Package downloads log scale",
caption = "Fig 3: PyPi.org monthly downloads of packages") +
theme(plot.caption=element_text(size=12, hjust=0, margin=margin(15,0,0,0)))
print(pyPiPlot)
```
Monthly counts of PyPi downloads.
## Pull request aging against github for repo.
CZI asks for PR aging information. The below hits the github API and does some counting if you provide the public repo path. It is very easy to get a personal access token, see [ https://github.com/settings/tokens]( https://github.com/settings/tokens) and without it the access will time out for more than just a few pages.
Format of 'github_credentials.json'
```
{
"PERSONAL_TOKEN":"XXXXXXXXXXXXXXXXXXXXXXXXxx",
"USER":"XXXXXXXXXXXXXXXXXXXXXX"
}
```
Make that file a sibling of the Rscript or this page and you should be good.
```{r eval = TRUE, echo = TRUE}
rm(list = ls()) #cleanup
library(httr) # web access
library(jsonlite) #json processing
library(stringr) #regex
library(lubridate) #date
USE_CACHE = TRUE #will have to setup a redis server
REPORT_PROGRESS = FALSE
credentials <- read_json('github_credentials.json')
PERSONAL_TOKEN <- credentials$PERSONAL_TOKEN
USER <- credentials$USER
# Format of github_credentials.json
# {
# "PERSONAL_TOKEN":"XXXXXXXXXXXXXXXXXXXXXXXXxx",
# "USER":"XXXXXXXXXXXXXXXXXXXXXX"
# }
# get your token at:
# https://github.com/settings/tokens
# you want to check the 'public repo access' at least.
# }
if (USE_CACHE) {
library(redux)
redis <- redux::hiredis()
get_results <- function(url) {
cache <- redis$GET(url) # check redis first
if (!is.null(cache)) {
result <- unserialize(cache)
if (result$status_code == 200) {
if (REPORT_PROGRESS) {
cat(paste("\nhitting redis for", url))
}
return(result)
}
}
random_wait <- abs(rnorm(1, 1, 1))
if (REPORT_PROGRESS) {
cat(paste("\ncache miss, querying:", url, "\n"))
cat(paste(
"\nWaiting",
random_wait,
"seconds to be nice to webserver\n"
))
}
Sys.sleep(random_wait)
result <- GET(url, config = authenticate(user = USER,
password = PERSONAL_TOKEN))
redis$SET(url, serialize(result, NULL))
return(result)
}
}
# package names below, look them up at github.com,
# e.g. https://github.com/stan-dev/stanc3 'standev/rstanarm')
# packages = c('stan-dev/cmdstan', 'stan-dev/stan', 'stan-dev/rstanarm',
# 'stan-dev/cmdstanpy')
packages = c('stan-dev/rstan', 'stan-dev/cmdstanr', 'stan-dev/math',
'stan-dev/cmdstanpy')
# packages = c('stan-dev/stanc3', 'stan-dev/pystan',
packageDataDf = data.frame()
for (i in 1:length(packages)) {
page = 1L
while(TRUE) {
url <- paste('https://api.github.com/repos/', packages[i],
'/pulls?state=all&page=', as.character(page), sep='')
if (USE_CACHE) {
result <- get_results(url)
}
else {
result <- GET(url, config = authenticate(user = USER,
password = PERSONAL_TOKEN))
}
if (result$status_code == 200) {
jsonTxt <- rawToChar(as.raw(strtoi(result$content, 16L)))
newDataDf <- jsonlite::fromJSON(jsonTxt)
n <- nrow(newDataDf)
newDataLongDf <- data.frame(package = rep(packages[i], n),
created = as.Date(newDataDf$created_at),
closed = as.Date(newDataDf$closed_at))
packageDataDf <- rbind(packageDataDf, newDataLongDf)
if (! str_detect(result$headers$link, 'next')) { #last page
break
}
page <- page + 1
# print(sprintf("doing page %d", page))
}
else {
print(paste("Error", result))
stop()
}
}
}
packageDataDf$age <- packageDataDf$closed - packageDataDf$created
for (i in 1:length(packages)) {
print(sprintf("package %s has %d closed pull requests, mean age to closure of %.0f days",
packages[i],
nrow(packageDataDf[packageDataDf$package == packages[i] &
!is.na(packageDataDf$closed),]),
mean(packageDataDf[packageDataDf$package == packages[i] &
!is.na(packageDataDf$closed),]$age)))
print(sprintf("package %s has %d open pull requests, mean age of %.0f days from %s",
packages[i],
nrow(packageDataDf[packageDataDf$package == packages[i] &
is.na(packageDataDf$closed),]),
mean(today(tzone = "UTC") -
packageDataDf[packageDataDf$package == packages[i] &
is.na(packageDataDf$closed),]$created),
today(tzone = "UTC")))
}
print(sprintf("Across all packages mean aging to closure is %.0f days",
mean(packageDataDf$age, na.rm = TRUE)))
print(sprintf("All packages mean open PR length is %.0f days from %s",
mean(today(tzone = "UTC") -
packageDataDf[is.na(packageDataDf$closed),]$created),
today(tzone = "UTC")))
```
Super ugly code that I used to get aging graphing going. Just saving it for later.
```{r}
rm(list = ls())
library(httr) # web access
library(jsonlite) #json processing
library(stringr) #regex
library(lubridate) #date
USE_CACHE = TRUE #will have to setup a redis server
REPORT_PROGRESS = TRUE
credentials <- read_json('github_credentials.json')
PERSONAL_TOKEN <- credentials$PERSONAL_TOKEN
USER <- credentials$USER
# Format of github_credentials.json
# {
# "PERSONAL_TOKEN":"XXXXXXXXXXXXXXXXXXXXXXXXxx",
# "USER":"XXXXXXXXXXXXXXXXXXXXXX"
# }
# get your token at:
# https://github.com/settings/tokens
# you want to check the 'public repo access' at least.
# }
if (USE_CACHE) {
library(redux)
redis <- redux::hiredis()
get_results <- function(url) {
cache <- redis$GET(url) # check redis first
if (!is.null(cache)) {
result <- unserialize(cache)
if (result$status_code == 200) {
if (REPORT_PROGRESS) {
cat(paste("\nhitting redis for", url))
}
return(result)
}
}
random_wait <- abs(rnorm(1, 1, 1))
if (REPORT_PROGRESS) {
cat(paste("\ncache miss, querying:", url, "\n"))
cat(paste(
"\nWaiting",
random_wait,
"seconds to be nice to webserver\n"
))
}
Sys.sleep(random_wait)
result <- GET(url, config = authenticate(user = USER,
password = PERSONAL_TOKEN))
redis$SET(url, serialize(result, NULL))
return(result)
}
}
# package names below, look them up at github.com,
# e.g. https://github.com/stan-dev/stanc3 'standev/rstanarm')
packages = c('stan-dev/cmdstan', 'stan-dev/stan', 'stan-dev/rstanarm',
'stan-dev/cmdstanpy')
packages = c('stan-dev/rstan', 'stan-dev/math',
'stan-dev/stan', 'stan-dev/stanc3', 'stan-dev/pystan',
'stan-dev/rstanarm', 'arviz-devs/arviz', 'pymc-devs/pymc3')
#packages =c('pymc-devs/pymc3','stan-dev/rstanarm', 'stan-dev/rstan','arviz-devs/arviz')
#packages = c('stan-dev/rstanarm')
packageDataDf = data.frame()
for (i in 1:length(packages)) {
page = 1L
while(TRUE) {
url <- paste('https://api.github.com/repos/', packages[i],
'/pulls?state=all&page=', as.character(page), sep='')
if (USE_CACHE) {
result <- get_results(url)
}
else {
result <- GET(url, config = authenticate(user = USER,
password = PERSONAL_TOKEN))
}
if (result$status_code == 200) {
jsonTxt <- rawToChar(as.raw(strtoi(result$content, 16L)))
newDataDf <- jsonlite::fromJSON(jsonTxt)
n <- nrow(newDataDf)
newDataLongDf <- data.frame(package = rep(packages[i], n),
created = as.Date(newDataDf$created_at),
closed = as.Date(newDataDf$closed_at))
packageDataDf <- rbind(packageDataDf, newDataLongDf)
if (! str_detect(result$headers$link, 'next')) { #last page
break
}
page <- page + 1
# print(sprintf("doing page %d", page))
}
else {
print(paste("Error", result))
stop()
}
}
}
packageDataDf$age <- packageDataDf$closed - packageDataDf$created
packageDataDf$org = str_extract(packageDataDf$package, "([^/]+)") #get org level
library(tidyverse)
library(ggrepel)
packageDataDf$floor_date_created = floor_date(packageDataDf$created, 'halfyear')
packageDataDf$count = 1
packageDataDf = packageDataDf %>% mutate(orgPrStat = if_else(is.na(closed),
paste(org,"open",
sep = '_'),
paste(org,"closed",
sep = '_')))
yearlyPackageDf = packageDataDf %>%
group_by(floor_date_created, orgPrStat) %>%
summarize(PR_count = sum(count))
orgLabels = c()
orgs = unique(yearlyPackageDf$orgPrStat)
for (i in 1:length(orgs)) {
orgVal = orgs[i]
if (endsWith(orgVal, "closed")) {
meanVal = mean(packageDataDf[packageDataDf$orgPrStat == orgVal,]$closed -
packageDataDf[packageDataDf$orgPrStat == orgVal,]$created)
orgLabels[i] = sprintf("%s: mean days to closure = %.0f", orgVal, meanVal)
}
else {
meanVal = mean(today(tzone = "UTC") -
packageDataDf[packageDataDf$orgPrStat == orgVal,]$created)
orgLabels[i] = sprintf("%s: mean days open = %.0f", orgVal, meanVal)
}
}
agingPlot = ggplot(data = yearlyPackageDf, aes(
x = floor_date_created,
y = PR_count,
group = orgPrStat,
color = orgPrStat
)) +
scale_x_date(limits = as.Date(c(min(yearlyPackageDf$floor_date_created),
"2021-01-01"))) +
geom_line() +
scale_color_discrete(name = "Organization PR aging",
breaks = orgs,
labels = orgLabels) +
labs(x = "Semi-annual PR creation counts",
y = "Pull request count",
caption = "Fig 5: Pull Request (PR) aging, closed and open"
) +
theme(plot.caption=element_text(size=12, hjust=0, margin=margin(15,0,0,0)))
print(agingPlot)
```
# CRAN downloads
Plot downloads from Rstudio's CRAN mirror for specified packages. Baseline packages included to chart relative growth.
```{r}
library(cranlogs)
library(ggplot2)
library(dplyr)
library(lubridate)
library(httr)
library(jsonlite)
library(stringr)
library(R.cache)
library(ggrepel)
#https://www.ubuntupit.com/best-r-machine-learning-packages/
packages <- c('rstan','lme4','Rcpp','randomForest','coda','glmnet','caret','mlr3','e1071','Rpart','KernLab','mlr','arules','mboost')
packages <- c('ggplot2','lme4','rstan','rstanarm','brms')
packages <- c('rstan','rstanarm','brms')
dls <- cran_downloads(
packages = packages,
from ="2016-01-01",
to = "2021-04-30"
)
# map to month from day data
mls <- dls %>% mutate(month=floor_date(date, "monthly")) %>%
group_by(month,package) %>%
summarize(monthly_downloads=sum(count))
# mls data check, don't trust the above
mls_val = (mls %>% filter(package=='rstan') %>%
filter(month=='2018-02-01'))$monthly_downloads
dls_val = sum(cran_downloads(packages = c('rstan'),
from ="2018-02-01", to = "2018-02-28")$count)
if (mls_val != dls_val) {
stop(sprintf(paste("Problems with data, expect computed monthly total",
"mls_val=%d and more simply computed monthly total dls_val=%d to be equal"),
mls_val, dls_val))
}
label_month <- max(mls$month)
mls_label <- mls %>%
mutate(label=if_else(month == label_month,
str_replace(package,
'ggplot2',
'BASELINE ggplot2'),
NA_character_))
plot1 <- ggplot(data=mls_label,
aes(x=month, y=monthly_downloads, color=package,
group=package)) +
geom_line()
b_plot1 <- ggplot(data=mls_label,
aes(x=as.numeric(month), y=log(monthly_downloads), color=package,
group=package)) +
geom_line()
log_plot1 <- plot1 + scale_y_continuous(breaks=c(0,100,1000,10000,100000,1000000),
trans = scales::log_trans())
log_plot1_display <- log_plot1 +
geom_smooth(method='lm',formula=y~x, fullrange=TRUE, se=FALSE) +
geom_label_repel(aes(label = label), na.rm = TRUE) +
scale_color_discrete(guide = FALSE)
log_plot1_2024_scale <- log_plot1 +
xlim(as.Date('2016-01-01'),as.Date('2024-06-30'))
log_plot1_2024_slopes_display <- log_plot1_2024_scale +
geom_smooth(method='lm',formula=y~x, fullrange=TRUE, se=FALSE) +
geom_label_repel(aes(label = label), na.rm = TRUE) +
scale_color_discrete(guide = FALSE)
print(log_plot1_2024_slopes_display)
```
Regression lines shown to express relative growth rates for Stan ecosystem components compared to ggplot2 and lme4.
# Page rank analysis
Entirely taken from [https://blog.revolutionanalytics.com/2014/12/a-reproducible-r-example-finding-the-most-popular-packages-using-the-pagerank-algorithm.html](https://blog.revolutionanalytics.com/2014/12/a-reproducible-r-example-finding-the-most-popular-packages-using-the-pagerank-algorithm.html).
A version that worked over the dependencies in PyPi might be useful.
This takes a long time to run so no output shown.
```{r eval=FALSE}
library(miniCRAN)
library(igraph)
library(magrittr)
# taken entirely from: https://blog.revolutionanalytics.com/2014/12/a-reproducible-r-example-finding-the-most-popular-packages-using-the-pagerank-algorithm.html
#need to change date to yesterday
MRAN <- "http://mran.revolutionanalytics.com/snapshot/2021-05-06/"
pdb <- MRAN %>%
contrib.url(type = "source") %>%
available.packages(type="source", filters = NULL)
g <- pdb[, "Package"] %>%
makeDepGraph(availPkgs = pdb, suggests=FALSE, enhances=FALSE, includeBasePkgs = FALSE)
pr <- g %>%
page.rank(directed = FALSE) %>%
use_series("vector") %>%
sort(decreasing = TRUE) %>%
as.matrix %>%
set_colnames("page.rank")
set.seed(42)
pr %>%
head(100) %>%
rownames %>%
makeDepGraph(pdb) %>%
plot(main="Top packages by page rank", cex=0.5)
print(sprintf("Rstan is %dth highest page rank score for R package dependencies out of %d packages", which(row.names(pr) == 'rstan'), nrow(pr)))
```