This repository has been archived by the owner on Jan 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
index.Rmd
executable file
·712 lines (512 loc) · 41.9 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
---
title: "Syllabus"
author:
name: "Max Held"
affiliation: "Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)"
date: "Summer Term 2020"
bibliography: library.bib
---
```{r setup, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(printr)
```
```{r readme, child="README.md"}
```
<div class="jumbotron" style="color:white; background: linear-gradient( rgba(0, 0, 0, 0.7), rgba(0, 0, 0, 0.7) ), url(img/keyboard-keys-2.jpg) no-repeat center center fixed; -webkit-background-size: cover; -moz-background-size: cover; -o-background-size: cover; background-size: cover;">
<h2>Software Carpentry: Hacking Skills for Data Science</h2>
<p>... because learning from hackers is learning to win?</p>
<p> <span class="label label-default">
#DataScience
</span>
<span class="label label-primary">
#rstats
</span>
<span class="label label-info">
Git(Hub)
</span>
<span class="label label-success">
#ReproducibleResearch
</span>
</p>
<p><small><sub>
Image Credit: Red Alt [CC BY 2.0](https://creativecommons.org/licenses/by/2.0/) [hjl](https://www.flickr.com/photos/hjl/8205547070/in/photolist-dv6zgu-nffY2e)
</sub></small></p>
</div>
---
<div class="embed-responsive embed-responsive-16by9">
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/dU1xS07N-FA?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
</div>
> *[Coding – ] it’s the next best thing we have to a superpower.*
> -- [Drew Houston](@drewhouston) via [code.org](https://code.org)
> *So we were very worried that what if the astronaut, during mid-course, would select pre-launch, for example?*
> *Never would happen, they said.*
> *Never would happen.*
> *It happened.*
> -- [Margaret Hamilton](https://www.metaltoad.com/blog/history-computer-girls-part-2-margaret)
> *Computers ... a bicycle for the mind*
> -- [Steven Jobs](https://www.brainpickings.org/2011/12/21/steve-jobs-bicycle-for-the-mind-1990/)
> *To me programming is more than an important practical art.*
> *It is also a gigantic undertaking in the foundations of knowledge.*
> -- [Grace Murray Hopper](https://en.wikipedia.org/wiki/Grace_Hopper)
> *Think of free speech, not free beer.*
> -- [Richard Stallman](https://stallman.org/)
> *Open source isn't like free sunshine; it's like a free puppy.*
> -- [Sarah Novotny](https://sarahnovotny.com/)
> *Most learning is not the result of instruction.*
> *It is rather the result of unhampered participation in a meaningful setting.*
> -- Ivan @Illich-1971
## Prerequisites {.alert .alert-success}
*Everyone* is welcome to this seminar.
This is *not* a "proper" computer science class, and participants do *not* need any background in CS, statistics or math.
You should just be curious and ready to:
- learn to use specialised command-line software and open-source tools for collaboration,
- read and write technical documents in simple, readable english and
- collaborate intensively using (perhaps unfamiliar) web-based tools.
No worries, we'll bring everyone up to speed in very little time.
You do *not* need to have completed a prior version of this class, or any other class.
If you *have* some prior training, you will start the class at a different level.
## Time and Place (Summer Term 2020) {.alert .alert-warning}
This will be an **all-remote**, largely **asynchronous** seminar, held via [Gitter chat](https://gitter.im/soztag/fossos), [GitHub](http://github.com/soztag/fossos) and occasional [Zoom](http://zoom.us) video conference.
### Preparatory Meeting
**Thursday, April 30th, 2020 15:00-17:00** (in a video conference, see below).
### Asynchronous Collaboration
Throughout the semester, students can work through the material at their own pace and schedule.
Support from the instructor and fellow students is available on the [Gitter chat](https://gitter.im/soztag/fossos).
Depending on needs, short video conferences will be held for selected topics.
### Digital Venus
We're going to use a few digital tools to work asynchronously.
- Static information will be at https://datascience.phil.fau.de/fossos/, the **class website**.
You can find all the resources (~ readings) and software links on https://datascience.phil.fau.de/fossos/stack.html.
- Pretty much all *individual* activity (i.e. to be done by one or a few students) is tracked as issues on our class repository **issue tracker** at https://github.com/soztag/fossos/issues.
If you have a question, have an idea to work on, or are looking for inspiration for a task, this is your place.
Issues are organised using labels and assignees.
Milestones are currently not in use.
- A (currently relevant) subset of these issues are also listed on our **Kanban board** at https://github.com/soztag/fossos/projects/2.
This board gives you an overview what everyone is busy with at any given point.
You can move your "own" issues around the board as appropriate, and you can also add issues that you want to see addressed.
- There is a [Gitter chat](https://gitter.im/soztag/fossos) that students can use throughout the semester to get support from the instructor and fellow students.
If you have your own repo for your own project (advanced students) you can open your own gitter chat and invite the instructor or fellow students for support.
These venues are also linked from the top bar of the class website, so you can always easily find them.
## Language requirements
Depending on who will be attending the class, instruction may also occur in english or german.
In any event, all of the readings and other course material are in english, and participants are expected to be proficient in reading and writing english technical documents.
## A Multi-Semester Series {.alert .alert-info}
It is obviously impossible (for most students) to cover all of the material in this course in *one* semester.
This course (with a slightly different name) will therefore be taught *every semester*, in a non-consecutive series.
Students can join the class every semester, and take the class for however many semesters they wish (if they still have new things to learn).
Do not be confused by the name this class takes in some semester (say, "Advanced R ...") -- you can still join as a beginner.
Depending on the listing (see below) students can also take this class for credit *multiple times*.
By implication, the group of students in the class in any *given* semester will be *heterogeneous*, working at different levels.
For example, some students may already have taken a course in the series previously, while others are just starting out.
Because the previous experiences and learning speed of students vary greatly anyway, this is not a significant (additional) hindrance.
Tasks, expectations and material covered will accordingly differ for each student, depending on the background.
## Credits and Listings
You can generally take this class as an undergraduate (Bachelor) lower-divison seminar (**Proseminar**) worth 5 ECTS points, or an upper-division seminar (**Hauptseminar**) worth 7.5 ECTS points.
The workload will be adjusted accordingly.
Depending on your major, you may also take the class to fulfill requirements for a *Masters* program.
Please be in touch to discuss the details.
This class was/is listed as:
- 2018/2019 Winter Term: "Open Source Werkzeuge für die wissenschaftliche Datenverarbeitung" (the original *FOSSOS*), crosslisted in the following modules:
- Bachelor Sociology
- Sociological Methods (Module `SOZ M`, [Soziologische Methodenlehre](https://www.soziologie.phil.fau.de/institut/arbeitsbereiche/methoden-der-empirischen-sozialforschung/))
- Labor and Organisation (Module `Soz Qf4`, [Arbeit und Organisation](https://www.soziologie.phil.fau.de/institut/arbeitsbereiche/arbeit-und-organisation/))
- Bachelor Digital Humanities and Social Sciences ("BA Zweitfach")
- Elective (Wahlpflichtbereich FPO 2018)
- Elective (Wahlpflichtbereich FPO 2016)
- 2019 Summer Term: "Advanced R and Open Social Data Science"
- Bachelor Sociology
- Sociological Methods (Module `SOZ M1`, `SOZ M2` [Soziologische Methodenlehre](https://www.soziologie.phil.fau.de/institut/arbeitsbereiche/methoden-der-empirischen-sozialforschung/))
- Bachelor Digital Humanities and Social Sciences ("BA Zweitfach")
- Elective (Wahlpflichtbereich FPO 2018)
- Elective (Wahlpflichtbereich FPO 2016)
- "Soft Skills" (Schlüsselqualifikationen)
- 2019/2020 Winter Term: "Open Source Software for the Humanities and Social Sciences", crosslisted in:
- Bachelor Sociology
- Sociological Methods (Module `SOZ M`, [Soziologische Methodenlehre](https://www.soziologie.phil.fau.de/institut/arbeitsbereiche/methoden-der-empirischen-sozialforschung/))
- "Soft Skills" (Schlüsselqualifikationen)
- Bachelor Digital Humanities and Social Sciences ("BA Zweitfach")
- Elective (Wahlpflichtbereich FPO 2018)
- Elective (Wahlpflichtbereich FPO 2016)
- 2020 Summer Term: "Software Carpentry -- Hacking Skills for Data Science"
## Related Classes
[Daniel Lemmer](https://www.pol.phil.fau.eu/person/daniel-lemmer/) is (again) offering an [introduction to R](https://univis.uni-erlangen.de/form?__s=2&dsc=anew/lecture_view&lvs=phil/dsp/isoz/zentr/einfhr_3&anonymous=1&founds=phil/dsp/ipowi/zentr/argent,/spanie,/wahlpa,///isoz/zentr/einfhr_3&sem=2019w&__e=183) (in german) as a seminar in the winter term 2019/2020.
Daniel's introduction to R is a great complement to *FOSSOS*, though it is *not* a prerequisite (and the same holds vice-versa).
His introduction is focused on running common statistical analyses in R.
*FOSSOS* is focused on open source tooling *around* R, R as a data science glue language and more advanced R.
If you have *not* taken (or will not) Daniel's (or another) introduction to R, you will probably spend your time in *FOSSOS* learning the broader open source tooling (["Software Carpentry"](https://datascience.phil.fau.de/fossos/stack.html#software_carpentry)) around R, which is still plenty of exciting material to keep you busy for a semester.
Daniel is *also* kindly hosting an open working group to learn statistics from first principles.
Contact [Daniel Lemmer](https://www.pol.phil.fau.eu/person/daniel-lemmer/) if you'd like to attend.
## Course Description
Digitisation has created both new challenges and yet unrealised potentials for empirical social sciences.
Larger, and often streamed datasets require more programmatic and dynamic statistical analyses.
Existing commercial programs with graphical user interfaces (GUIs) are expensive, and analyses can easily become intransparent, sometimes contributing to a crisis of reproducibility in the social sciences and beyond [e.g., @MairThouShaltBe2016] or even propagating outright bugs [e.g., @ReinhartGrowthTimeDebt2010].
Happily, the open source community has already pioneered a set of technologies and conventions for their software development efforts that have proven useful in solving these problems in many academic fields.
Additionally, open source software offers new ways to analyse and visualize data, as well as to present interactive results.
Together, these tools promise a radically open and participatory approach to science, and productive yet skeptical use of emerging data streams.
Unfortunately, learning these tools takes more time than is usually available until any given project deadline.
The goal of this series of seminars is therefore to train participants in a coherent set of leading tools and best practices, including:
- Software Carpentry
- Open source issue trackers to manage projects and their learning.
- Using leading community resources and services to troubleshoot issues.
- Writing text in a lightweight markup language (markdown).
- The world of UNIX-style command-line interface (CLI) programs ...
- ... and package managers, such as Homebrew or APT.
- Establishing an efficient plain-text workflow using editors and an Integrated Development Environment (IDE), including Atom and RStudio.
- Source control management (SCM) and massively collaborative development using Git and GitHub.
- Separating content and presentation using plain-text formats for technical and scientific writing, including LaTeX, Pandoc Markdown and RMarkdown and rendering results in a variety of formats (Word, HTML, PDF).
- Introductory R
- Introduction to "base" R.
- Literate programming in R.
- Intermediate R
- Importing, transforming and modeling data using tools from the R tidyverse ecosystem.
- Visualising data using ggplot2.
- Interactive R
- Interactive visualisations using leading JavaScript libraries (via plotly, htmlwidgets).
- Web dashboards using flexdashboard.
- Interactive webapps using shiny.
- Advanced R
- Types, functional programming, object oriented programming (only S3), metaprogramming and techniques, all following Hadley Wickham's [Advanced R](https://adv-r.hadley.nz)
- Cloud Computing
- Offloading computationally intensive, or regularly automated tasks to cloud services.
- Using containerisation (docker).
- Applying continuous integration and deployment (CI/CD) tools such as Travis CI.
- Reproducible Research
- Improving code quality by applying assertions using checkmate.
- Storing datasets in public repositories such as the Harvard dataverse.
- Releasing, publishing and indexing finished research using GitHub releases and zenodo.
- Other tools and practices for open and reproducible science.
- Strenghening reproducibility and portability by using dependency management (packrat) and containerisation (docker).
- Package Development
- Including documentation (roxygen2), defensive programming (checkmate), testing (testthat) and more best practices, all following Hadley Wickham's [R packages](http://r-pkgs.had.co.nz).
Towards the end of each of the seminars, participants will be able to use (parts of) this toolchain to work on their own projects, or to contribute to existing free and open source software.
```{r venn, fig.cap="The [Data Science Venn Diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram) by Drew Conway (2010)", out.width='100%'}
knitr::include_graphics(path = "img/Data_Science_VD.png")
```
This course will *not* focus on math and statistics knowledge or substantive domain expertise, though both are essential for solid data science work.
Rather, the emphasis is on what Drew Conway loosely called *hacking skills* in his [Data Science Venn Diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram), that is, simply getting these tools to work together, to learn how to troubleshoot them, and -- aspirationally -- to absorb some best practices of open source development.
While the course is *not* a proper computer science class, it should also be valuable to students with coding experience or a CS background who may be interested in the tooling and practices covered.
We will not cover the scaling and efficiency issues of proper “Big Data”, but confine ourselves to in-memory problems.
We also limit ourselves to the R ecosystem, though some tools and problems will be similar for other scripting languages such as python.
An introduction to data science and open source may well open up new job opportunities, or serve as a first stepping stone to a career in tech, but that is arguably not the only reason why social scientists should be excited about it.
Instead, to learn the way of open source is perhaps to update the ideals of the scientific process for the modern day:
radical openness and rigorous reproducibility, maximal inclusivity and promised meritocracy, generous sharing and personal attribution.
Open source may also be a worthwhile exercise in participant observation for social scientists:
here is a real, if surely flawed utopia, massively coordinating individuals that is *neither* market nor state.
Less loftily, but not least, the seminar also promises a starter dose of gratification from having built something that actually works, and is of some immediate use to our fellow human -- a good feeling sometimes hard to come by in the social sciences.
## Philosophy
This course is a little different from most seminars.
Teaching teaching R (and the broader ecosystem) at FAU sociology (as most other smaller, non-tech focused institutions) faces a couple of important constraints:
- Participants will have vastly different levels of previous experience, and will learn at different speeds.
- Given the relatively small number of interested students and complicated timetables, strictly consecutive seminars are difficult to organize.
Too few students would ever meet the requirements (and schedule) to attend the advanced seminars.
- There is already plenty of high quality teaching material out there, and there is little point in re-inventing (an inferior) wheel.
To meet these constraints, this course will be held as a **non-consecutive multi-semester series of seminars**, and will, for the most part, operate on a **flipped classroom model**.
## Flipped Classroom
Because students will learn at different speeds, and from different starting points -- among other reasons -- teacher-centered teaching will be minimal in this class.
Instead, students will study the assigned material outside of class, including online documents, videos and interactive learning applications.
As they encounter problems, or develop own (small) projects, students will track such work on the issue tracker used in class.
In class, students will work on their own problems or projects, in small groups and assisted by the instructor as necessary.
This class does *not* offer a one-size-fits-all set of pre-defined materials and assignments necessary for successful participation.
What the class offers is:
- A carefully curated list of external learning resources, organised in a (somewhat) linear syllabus.
- A social setting (the class settings) and electronic fora (github repo) to keep organised, motivated and to help one another.
- Guidance and assistance by the instructor for each *individual* student.
## Expectations
Happily, there are a *lot* of great resources for learning data science tools out there, many of them free, some of them even open source themselves.
We will be reusing a lot of these resources, and I (the instructor) do not have to reinvent an (inferior) wheel.
There is no *one* curriculum that's quite right for us, so I have cobbled together material from different sources.
All resources are listed, in roughly advisable chronological order, along with the [stack](/stack).
<div class="alert alert-warning role="alert">
<b>Resources</b> listed in the <a href="/stack">stack</a> are <em>mandatory reading</em>.
<b>Additional Resources</b> listed in the <a href="/stack">stack</a> are <em>recommended or optional reading</em>.
</div>
The good news is that there are no academic papers or books for this class and everything students need is available online.
There is, however, still a lot of material to work through (to the tune of hours per week), though it is written in a hopefully more accessible style than many academic documents.
The listed resources are guaranteed to cover everything you need to use the software, often including tutorials, videos and exercises.
Students are not limited to the listed resources; they can also choose their own material, so as long as it covers roughly the same ground.
In fact, students are encouraged to share good additional resources with the rest of the class.
There is a lot of duplicate content between the alternative resources listed.
Students should browse *each* of the resources, and then work in-depth through whichever they find most suitable.
<div class="alert alert-warning role="alert">
First-time students of <em>FOSSOS</em> are expected to work through (not just read) all the material listed in the <a href="https://datascience.phil.fau.de/fossos/stack.html#introduction">Introduction</a> and <a href="https://datascience.phil.fau.de/fossos/stack.html#software_carpentry">Software Carpentry</a> sections.
Repeat participants in the seminar who have mastered this material can advance to any of the other sections according to their interests and should prepare accordingly.
</div>
Whenever your run into a problem, or have a question, raise an issue on our [https://github.com/soztag/fossos/issues](github issue tracker).
Please also make sure that:
- the issue does not *already exist* (always *search first!*)
- the issue is properly *labelled* (so we can all navigate through the issues)
- the issue is *answerable*, *actionable* and *closable*.
Good issues are framed in such a way that they *can* be closed.
## Schedule
Because students will learn at different speeds, and from different starting points, there is not *a* schedule for the class.
The [stack](/stack) lists the tools (and resources) in the rough order in which they should be studied.
Students can work through this material at their own pace.
Likewise, some students may wish to cover a lot of breadth (at shallow depth), while others want to dig in on a particular topic.
This is all fine, but students should ensure that they learn *something* at a *useful* level to solve real-world problems, as will also be required for the assessment.
If in doubt, ask the instructor for guidance.
Every student should first become competent in the practices and tools covered in ["Software Carpentry"](https://datascience.phil.fau.de/fossos/stack.html#software_carpentry); these are required for all later topics.
As a loose guide, *every* student should cover at least *one* top-level heading ("Interactive R", "Intermediate R", etc.) per semester.
There are often several heavily overlapping resources recommended for a tool; students should study whichever best suits their taste.
It's a good idea to browse through all of the resources to make sure you don't miss anything.
## Assessment
Assessments are an unfortunate, tedious and arguably needless part of teaching -- but here we are, so we are going to make the best of it.
Instead of some *make belief* work or hobby project, assignments in this class are, for the most part, designed to be *actually useful* to other people.
This can be motivating, but it also means that other people are relying upon our work:
it has to be delivered by the time, and in the quality expected.
You can work on pretty much anything you like -- improving this very class (and its repo), some existing project that you like or even your own new (or existing) project.
The only conditions are:
1. The work needs to be related to the tools and practices covered in class.
2. The work needs to be on GitHub or otherwise transparent.
3. The instructor needs to be able to assess the quality of the work, and advise you in your work.
This unfortunately rules out any projects not using the technologies covered in this class.
We will begin with relatively easy, small tasks to serve other students in class, then address smaller issues with resources for the broader community, and eventually, fixing "real" bugs or enhancing functionality of open source data science software.
All tasks, big and small, are listed and tracked on the [class github repository issue tracker](https://github.com/soztag/fossos).
Students should assign themselves to tasks they will be working on, and report / link to any progress on these tasks in the issue thread.
### Pass/Fail
**All students**, including those who **just want a "Sitzschein" (pass/fail option)** must contribute to a number of issues labelled as [`pass/fail`](https://github.com/soztag/fossos/labels/pass%2Ffail).
These are issues that are smaller in scale and scope.
There is no straightfoward minimum metric (say, number of closed issues) to pass the class.
Instead, students should display substantial contributions across a range of helpful activities, as recorded in the issue tracker.
Before working on these issues, students should *assign themselves*, to avoid us doing duplicate work.
### Graded
Students who want to receive a grade on the class also have to complete a couple of issues tagged with `graded-x`.
The numbers next to the labels roughly indicate the **estimated workload and difficulty** of a task (also known as "story points" in agile development).
Estimates are frequently wrong, and these points can be adjusted in consultation with the instructor, if some task turns out to be much harder or easier than expected.
These story points correspond to ECTS credit points; if you are taking this as a "Proseminar", you will need to have owned and closed issues worth 5 story points.
If you are taking this as a "Hauptseminar", you will need to have owned and closed 7.5 story points worth of issues.
You will be graded based on how well you have adhered to the best practices and tooling covered in class, as well as (if applicable) the guidelines and standards of the external project (some other repo) or platform (Stack Overflow)
There are **different *kinds* of graded issues**:
#### Reproducible Example
Labels:
- `community.rstudio`, `stack-overflow` or `bug report`,
- and `reprex`, and `question` respectively.
Though it may also benefit yourself, a well-formulated question or bug report with a reproducible example can also serve the community.
This is what we're aiming for here.
A well-formulated question, in the context of open source development is often a reproducible example, or *reprex*, for short.
This means that you should provide a code snippet (or, if not applicable, a very precise description of steps) that will *allow any other user to reproduce the behavior in question, with no additional resources*.
Producing this can be harder than it sounds, and just narrowing down a problem like that may often help you solve it.
Make sure to read and adhere to all the resources listed [community and help](https://www.maxheld.de/fossos/stack.html#community__help).
The three target platforms can be listed roughly in ascending order of precision of the question:
1. http://community.rstudio.com:
Open to *relatively* open/vague questions, though you are absolutely expected to do your own research.
2. http://stackoverflow.com:
Questions should be very precise and reproducible, and be *definitively answerable*.
Not good for opiniated stuff.
Consider the resources listed under [community and help](/stack.html#help).
3. Bug report:
*If* you're absolutely sure that you have run into a bug, then it can be a good idea to raise it on the repository in question.
For most things, you should raise it on S-O or community.rstudio first, to be sure that it really *is* a bug.
Here, as with all things open source, we must ensure that other people's time is well-spent engaging our question (or bug report).
To ensure that, please follow this procedure:
```{r reprex, fig.cap="Sequence Chart for a Reprex"}
DiagrammeR::mermaid(diagram = "reprex.mmd", height = 1200, width = 800)
```
#### Answer on S-O or community.rstudio
Labels:
- `community.rstudio`, `stack-overflow`,
- `reprex`, and `answer` respectively.
Same process as for the above.
#### External Contribution
Labels: `external documentation`, `external software`.
These are improvements to *external* repos (typically also on GitHub), either other software (typically R repositories) or documentation and learning resources (typically those covered in class).
The actual work (forking, raising a pull request, etc.) consequently occurs in the external target repository, and this activity is merely *tracked* in a placeholder issue in the class repository.
Simply link to any relevant issues, commits or pull requests on the target repo in a placeholder issue.
This sounds quite challening, but it can be quite doable, especially if you're starting by improving the documentation.
To start contributing to open source, you might also find these resources helpful:
- code.likeagirl.io:
[How to find a newcomer-friendly open source project](https://code.likeagirl.io/the-new-developers-guide-to-open-source-228ca257dd68)
- Look for open issues on projects that you like, labelled as "needs help", "good first issue" or similar.
(Some maintainers will especially highlight starter issues.)
For contributions to external documentation or software, it is very important that we do not burden the respective maintainers with sub-par work.
To ensure that we deliver high quality work, you **must follow the following procedure**:
```{r external, fig.cap="Sequence Chart for an External Contribution"}
DiagrammeR::mermaid(diagram = "external.mmd", height = 1800, width = 800)
```
Grading criteria are listed for each of the issues.
Generally, a good grade will require following the practices and standards appropriate for the type of contribution in question, and students will need to demonstrate adequate command of the toolchain covered in class.
For an excellent grade, students will need to go (a bit) beyond the covered material, and work on an especially pressing or complicated problem.
#### Own Project
As an alternative to this (graded) assessment, if students already have some prior knowledge and a ready project they wish to work on, this can also be arranged.
Students should contact the instructor, and also track their progress on their *own* project in a placeholder issue on the fossos issue tracker.
### Grading Rubric
The graded tasks (see above) will be graded using the below rubrics.
The grading rubric is taken from the [University of British Columbia Master of Data Science program](http://ubc-mds.github.io) ([CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/us/)).
```{r grading-rubric, echo=FALSE}
accuracy <- c(
Poor = c(
"Code fails to run, doesn't have clear output, or performs the wrong task."
),
Unsatisfactory = c(
"Code performs only some of the correct tasks, the output is not easily understandable and the methods used to achieve the result are inefficient if performance is a concern."
),
Satisfactory = c(
"Code performs most of the correct tasks, the output is understandable, however the methods used to achieve the result are inefficient if performance is a concern."
),
Good = c(
"Code performs the correct tasks, the output is reasonably easy to understand, however the methods used to achieve the result are not the most efficient if performance is a concern."
),
Excellent = glue::glue(
"Code runs correctly without crashing, the output is very clear, and the intended or suitably correct methods are employed to achieve the correct result.",
"Student has chosen the most efficient algorithm reasonable if performance is a concern.",
.sep = " "
)
)
mechanics <- c(
Poor = glue::glue(
"Evaluator was unable to run/open/read assignment submission despite best efforts.",
"This may be because the student forgot to include certain files in the submission or tailored the software to only work on their local machine e.g. the code only works when run from a certain directory on the student's machine, contains paths to files only on the student's machine, etc., or they did not submit their assignment correctly or completely, or it was unclear where the relevant parts of the assignment are included in the submission.",
.sep = " "
),
Unsatisfactory = c(
"Evaluator had to spend some time to get the raw submission to work correctly"
),
Satisfactory = c(
"Evaluator had to make an obvious, small, quick fix to get things working or the wrong file format was submitted"
),
Good = c(
"The submission is self-contained and works flawlessly; it just works in anybody's hands."
),
Excellent = glue::glue(
"The student did not forget to include all the files in the submission.",
"Any necessary libraries to install are either included or are installed by a script, or are made obvious that that the evaluator must install them.",
"Student used the asked for file format.",
"All assignment instructions were followed.",
"All files were put in a repository, in a reasonable place, with reasonable names; any source files .tex, .Rmd are rendered to a readable output format e.g. .pdf, all figures are included, there is a README file indicating where to find the different aspects of the assignment, etc.",
.sep = " "
)
)
code_quality <- c(
Poor = glue::glue(
"Code is difficult to read and understand due to many issues that affects readability.",
"Code is also poorly organized.",
.sep = " "
),
Unsatisfactory = c(
"Code is generally easy to read and understand with few non-reoccurring issues and at most two reoccurring issue that affects readability."
),
Satisfactory = c(
"Code is generally easy to read and understand with few non-reoccurring issues and at most one reoccurring issue that affects readability"
),
Good = c(
"Code is easy to read and understand with only 1-2 minor and non-reoccurring issues that affect readability."
),
Excellent = glue::glue(
"Code is exceptionally easy to read and understand.",
"For example, variable names are clear, an appropriate amount of whitespace is used to maximize visibility, tabs and spaces are not mixed for indentation, sufficient comments are given.",
"Any coding sections of the assignment that were not completed have documentation explaining what a coded solution would look like.",
"Overall, the code is extremely well organized and documented.",
.sep = " "
)
)
robustness <- c(
Poor = c(
"Multiple issues with code repetition exist, and several tests are absent and/or are of poor efficacy"
),
Unsatisfactory = c(
"Some form of re-occuring code repetition exists, or tests efficacy is poor."
),
Satisfactory = c(
"Some form of re-occuring code repetition exists, or tests efficacy is poor."
),
Good = c(
"Code repetition is mostly minimized and effective tests are present for most functions."
),
Excellent = glue::glue(
"Code repetition is minimized via the use of loops/mapping functions, functions or classes or scripts/files as needed without becoming overly complicated.",
"Functions are short, concise, and cohesive without losing clarity; code can be easily modified.",
"Tests are present to ensure functions work as expected.",
"Exceptions are caught and thrown if necessary, pnce students have learned about exceptions.",
.sep = " "
)
)
rubric <- dplyr::bind_rows(
`Accuracy 25%` = accuracy,
`Code Quality 25%` = code_quality,
`Mechanics 25%` = mechanics,
`Robustness 25%` = robustness,
.id = "Dimension"
)
rubric
```
## Technical Requirements {#reqs}
Unfortunately, FAU has no computer lab facilities suitable for teaching this class and participants will have to **bring their own computers**.
This has the advantage that students will learn to set up their own development environments, but adds some unwelcome complexity (different OSes, etc.).
The class will assist students in installing software on their devices, but **students are responsible for maintaining their computers**.
In particular, student laptops must:
- have a reasonably current *desktop* operating system (MacOS >= 10.13, Microsoft Windows >= Vista, Linux),
- have a current version of a web browser installed,
- *not* be virus-infested or in some other borked-up state,
- *not* be a mobile device (iOS or Android won't work!) (unless you can SSH into a Linux box or something),
- and have ready access to one of the WiFi networks at FAU: `FAU-STUD`, `eduroam` or `FAU.fm`.
(If you need help setting up your WiFi, consult the RRZE Website.)
Emphatically, none of this requires a new, powerful or expensive device, let alone software.
You can get a used laptop with / ready for Linux Ubuntu on EBay for well under €100 (if you buy a used computer, make sure that the hardware has good Linux support).
With some [tweaking](https://leanpub.com/universities/courses/jhu/cbds-chromebook), you can even use an inexpensive (`x86`) Google Chromebook (which runs on Linux).
For more information, see [stack](/stack.html#moving_to_linux).
If you are facing financial difficulties in obtaining a laptop for the class, please contact the instructor.
We'll figure something out for you.
### Operating System Maintenance {.alert .alert-warning}
It is *your* responsibility to maintain your own computer and operating system (OS), as well as to figure out how to install the below software on your machine (though we will all help one another within reason).
### In-Browser Development
For a ready-made development environment, you can use the RStudio IDE (integrated development environment) *inside your web browser*.
RStudio is best for R development, but has decent support for other languages and includes access to a terminal and version control.
Using RStudio in the browser means that all the software you're using won't ever *really* be installed on your system, but only exist in a virtual image or online service.
If you want to do serious development work or are facing edge cases, you may require a "real" installation on your client (see instructions in [stack](/stack.html#moving_to_linux)).
However, in-browser development is a great way to have a standardized environment ready quickly.
You can run the RStudio IDE in your webbrowser in two ways:
#### `rocker/verse` Docker Image (Recommended)
Docker is an open-source industry standard to define, provision and share computing environments, known as *containers*.
Containers allow you to run computing environments on other computers.
Containers are similar to virtual machines (a computer inside a computer), but slimmer and generally neater.
A lot of the software you need to run in this class is included in the `rocker/verse` image published by the [Rocker Project](https://rocker-project.org).
For a list of things you *still* need to install "locally", consult the [stack](/stack.html).
For installation instructions, see [here](/stack.html#docker)).
Unfortunately, Docker has some [system requirement](https://docs.docker.com/docker-for-windows/install/) that many Windows versions do not meet.
#### Cloud Alternative (Not Recommended)
As a backup plan to using Docker on your own own operating system, you may use [RStudio Cloud](https://rstudio.cloud), a data science Software-as-a-Service (SaaS).
RStudio Cloud furnishes you with a ready RStudio session in a Docker image similar to `rocker/verse` with all necessary system dependencies.
RStudio Cloud is still in *alpha* and may not be always reliable.
Once out of alpha, it may also be a paid service, for which you may have to pay yourself.
Full disclosure: the instructor has worked for RStudio PBC.
You are strongly encouraged to invest the time and effort to set up and maintain a development environment on your own computer.
Otherwise:
<a class="btn btn-primary" href="https://rstudio.cloud" role="button">Sign up to RStudio Cloud</a>
<div class="alert alert-warning role="alert">
It's best to sign up with your GitHub account, but this <em>does not</em> give your RStudio Cloud instance read or write privileges to your repos.
Remember to also configure <a href="https://maurolepore.github.io/cloudgithub/">RStudio Cloud with your git credentials</a>.
</div>
You should also study the [RStudio Cloud guide](https://rstudio.cloud/learn/guide).
### Linux
<a class="btn btn-info" href="linux.html" role="button">Learn More</a>
If you want to install the programs used in this class on your system, rather than use them through a (Docker) container, you may find it easier to do that on Unix-compatible operating systems, including macOS and Linux.
Getting Windows to play nicely with open source software can be harder, and some convenient system utilities (such as a package manager) are often missing.
It *is* technically possible to use most, if not all, of the tools above on Windows, but they may behave slightly differently, and supporting them may be more involved.
If you are using a Windows machine, you may consider the following alternatives to get a more Unix-compatible operating system, roughly ranked from easiest to most involved:
1. Replace your existing operating system with, say, [Ubuntu](https://tutorials.ubuntu.com/tutorial/tutorial-install-ubuntu-desktop#0), a frequently used Linux distribution.
Before you do this, make sure that your hardware has good Linux support.
This would also delete all of your data and applications, and you might have to choose and use new replacement applications.
2. Same as 1, but with a [dual boot setup](https://opensource.com/article/18/5/dual-boot-linux).
This way, you can retain both your old operating system, and a new Linux install.
However, you always have to restart to switch between the two systems.
3. Same as 2, but in a [virtual machine](https://itsfoss.com/install-linux-in-virtualbox/) which can run alongside and *inside* your Windows install.
([Here](https://www.lifewire.com/install-ubuntu-linux-windows-10-steps-2202108) are alternative instructions).
Apparently, if your computer and Windows 10 version support it, there is also now a fancier/more efficient way to do this via [Hyper-V](https://www.windowscentral.com/how-run-linux-distros-windows-10-using-hyper-v).
Carries some performance penalty.
4. [Install the Windows Subsystem for Linux (WSL)](https://docs.microsoft.com/en-us/windows/wsl/install-win10).
This solution is available only for recent versions of Windows 10.
It seems pretty elegant, but has some limitations (no GUIs) and may be quite involved.
5. Buy an x86 Chromebook and use [crouton](https://github.com/dnschneid/crouton) or (better, but still in beta?) [crostini](https://www.zdnet.com/article/how-to-add-linux-to-your-chromebook/) to run Linux on your Chromebook.
6. Rent a virtual machine (VM, same as 3), but on a rented cloud host.
You can access everything through a browser, but there is a (small) fee, depending on your setup.
There is no guarantee that any of these alternatives or links will work for you; you will have to research them on your own.
## Contributors
A big **Thank You** to all contributors (in alphabetical order by username):
```{r contribs, echo=FALSE, results='asis', message=FALSE}
# wrappers are necessary to shut up chatty function
# function apparently chats via cat, which cannot be disabled in chunk options
invisible(capture.output(
contribs <- usethis::use_tidy_thanks()
))
cat(paste0(" [",contribs,"]","(https://www.github.com/",contribs,"), "))
```
## References