diff --git a/assignments/Project_1.Rmd b/assignments/Project_1.Rmd new file mode 100644 index 0000000..f50c574 --- /dev/null +++ b/assignments/Project_1.Rmd @@ -0,0 +1,64 @@ +--- +title: "Project 1" +output: html_document +--- + +```{r setup, include=FALSE} +library(tidyverse) +knitr::opts_chunk$set(echo = TRUE) +``` + +This is the dataset you will be working with: +```{r message = FALSE} +olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv') + +triathlon <- olympics %>% + filter(!is.na(height)) %>% # only keep athletes with known height + filter(sport == "Triathlon") %>% # keep only triathletes + mutate( + medalist = case_when( # add column to track medalist vs not + is.na(medal) ~ "non-medalist", + !is.na(medal) ~ "medalist" # any medals (Gold, Silver, Bronze) count + ) + ) +``` + +`triathlon` is a subset of `olympics` and contains only the data for triathletes. More information about the original `olympics` dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md and https://www.sports-reference.com/olympics.html. + +For this project, use `triathlon` to answer the following questions about athletes competing in this sport: + +1. In how many events total did male and female triathletes compete for each country? +2. Are there height differences among triathletes between sexes or over time? +3. Are there height differences among triathletes that have medaled or not, again also considering athlete sex? + +You should make one plot per question. + +**Hints:** + +- We recommend you use a bar plot for question 1, a boxplot for question 2, and a sina plot overlaid on top of violins for question 3. However, you are free to use any of the plots we have discussed in class so far. +- For question 2, you will have to convert `year` into a factor. +- For question 3, consider why a boxplot or simple violin plot is not a good idea and mention this in the approach section. +- For all questions, you can use either faceting or color coding or both. Pick whichever you prefer. +- Adjust `fig.width` and `fig.height` in the chunk headers to customize figure sizing and figure aspect ratios. + +You can delete these instructions from your project. Please also delete text such as *Your approach here* or `# Q1: Your R code here`. + +**Introduction:** *Your introduction here.* + +**Approach:** *Your approach here.* + +**Analysis:** + +```{r fig.width = 5, fig.heigth = 5} +# Q1: Your R code here +``` + +```{r fig.width = 5, fig.heigth = 5} +# Q2: Your R code here +``` + +```{r fig.width = 5, fig.heigth = 5} +# Q3: Your R code here +``` + +**Discussion:** *Your discussion of results here.* diff --git a/assignments/Project_1.html b/assignments/Project_1.html new file mode 100644 index 0000000..96c0471 --- /dev/null +++ b/assignments/Project_1.html @@ -0,0 +1,456 @@ + + + + + + + + + + + + + +Project 1 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

This is the dataset you will be working with:

+
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
+
+triathlon <- olympics %>% 
+  filter(!is.na(height)) %>%             # only keep athletes with known height
+  filter(sport == "Triathlon") %>%       # keep only triathletes
+  mutate(
+    medalist = case_when(                # add column to track medalist vs not
+      is.na(medal) ~ "non-medalist",
+      !is.na(medal) ~ "medalist"         # any medals (Gold, Silver, Bronze) count
+    )
+  )
+

triathlon is a subset of olympics and +contains only the data for triathletes. More information about the +original olympics dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md +and https://www.sports-reference.com/olympics.html.

+

For this project, use triathlon to answer the following +questions about athletes competing in this sport:

+
    +
  1. In how many events total did male and female triathletes compete for +each country?
  2. +
  3. Are there height differences among triathletes between sexes or over +time?
  4. +
  5. Are there height differences among triathletes that have medaled or +not, again also considering athlete sex?
  6. +
+

You should make one plot per question.

+

Hints:

+ +

You can delete these instructions from your project. Please also +delete text such as Your approach here or +# Q1: Your R code here.

+

Introduction: Your introduction here.

+

Approach: Your approach here.

+

Analysis:

+
# Q1: Your R code here
+
# Q2: Your R code here
+
# Q3: Your R code here
+

Discussion: Your discussion of results +here.

+ + + + +
+ + + + + + + + + + + + + + + diff --git a/assignments/Project_1_example.html b/assignments/Project_1_example.html new file mode 100644 index 0000000..b6261dd --- /dev/null +++ b/assignments/Project_1_example.html @@ -0,0 +1,576 @@ + + + + + + + + + + + + + +Project 1 Example Solution + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

Claus O. Wilke, EID

+

This is the dataset you will be working with:

+
NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
+
+NCbirths
+
## # A tibble: 1,409 × 10
+##    Plural   Sex MomAge Weeks Gained Smoke BirthWeightGm   Low Premie Marital
+##     <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>         <dbl> <dbl>  <dbl>   <dbl>
+##  1      1     1     32    40     38     0         3147.     0      0       0
+##  2      1     2     32    37     34     0         3289.     0      0       0
+##  3      1     1     27    39     12     0         3912.     0      0       0
+##  4      1     1     27    39     15     0         3856.     0      0       0
+##  5      1     1     25    39     32     0         3430.     0      0       0
+##  6      1     1     28    43     32     0         3317.     0      0       0
+##  7      1     2     25    39     75     0         4054.     0      0       0
+##  8      1     2     15    42     25     0         3204.     0      0       1
+##  9      1     2     21    39     28     0         3402      0      0       0
+## 10      1     2     27    40     37     0         3515.     0      0       1
+## # ℹ 1,399 more rows
+

Questions:

+
    +
  1. Is there a relationship between whether a mother smokes or not +and her baby’s weight at birth?

  2. +
  3. How many mothers are smokers or non-smokers?

  4. +
  5. What are the age distributions of mothers of twins or +triplets?

  6. +
+

Introduction: We are working with the +NCbirths dataset, which contains 1409 birth records from +North Carolina in 2001. In this dataset, each row corresponds to one +birth, and there are ten columns providing information about the birth, +the mother, and the baby. Information about the birth includes whether +it is a single, twin, or triplet birth, the number of completed weeks of +gestation, and whether the birth is premature. Information about the +baby includes the sex, the weight at birth, and whether the birth weight +should be considered low. Information about the mother includes her age, +the weight gained during pregnancy, whether she is a smoker, and whether +she is married.

+

To answer the three questions, we will work with five variables, the +baby’s birthweight (column BirthWeightGm), whether the baby +was born prematurely (column Premie), whether it was a +singleton, twin, or triplet birth (column Plural), whether +the mother is a smoker or not (column Smoke), and the +mother’s age (column MomAge). The birthweight is provided +as a numeric value, in grams. The premature birth status is encoded as +0/1, where 0 means regular and 1 means premature (36 weeks or sooner). +The number of births is encoded as 1/2/3 representing singleton, twins, +and triplets, respectively. The smoking status is encoded as 0/1, where +0 means the mother is not a smoker and 1 means she is a smoker. The +mother’s age is provided in years.

+

Approach: To show the distributions of birthweights +versus the mothers’ smoking status we will be using violin plots +(geom_violin()). We also separate out regular and premature +births, because babies born prematurely have much lower birthweight and +therefore must be considered separately. Violins make it easy to compare +multiple distributions side-by-side.

+

To show the number of mothers that are smokers or non-smokers we will +use a simple bar plot (geom_bar()). Finally, to show the +distribution of mothers’ ages we will make a strip chart. The number of +twin and triplet births in the dataset is not that large, so a strip +chart is a good option here.

+

Analysis:

+

Question 1: Is there a relationship between whether a mother smokes +or not and her baby’s weight at birth?

+

To answer this question, we plot the birthweight distributions as +violins, separated by both smoking status and by whether the birth was +regular or premature.

+
# The columns `Premie` and `Smoke` are numerical but contain
+# categorical data, so we convert to factors to ensure ggplot
+# treats them correctly
+ggplot(NCbirths, aes(factor(Premie), BirthWeightGm)) +
+  geom_violin(aes(fill = factor(Smoke))) +
+  scale_x_discrete(
+    name = NULL, # remove axis title entirely
+    labels = c("regular birth", "premature birth")
+  ) +
+  scale_y_continuous(
+    name = "Birth weight (gm)"
+  ) +
+  scale_fill_manual(
+    name = "Mother",
+    labels = c("non-smoker", "smoker"),
+    # explicitly assign colors to specific data values
+    values = c(`0` = "#56B4E9", `1` = "#E69F00")
+  ) + 
+  theme_bw(12)
+

+

There is a clear difference between birthweight for regular and +premature births, and for regular births the birthweight also seems to +be lower when the mother smokes.

+

Question 2: How many mothers are smokers or non-smokers?

+

To answer this question, we make a simple bar plot of the number of +mothers by smoking status.

+
# again, convert `Smoke` into factor so it's categorical
+ggplot(NCbirths, aes(y = factor(Smoke))) +
+  geom_bar() +
+  scale_y_discrete(
+    name = NULL,
+    labels = c("non-smoker", "smoker")
+  ) +
+  scale_x_continuous(
+    # ensure there's no gap between the beginning of the bar
+    # and the edge of the plot panel
+    expand = expansion(mult = c(0, 0.1))
+  ) +
+  theme_bw(12)
+

+

The vast majority of mothers in the dataset are non-smokers (almost +1250). Fewer than 250 are smokers.

+

Question 3. What are the age distributions of mothers of twins or +triplets?

+

To answer this question, we first remove singleton births from the +dataset and then show age distributions as a strip chart.

+
NCbirths %>%
+  filter(Plural > 1) %>% # remove singlet births
+  ggplot(aes(x = factor(Plural), y = MomAge)) +
+  geom_point(
+    # jitter horizontally so points don't overlap
+    position = position_jitter(
+      width = 0.2,
+      height = 0
+    ),
+    # it's nice to make points a little bigger and give them some color
+    size = 2,
+    color = "#1E4A7F"
+  ) +
+  scale_x_discrete(
+    name = NULL,
+    labels = c("twins", "triplets")
+  ) +
+  scale_y_continuous(
+    name = "age of mother (years)"
+  ) +
+  theme_bw(12)
+

+

Mothers of twins span the entire childbearing range, from 15 years to +approximately 40 years old. By contrast, mothers of triplets tend to be +in their thirties.

+

Discussion: The smoking status of the mother appears +to have a small effect on the average birth weight for regular births. +We can see this by comparing the two left-most violins in the first +plot, where we see that they are slightly vertically shifted relative to +each other but have otherwise a comparable shape. However, a much bigger +effect comes from whether the baby is born prematurely or not. Premature +births have on average a much lower birthweight than regular births, and +the variance is also bigger (the two right-most violins are taller than +the two left-most violins). Interestingly, smoking status does not seem +to affect the distribution of birthweights for premature births much. We +can see this from the fact that the two right-most violins look +approximately the same. We would have to run a multivariate statistical +analysis to determine whether any of these observed patterns are +statistically significant.

+

There are many more births to non-smoking mothers than to smoking +mothers in the dataset. This is important because it means we have more +complete data for non-smoking mothers. Some of the differences we saw in +the first graph, such as the slightly lower variance in birthweight for +premature births to smoking mothers—as compared to premature births to +non-smoking mothers—may simply be due to a smaller data set.

+

When comparing age distributions of mothers of twins or of triplets +we see an unexpected difference. It appears that mothers of all ages, +from teenage moms to moms in their early fourties, all can have twins. +By contrast, only mothers in their thirties appear to have triplets. We +can think of a possible explanation. Twin births happen due to natural +causes and therefore can occur in mothers of all ages. Triplet births, +however, are extremely unlikely to occur naturally, and most commonly +are caused by fertility treatments that cause multiple eggs to mature at +once. It is unlikely that women in their late teens or twenties will +undergo fertility treatment, whereas women in their thirties do so +frequently. We also note, however, that there are only four triplet +births in the dataset, so the lack of younger mothers could be due to +random chance. We would have to perform further analysis or run +statistical tests develop a clearer picture of what mechanisms may have +caused the observed patterns in the data.

+ + + + +
+ + + + + + + + + + + + + + + diff --git a/assignments/Project_1_instructions.html b/assignments/Project_1_instructions.html new file mode 100644 index 0000000..5ee58a8 --- /dev/null +++ b/assignments/Project_1_instructions.html @@ -0,0 +1,475 @@ + + + + + + + + + + + + + +Project 1 Instructions + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

Please use the project template R Markdown document to complete your +project. The knitted R Markdown document (as a PDF) and the raw +R Markdown file (as .Rmd) must be submitted to Canvas by 11:00pm on +Thurs., Feb 15, 2024. These two documents will be +graded jointly, so they must be consistent (as in, don’t change the R +Markdown file without also updating the knitted document!).

+

All results presented must have corresponding code, and the +code should be visible in the final generated pdf for ease of grading. +Any answers/results given without the corresponding R code that +generated the result will be considered absent. All code +reported in your final project document should work properly. Please do +not include any extraneous code or code which produces error messages. +(Code which produces warnings is acceptable, as long as you understand +what the warnings mean and explain this.)

+

For this project, you will be using an Olympic Games dataset, which +is a compilation of records for athletes that have competed in the +Olympics from Athens 1896 to Rio 2016.

+

Each record contains information including the name of the athlete +(name), their sex, their age, +their height, their weight, their +team, their nationality (noc), the +games at which they played, the year, the +olympic season, the city where the olympics +took place, the sport, the name of the event +(event), the decade during which the Olympics took place +(decade), whether or not the athlete won a gold medal +(gold), whether or not the athlete won any medal +(medalist) and if the athlete won “Gold”, “Silver”, +“Bronze” or received “no medal” (medal). More information +about the dataset can be found at https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

+

We will provide you with specific questions to answer and specific +instructions on how to answer the questions. The project should be +structured as follows:

+ +

We encourage you to be concise. A paragraph should typically not be +longer than 5 sentences.

+

You are not required to perform any statistical +tests in this project, but you may do so if you find it helpful to +answer your question.

+
+

Instructions

+

In the Introduction section, write a brief introduction to the +dataset, the questions, and what parts of the dataset are necessary to +answer the questions. You may repeat some of the information about the +dataset provided above, paraphrasing on your own terms. Imagine that +your project is a standalone document and the grader has no prior +knowledge of the dataset. You do not need to describe variables that are +never used in your analysis.

+

In the Approach section, describe what types of plots you are going +to make to address your questions. For each plot, provide a clear +explanation as to why this plot (e.g. boxplot, barplot, histogram, etc.) +is best for providing the information you are asking about. (You can +draw on the materials provided +here for guidance.) All plots should be of different types, +and all should use either color mapping or faceting or +both.

+

In the Analysis section, provide the code that generates your plots. +Use scale functions to provide nice axis labels and guides. You are +welcome to use theme functions to customize the appearance of your plot, +but you are not required to do so for this project. All plots +must be made with ggplot2. Do not use base R plotting +functions.

+

In the Discussion section, interpret the results of your analysis. +Identify any trends revealed (or not revealed) by the plots. Speculate +about why the data looks the way it does.

+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/assignments/Project_1_rubric.pdf b/assignments/Project_1_rubric.pdf new file mode 100644 index 0000000..1729159 Binary files /dev/null and b/assignments/Project_1_rubric.pdf differ diff --git a/docs/assignments/Project_1.Rmd b/docs/assignments/Project_1.Rmd new file mode 100644 index 0000000..f50c574 --- /dev/null +++ b/docs/assignments/Project_1.Rmd @@ -0,0 +1,64 @@ +--- +title: "Project 1" +output: html_document +--- + +```{r setup, include=FALSE} +library(tidyverse) +knitr::opts_chunk$set(echo = TRUE) +``` + +This is the dataset you will be working with: +```{r message = FALSE} +olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv') + +triathlon <- olympics %>% + filter(!is.na(height)) %>% # only keep athletes with known height + filter(sport == "Triathlon") %>% # keep only triathletes + mutate( + medalist = case_when( # add column to track medalist vs not + is.na(medal) ~ "non-medalist", + !is.na(medal) ~ "medalist" # any medals (Gold, Silver, Bronze) count + ) + ) +``` + +`triathlon` is a subset of `olympics` and contains only the data for triathletes. More information about the original `olympics` dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md and https://www.sports-reference.com/olympics.html. + +For this project, use `triathlon` to answer the following questions about athletes competing in this sport: + +1. In how many events total did male and female triathletes compete for each country? +2. Are there height differences among triathletes between sexes or over time? +3. Are there height differences among triathletes that have medaled or not, again also considering athlete sex? + +You should make one plot per question. + +**Hints:** + +- We recommend you use a bar plot for question 1, a boxplot for question 2, and a sina plot overlaid on top of violins for question 3. However, you are free to use any of the plots we have discussed in class so far. +- For question 2, you will have to convert `year` into a factor. +- For question 3, consider why a boxplot or simple violin plot is not a good idea and mention this in the approach section. +- For all questions, you can use either faceting or color coding or both. Pick whichever you prefer. +- Adjust `fig.width` and `fig.height` in the chunk headers to customize figure sizing and figure aspect ratios. + +You can delete these instructions from your project. Please also delete text such as *Your approach here* or `# Q1: Your R code here`. + +**Introduction:** *Your introduction here.* + +**Approach:** *Your approach here.* + +**Analysis:** + +```{r fig.width = 5, fig.heigth = 5} +# Q1: Your R code here +``` + +```{r fig.width = 5, fig.heigth = 5} +# Q2: Your R code here +``` + +```{r fig.width = 5, fig.heigth = 5} +# Q3: Your R code here +``` + +**Discussion:** *Your discussion of results here.* diff --git a/docs/assignments/Project_1.html b/docs/assignments/Project_1.html new file mode 100644 index 0000000..96c0471 --- /dev/null +++ b/docs/assignments/Project_1.html @@ -0,0 +1,456 @@ + + + + + + + + + + + + + +Project 1 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

This is the dataset you will be working with:

+
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
+
+triathlon <- olympics %>% 
+  filter(!is.na(height)) %>%             # only keep athletes with known height
+  filter(sport == "Triathlon") %>%       # keep only triathletes
+  mutate(
+    medalist = case_when(                # add column to track medalist vs not
+      is.na(medal) ~ "non-medalist",
+      !is.na(medal) ~ "medalist"         # any medals (Gold, Silver, Bronze) count
+    )
+  )
+

triathlon is a subset of olympics and +contains only the data for triathletes. More information about the +original olympics dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md +and https://www.sports-reference.com/olympics.html.

+

For this project, use triathlon to answer the following +questions about athletes competing in this sport:

+
    +
  1. In how many events total did male and female triathletes compete for +each country?
  2. +
  3. Are there height differences among triathletes between sexes or over +time?
  4. +
  5. Are there height differences among triathletes that have medaled or +not, again also considering athlete sex?
  6. +
+

You should make one plot per question.

+

Hints:

+ +

You can delete these instructions from your project. Please also +delete text such as Your approach here or +# Q1: Your R code here.

+

Introduction: Your introduction here.

+

Approach: Your approach here.

+

Analysis:

+
# Q1: Your R code here
+
# Q2: Your R code here
+
# Q3: Your R code here
+

Discussion: Your discussion of results +here.

+ + + + +
+ + + + + + + + + + + + + + + diff --git a/docs/assignments/Project_1_example.html b/docs/assignments/Project_1_example.html new file mode 100644 index 0000000..b6261dd --- /dev/null +++ b/docs/assignments/Project_1_example.html @@ -0,0 +1,576 @@ + + + + + + + + + + + + + +Project 1 Example Solution + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

Claus O. Wilke, EID

+

This is the dataset you will be working with:

+
NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
+
+NCbirths
+
## # A tibble: 1,409 × 10
+##    Plural   Sex MomAge Weeks Gained Smoke BirthWeightGm   Low Premie Marital
+##     <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>         <dbl> <dbl>  <dbl>   <dbl>
+##  1      1     1     32    40     38     0         3147.     0      0       0
+##  2      1     2     32    37     34     0         3289.     0      0       0
+##  3      1     1     27    39     12     0         3912.     0      0       0
+##  4      1     1     27    39     15     0         3856.     0      0       0
+##  5      1     1     25    39     32     0         3430.     0      0       0
+##  6      1     1     28    43     32     0         3317.     0      0       0
+##  7      1     2     25    39     75     0         4054.     0      0       0
+##  8      1     2     15    42     25     0         3204.     0      0       1
+##  9      1     2     21    39     28     0         3402      0      0       0
+## 10      1     2     27    40     37     0         3515.     0      0       1
+## # ℹ 1,399 more rows
+

Questions:

+
    +
  1. Is there a relationship between whether a mother smokes or not +and her baby’s weight at birth?

  2. +
  3. How many mothers are smokers or non-smokers?

  4. +
  5. What are the age distributions of mothers of twins or +triplets?

  6. +
+

Introduction: We are working with the +NCbirths dataset, which contains 1409 birth records from +North Carolina in 2001. In this dataset, each row corresponds to one +birth, and there are ten columns providing information about the birth, +the mother, and the baby. Information about the birth includes whether +it is a single, twin, or triplet birth, the number of completed weeks of +gestation, and whether the birth is premature. Information about the +baby includes the sex, the weight at birth, and whether the birth weight +should be considered low. Information about the mother includes her age, +the weight gained during pregnancy, whether she is a smoker, and whether +she is married.

+

To answer the three questions, we will work with five variables, the +baby’s birthweight (column BirthWeightGm), whether the baby +was born prematurely (column Premie), whether it was a +singleton, twin, or triplet birth (column Plural), whether +the mother is a smoker or not (column Smoke), and the +mother’s age (column MomAge). The birthweight is provided +as a numeric value, in grams. The premature birth status is encoded as +0/1, where 0 means regular and 1 means premature (36 weeks or sooner). +The number of births is encoded as 1/2/3 representing singleton, twins, +and triplets, respectively. The smoking status is encoded as 0/1, where +0 means the mother is not a smoker and 1 means she is a smoker. The +mother’s age is provided in years.

+

Approach: To show the distributions of birthweights +versus the mothers’ smoking status we will be using violin plots +(geom_violin()). We also separate out regular and premature +births, because babies born prematurely have much lower birthweight and +therefore must be considered separately. Violins make it easy to compare +multiple distributions side-by-side.

+

To show the number of mothers that are smokers or non-smokers we will +use a simple bar plot (geom_bar()). Finally, to show the +distribution of mothers’ ages we will make a strip chart. The number of +twin and triplet births in the dataset is not that large, so a strip +chart is a good option here.

+

Analysis:

+

Question 1: Is there a relationship between whether a mother smokes +or not and her baby’s weight at birth?

+

To answer this question, we plot the birthweight distributions as +violins, separated by both smoking status and by whether the birth was +regular or premature.

+
# The columns `Premie` and `Smoke` are numerical but contain
+# categorical data, so we convert to factors to ensure ggplot
+# treats them correctly
+ggplot(NCbirths, aes(factor(Premie), BirthWeightGm)) +
+  geom_violin(aes(fill = factor(Smoke))) +
+  scale_x_discrete(
+    name = NULL, # remove axis title entirely
+    labels = c("regular birth", "premature birth")
+  ) +
+  scale_y_continuous(
+    name = "Birth weight (gm)"
+  ) +
+  scale_fill_manual(
+    name = "Mother",
+    labels = c("non-smoker", "smoker"),
+    # explicitly assign colors to specific data values
+    values = c(`0` = "#56B4E9", `1` = "#E69F00")
+  ) + 
+  theme_bw(12)
+

+

There is a clear difference between birthweight for regular and +premature births, and for regular births the birthweight also seems to +be lower when the mother smokes.

+

Question 2: How many mothers are smokers or non-smokers?

+

To answer this question, we make a simple bar plot of the number of +mothers by smoking status.

+
# again, convert `Smoke` into factor so it's categorical
+ggplot(NCbirths, aes(y = factor(Smoke))) +
+  geom_bar() +
+  scale_y_discrete(
+    name = NULL,
+    labels = c("non-smoker", "smoker")
+  ) +
+  scale_x_continuous(
+    # ensure there's no gap between the beginning of the bar
+    # and the edge of the plot panel
+    expand = expansion(mult = c(0, 0.1))
+  ) +
+  theme_bw(12)
+

+

The vast majority of mothers in the dataset are non-smokers (almost +1250). Fewer than 250 are smokers.

+

Question 3. What are the age distributions of mothers of twins or +triplets?

+

To answer this question, we first remove singleton births from the +dataset and then show age distributions as a strip chart.

+
NCbirths %>%
+  filter(Plural > 1) %>% # remove singlet births
+  ggplot(aes(x = factor(Plural), y = MomAge)) +
+  geom_point(
+    # jitter horizontally so points don't overlap
+    position = position_jitter(
+      width = 0.2,
+      height = 0
+    ),
+    # it's nice to make points a little bigger and give them some color
+    size = 2,
+    color = "#1E4A7F"
+  ) +
+  scale_x_discrete(
+    name = NULL,
+    labels = c("twins", "triplets")
+  ) +
+  scale_y_continuous(
+    name = "age of mother (years)"
+  ) +
+  theme_bw(12)
+

+

Mothers of twins span the entire childbearing range, from 15 years to +approximately 40 years old. By contrast, mothers of triplets tend to be +in their thirties.

+

Discussion: The smoking status of the mother appears +to have a small effect on the average birth weight for regular births. +We can see this by comparing the two left-most violins in the first +plot, where we see that they are slightly vertically shifted relative to +each other but have otherwise a comparable shape. However, a much bigger +effect comes from whether the baby is born prematurely or not. Premature +births have on average a much lower birthweight than regular births, and +the variance is also bigger (the two right-most violins are taller than +the two left-most violins). Interestingly, smoking status does not seem +to affect the distribution of birthweights for premature births much. We +can see this from the fact that the two right-most violins look +approximately the same. We would have to run a multivariate statistical +analysis to determine whether any of these observed patterns are +statistically significant.

+

There are many more births to non-smoking mothers than to smoking +mothers in the dataset. This is important because it means we have more +complete data for non-smoking mothers. Some of the differences we saw in +the first graph, such as the slightly lower variance in birthweight for +premature births to smoking mothers—as compared to premature births to +non-smoking mothers—may simply be due to a smaller data set.

+

When comparing age distributions of mothers of twins or of triplets +we see an unexpected difference. It appears that mothers of all ages, +from teenage moms to moms in their early fourties, all can have twins. +By contrast, only mothers in their thirties appear to have triplets. We +can think of a possible explanation. Twin births happen due to natural +causes and therefore can occur in mothers of all ages. Triplet births, +however, are extremely unlikely to occur naturally, and most commonly +are caused by fertility treatments that cause multiple eggs to mature at +once. It is unlikely that women in their late teens or twenties will +undergo fertility treatment, whereas women in their thirties do so +frequently. We also note, however, that there are only four triplet +births in the dataset, so the lack of younger mothers could be due to +random chance. We would have to perform further analysis or run +statistical tests develop a clearer picture of what mechanisms may have +caused the observed patterns in the data.

+ + + + +
+ + + + + + + + + + + + + + + diff --git a/docs/assignments/Project_1_instructions.html b/docs/assignments/Project_1_instructions.html new file mode 100644 index 0000000..5ee58a8 --- /dev/null +++ b/docs/assignments/Project_1_instructions.html @@ -0,0 +1,475 @@ + + + + + + + + + + + + + +Project 1 Instructions + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

Please use the project template R Markdown document to complete your +project. The knitted R Markdown document (as a PDF) and the raw +R Markdown file (as .Rmd) must be submitted to Canvas by 11:00pm on +Thurs., Feb 15, 2024. These two documents will be +graded jointly, so they must be consistent (as in, don’t change the R +Markdown file without also updating the knitted document!).

+

All results presented must have corresponding code, and the +code should be visible in the final generated pdf for ease of grading. +Any answers/results given without the corresponding R code that +generated the result will be considered absent. All code +reported in your final project document should work properly. Please do +not include any extraneous code or code which produces error messages. +(Code which produces warnings is acceptable, as long as you understand +what the warnings mean and explain this.)

+

For this project, you will be using an Olympic Games dataset, which +is a compilation of records for athletes that have competed in the +Olympics from Athens 1896 to Rio 2016.

+

Each record contains information including the name of the athlete +(name), their sex, their age, +their height, their weight, their +team, their nationality (noc), the +games at which they played, the year, the +olympic season, the city where the olympics +took place, the sport, the name of the event +(event), the decade during which the Olympics took place +(decade), whether or not the athlete won a gold medal +(gold), whether or not the athlete won any medal +(medalist) and if the athlete won “Gold”, “Silver”, +“Bronze” or received “no medal” (medal). More information +about the dataset can be found at https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

+

We will provide you with specific questions to answer and specific +instructions on how to answer the questions. The project should be +structured as follows:

+ +

We encourage you to be concise. A paragraph should typically not be +longer than 5 sentences.

+

You are not required to perform any statistical +tests in this project, but you may do so if you find it helpful to +answer your question.

+
+

Instructions

+

In the Introduction section, write a brief introduction to the +dataset, the questions, and what parts of the dataset are necessary to +answer the questions. You may repeat some of the information about the +dataset provided above, paraphrasing on your own terms. Imagine that +your project is a standalone document and the grader has no prior +knowledge of the dataset. You do not need to describe variables that are +never used in your analysis.

+

In the Approach section, describe what types of plots you are going +to make to address your questions. For each plot, provide a clear +explanation as to why this plot (e.g. boxplot, barplot, histogram, etc.) +is best for providing the information you are asking about. (You can +draw on the materials provided +here for guidance.) All plots should be of different types, +and all should use either color mapping or faceting or +both.

+

In the Analysis section, provide the code that generates your plots. +Use scale functions to provide nice axis labels and guides. You are +welcome to use theme functions to customize the appearance of your plot, +but you are not required to do so for this project. All plots +must be made with ggplot2. Do not use base R plotting +functions.

+

In the Discussion section, interpret the results of your analysis. +Identify any trends revealed (or not revealed) by the plots. Speculate +about why the data looks the way it does.

+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/docs/assignments/Project_1_rubric.pdf b/docs/assignments/Project_1_rubric.pdf new file mode 100644 index 0000000..1729159 Binary files /dev/null and b/docs/assignments/Project_1_rubric.pdf differ diff --git a/docs/schedule.html b/docs/schedule.html index 76551b3..337ddb8 100644 --- a/docs/schedule.html +++ b/docs/schedule.html @@ -2646,6 +2646,16 @@

Homework 7 (due Apr 11, 2024)

Projects

All projects are due by 11:00pm on the day they are due. Projects need to be submitted on Canvas. Please carefully read the submission instructions for each project.

Project 1 (due Feb 15, 2023)

+

+Materials: +

+

Project 2 (due Mar 21, 2023)

Project 3 (due Apr 18, 2023)

Reuse

diff --git a/docs/search.json b/docs/search.json index ed33f1f..a88864e 100644 --- a/docs/search.json +++ b/docs/search.json @@ -6,21 +6,21 @@ "description": "Data Visualization in R", "author": [], "contents": "\nThis is the home page for SDS 375, Data Visualization in R. All course materials will be posted on this site.\nInstructor: Claus O. Wilke\nMeeting times: TTH 3:30pm to 5:00pm\nVenue: UTC 4.110\nSyllabus: click here\nUpcoming lectures and assignments: click here\nComputing requirements\nFor students enrolled in this course, you only need a working web browser to access the edupod server, located at: https://edupod.cns.utexas.edu/\nIf you are using the edupod server, stop reading here. Everything is pre-installed and no further action is needed.\nTo run any of the materials locally on your own machine, you will need the following:\nA recent version of R, download from here.\nA recent version of RStudio, download from here.\nThe following R packages:\nbroom, cluster, colorspace, cowplot, distill, gapminder, GGally, gganimate, ggiraph, ggdendro, ggdist, ggforce, ggplot2movies, ggrepel, ggridges, ggthemes, gifski, glue, knitr, learnr, naniar, margins, MASS, Matrix, nycflights13, palmerpenguins, patchwork, rmarkdown, rnaturalearth, rnaturalearthhires, scales, sf, shinyjs, sp, tidyverse, transformr, umap, xaringan\nYou can install all required R packages at once by running the following code in the R command line:\n\n\n# first run this command:\ninstall.packages(\n c(\n \"broom\", \"cluster\", \"colorspace\", \"cowplot\", \"distill\", \"gapminder\", \n \"GGally\", \"gganimate\", \"ggiraph\", \"ggdendro\", \"ggdist\", \"ggforce\",\n \"ggplot2movies\", \"ggrepel\", \"ggridges\", \"ggthemes\", \"gifski\", \"glue\",\n \"knitr\", \"learnr\", \"naniar\", \"margins\", \"MASS\", \"Matrix\",\n \"nycflights13\", \"palmerpenguins\", \"patchwork\", \"rmarkdown\", \"rnaturalearth\",\n \"scales\", \"sf\", \"shinyjs\", \"sp\", \"tidyverse\", \"transformr\", \"umap\",\n \"xaringan\"\n )\n)\n\n# then run this command:\ninstall.packages(\n \"rnaturalearthhires\", repos = \"https://packages.ropensci.org\", type = \"source\"\n)\n\n\nReuse\nText and figures are licensed under Creative Commons Attribution CC BY 4.0. Any computer code (R, HTML, CSS, etc.) in slides and worksheets, including in slide and worksheet sources, is also licensed under MIT. Note that figures in slides may be pulled in from external sources and may be licensed under different terms. For such images, image credits are available in the slide notes, accessible via pressing the letter ‘p’.\n\n\n\n", - "last_modified": "2024-01-29T15:00:27-06:00" + "last_modified": "2024-02-01T14:50:44-06:00" }, { "path": "LICENSE.html", "author": [], "contents": "\nMIT License\nCopyright (c) 2021 Claus O. Wilke\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the “Software”), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\nTHE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n\n\n", - "last_modified": "2024-01-29T15:00:27-06:00" + "last_modified": "2024-02-01T14:50:45-06:00" }, { "path": "schedule.html", "title": "SDS 375 Schedule Spring 2023", "description": "", "author": [], - "contents": "\n\nContents\nLectures\nHomeworks\nProjects\nReuse\n\nLectures\n1. Jan 16, 2024—Introduction\n\nMaterials:\n\nSlides\nWorksheet (Solutions are available here)\n2. Jan 18, 2024—Aesthetic mappings\n\nMaterials:\n\nSlides\nWorksheet\n3. Jan 23, 2023—Telling a story, Visualizing amounts\n\nMaterials:\n\nSlides: Telling a story\nSlides: Visualizing amounts\nWorksheet\n4. Jan 25, 2023—Coordinate systems and axes\n\nMaterials:\n\nSlides\nWorksheet\n5. Jan 30, 2024—Visualizing distributions 1\n\nMaterials:\n\nSlides\nWorksheet\n6. Feb 1, 2024—Visualizing distributions 2\n\nMaterials:\n\nSlides\nWorksheet\nHomeworks\nAll homeworks are due by 11:00pm on the day they are due. Homeworks need to be submitted as pdf files on Canvas.\nHomework 1 (due Jan 25, 2024)\n\nMaterials:\n\nR Markdown template\nHTML\nHomework 2 (due Feb 1, 2024)\n\nMaterials:\n\nR Markdown template\nHTML\nHomework 3 (due Feb 8, 2024)\n\nMaterials:\n\nR Markdown template\nHTML\nHomework 4 (due Feb 29, 2024)\nHomework 5 (due Mar 7, 2024)\nHomework 6 (due Apr 4, 2024)\nHomework 7 (due Apr 11, 2024)\nProjects\nAll projects are due by 11:00pm on the day they are due. Projects need to be submitted on Canvas. Please carefully read the submission instructions for each project.\nProject 1 (due Feb 15, 2023)\nProject 2 (due Mar 21, 2023)\nProject 3 (due Apr 18, 2023)\nReuse\nText and figures are licensed under Creative Commons Attribution CC BY 4.0. Any computer code (R, HTML, CSS, etc.) in slides and worksheets, including in slide and worksheet sources, is also licensed under MIT. Note that figures in slides may be pulled in from external sources and may be licensed under different terms. For such images, image credits are available in the slide notes, accessible via pressing the letter ‘p’.\n\n\n\n", - "last_modified": "2024-01-29T15:00:28-06:00" + "contents": "\n\nContents\nLectures\nHomeworks\nProjects\nReuse\n\nLectures\n1. Jan 16, 2024—Introduction\n\nMaterials:\n\nSlides\nWorksheet (Solutions are available here)\n2. Jan 18, 2024—Aesthetic mappings\n\nMaterials:\n\nSlides\nWorksheet\n3. Jan 23, 2023—Telling a story, Visualizing amounts\n\nMaterials:\n\nSlides: Telling a story\nSlides: Visualizing amounts\nWorksheet\n4. Jan 25, 2023—Coordinate systems and axes\n\nMaterials:\n\nSlides\nWorksheet\n5. Jan 30, 2024—Visualizing distributions 1\n\nMaterials:\n\nSlides\nWorksheet\n6. Feb 1, 2024—Visualizing distributions 2\n\nMaterials:\n\nSlides\nWorksheet\nHomeworks\nAll homeworks are due by 11:00pm on the day they are due. Homeworks need to be submitted as pdf files on Canvas.\nHomework 1 (due Jan 25, 2024)\n\nMaterials:\n\nR Markdown template\nHTML\nHomework 2 (due Feb 1, 2024)\n\nMaterials:\n\nR Markdown template\nHTML\nHomework 3 (due Feb 8, 2024)\n\nMaterials:\n\nR Markdown template\nHTML\nHomework 4 (due Feb 29, 2024)\nHomework 5 (due Mar 7, 2024)\nHomework 6 (due Apr 4, 2024)\nHomework 7 (due Apr 11, 2024)\nProjects\nAll projects are due by 11:00pm on the day they are due. Projects need to be submitted on Canvas. Please carefully read the submission instructions for each project.\nProject 1 (due Feb 15, 2023)\n\nMaterials:\n\nInstructions\nProject Template (Rmd)\nProject Template (HTML)\nGrading rubric\nExample project\nProject 2 (due Mar 21, 2023)\nProject 3 (due Apr 18, 2023)\nReuse\nText and figures are licensed under Creative Commons Attribution CC BY 4.0. Any computer code (R, HTML, CSS, etc.) in slides and worksheets, including in slide and worksheet sources, is also licensed under MIT. Note that figures in slides may be pulled in from external sources and may be licensed under different terms. For such images, image credits are available in the slide notes, accessible via pressing the letter ‘p’.\n\n\n\n", + "last_modified": "2024-02-01T14:50:45-06:00" }, { "path": "syllabus.html", @@ -28,7 +28,7 @@ "description": "", "author": [], "contents": "\n\nContents\nCourse title and instructor\nPurpose and contents of the class\nPrerequisites\nTextbook\nTopics covered\nComputing requirements\nCourse site\nAssignments and grading\nLate assignment policy\nOffice hours\nEmail policy\nSpecial accommodations\nAcademic dishonesty\nSharing of Course Materials is Prohibited\nClass Recordings\nReuse\n\nCourse title and instructor\nTitle: SDS 375 Data Visualization in RSemester: Spring 2024Unique: 56690, TTH 3:30pm–5:00pm, UTC 4.110\nInstructor: Claus O. WilkeEmail: wilke@austin.utexas.eduOffice Hours: Mon. 9am - 10am (open Zoom), Thurs. 10am - 11am (open Zoom), or by appointment\nTeaching Assistant: Alexis HillEmail: alexis.hill@utexas.eduOffice Hours: Wed. 2pm - 3PM (open Zoom), Thurs. 11am - 12pm (open Zoom), or by appointment\nPurpose and contents of the class\nIn this class, students will learn how to visualize data sets and how to reason about and communicate with data visualizations. A substantial component of this class will be dedicated to learning how to program in R. In addition, students will learn how to compile analyses and visualizations into reports, how to make the reports reproducible, and how to post reports on a website or blog.\nPrerequisites\nThe class requires no prior knowledge of programming. However, students are expected to have successfully completed an introductory statistics class taught with R, such as SDS 320E, and they are expected to have some basic familiarity with the statistical language R.\nTextbook\nThis class draws heavily from materials presented in the following book:\nClaus O. Wilke. Fundamentals of Data Visualization. O’Reilly Media, 2019.\nAdditionally, we will also make use of the following books:\nHadley Wickham, Danielle Navarro, and Thomas Lin Pedersen. ggplot2: Elegant Graphics for Data Analysis, 3rd ed. Springer, to appear.\nKieran Healy. Data Visualization: A Practical Introduction. Princeton University Press, 2018.\nAll these books are freely available online and you do not need to purchase a physical copy of either book to succeed in this class.\nTopics covered\n\nClass\nTopic\nCoding concepts covered\n1.\nIntroduction, reproducible\nworkflows\nRStudio setup online, R Markdown\n2.\nAesthetic mappings\nggplot2 quickstart\n3.\nTelling a story\n\n4.\nVisualizing amounts\ngeom_col(), geom_point(),\nposition adjustments\n5.\nCoordinate systems and\naxes\ncoords and position scales\n6.\nVisualizing distributions\n1\nstats, geom_density(),\ngeom_histogram()\n7.\nVisualizing distributions\n2\nviolin plots, sina plots, ridgeline plots\n8.\nColor scales\ncolor and fill scales\n9.\nData wrangling 1\nmutate(), filter(), arrange()\n10.\nData wrangling 2\ngroup_by(), summarize(), count()\n11.\nVisualizing proportions\nbar charts, pie charts\n12.\nGetting to know your data\nhandling missing data, is.na(), case_when()\n13.\nGetting things into the\nright order\nfct_reorder(), fct_lump()\n14.\nFigure design\nggplot themes\n15.\nColor spaces, color vision\ndeficiency\ncolorspace package\n16.\nFunctions and functional\nprogramming\nmap(), nest(), purrr package\n17.\nVisualizing trends\ngeom_smooth()\n18.\nWorking with models\nlm, cor.test, broom package\n19.\nVisualizing uncertainty\nfrequency framing, error bars, ggdist package\n20.\nDimension reduction 1\nPCA\n21.\nDimension reduction 2\nkernel PCA, t-SNE, UMAP\n22.\nClustering 1\nk-means clustering\n23.\nClustering 2\nhierarchical clustering\n24.\nVisualizing geospatial\ndata\ngeom_sf(), coord_sf()\n25.\nRedundant coding, text\nannotations\nggrepel package\n26.\nInteractive plots\nggiraph package\n27.\nOver-plotting\njittering, 2d histograms,\ncontour plots\n28.\nCompound figures\npatchwork package\n\nComputing requirements\nProgramming needs to be learned by doing, and a significant portion of the in-class time will be dedicated to working through simple problems. All programming exercises will be available through a web-based system, so the only system requirement for student computers is a modern web browser.\nCourse site\nAll materials and assignments will be posted on the course webpage at:\nhttps://wilkelab.org/SDS375\nAssignment deadlines are shown on the schedule at: https://wilkelab.org/SDS375/schedule.html\nAssignments will be submitted and grades will be posted on Canvas at:\nhttps://utexas.instructure.com\nParticipation via presence in class and in online discussions will also be tracked on Canvas.\nR compute sessions are available at:\nhttps://edupod.cns.utexas.edu\nNote that edupods will be unavailable due to maintenance approximately two hours per month, usually on a Thursday afternoon between 4pm and 6pm. Specific maintenance times are published in advance here:\nhttps://wikis.utexas.edu/display/RCTFusers\nAssignments and grading\nThe graded components of this class will be homeworks, projects, peer-grading, and participation. Each week either a homework, a project, or a peer-grading is due. Homeworks will be relatively short visualization problems to be solved by the student, usually involving some small amount of programming to achieve a specified goal. They are graded by the TA. Projects are larger and more involved data analysis problems that involve both programming and writing. They are peer-graded by the students. Students will have at least one week to complete each homework and two weeks to complete each project. The submission deadlines for homeworks and projects will be Thursdays at 11pm.\nThere will be seven homeworks and three projects. Both homeworks and projects need to be submitted electronically on Canvas. Homeworks are worth 20 points and projects are worth 100 points. The lowest-scoring homework will be dropped, so that a maximum of 120 points can be obtained from the homeworks.\nProjects are peer-graded, which involves evaluating three projects by other students according to a detailed grading rubric that will be provided. The final grade for each project is the mean of the peer-graded projects. The peer-grading itself will be graded by the TA, who will also oversee and spot-check the assigned peer grades. Experience has shown that peer-grading is often the most instructive component of this class, so don’t take this lightly.\nParticipation is assessed in two ways. First, students will receive 2 points for every lecture they attend. This is tracked via simple quizzes on Canvas. Second, each week students can receive up to 4 points for making substantive contributions to the Canvas online discussion (2 points per contribution). Total participation points are capped at 52 (13 weeks of class times 4 points), so students can compensate for lack of in-person attendance by participating in discussions and vice versa. You do not have to get full points in both in-person attendance and online discussions. No participation is assessed in the first week of class.\n\nAssignment type\nNumber\nPoints per assignment\nTotal points\nHomework\n6 (+1)\n20\n120\nProject\n3\n100\n300\nPeer grading\n3\n16\n48\nParticipation\n26 (+26)\n2\n52\n\nThus, in summary, each project (+ peer grading) contributes 22% to the final grade, the totality of all homeworks contributes another 23% to the final grade, and participation contributes 10%. There are no traditional exams in this class and there is no final.\nThe class will use +/- grading, and the exact grade boundaries will be determined at the end of the semester. However, the following minimum grades will be guaranteed:\n\nPoints achieved\nMinimum guaranteed grade\n468 (90%)\nA-\n416 (80%)\nB-\n364 (70%)\nC-\n260 (50%)\nD-\n\nLate assignment policy\nHomeworks that are submitted past the posted deadline will not be graded and will receive 0 points.\nProject submissions will have a 1-day grace period. Projects submitted during the grace period will have 25 points deducted from the obtained grade. After the grace period, students who have not submitted their project will receive 0 points.\nPeer grades need to be submitted by the posted deadline. Late submissions will result in 0 points for the peer-grading effort.\nIn case of illness or other unforeseen circumstances out of your control, please reach out to Claus Wilke as soon as possible. We will consider your request on a case-by-case basis. If you need a deadline extension for valid reasons, please reach out before the official submission deadline and state how much of an extension you would need. Whether deadline extensions are possible depends on the severity of your situation as well as whether the solutions to the assignment have already been published.\nOffice hours\nBoth the graduate TA and myself will be available at posted times or by appointment. Office hours will be over Zoom. The most effective way to request an appointment for office hours outside of posted times is to suggest several times that work for you. I would suggest to write an email such as the following:\nDear Dr. Wilke,\n\nI would like to request a meeting with you outside of \nregular office hours this week. I am available Thurs.\nbetween 1pm and 2:30pm or Fri. before 11am or after 4pm.\n\nThanks a lot,\n John Doe\nNote that we will not usually make appointments before 9am or after 5pm.\nEmail policy\nWhen emailing about this course, please put “SDS375” into the subject line. Emails to the instructor or TA should be restricted to organizational issues, such as requests for appointments, questions about course organization, etc. For all other issues, post in the discussions on Canvas, ask a question during open Zoom, or make an appointment for a one-on-one session.\nSpecifically, we will not discuss technical issues related to assignments over email. Technical issues are questions concerning how to approach a particular problem, whether a particular solution is correct, or how to use the statistical software R. These questions should be posted as issues on GitHub. Also, we will not discuss grading-related matters over email. If you have a concern about grading, schedule a one-on-one Zoom meeting.\nSpecial accommodations\nStudents with disabilities. Students with disabilities may request appropriate accommodations from the Division of Diversity and Community Engagement, Services for Students with Disabilities, 512-471-6259, https://diversity.utexas.edu/disability/\nReligious holy days. Students who must miss a class or an assignment to observe a religious holy day will be given an opportunity to complete the missed work within a reasonable time after the absence. According to UT Austin policy, such students must notify me of the pending absence at least fourteen days prior to the date of observance of a religious holy day.\nAcademic dishonesty\nThis course is built upon the idea that student interaction is important and a powerful way to learn. We encourage you to communicate with other students, in particular through the discussion forums on Canvas. However, there are times when you need to demonstrate your own ability to work and solve problems. In particular, your homeworks and projects are independent work, unless explicitly stated otherwise. You are allowed to confer with fellow students about general approaches to solve the problems in the assignments, but you have to do the assignments on your own and describe your work in your own words. Students who violate these expectations can expect to receive a failing grade on the assignment and will be reported to Student Judicial Services. These types of violations are reported to professional schools, should you ever decide to apply one day. Don’t do it—it’s not worth the consequences.\nSharing of Course Materials is Prohibited\nAny materials in this class that are not posted publicly may not be shared online or with anyone outside of the class unless you have my explicit, written permission. This includes but is not limited to lecture hand-outs, videos, assessments (quizzes, exams, papers, projects, homework assignments), in-class materials, review sheets, and additional problem sets. Unauthorized sharing of materials promotes cheating. It is a violation of the University’s Student Honor Code and an act of academic dishonesty. We are well aware of the sites used for sharing materials, and any materials found online that are associated with you, or any suspected unauthorized sharing of materials, will be reported to Student Conduct and Academic Integrity in the Office of the Dean of Students. These reports can result in sanctions, including failure in the course.\nAny materials posted on the public class website (https://wilkelab.org/SDS375/) are considered public and can be shared under the Creative Commons Attribution CC BY 4.0 license.\nClass Recordings\nIf any class recordings are provided they are reserved only for students in this class for educational purposes and are protected under FERPA. The recordings should not be shared outside the class in any form. Violation of this restriction by a student could lead to Student Misconduct proceedings.\nReuse\nText and figures are licensed under Creative Commons Attribution CC BY 4.0. Any computer code (R, HTML, CSS, etc.) in slides and worksheets, including in slide and worksheet sources, is also licensed under MIT. Note that figures in slides may be pulled in from external sources and may be licensed under different terms. For such images, image credits are available in the slide notes, accessible via pressing the letter ‘p’.\n\n\n\n", - "last_modified": "2024-01-29T15:00:28-06:00" + "last_modified": "2024-02-01T14:50:45-06:00" } ], "collections": [] diff --git a/schedule.Rmd b/schedule.Rmd index b0d2330..c8199e5 100644 --- a/schedule.Rmd +++ b/schedule.Rmd @@ -96,6 +96,14 @@ All projects are due by 11:00pm on the day they are due. Projects need to be sub ### Project 1 (due Feb 15, 2023) +

Materials:

+ +- [Instructions](assignments/Project_1_instructions.html) +- [Project Template (Rmd)](assignments/Project_1.Rmd) +- [Project Template (HTML)](assignments/Project_1.html) +- [Grading rubric](assignments/Project_1_rubric.pdf) +- [Example project](assignments/Project_1_example.html) + ### Project 2 (due Mar 21, 2023) ### Project 3 (due Apr 18, 2023)