diff --git a/assignments/Project_1.Rmd b/assignments/Project_1.Rmd new file mode 100644 index 0000000..f50c574 --- /dev/null +++ b/assignments/Project_1.Rmd @@ -0,0 +1,64 @@ +--- +title: "Project 1" +output: html_document +--- + +```{r setup, include=FALSE} +library(tidyverse) +knitr::opts_chunk$set(echo = TRUE) +``` + +This is the dataset you will be working with: +```{r message = FALSE} +olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv') + +triathlon <- olympics %>% + filter(!is.na(height)) %>% # only keep athletes with known height + filter(sport == "Triathlon") %>% # keep only triathletes + mutate( + medalist = case_when( # add column to track medalist vs not + is.na(medal) ~ "non-medalist", + !is.na(medal) ~ "medalist" # any medals (Gold, Silver, Bronze) count + ) + ) +``` + +`triathlon` is a subset of `olympics` and contains only the data for triathletes. More information about the original `olympics` dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md and https://www.sports-reference.com/olympics.html. + +For this project, use `triathlon` to answer the following questions about athletes competing in this sport: + +1. In how many events total did male and female triathletes compete for each country? +2. Are there height differences among triathletes between sexes or over time? +3. Are there height differences among triathletes that have medaled or not, again also considering athlete sex? + +You should make one plot per question. + +**Hints:** + +- We recommend you use a bar plot for question 1, a boxplot for question 2, and a sina plot overlaid on top of violins for question 3. However, you are free to use any of the plots we have discussed in class so far. +- For question 2, you will have to convert `year` into a factor. +- For question 3, consider why a boxplot or simple violin plot is not a good idea and mention this in the approach section. +- For all questions, you can use either faceting or color coding or both. Pick whichever you prefer. +- Adjust `fig.width` and `fig.height` in the chunk headers to customize figure sizing and figure aspect ratios. + +You can delete these instructions from your project. Please also delete text such as *Your approach here* or `# Q1: Your R code here`. + +**Introduction:** *Your introduction here.* + +**Approach:** *Your approach here.* + +**Analysis:** + +```{r fig.width = 5, fig.heigth = 5} +# Q1: Your R code here +``` + +```{r fig.width = 5, fig.heigth = 5} +# Q2: Your R code here +``` + +```{r fig.width = 5, fig.heigth = 5} +# Q3: Your R code here +``` + +**Discussion:** *Your discussion of results here.* diff --git a/assignments/Project_1.html b/assignments/Project_1.html new file mode 100644 index 0000000..96c0471 --- /dev/null +++ b/assignments/Project_1.html @@ -0,0 +1,456 @@ + + + + +
+ + + + + + + + +This is the dataset you will be working with:
+olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
+
+triathlon <- olympics %>%
+ filter(!is.na(height)) %>% # only keep athletes with known height
+ filter(sport == "Triathlon") %>% # keep only triathletes
+ mutate(
+ medalist = case_when( # add column to track medalist vs not
+ is.na(medal) ~ "non-medalist",
+ !is.na(medal) ~ "medalist" # any medals (Gold, Silver, Bronze) count
+ )
+ )
+triathlon
is a subset of olympics
and
+contains only the data for triathletes. More information about the
+original olympics
dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md
+and https://www.sports-reference.com/olympics.html.
For this project, use triathlon
to answer the following
+questions about athletes competing in this sport:
You should make one plot per question.
+Hints:
+year
into a
+factor.fig.width
and fig.height
in the
+chunk headers to customize figure sizing and figure aspect ratios.You can delete these instructions from your project. Please also
+delete text such as Your approach here or
+# Q1: Your R code here
.
Introduction: Your introduction here.
+Approach: Your approach here.
+Analysis:
+# Q1: Your R code here
+# Q2: Your R code here
+# Q3: Your R code here
+Discussion: Your discussion of results +here.
+ + + + +Claus O. Wilke, EID
+This is the dataset you will be working with:
+NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
+
+NCbirths
+## # A tibble: 1,409 × 10
+## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
+## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+## 1 1 1 32 40 38 0 3147. 0 0 0
+## 2 1 2 32 37 34 0 3289. 0 0 0
+## 3 1 1 27 39 12 0 3912. 0 0 0
+## 4 1 1 27 39 15 0 3856. 0 0 0
+## 5 1 1 25 39 32 0 3430. 0 0 0
+## 6 1 1 28 43 32 0 3317. 0 0 0
+## 7 1 2 25 39 75 0 4054. 0 0 0
+## 8 1 2 15 42 25 0 3204. 0 0 1
+## 9 1 2 21 39 28 0 3402 0 0 0
+## 10 1 2 27 40 37 0 3515. 0 0 1
+## # ℹ 1,399 more rows
+Questions:
+Is there a relationship between whether a mother smokes or not +and her baby’s weight at birth?
How many mothers are smokers or non-smokers?
What are the age distributions of mothers of twins or +triplets?
Introduction: We are working with the
+NCbirths
dataset, which contains 1409 birth records from
+North Carolina in 2001. In this dataset, each row corresponds to one
+birth, and there are ten columns providing information about the birth,
+the mother, and the baby. Information about the birth includes whether
+it is a single, twin, or triplet birth, the number of completed weeks of
+gestation, and whether the birth is premature. Information about the
+baby includes the sex, the weight at birth, and whether the birth weight
+should be considered low. Information about the mother includes her age,
+the weight gained during pregnancy, whether she is a smoker, and whether
+she is married.
To answer the three questions, we will work with five variables, the
+baby’s birthweight (column BirthWeightGm
), whether the baby
+was born prematurely (column Premie
), whether it was a
+singleton, twin, or triplet birth (column Plural
), whether
+the mother is a smoker or not (column Smoke
), and the
+mother’s age (column MomAge
). The birthweight is provided
+as a numeric value, in grams. The premature birth status is encoded as
+0/1, where 0 means regular and 1 means premature (36 weeks or sooner).
+The number of births is encoded as 1/2/3 representing singleton, twins,
+and triplets, respectively. The smoking status is encoded as 0/1, where
+0 means the mother is not a smoker and 1 means she is a smoker. The
+mother’s age is provided in years.
Approach: To show the distributions of birthweights
+versus the mothers’ smoking status we will be using violin plots
+(geom_violin()
). We also separate out regular and premature
+births, because babies born prematurely have much lower birthweight and
+therefore must be considered separately. Violins make it easy to compare
+multiple distributions side-by-side.
To show the number of mothers that are smokers or non-smokers we will
+use a simple bar plot (geom_bar()
). Finally, to show the
+distribution of mothers’ ages we will make a strip chart. The number of
+twin and triplet births in the dataset is not that large, so a strip
+chart is a good option here.
Analysis:
+Question 1: Is there a relationship between whether a mother smokes +or not and her baby’s weight at birth?
+To answer this question, we plot the birthweight distributions as +violins, separated by both smoking status and by whether the birth was +regular or premature.
+# The columns `Premie` and `Smoke` are numerical but contain
+# categorical data, so we convert to factors to ensure ggplot
+# treats them correctly
+ggplot(NCbirths, aes(factor(Premie), BirthWeightGm)) +
+ geom_violin(aes(fill = factor(Smoke))) +
+ scale_x_discrete(
+ name = NULL, # remove axis title entirely
+ labels = c("regular birth", "premature birth")
+ ) +
+ scale_y_continuous(
+ name = "Birth weight (gm)"
+ ) +
+ scale_fill_manual(
+ name = "Mother",
+ labels = c("non-smoker", "smoker"),
+ # explicitly assign colors to specific data values
+ values = c(`0` = "#56B4E9", `1` = "#E69F00")
+ ) +
+ theme_bw(12)
+
+There is a clear difference between birthweight for regular and +premature births, and for regular births the birthweight also seems to +be lower when the mother smokes.
+Question 2: How many mothers are smokers or non-smokers?
+To answer this question, we make a simple bar plot of the number of +mothers by smoking status.
+# again, convert `Smoke` into factor so it's categorical
+ggplot(NCbirths, aes(y = factor(Smoke))) +
+ geom_bar() +
+ scale_y_discrete(
+ name = NULL,
+ labels = c("non-smoker", "smoker")
+ ) +
+ scale_x_continuous(
+ # ensure there's no gap between the beginning of the bar
+ # and the edge of the plot panel
+ expand = expansion(mult = c(0, 0.1))
+ ) +
+ theme_bw(12)
+
+The vast majority of mothers in the dataset are non-smokers (almost +1250). Fewer than 250 are smokers.
+Question 3. What are the age distributions of mothers of twins or +triplets?
+To answer this question, we first remove singleton births from the +dataset and then show age distributions as a strip chart.
+NCbirths %>%
+ filter(Plural > 1) %>% # remove singlet births
+ ggplot(aes(x = factor(Plural), y = MomAge)) +
+ geom_point(
+ # jitter horizontally so points don't overlap
+ position = position_jitter(
+ width = 0.2,
+ height = 0
+ ),
+ # it's nice to make points a little bigger and give them some color
+ size = 2,
+ color = "#1E4A7F"
+ ) +
+ scale_x_discrete(
+ name = NULL,
+ labels = c("twins", "triplets")
+ ) +
+ scale_y_continuous(
+ name = "age of mother (years)"
+ ) +
+ theme_bw(12)
+
+Mothers of twins span the entire childbearing range, from 15 years to +approximately 40 years old. By contrast, mothers of triplets tend to be +in their thirties.
+Discussion: The smoking status of the mother appears +to have a small effect on the average birth weight for regular births. +We can see this by comparing the two left-most violins in the first +plot, where we see that they are slightly vertically shifted relative to +each other but have otherwise a comparable shape. However, a much bigger +effect comes from whether the baby is born prematurely or not. Premature +births have on average a much lower birthweight than regular births, and +the variance is also bigger (the two right-most violins are taller than +the two left-most violins). Interestingly, smoking status does not seem +to affect the distribution of birthweights for premature births much. We +can see this from the fact that the two right-most violins look +approximately the same. We would have to run a multivariate statistical +analysis to determine whether any of these observed patterns are +statistically significant.
+There are many more births to non-smoking mothers than to smoking +mothers in the dataset. This is important because it means we have more +complete data for non-smoking mothers. Some of the differences we saw in +the first graph, such as the slightly lower variance in birthweight for +premature births to smoking mothers—as compared to premature births to +non-smoking mothers—may simply be due to a smaller data set.
+When comparing age distributions of mothers of twins or of triplets +we see an unexpected difference. It appears that mothers of all ages, +from teenage moms to moms in their early fourties, all can have twins. +By contrast, only mothers in their thirties appear to have triplets. We +can think of a possible explanation. Twin births happen due to natural +causes and therefore can occur in mothers of all ages. Triplet births, +however, are extremely unlikely to occur naturally, and most commonly +are caused by fertility treatments that cause multiple eggs to mature at +once. It is unlikely that women in their late teens or twenties will +undergo fertility treatment, whereas women in their thirties do so +frequently. We also note, however, that there are only four triplet +births in the dataset, so the lack of younger mothers could be due to +random chance. We would have to perform further analysis or run +statistical tests develop a clearer picture of what mechanisms may have +caused the observed patterns in the data.
+ + + + +Please use the project template R Markdown document to complete your +project. The knitted R Markdown document (as a PDF) and the raw +R Markdown file (as .Rmd) must be submitted to Canvas by 11:00pm on +Thurs., Feb 15, 2024. These two documents will be +graded jointly, so they must be consistent (as in, don’t change the R +Markdown file without also updating the knitted document!).
+All results presented must have corresponding code, and the +code should be visible in the final generated pdf for ease of grading. +Any answers/results given without the corresponding R code that +generated the result will be considered absent. All code +reported in your final project document should work properly. Please do +not include any extraneous code or code which produces error messages. +(Code which produces warnings is acceptable, as long as you understand +what the warnings mean and explain this.)
+For this project, you will be using an Olympic Games dataset, which +is a compilation of records for athletes that have competed in the +Olympics from Athens 1896 to Rio 2016.
+Each record contains information including the name of the athlete
+(name
), their sex
, their age
,
+their height
, their weight
, their
+team
, their nationality (noc
), the
+games
at which they played, the year
, the
+olympic season
, the city
where the olympics
+took place, the sport
, the name of the event
+(event
), the decade during which the Olympics took place
+(decade
), whether or not the athlete won a gold medal
+(gold
), whether or not the athlete won any medal
+(medalist
) and if the athlete won “Gold”, “Silver”,
+“Bronze” or received “no medal” (medal
). More information
+about the dataset can be found at https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
We will provide you with specific questions to answer and specific +instructions on how to answer the questions. The project should be +structured as follows:
+We encourage you to be concise. A paragraph should typically not be +longer than 5 sentences.
+You are not required to perform any statistical +tests in this project, but you may do so if you find it helpful to +answer your question.
+In the Introduction section, write a brief introduction to the +dataset, the questions, and what parts of the dataset are necessary to +answer the questions. You may repeat some of the information about the +dataset provided above, paraphrasing on your own terms. Imagine that +your project is a standalone document and the grader has no prior +knowledge of the dataset. You do not need to describe variables that are +never used in your analysis.
+In the Approach section, describe what types of plots you are going +to make to address your questions. For each plot, provide a clear +explanation as to why this plot (e.g. boxplot, barplot, histogram, etc.) +is best for providing the information you are asking about. (You can +draw on the materials provided +here for guidance.) All plots should be of different types, +and all should use either color mapping or faceting or +both.
+In the Analysis section, provide the code that generates your plots. +Use scale functions to provide nice axis labels and guides. You are +welcome to use theme functions to customize the appearance of your plot, +but you are not required to do so for this project. All plots +must be made with ggplot2. Do not use base R plotting +functions.
+In the Discussion section, interpret the results of your analysis. +Identify any trends revealed (or not revealed) by the plots. Speculate +about why the data looks the way it does.
+This is the dataset you will be working with:
+olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
+
+triathlon <- olympics %>%
+ filter(!is.na(height)) %>% # only keep athletes with known height
+ filter(sport == "Triathlon") %>% # keep only triathletes
+ mutate(
+ medalist = case_when( # add column to track medalist vs not
+ is.na(medal) ~ "non-medalist",
+ !is.na(medal) ~ "medalist" # any medals (Gold, Silver, Bronze) count
+ )
+ )
+triathlon
is a subset of olympics
and
+contains only the data for triathletes. More information about the
+original olympics
dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md
+and https://www.sports-reference.com/olympics.html.
For this project, use triathlon
to answer the following
+questions about athletes competing in this sport:
You should make one plot per question.
+Hints:
+year
into a
+factor.fig.width
and fig.height
in the
+chunk headers to customize figure sizing and figure aspect ratios.You can delete these instructions from your project. Please also
+delete text such as Your approach here or
+# Q1: Your R code here
.
Introduction: Your introduction here.
+Approach: Your approach here.
+Analysis:
+# Q1: Your R code here
+# Q2: Your R code here
+# Q3: Your R code here
+Discussion: Your discussion of results +here.
+ + + + +Claus O. Wilke, EID
+This is the dataset you will be working with:
+NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
+
+NCbirths
+## # A tibble: 1,409 × 10
+## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
+## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+## 1 1 1 32 40 38 0 3147. 0 0 0
+## 2 1 2 32 37 34 0 3289. 0 0 0
+## 3 1 1 27 39 12 0 3912. 0 0 0
+## 4 1 1 27 39 15 0 3856. 0 0 0
+## 5 1 1 25 39 32 0 3430. 0 0 0
+## 6 1 1 28 43 32 0 3317. 0 0 0
+## 7 1 2 25 39 75 0 4054. 0 0 0
+## 8 1 2 15 42 25 0 3204. 0 0 1
+## 9 1 2 21 39 28 0 3402 0 0 0
+## 10 1 2 27 40 37 0 3515. 0 0 1
+## # ℹ 1,399 more rows
+Questions:
+Is there a relationship between whether a mother smokes or not +and her baby’s weight at birth?
How many mothers are smokers or non-smokers?
What are the age distributions of mothers of twins or +triplets?
Introduction: We are working with the
+NCbirths
dataset, which contains 1409 birth records from
+North Carolina in 2001. In this dataset, each row corresponds to one
+birth, and there are ten columns providing information about the birth,
+the mother, and the baby. Information about the birth includes whether
+it is a single, twin, or triplet birth, the number of completed weeks of
+gestation, and whether the birth is premature. Information about the
+baby includes the sex, the weight at birth, and whether the birth weight
+should be considered low. Information about the mother includes her age,
+the weight gained during pregnancy, whether she is a smoker, and whether
+she is married.
To answer the three questions, we will work with five variables, the
+baby’s birthweight (column BirthWeightGm
), whether the baby
+was born prematurely (column Premie
), whether it was a
+singleton, twin, or triplet birth (column Plural
), whether
+the mother is a smoker or not (column Smoke
), and the
+mother’s age (column MomAge
). The birthweight is provided
+as a numeric value, in grams. The premature birth status is encoded as
+0/1, where 0 means regular and 1 means premature (36 weeks or sooner).
+The number of births is encoded as 1/2/3 representing singleton, twins,
+and triplets, respectively. The smoking status is encoded as 0/1, where
+0 means the mother is not a smoker and 1 means she is a smoker. The
+mother’s age is provided in years.
Approach: To show the distributions of birthweights
+versus the mothers’ smoking status we will be using violin plots
+(geom_violin()
). We also separate out regular and premature
+births, because babies born prematurely have much lower birthweight and
+therefore must be considered separately. Violins make it easy to compare
+multiple distributions side-by-side.
To show the number of mothers that are smokers or non-smokers we will
+use a simple bar plot (geom_bar()
). Finally, to show the
+distribution of mothers’ ages we will make a strip chart. The number of
+twin and triplet births in the dataset is not that large, so a strip
+chart is a good option here.
Analysis:
+Question 1: Is there a relationship between whether a mother smokes +or not and her baby’s weight at birth?
+To answer this question, we plot the birthweight distributions as +violins, separated by both smoking status and by whether the birth was +regular or premature.
+# The columns `Premie` and `Smoke` are numerical but contain
+# categorical data, so we convert to factors to ensure ggplot
+# treats them correctly
+ggplot(NCbirths, aes(factor(Premie), BirthWeightGm)) +
+ geom_violin(aes(fill = factor(Smoke))) +
+ scale_x_discrete(
+ name = NULL, # remove axis title entirely
+ labels = c("regular birth", "premature birth")
+ ) +
+ scale_y_continuous(
+ name = "Birth weight (gm)"
+ ) +
+ scale_fill_manual(
+ name = "Mother",
+ labels = c("non-smoker", "smoker"),
+ # explicitly assign colors to specific data values
+ values = c(`0` = "#56B4E9", `1` = "#E69F00")
+ ) +
+ theme_bw(12)
+
+There is a clear difference between birthweight for regular and +premature births, and for regular births the birthweight also seems to +be lower when the mother smokes.
+Question 2: How many mothers are smokers or non-smokers?
+To answer this question, we make a simple bar plot of the number of +mothers by smoking status.
+# again, convert `Smoke` into factor so it's categorical
+ggplot(NCbirths, aes(y = factor(Smoke))) +
+ geom_bar() +
+ scale_y_discrete(
+ name = NULL,
+ labels = c("non-smoker", "smoker")
+ ) +
+ scale_x_continuous(
+ # ensure there's no gap between the beginning of the bar
+ # and the edge of the plot panel
+ expand = expansion(mult = c(0, 0.1))
+ ) +
+ theme_bw(12)
+
+The vast majority of mothers in the dataset are non-smokers (almost +1250). Fewer than 250 are smokers.
+Question 3. What are the age distributions of mothers of twins or +triplets?
+To answer this question, we first remove singleton births from the +dataset and then show age distributions as a strip chart.
+NCbirths %>%
+ filter(Plural > 1) %>% # remove singlet births
+ ggplot(aes(x = factor(Plural), y = MomAge)) +
+ geom_point(
+ # jitter horizontally so points don't overlap
+ position = position_jitter(
+ width = 0.2,
+ height = 0
+ ),
+ # it's nice to make points a little bigger and give them some color
+ size = 2,
+ color = "#1E4A7F"
+ ) +
+ scale_x_discrete(
+ name = NULL,
+ labels = c("twins", "triplets")
+ ) +
+ scale_y_continuous(
+ name = "age of mother (years)"
+ ) +
+ theme_bw(12)
+
+Mothers of twins span the entire childbearing range, from 15 years to +approximately 40 years old. By contrast, mothers of triplets tend to be +in their thirties.
+Discussion: The smoking status of the mother appears +to have a small effect on the average birth weight for regular births. +We can see this by comparing the two left-most violins in the first +plot, where we see that they are slightly vertically shifted relative to +each other but have otherwise a comparable shape. However, a much bigger +effect comes from whether the baby is born prematurely or not. Premature +births have on average a much lower birthweight than regular births, and +the variance is also bigger (the two right-most violins are taller than +the two left-most violins). Interestingly, smoking status does not seem +to affect the distribution of birthweights for premature births much. We +can see this from the fact that the two right-most violins look +approximately the same. We would have to run a multivariate statistical +analysis to determine whether any of these observed patterns are +statistically significant.
+There are many more births to non-smoking mothers than to smoking +mothers in the dataset. This is important because it means we have more +complete data for non-smoking mothers. Some of the differences we saw in +the first graph, such as the slightly lower variance in birthweight for +premature births to smoking mothers—as compared to premature births to +non-smoking mothers—may simply be due to a smaller data set.
+When comparing age distributions of mothers of twins or of triplets +we see an unexpected difference. It appears that mothers of all ages, +from teenage moms to moms in their early fourties, all can have twins. +By contrast, only mothers in their thirties appear to have triplets. We +can think of a possible explanation. Twin births happen due to natural +causes and therefore can occur in mothers of all ages. Triplet births, +however, are extremely unlikely to occur naturally, and most commonly +are caused by fertility treatments that cause multiple eggs to mature at +once. It is unlikely that women in their late teens or twenties will +undergo fertility treatment, whereas women in their thirties do so +frequently. We also note, however, that there are only four triplet +births in the dataset, so the lack of younger mothers could be due to +random chance. We would have to perform further analysis or run +statistical tests develop a clearer picture of what mechanisms may have +caused the observed patterns in the data.
+ + + + +Please use the project template R Markdown document to complete your +project. The knitted R Markdown document (as a PDF) and the raw +R Markdown file (as .Rmd) must be submitted to Canvas by 11:00pm on +Thurs., Feb 15, 2024. These two documents will be +graded jointly, so they must be consistent (as in, don’t change the R +Markdown file without also updating the knitted document!).
+All results presented must have corresponding code, and the +code should be visible in the final generated pdf for ease of grading. +Any answers/results given without the corresponding R code that +generated the result will be considered absent. All code +reported in your final project document should work properly. Please do +not include any extraneous code or code which produces error messages. +(Code which produces warnings is acceptable, as long as you understand +what the warnings mean and explain this.)
+For this project, you will be using an Olympic Games dataset, which +is a compilation of records for athletes that have competed in the +Olympics from Athens 1896 to Rio 2016.
+Each record contains information including the name of the athlete
+(name
), their sex
, their age
,
+their height
, their weight
, their
+team
, their nationality (noc
), the
+games
at which they played, the year
, the
+olympic season
, the city
where the olympics
+took place, the sport
, the name of the event
+(event
), the decade during which the Olympics took place
+(decade
), whether or not the athlete won a gold medal
+(gold
), whether or not the athlete won any medal
+(medalist
) and if the athlete won “Gold”, “Silver”,
+“Bronze” or received “no medal” (medal
). More information
+about the dataset can be found at https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
We will provide you with specific questions to answer and specific +instructions on how to answer the questions. The project should be +structured as follows:
+We encourage you to be concise. A paragraph should typically not be +longer than 5 sentences.
+You are not required to perform any statistical +tests in this project, but you may do so if you find it helpful to +answer your question.
+In the Introduction section, write a brief introduction to the +dataset, the questions, and what parts of the dataset are necessary to +answer the questions. You may repeat some of the information about the +dataset provided above, paraphrasing on your own terms. Imagine that +your project is a standalone document and the grader has no prior +knowledge of the dataset. You do not need to describe variables that are +never used in your analysis.
+In the Approach section, describe what types of plots you are going +to make to address your questions. For each plot, provide a clear +explanation as to why this plot (e.g. boxplot, barplot, histogram, etc.) +is best for providing the information you are asking about. (You can +draw on the materials provided +here for guidance.) All plots should be of different types, +and all should use either color mapping or faceting or +both.
+In the Analysis section, provide the code that generates your plots. +Use scale functions to provide nice axis labels and guides. You are +welcome to use theme functions to customize the appearance of your plot, +but you are not required to do so for this project. All plots +must be made with ggplot2. Do not use base R plotting +functions.
+In the Discussion section, interpret the results of your analysis. +Identify any trends revealed (or not revealed) by the plots. Speculate +about why the data looks the way it does.
+All projects are due by 11:00pm on the day they are due. Projects need to be submitted on Canvas. Please carefully read the submission instructions for each project.
+Materials: +
+Materials:
+ +- [Instructions](assignments/Project_1_instructions.html) +- [Project Template (Rmd)](assignments/Project_1.Rmd) +- [Project Template (HTML)](assignments/Project_1.html) +- [Grading rubric](assignments/Project_1_rubric.pdf) +- [Example project](assignments/Project_1_example.html) + ### Project 2 (due Mar 21, 2023) ### Project 3 (due Apr 18, 2023)