From 2065ac76cf2c7b34da8ca4b481a309eae849db6a Mon Sep 17 00:00:00 2001 From: ytakemon Date: Wed, 2 Oct 2024 08:55:24 -0700 Subject: [PATCH 1/5] fixes #285 --- episodes/03-basics-factors-dataframes.Rmd | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/episodes/03-basics-factors-dataframes.Rmd b/episodes/03-basics-factors-dataframes.Rmd index dff912a6..6a709b19 100644 --- a/episodes/03-basics-factors-dataframes.Rmd +++ b/episodes/03-basics-factors-dataframes.Rmd @@ -228,12 +228,13 @@ str(subset) Ok, thats a lot up unpack! Some things to notice. -- the object type `data.frame` is displayed in the first row along with its +- The object type `data.frame` is displayed in the first row along with its dimensions, in this case 801 observations (rows) and 4 variables (columns) -- Each variable (column) has a name (e.g. `sample_id`). This is followed - by the object mode (e.g. chr, int, etc.). Notice that before each +- Each variable (column) has a name (e.g. `sample_id`). Notice that before each variable name there is a `$` - this will be important later. - +- Each variable name is followed by the data type it contains (e.g. chr, int, etc.). + The `int` type shows an integer, which is a type of numerical data, where it can only + store whole numbers (i.e. no decimal points ). ::::::::::::::::::::::::::::::::::::::: challenge From f11772a87baa315c5b846918d94c3a568efa930c Mon Sep 17 00:00:00 2001 From: ytakemon Date: Wed, 2 Oct 2024 09:02:06 -0700 Subject: [PATCH 2/5] fixes #284 --- episodes/03-basics-factors-dataframes.Rmd | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/episodes/03-basics-factors-dataframes.Rmd b/episodes/03-basics-factors-dataframes.Rmd index 6a709b19..3d5512d8 100644 --- a/episodes/03-basics-factors-dataframes.Rmd +++ b/episodes/03-basics-factors-dataframes.Rmd @@ -298,10 +298,19 @@ head(alt_alleles) ``` There are 801 alleles (one for each row). To simplify, lets look at just the -single-nucleotide alleles (SNPs). We can use some of the vector indexing skills -from the last episode. +single-nucleotide alleles (SNPs). + +Let's review some of the vector indexing skills from the last episode that can help: ```{r, purl=FALSE} +# This will find all matching alleles with the single nucleotide "A" and provide a TRUE/FASE vector +alt_alleles == "A" + +# Then, we wrap them into an index to pull all the positions that match this. +alt_alleles[alt_alleles == "A"] + +# If we repeat this for each nucleotide A, T, G, and C, and connect them using `c()`, +# we can index all the single nucleotide changes. snps <- c(alt_alleles[alt_alleles == "A"], alt_alleles[alt_alleles=="T"], alt_alleles[alt_alleles=="G"], From 4a9e7bf54156794054546d80786817717d49cb5e Mon Sep 17 00:00:00 2001 From: ytakemon Date: Wed, 2 Oct 2024 09:07:21 -0700 Subject: [PATCH 3/5] fixes #283 --- episodes/03-basics-factors-dataframes.Rmd | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/episodes/03-basics-factors-dataframes.Rmd b/episodes/03-basics-factors-dataframes.Rmd index 3d5512d8..9ae712c0 100644 --- a/episodes/03-basics-factors-dataframes.Rmd +++ b/episodes/03-basics-factors-dataframes.Rmd @@ -328,7 +328,13 @@ plot(snps) ``` Whoops! Though the `plot()` function will do its best to give us a quick plot, -it is unable to do so here. One way to fix this it to tell R to treat the SNPs +it is unable to do so here. Let's use `str()` to see why this might be: + +```{r, purl=FALSE} +str(snps) +``` + +R may not know how to plot a character vector! One way to fix this it to tell R to treat the SNPs as categories (i.e. a factor vector); we will create a new object to avoid confusion using the `factor()` function: @@ -359,9 +365,12 @@ We can see how many items in our vector fall into each category: ```{r, purl=FALSE} summary(factor_snps) + +# Compare the character vector +summary(snps) ``` -As you can imagine, this is already useful when you want to generate a tally. +As you can imagine, factors are already useful when you want to generate a tally. ::::::::::::::::::::::::::::::::::::::::: callout From ff44a81b7bccced4462f8aa74cc16f590e91b2b5 Mon Sep 17 00:00:00 2001 From: ytakemon Date: Wed, 2 Oct 2024 09:32:55 -0700 Subject: [PATCH 4/5] fixes 291 --- episodes/03-basics-factors-dataframes.Rmd | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/episodes/03-basics-factors-dataframes.Rmd b/episodes/03-basics-factors-dataframes.Rmd index 9ae712c0..83ec274e 100644 --- a/episodes/03-basics-factors-dataframes.Rmd +++ b/episodes/03-basics-factors-dataframes.Rmd @@ -185,7 +185,15 @@ you have the `variants` object, listed as 801 obs. (observations/rows) of 29 variables (columns). Double-clicking on the name of the object will open a view of the data in a new tab. -![RStudio data frame view]("fig/rstudio_dataframeview.png") +![RStudio data frame view]("epidoes/fig/rstudio_dataframeview.png") + +We can also quickly query the dimensions of the variable using `dim()`. You'll see that the first number `801` shows the number of rows, then `29` the number of columns + +```{r, purl=FALSE} +## get summary statistics on a data frame + +dim(variants) +``` ## Summarizing, subsetting, and determining the structure of a data frame. @@ -209,11 +217,15 @@ other variables (e.g. `sample_id`) are treated as characters data (more on this in a bit). There is a lot to work with, so we will subset the first three columns into a -new data frame using the `data.frame()` function. +new data frame using the `data.frame()` function. To subset/index a two dimensional +variable, we need to define them on the appropriate side of the brackets. The left +hand side of the comma indicates the rows you want to subset, and the right is the +column position (e.g. ["row index", "column index"]). ```{r, purl=FALSE} ## put the first three columns of variants into a new data frame called subset - +## Notice that we are wrapping the numbers in a c() function, to indicate a vector +## in the right hand side of the comma. subset <- data.frame(variants[, c(1:3, 6)]) ``` From 4752cd1596f8df4ea24682c03b8de31dfb787074 Mon Sep 17 00:00:00 2001 From: ytakemon Date: Wed, 2 Oct 2024 09:35:51 -0700 Subject: [PATCH 5/5] fixes 292 --- episodes/03-basics-factors-dataframes.Rmd | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/episodes/03-basics-factors-dataframes.Rmd b/episodes/03-basics-factors-dataframes.Rmd index 83ec274e..20544a80 100644 --- a/episodes/03-basics-factors-dataframes.Rmd +++ b/episodes/03-basics-factors-dataframes.Rmd @@ -216,14 +216,15 @@ these columns, as well as mean, median, and interquartile ranges. Many of the other variables (e.g. `sample_id`) are treated as characters data (more on this in a bit). -There is a lot to work with, so we will subset the first three columns into a -new data frame using the `data.frame()` function. To subset/index a two dimensional -variable, we need to define them on the appropriate side of the brackets. The left -hand side of the comma indicates the rows you want to subset, and the right is the -column position (e.g. ["row index", "column index"]). +There is a lot to work with, so we will subset the columns into a new data frame using +the `data.frame()` function. To subset/index a two dimensional variable, we need to +define them on the appropriate side of the brackets. The left hand side of the comma +indicates the rows you want to subset, and the right is the column position +(e.g. ["row index", "column index"]). + +Let's put the columns 1, 2, 3, and 6 into a new data frame called subset: ```{r, purl=FALSE} -## put the first three columns of variants into a new data frame called subset ## Notice that we are wrapping the numbers in a c() function, to indicate a vector ## in the right hand side of the comma. subset <- data.frame(variants[, c(1:3, 6)])