diff --git a/episodes/03-basics-factors-dataframes.Rmd b/episodes/03-basics-factors-dataframes.Rmd index dff912a6..20544a80 100644 --- a/episodes/03-basics-factors-dataframes.Rmd +++ b/episodes/03-basics-factors-dataframes.Rmd @@ -185,7 +185,15 @@ you have the `variants` object, listed as 801 obs. (observations/rows) of 29 variables (columns). Double-clicking on the name of the object will open a view of the data in a new tab. -![RStudio data frame view]("fig/rstudio_dataframeview.png") +![RStudio data frame view]("epidoes/fig/rstudio_dataframeview.png") + +We can also quickly query the dimensions of the variable using `dim()`. You'll see that the first number `801` shows the number of rows, then `29` the number of columns + +```{r, purl=FALSE} +## get summary statistics on a data frame + +dim(variants) +``` ## Summarizing, subsetting, and determining the structure of a data frame. @@ -208,12 +216,17 @@ these columns, as well as mean, median, and interquartile ranges. Many of the other variables (e.g. `sample_id`) are treated as characters data (more on this in a bit). -There is a lot to work with, so we will subset the first three columns into a -new data frame using the `data.frame()` function. +There is a lot to work with, so we will subset the columns into a new data frame using +the `data.frame()` function. To subset/index a two dimensional variable, we need to +define them on the appropriate side of the brackets. The left hand side of the comma +indicates the rows you want to subset, and the right is the column position +(e.g. ["row index", "column index"]). -```{r, purl=FALSE} -## put the first three columns of variants into a new data frame called subset +Let's put the columns 1, 2, 3, and 6 into a new data frame called subset: +```{r, purl=FALSE} +## Notice that we are wrapping the numbers in a c() function, to indicate a vector +## in the right hand side of the comma. subset <- data.frame(variants[, c(1:3, 6)]) ``` @@ -228,12 +241,13 @@ str(subset) Ok, thats a lot up unpack! Some things to notice. -- the object type `data.frame` is displayed in the first row along with its +- The object type `data.frame` is displayed in the first row along with its dimensions, in this case 801 observations (rows) and 4 variables (columns) -- Each variable (column) has a name (e.g. `sample_id`). This is followed - by the object mode (e.g. chr, int, etc.). Notice that before each +- Each variable (column) has a name (e.g. `sample_id`). Notice that before each variable name there is a `$` - this will be important later. - +- Each variable name is followed by the data type it contains (e.g. chr, int, etc.). + The `int` type shows an integer, which is a type of numerical data, where it can only + store whole numbers (i.e. no decimal points ). ::::::::::::::::::::::::::::::::::::::: challenge @@ -297,10 +311,19 @@ head(alt_alleles) ``` There are 801 alleles (one for each row). To simplify, lets look at just the -single-nucleotide alleles (SNPs). We can use some of the vector indexing skills -from the last episode. +single-nucleotide alleles (SNPs). + +Let's review some of the vector indexing skills from the last episode that can help: ```{r, purl=FALSE} +# This will find all matching alleles with the single nucleotide "A" and provide a TRUE/FASE vector +alt_alleles == "A" + +# Then, we wrap them into an index to pull all the positions that match this. +alt_alleles[alt_alleles == "A"] + +# If we repeat this for each nucleotide A, T, G, and C, and connect them using `c()`, +# we can index all the single nucleotide changes. snps <- c(alt_alleles[alt_alleles == "A"], alt_alleles[alt_alleles=="T"], alt_alleles[alt_alleles=="G"], @@ -318,7 +341,13 @@ plot(snps) ``` Whoops! Though the `plot()` function will do its best to give us a quick plot, -it is unable to do so here. One way to fix this it to tell R to treat the SNPs +it is unable to do so here. Let's use `str()` to see why this might be: + +```{r, purl=FALSE} +str(snps) +``` + +R may not know how to plot a character vector! One way to fix this it to tell R to treat the SNPs as categories (i.e. a factor vector); we will create a new object to avoid confusion using the `factor()` function: @@ -349,9 +378,12 @@ We can see how many items in our vector fall into each category: ```{r, purl=FALSE} summary(factor_snps) + +# Compare the character vector +summary(snps) ``` -As you can imagine, this is already useful when you want to generate a tally. +As you can imagine, factors are already useful when you want to generate a tally. ::::::::::::::::::::::::::::::::::::::::: callout