layout | title | permalink |
---|---|---|
default |
PE4PS |
/pe4ps/ |
A gentle introduction to data programming
in R for social science audiences.
Jesse Lecy
&
Jamison Crawford
Attribution · NonCommercial · ShareAlike
Source Code
This textbook is being developed by adapting lecture notes and resources from a graduate-level introductory course in data science that is offered at the Watts College of Public Service at Arzona State University.
Comments and suggestions are welcome! · · · Comments
CONTENTS:
- TOC {:toc}
We will need three tools to manage your data science projects: a data programming language (R), a project management interfact (R Studio), and a way to create data-driven documents (R Markdown).
- What is R? [ video ]
- Packages
- Installing R and R Studio
- Tour of R Studio
- Automation & Flexibility
- The Importance of Reproducibility
- Formats link
- Gallery link
- R Markdown Formats overview
- Headers and Chunks link
- Knitting link
- Customization
These are some useful resources and guides for learning how to program if you are new to R or data programming.
- Help files
- Error messages
- Discussion boards
- Vocabular and verbs
- Learning to Learn R
Functions, variables, and operators are the core components of any functional programming language. These first chapters are foundational for everything moving forward.
- Mathematical Operators
- Objects
- Assignment
- Input-Output Devices
- Arguments
- Values
- Returns
- Logical operators
- equal
- not equal
- greater than or less than
- opposite of
- Unique values
- Duplicates
- Missing values (NA)
- Maximum
- Minimum
Vectors are the building blocks of analysis in R. Vectors come in a variety of flavors - we cover the four most salient data types here: numbers, characters, categories, and logical or boolean.
- Vector Types
- Numeric (v)
- Character (s)
- Factor (ordered vs unordered) (f)
- Logical (true/false) (L)
- Checking vector types
- data class
- data mode
- Casting
- explicit casting
- implicit casting (coercion)
- Information loss
- Care with factors
- Linear transformations
- vectorized functions
- recycling rules
- Recoding values
- find and replace
- recoding factors
- Floors and ceilings
Vectors typically represent individual variables in the social science context. A dataset contains IDs for individuals, and multiple measures from each individual. Typically data is organized so that columns represent distinct variables and rows represent individuals in the dataset. This spreadsheet representation of data is operationalized as data frames in R. Here you learn how to construct and manipulate data frames.
- Creating data frames from vectors
- rows and columns
- the
$
operator - Checking and changing class types
- Filter rows and select columns
- the
[]
operator dplyr::filter
anddplyr::select
- the
- Reorder rows or columns
sort()
versusorder()
dplyr::arrange
- Building data objects:
data.frame()
vscbind()
andrbind()
- Variable transformations in df's
- assignment inside a df:
dat$x_squared <- x·x
dplyr::mutate
vsdplyr::transmute()
- assignment inside a df:
- Matrix
- Lists
- Conversions:
- matrix to df
- list to df
Data import and export [ input / output ].
- Working directories
- paths: windows v linux
- current working directory:
getwd()
- change working directory:
setwd()
- check files in directory:
dir()
- create new folder:
dir.create("name")
- Unzip files
unzip("filename")
- Delete files tutorial
- Core R datasets
- Datasets in packages
- Packages that are data
- Read options
- Copy and paste from Excel
- Using rdata format
- Read from csv or tsv
- Read text files
- Import from Excel
- Import from common format (foreign package)
- Import from the web (RCurl)
- Import from GitHub
- Import from DropBox
- [ tutorial ]
- Write options
- CSV
- R Data Sets (RDS)
- CSV vs RDS
- Tables
- RData Format
- SPSS or Stata
- Copy to Clipboard
- Copy to Excel
- [ tutorial ]
- What is an API?
- Examples
- Census
- Socrata
- [ Demo with DataUSA API ]
Data wrangling is the process of preparing data for analysis, which includes reading data into R from a variety of formats, cleaning data, tidying datasets, creating subsets and filters, transforming variables, grouping data, and joining multiple datasets.
The goal of data wrangling is to create a rodeo dataset (clean and well-structured) that is ready for the big show (modeling and visualization)!
- Subset operator
[]
- by position
- by name
- by logical vector
- with recycling
- Selector vectors
- Subset by row
dat[ row_selector , ]
dplyr::filter( dat, row_selector )
- Subset by column
dat[ , column_selector ]
dplyr::select( dat, column_selector )
- Reorder
- with index
- order / match
- Pipe operator
- Window vs summary functions
- dplyr cheat sheet
merge()
andmatch()
join()
in dplyr- inner, outer, right, left
The first step in the data science process is to get to know your data through descriptive analysis and exploratory analysis that searches for useful patterns or trends. We accomplish this through summary statistics, and in the next section visualization.
- Counting things:
sum( logical statement )
- Counting missing data:
sum( is.na(x) )
- Categorical data:
table( f1, f2 )
prop.table()
andmargin.table()
- Numeric data: min, max, mean, median, summary, quantile
- all vectors at once:
summary( data.frame )
- all vectors at once:
table( f1, f2 )
ftable( row.vars=c("f1","f2"), col.vars="f3" )
- Function over groups:
tapply( v1, f1 )
ordplyr:: group_by() + summarise()
- Functions over levels of numeric data:
tapply( v1, cut(v2) )
tapply( v1, INDEX=list(f1,f2)
ordplyr:: group_by() + summarise()
aggregate( dat, FUN, by=f1 )
- https://cran.r-project.org/web/packages/DescTools/vignettes/DescToolsCompanion.pdf
- v1, v2 using
cor()
or visually withpairs()
As you become proficient with descriptive analysis you will want to find ways to be more efficient. Unless you learn how to scale data exploration and modeling you will not be able to quickly identify patterns in your data. The most efficient way to scale your analysis is to understand the dimensionality or internal problem space in your data, and use apply functions in R to replicate analysis over many groups at once.
- Logical statements
- define group criteria
- TRUE signifies membership
- Group constructors
- from categorical variables
- from numeric variables
- from strings
- from missing values
- Compound logical statements: AND and OR
- Casting logical vectors
- Combining factors and numeric data for analysis
- Faceting in plots
- Mathematical operators with logical vectors
- counts of members: sum( L1 )
- proportions of members: mean( L1 )
- Conditional proportions
- subset then tabulate
- logical statement in numerator and demoninator
- Group structure
- generalizing logical statements
- Group dimensionality
- how many unique groups are in the data?
- combinatorics of attributes
- total groups from f1 and f2
= nlevels(F1) · nlevels(F2)
- Groups as problem spaces
- complexity theory
- search
- dimension reduction
- Contingency tables
- counts of members:
f1 · f2
- counts of members:
- Statistics by group
- function applied over a group:
v1 ~ f1 · f2
apply()
functions- dplyr
group_by()
andsummarize()
functions
- function applied over a group:
- clustering
- unsupervised learning approaches
For a great overview with examples of R code:
Wilke, C. O. (2019). Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O'Reilly Media. FREE EBOOK
- Ground, figure, narrative (context, subject, action)
- Tufte’s rules
- Visual tragedies
plot()
function- Arguments:
- plot point types
- colors
- size
- axis labels
- plot title
- Defining a canvas: xlim, ylim
- Adding data
- Type (point, line, both)
- Symbols
- Color
- Size
- Adding grids
- Adding axes
- Adding titles / axes labels
- Adding data labels: text()
- Margins
- select by name:
- color theory
- value
- shade, tint, tone
- hue, saturation
- transparency
- color values
- color functions
- Custom fonts
- Math symbols
- Multiple plots (core graphics)
- Custom graph layouts
- Grammar of graphics concept
- ggplot overview
- What makes documents dynamic?
- Widgets
- input objects
- Widgets Gallery
- Render functions
- Reactive functions
- [ tutorial ]
- Principles of good dashboard design
- Layouts
- Sidebars
- Value boxes
- [ demo RMD ]
- CSS: cascading style sheets
<style> .post-content>ul { font-family: "Century Gothic", CenturyGothic, AppleGothic, sans-serif; font-size: 18px; font-style: normal; font-variant: small-caps; font-weight: 100; line-height: 26.4px; } .post-content>h2 { font-size: 22px; color: maroon; font-weight: normal; } .post-content>h4 { font-family: "Century Gothic", CenturyGothic, AppleGothic, sans-serif; font-weight: 100; font-size: 48px; color: maroon; line-height: 1.1em; } .post-content>h1 { background-color: #666; color: white; font-size: 24px; padding: 10px; margin-top: 70px; margin-bottom: 40px; } .post-content a { color: maroon; font-weight: bold; } .navbar-nav>li>a { display: flex; justify-content: center; align-items: center; box-sizing: border-box; height: 80px; padding: 0 15px; font-size: .875rem; font-family: system-ui; text-decoration: none; } #markdown-toc ul { font-size:calc(0.85em + 0.25vw); line-height:1.2; font-weight: bold; } #markdown-toc ul li { list-style-type: disc !important; font-size:calc(0.65em + 0.25vw); line-height:1.2; margin-left: 20px; } /* #markdown-toc li ul li { */ /* display:none; */ /* } */ #markdown-toc a { color: black; font-size:calc(0.65em + 0.25vw); line-height:1.2; font-weight: normal; } #markdown-toc a:hover { color: black; text-decoration: none; font-weight: bold; } pre, code { border: none; } code { font-size: 1.1em; font-family: "Andale Mono", AndaleMono, monospace; padding: 0px 1px; padding-top: 0px; padding-right: 1px; padding-bottom: 0px; padding-left: 1px; border-radius: 0px; font-weight: bold; } .collapsible { background-color: #fff; color: #444; cursor: pointer; padding: 18px; width: 20%; border: none; text-align: left; outline: none; font-size: 15px; } .active, .collapsible:hover { background-color: #ccc; } .active, .collapsible:hover { background-color: #ccc; } .content { display: none; overflow: hidden; } </style>