Skip to content

Latest commit

 

History

History
707 lines (488 loc) · 16.1 KB

textbook.md

File metadata and controls

707 lines (488 loc) · 16.1 KB
layout title permalink
default
PE4PS
/pe4ps/


DP4SS: Data Programming for the Social Sciences


A gentle introduction to data programming
in R for social science audiences.



Jesse Lecy
&
Jamison Crawford





CC License


Attribution · NonCommercial · ShareAlike


GitHub
Source Code



This textbook is being developed by adapting lecture notes and resources from a graduate-level introductory course in data science that is offered at the Watts College of Public Service at Arzona State University.

Comments and suggestions are welcome! · · · new issue Comments




CONTENTS:


  • TOC {:toc}




Your Data Science Toolkit

We will need three tools to manage your data science projects: a data programming language (R), a project management interfact (R Studio), and a way to create data-driven documents (R Markdown).

  • Installing R and R Studio
  • Tour of R Studio


Getting Started

These are some useful resources and guides for learning how to program if you are new to R or data programming.

Starting to Code

  • Help files
  • Error messages
  • Discussion boards
  • Vocabular and verbs
  • Learning to Learn R


Using R

Functions, variables, and operators are the core components of any functional programming language. These first chapters are foundational for everything moving forward.

  • Mathematical Operators
  • Objects
  • Assignment
  • Input-Output Devices
  • Arguments
  • Values
  • Returns
  • Logical operators
    • equal
    • not equal
    • greater than or less than
    • opposite of

Special Operators

  • Unique values
  • Duplicates
  • Missing values (NA)
  • Maximum
  • Minimum


One-Dimensional Datasets

Vectors are the building blocks of analysis in R. Vectors come in a variety of flavors - we cover the four most salient data types here: numbers, characters, categories, and logical or boolean.

  • Vector Types
    • Numeric (v)
    • Character (s)
    • Factor (ordered vs unordered) (f)
    • Logical (true/false) (L)
  • Checking vector types
    • data class
    • data mode

Converting Data Type

  • Casting
    • explicit casting
    • implicit casting (coercion)
  • Information loss
  • Care with factors

Variable Transformations

  • Linear transformations
    • vectorized functions
    • recycling rules
  • Recoding values
    • find and replace
    • recoding factors
  • Floors and ceilings


Two-Dimensional Datasets

Vectors typically represent individual variables in the social science context. A dataset contains IDs for individuals, and multiple measures from each individual. Typically data is organized so that columns represent distinct variables and rows represent individuals in the dataset. This spreadsheet representation of data is operationalized as data frames in R. Here you learn how to construct and manipulate data frames.

Dataframes

  • Creating data frames from vectors
    • rows and columns
  • the $ operator
  • Checking and changing class types

Dataframe Subsets

  • Filter rows and select columns
    • the [] operator
    • dplyr::filter and dplyr::select
  • Reorder rows or columns
    • sort() versus order()
    • dplyr::arrange

Dataframe Constructors

  • Building data objects:
    • data.frame() vs cbind() and rbind()
  • Variable transformations in df's
    • assignment inside a df: dat$x_squared <- x·x
    • dplyr::mutate vs dplyr::transmute()

Matrices and Lists

  • Matrix
  • Lists
  • Conversions:
    • matrix to df
    • list to df


Data IO

Data import and export [ input / output ].

Navigation

  • Working directories
    • paths: windows v linux
    • current working directory: getwd()
    • change working directory: setwd()
    • check files in directory: dir()
    • create new folder: dir.create("name")
  • Unzip files unzip("filename")
  • Delete files tutorial

Built-In Datasets

  • Core R datasets
  • Datasets in packages
  • Packages that are data
  • Read options
  • Copy and paste from Excel
  • Using rdata format
  • Read from csv or tsv
  • Read text files
  • Import from Excel
  • Import from common format (foreign package)
  • Import from the web (RCurl)
  • Import from GitHub
  • Import from DropBox
  • [ tutorial ]
  • Write options
    • CSV
    • R Data Sets (RDS)
    • CSV vs RDS
    • Tables
    • RData Format
    • SPSS or Stata
  • Copy to Clipboard
  • Copy to Excel
  • [ tutorial ]




Data Wrangling (dplyr)

Data wrangling is the process of preparing data for analysis, which includes reading data into R from a variety of formats, cleaning data, tidying datasets, creating subsets and filters, transforming variables, grouping data, and joining multiple datasets.

The goal of data wrangling is to create a rodeo dataset (clean and well-structured) that is ready for the big show (modeling and visualization)!

  • Subset operator []
    • by position
    • by name
    • by logical vector
    • with recycling
  • Selector vectors
  • Subset by row
    • dat[ row_selector , ]
    • dplyr::filter( dat, row_selector )
  • Subset by column
    • dat[ , column_selector ]
    • dplyr::select( dat, column_selector )
  • Reorder
    • with index
    • order / match
  • Pipe operator
  • Window vs summary functions
  • dplyr cheat sheet
  • merge() and match()
  • join() in dplyr
  • inner, outer, right, left


Explore and Describe

The first step in the data science process is to get to know your data through descriptive analysis and exploratory analysis that searches for useful patterns or trends. We accomplish this through summary statistics, and in the next section visualization.

Summarizing Vectors

  • Counting things:
    • sum( logical statement )
  • Counting missing data:
    • sum( is.na(x) )
  • Categorical data:
    • table( f1, f2 )
    • prop.table() and margin.table()
  • Numeric data: min, max, mean, median, summary, quantile
    • all vectors at once: summary( data.frame )

Summarizing Groups of Vectors

  • table( f1, f2 )
  • ftable( row.vars=c("f1","f2"), col.vars="f3" )
  • Function over groups: tapply( v1, f1 ) or dplyr:: group_by() + summarise()
  • Functions over levels of numeric data: tapply( v1, cut(v2) )
  • tapply( v1, INDEX=list(f1,f2) or dplyr:: group_by() + summarise()
  • aggregate( dat, FUN, by=f1 )
  • https://cran.r-project.org/web/packages/DescTools/vignettes/DescToolsCompanion.pdf
  • v1, v2 using cor() or visually with pairs()


Efficient Analysis With Groups

As you become proficient with descriptive analysis you will want to find ways to be more efficient. Unless you learn how to scale data exploration and modeling you will not be able to quickly identify patterns in your data. The most efficient way to scale your analysis is to understand the dimensionality or internal problem space in your data, and use apply functions in R to replicate analysis over many groups at once.

  • Logical statements
    • define group criteria
    • TRUE signifies membership
  • Group constructors
    • from categorical variables
    • from numeric variables
    • from strings
    • from missing values
  • Compound logical statements: AND and OR
  • Casting logical vectors
  • Combining factors and numeric data for analysis
  • Faceting in plots

Counting Group Members

  • Mathematical operators with logical vectors
    • counts of members: sum( L1 )
    • proportions of members: mean( L1 )
  • Conditional proportions
    • subset then tabulate
    • logical statement in numerator and demoninator

The Mathematics of Groups

  • Group structure
    • generalizing logical statements
  • Group dimensionality
    • how many unique groups are in the data?
    • combinatorics of attributes
    • total groups from f1 and f2 = nlevels(F1) · nlevels(F2)
  • Groups as problem spaces
    • complexity theory
    • search
    • dimension reduction

Analysis with Groups

  • Contingency tables
    • counts of members: f1 · f2
  • Statistics by group
    • function applied over a group: v1 ~ f1 · f2
    • apply() functions
    • dplyr group_by() and summarize() functions

Latent Groups

  • clustering
  • unsupervised learning approaches


Visualization

For a great overview with examples of R code:

Wilke, C. O. (2019). Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O'Reilly Media. FREE EBOOK

  • Ground, figure, narrative (context, subject, action)
  • Tufte’s rules
  • Visual tragedies
  • plot() function
  • Arguments:
    • plot point types
    • colors
    • size
    • axis labels
    • plot title
  • Defining a canvas: xlim, ylim
  • Adding data
  • Type (point, line, both)
  • Symbols
  • Color
  • Size
  • Adding grids
  • Adding axes
  • Adding titles / axes labels
  • Adding data labels: text()
  • Margins

Colors in R

Advanced Plot Features

  • Grammar of graphics concept
  • ggplot overview

Animations



Dynamic Documents

  • What makes documents dynamic?
  • Widgets
  • Render functions
  • Reactive functions
  • [ tutorial ]
  • Principles of good dashboard design
  • Layouts
  • Sidebars
  • Value boxes
  • [ demo RMD ]

Customizing Styles

  • CSS: cascading style sheets



<style> .post-content>ul { font-family: "Century Gothic", CenturyGothic, AppleGothic, sans-serif; font-size: 18px; font-style: normal; font-variant: small-caps; font-weight: 100; line-height: 26.4px; } .post-content>h2 { font-size: 22px; color: maroon; font-weight: normal; } .post-content>h4 { font-family: "Century Gothic", CenturyGothic, AppleGothic, sans-serif; font-weight: 100; font-size: 48px; color: maroon; line-height: 1.1em; } .post-content>h1 { background-color: #666; color: white; font-size: 24px; padding: 10px; margin-top: 70px; margin-bottom: 40px; } .post-content a { color: maroon; font-weight: bold; } .navbar-nav>li>a { display: flex; justify-content: center; align-items: center; box-sizing: border-box; height: 80px; padding: 0 15px; font-size: .875rem; font-family: system-ui; text-decoration: none; } #markdown-toc ul { font-size:calc(0.85em + 0.25vw); line-height:1.2; font-weight: bold; } #markdown-toc ul li { list-style-type: disc !important; font-size:calc(0.65em + 0.25vw); line-height:1.2; margin-left: 20px; } /* #markdown-toc li ul li { */ /* display:none; */ /* } */ #markdown-toc a { color: black; font-size:calc(0.65em + 0.25vw); line-height:1.2; font-weight: normal; } #markdown-toc a:hover { color: black; text-decoration: none; font-weight: bold; } pre, code { border: none; } code { font-size: 1.1em; font-family: "Andale Mono", AndaleMono, monospace; padding: 0px 1px; padding-top: 0px; padding-right: 1px; padding-bottom: 0px; padding-left: 1px; border-radius: 0px; font-weight: bold; } .collapsible { background-color: #fff; color: #444; cursor: pointer; padding: 18px; width: 20%; border: none; text-align: left; outline: none; font-size: 15px; } .active, .collapsible:hover { background-color: #ccc; } .active, .collapsible:hover { background-color: #ccc; } .content { display: none; overflow: hidden; } </style>