Skip to content

Latest commit

 

History

History
115 lines (61 loc) · 4.47 KB

data.md

File metadata and controls

115 lines (61 loc) · 4.47 KB
title
Data

We have both CSV and RDS (R data set) files available in the DATA section of this GitHub repository. They can be loaded directly into R as follows.

Read CSV version:

dat <- read.csv( "https://github.com/Nonprofit-Open-Data-Collective/machine_learning_mission_codes/blob/master/DATA/MISSION.csv?raw=true", stringsAsFactors=F )

Read RDS version:

dat <- readRDS( gzcon( url( "https://github.com/Nonprofit-Open-Data-Collective/machine_learning_mission_codes/blob/master/DATA/MISSION.rds?raw=true" )))



Overview of the Training Dataset

[need to add]...How was sample created, what it represents, why a test sample is useful for benchmarking and replication.

Garbage in garbage out discussion:

  • quality of program / mission descriptions
  • quality of activity codes

See the Taxonomy section for activity codes.

IRS versus human coding...(validity and reliability of taxonomies)

Why Use a Common Replication Dataset?

The goal of this project is to create a training dataset that can serve as a reference point for performance of program activity classification algorithms that rely on the types of text that would be readily available on websites, grant aplications, tax forms, or annual reports.

The creation of a reference dataset allows for innovation and progress since the relative performance of algorithms can be compared when they are applied to the same dataset. Performance metrics are difficult to interpret if they are drawn from different underlying data sources.

The field of social network analysis provides some examples of this approach by benchmarking the performance of clustering algorithms using a small set of canonical datasets.

Agarwal, G., & Kempe, D. (2008). Modularity-maximizing graph communities via mathematical programming. The European Physical Journal B, 66(3), 409-418. PAPER

Raw Data Sources

We have built a training dataset using data from two primary sources:

The IRS E-File database contains machine-readable text fields on nonprofit names, mission statements, and program service accomplishments.

The IRS 1023-EZ files contain mission taxonomy codes for the traditional National Taxonomy of Exempt Entities (NTEE), as well as eight binary mission codes related to nonprofit purpose such as religious activities, scientific activities, recreational activities, or welfare activities.

See the taxonomy section of this site for more information.

Available Mission and Activity Text

Text-based data describing nonprofit activities.

  • Nonprofit name: Form 990 and 990-EZ, header
  • Nonprofit missions on IRS forms: Form 990, Part I, Line 1; Form 990-EZ, Part III, Line 0
  • Program service accomplishments: Form 990, Part III; Form 990-EZ, Part III

Raw Mission Data

The nonprofit mission data comes from the new IRS e-file data available on AWS as XML files.

library( xmltools )
library( purrr )
library( xml2 )
library( dplyr )

# source build functions
source( "https://raw.githubusercontent.com/Nonprofit-Open-Data-Collective/irs-990-efiler-database/master/BUILD_SCRIPTS/build_efile_database_functions.R" )

dat <- buildIndex()
table( dat$FormType, dat$TaxYear )
2009 2010 2011 2012 2013 2014 2015 2016 2017
990 33,360 123,107 159,539 179,674 198,738 218,614 232,975 214,585 25,921
990EZ 15,500 63,253 82,066 93,769 104,538 116,461 124,507 121,530 28,767
990PF 2,352 25,275 34,597 39,936 45,897 53,443 58,724 60,305 20,608

XML Tools in R

If you want to work with the data directly you will need to use some XML tools.

Quick Guide to Working with XML in R

Build Custom Databases

You can build custom datasets from the IRS XML fields. Some sapmle scripts are available here:

Nonprofit Open Data Collective

And many of the tables in CSV formats are available on our Data World group: https://data.world/activity/npdata