Adding a Dataset to Data Commons

This document summarizes the steps involved in adding a dataset to Data Commons (DC). As a prerequisite, please ensure that the DC core team (support@datacommons.org) has approved the addition of the dataset.

Background

The following documents provide a background on the data model, format and workflow:

Summary of data model (DC inherits schema from schema.org)
How statistics is represented in DC
MCF Format
Life of a dataset

Designing location and schema mapping

Data Commons is a single graph that reconciles references to the same entities and concepts across datasets. This linking happens at the time of importing datasets.

As part of the first step, we identify how the locations/places/entities and variables/properties in the dataset will get mapped.

For locations/places, we can use the following (in preferred order): global identifiers (like FIPS), geo info (lat/lng, geo boundary), qualified names. The approach we use depends on how the locations appear in the dataset.
For variables, we either need to find already existing schema in Data Commons (from existing statistical variables here), or add new StatisticalVariable nodes along with core schema (new Class, Property, Enumeration nodes) as necessary.

This process typically happens in collaboration with the DC core team, and we recommend that you put together a short import document.

Links:

Suggested import document template
Example1
Example2

Preparing artifacts

Once the entity and schema mapping have been finalized, you prepare the artifacts. This includes:

StatisticalVariable MCF nodes (if any) checked into schema repo. These nodes may be written by hand when there are only a handful in number. Otherwise, these nodes can be generated via scripts.
Template MCF and corresponding cleaned tabular files (typically CSV). Like StatisticalVariable MCF nodes, the Template MCF nodes can also be hand-written or script-generated depending on number of nodes.
Data cleaning code (along with README) checked into data repo
Validation results for the artifacts (from running dc-import tool)

When all the artifacts are ready, please get it reviewed by the DC core team via github Pull Requests. More details on this step are in the Life of a Dataset document.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Adding a Dataset to Data Commons

Background

Designing location and schema mapping

Preparing artifacts

Files

README.md

Latest commit

History

README.md

File metadata and controls

Adding a Dataset to Data Commons

Background

Designing location and schema mapping

Preparing artifacts