Note: this document assumes familiarity with how statistics is represented in Data Commons and the MCF format.
This tutorial walks through the process of structuring and inserting data into the Data Commons graph.
As a prerequisite, you should understand the dataset, and have an idea of how to map location entities in your dataset to Data Commons entities and measures in your dataset to Data Commons statistical variables.
Begin by making sure you've completed the import document template (as per the README.md) and worked with the DC team to review your design. Once your schema mapping/statistical variable (SV) categories have been reviewed and finalized, proceed to the next section on how to define your SVs.
If you are adding new types of data to the knowledge graph, you might need to define new statistical variables. You can browse all existing variables in the Statistical Variable Explorer.
The statistical variable DCIDs should be human-readable, encapsulating the meaning of its triples. The naming rules are summarized in this doc.
You'll need approval from the Data Commons team on any new variables (see the "Defining location and schema mapping" section here).
When the variables are finalized, they get checked into the schema repo.
Template MCF is essentially a mapping file that instructs how to convert the data in a CSV into graph nodes for ingestion into Data Commons. For additional information, read Template MCF.
The raw CSV will often need pre-processing before it can be imported. An example simple cleaning script is here.
There are no restrictions on your approach for this step, but the only requirement is that a property value in the TMCF maps to a single CSV column (as illustrated in the examples in MCF format).
The general guidelines are:
- A property in the Template MCF node should have a constant value (like
typeOf
), reference to another node (likeE:Dataset->E1
), or refer to a CSV column for its value (likeC:Dataset->col_name
). - Dates must be in ISO 8601 format: "YYYY-MM-DD", "YYYY-MM", etc.
- References to existing nodes in the graph must be
dcid
s. - The cleaning script is reproducible and easy to run. Python or Golang is recommended.
There are a couple of ways to map the statistical variables with TMCF:
- Each
StatisticalVariable
has its own column for its observed value. So, there are as many TMCFStatVarObservation
nodes as variables. For an example, see this TMCF and the corresponding CSV. - The
StatisticalVariable
DCIDs are included in CSV values, such that there is a single TMCFStatVarObservation
node that points to the variable column. For an example, see this TMCF and the corresponding CSV.
TIP: To represent DC strings and repeated values in a CSV field, refer to these CSV Formatting Tips.
Use the dc-import
tool to validate the artifacts. When you run it, it will generate report.json
and summary_report.html
with counters representing warnings/errors and summary statistics.
Create a Pull Request (PR) with the Template MCF file together with the cleaned CSV, its preprocessing script, and the README (template) to https://github.com/datacommonsorg/data under the appropriate scripts/<provenance>/<dataset>
subdirectory. If you wrote a script to automate the generation of the TMCF, please also include that.
In the PR, please also include the validation results (report.json
and summary_report.html
).
If you introduced new statistical variables, please create a Pull Request for them in the schema repo.
After your PR is reviewed and approved, the Data Commons team will work with you to manifest (e.g. generate for inclusion in our model) the artifacts you've created in our internal system. Only after the artifacts are manifested will they be added to the Data Commons graph and be accessible.
In some cases, a dataset is so highly unstructured that it makes sense to skip the Template MCF / CSV approach and directly generate the instance MCF. For example, data from biological sources frequently needs to be directly formatted as MCF.
In this case, the cleaning script should do more heavy-lifting to generate instance MCFs. Such an example script is here.