Extract or create metadata on publications and link to datasets. Results get exported to RCPublications
.
We are currently prioritizing research publications that describe use cases for datasets, since that's what our intended recommendations need to focus on. For example, a use case for NOAA datasets is about coastal flooding and what the results mean for municipalities/states/etc. The publications that we're adding should be tending closer to how data are used for policy/decision making.
Clone https://github.com/NYU-CI/RCDatasets.
Metadata - primarily linkages between datasets and publications - will come from
our partners and clients. We want to capture information on the dataset, and publication metadata, including linkages to the datasets that we are enumerating in datasets.json
.
- In this repo, in
/metadata
, create a subfolder for the drop you are working with, and give it a name that reflects what's in it e.g.20190913_usda_excel
is named with the date USDA sent it, the data provider (usda) and the format. - As you sift through the linkages additions to
datasets.json
, if you come across datasets that are in a publication but not listed yet. When adding an entry todatasets.json
create a new branch from https://github.com/NYU-CI/RCDatasets. It may be helpful to name the branch with the same name as your subfolder in/metadata
.
At a minimum, each record in the datasets.json
file must have these
required fields:
provider
-- name of the data providertitle
-- name of the datasetid
-- a unique sequential identifier
For the names, use what the data provider shows on their web page and try to be as consise as possible.
When adding records:
- add to the bottom of the file
- increment the
id
number manually - make sure not to introduce multiple names for the same provider
- make sure to remove any special characters or characters that will raise encoding errors
- all values should be string values (e.g if you see any dictionaries, those should be removed)
Other fields that may be included:
alt_title
-- list of alternative titles or abbreviations, aka "mentions"url
-- URL for the main page describing the datasetdoi
-- a unique persistent identifier assigned by the data provideralt_ids
-- other unique identifiers (alternative DOIs, etc.)description
-- a brief (tweet sized) text description of the datasetdate
-- date of publication, which may help resolve conflicting identifiers
Example entry:
{
"id": "dataset-058",
"provider": "Bureau of Labor Statistics",
"title": "Consumer Price Index",
"alt_title": [
"HEI"
],
"url": "https://www.bls.gov/cpi/",
"description": "The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services."
}
- Create a csv file in which you'll document the publication metadata (title, url, doi, etc). Be sure to keep track of the linkages with the
dataset_id
that you just created. Ultimately you will export the publication metadata to a json file; name that according to the data drop as well.
At a minimum, each record in the <your_unique_name>_publications.json
file must have these required fields:
title
-- name of the publicationurl
-- URL for the main page describing the datasetrelated_dataset
--dataset_id
fromdatasets.json
.
Other fields that may be included:
doi
-- a unique persistent identifier assigned by the data providertitle
-- name of the datasetid
-- a unique sequential identifier
Example entry in <your_unique_name>_publications.json
:
{
"title": "Design Issues in USDA's Supplemental Nutrition Assistance Program: Looking Ahead by Looking Back",
"url": "https://www.ers.usda.gov/webdocs/publications/86924/err-243.pdf?v=43124",
"related_dataset": [
{"dataset_id": "dataset-026"}
]
}