Skip to content

Explora Data Input Standards and Data Source Objectives

Richard Bruskiewich edited this page Aug 18, 2014 · 2 revisions

August 2014 - Preliminary Thoughts and Observations

The original (and baseline) Explora code accepts a simple CSV file as its input source, each row of which is a separate record for a plant genetic resources accession, and the columns correspond to three data dimensions:

  • Column 1 - A locally unique identifier, perhaps just a number for the accession in the input set
  • Set of Columns 2 to n, corresponding to trait descriptors with continuous value ranges, where n is the number of continuous variables
  • Set of columns n+1 to m, corresponding to trait descriptors with nominal (categorical) value ranges, where m - n+1 is the number of nominal variables

The continuous trait columns are presumed to preceed the nominal value columns, and user also explicitly tells Explora how many of each type of variable are in the dataset, thus perhaps allowing the software to ignore anything in the spreadsheet after column m.

Several thoughts fall out of the above:

  1. Is there any other format other than CSV that could be handled (for example, XML or RDF formatted data?)
  2. Could (should?) direct selection and importing of data from standard public sources (e.g. GENESYS?) be enabled?
  3. Could the adoption of common PGR data formats by such data sources accelerate adoption of the tool by the community?
  4. Is there any mechanism by which the input data could be made self-describing, that is, that the identifier, continuous and nominal data columns could be automatically extracted out of the file (e.g. perhaps a fixed prefix to the column names would empower this kind of automation?)
  5. Should the accession identifiers have a more informative global meaning and syntax (e.g. URI's)?
  6. Should the trait values be constrained to map onto public trait ontology (e.g. crop ontology) values?
  7. Could (or should) additional meta-data (files) be provided or maintained in the system, that map additional meta-data such as passport data mapped to accession identifiers, or crop ontology linkages to trait columns?
  8. Should Eplora application user interaction eventually provide in situ support for the above accession and trait descriptor details (e.g. external clickable linkages to relevant online (meta-)data?)