Skip to content

Parse tabular files (Excels, csv, tsv...) to add multiple metadata values to multiple iRODS data objects.

Notifications You must be signed in to change notification settings

kuleuven/mango-excels2metadata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python module to extract metadata from tables

Use this module to process tabular files in which each row represents an iRODS data object and each column contains either an identifier or metadata to add to this data object. It supports plain text files and Excel files, which could be stored locally or in iRODS itself.

As always, create a virtual environment and install the dependencies described in the requirements file:

python -m venv venv
. venv/bin/activate
pip install -r requirements.txt

Usage

This module can run on the command line with two commands: setup and run.

The setup command takes as arguments the path to a tabular file (local or in iRODS) and the desired path for the output YAML, asks the user questions about how to parse the tabular file, and outputs a configuration file.

This configuration file can then be provided as the --config option to the run command in order to standardize tabular files and properly obtain paths to data objects and attach metadata to them based on the columns of these files.

Examples

A small csv file

The following file simulates having a small semicolon-separated file with absolute paths in a "dataobject" column and a few columns with metadata.

First, with the setup command, we answer a few questions on how to parse the tabular file and create a "test-config.yaml" configuration file that keeps track of the answers.

Then, with the run command, we use the information on the configuration YAML file to parse the tabular file and, because it's just a "dry run", we simulate adding the metadata to each data object. Note that this run command could then also be used on other tabular files with the same properties as the original one.

python metadata_from_tabular.py setup testdata/testdata.csv test-config.yaml --sep ";"
python metadata_from_tabular.py run testdata/testdata.csv --config test-config.yaml --dry-run

A larger Excel file with multiple sheets

In this second example the file is an Excel file with multiple sheets, including one that has no relevant metadata. Again, with the setup command we indicate how the Excel should be parsed and record the answers in a YAML configuration file. Then, with the run command we parse the Excel and simulate adding the metadata.

python metadata_from_tabular.py setup testdata/bigger-testdata.xlsx bigger-test-config.yaml
python metadata_from_tabular.py run testdata/bigger-testdata.xlsx --config bigger-test-config.yaml --dry-run

setup

The configuration file can be created as follows:

python metadata_from_tabular.py setup filename output_path

In this case filename is the path to a tabular file (csv, tsv, Excel...), stored either locally or in iRODS. If it lives in iRODS, the --irods flag should be used, so that an iRODS session is started:

python metadata_from_tabular.py setup /zone/home/project/path/to/tabular output_path --irods

If the tabular file is a plain text file, it is possible to specify a column separator with the --sep option, which has "," as a default. If a wrong separator is provided and the parser finds a single column, it will warn you and give you the possibility to correct it.

python metadata_from_tabular.py setup testdata/testdata.csv test-output.yml --sep ";"

If the file can be found and opened as a dataframe, the user will be prompted with questions that will later guide preprocessing of equivalent tabular files:

  • If there are multiple sheets in an Excel file, which one(s) should be used?
  • Which of the columns contains a unique identifier of the data objects that metadata has to be attached to?
  • If the unique identifier is not an absolute path, is it a relative path or part of filename? And if so, within which collection should the data objects be found?
  • Should any columns be whitelisted or blacklisted?

The final YAML will be printed on the console and saved as a file locally.

run

Given a path to a tabular file with metadata and a YAML with the settings to preprocess it, metadata can be added with the run command:

python metadata_from_tabular.py run path_to_tabular --config path/to/config.yml

For testing purposes, it is possible to use the --dry-run flag, which simulates the preprocessing and identification of metadata and prints a small report at the end. An iRODS session will be initiated always, so make sure you have a valid active iRODS Session.

python metadata_from_tabular.py run path_to_tabular --config path/to/config.yml --dry-run

It is not necessary to rerun both setup and run for each tabular file: if you have several tabular files with the same properties, and that thus can be described by the same YAML configuration file, you just need to run setup with one of them, and then run with each of the tabular files and the same configuration file.

About

Parse tabular files (Excels, csv, tsv...) to add multiple metadata values to multiple iRODS data objects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages