diff --git a/doc/source/admin/data_tables.md b/doc/source/admin/data_tables.md new file mode 100644 index 000000000000..44f1cc008e72 --- /dev/null +++ b/doc/source/admin/data_tables.md @@ -0,0 +1,105 @@ +# Tool data + +Galaxy stores tool data in the path defined by `tool_data_path` (by default `tool-data/`). +It's possible to separate tool data of toolshed-installed tools by setting `shed_tool_data_path`. + +Tool data consists of: + +1. the actual data +2. one or more `loc` files +3. entries in a tool data table (config) file + + +## History + +In order to understand the naming and structure of these three components it might be of advantage +to look in the history. Tool data was organized in tabular `loc` that contained metadata and paths +of the data. Those files were installed with the tool and could be accessed with the +[`from_file`](https://docs.galaxyproject.org/en/master/dev/schema.html#from-file) mechanism from tools. +Since each tool version had it's own `loc` file the maintenance was difficult. With tool data tables +an additional abstraction layer was introduced that is used from tools via +[`from_datatable`](https://docs.galaxyproject.org/en/master/dev/schema.html#from-data-table). + +## Tool data + +This is the actual data that is stored by default in `tool_data_path`. It may be favorable to store the +actual tool data in a separate folder. For manually managed tool data this can be achieved by simply +storing the data in another folder. For data that is added by data managers this can be achieved by +setting `galaxy_data_manager_data_path`. + +## `loc` files + +In order to make tool data usable from Galaxy tools `loc` files are used. +Those are tabular (by default tab separated) files with the extension `.loc`. +Besides the actual paths, the entries can contain IDs, names, or other metadata +that can be used in tools to select reference data. The paths should be given as absolute paths, +but can also be given relative to the Galaxy root dir. +By default `loc` files are installed in `tool_data_path` (where also built-in `loc` files +are stored). By setting `shed_tool_data_path` this can be separated. + +## Tool data tables + +The tool data tables that should be used in a Galaxy instance are listed +in tool data table config files. In addition these contain metadata. + +Tool data table config files are XML files listing tool data table configurations: + +```xml + + .... + +``` + +An entry for a tool data looks like this + +```xml + + value, dbkey, name, path + +
+``` + +- `table`: `name`, `comment_char` (default `"#"`), `separator` (default `"\t"`), `allow_duplicate_entries` (default `True`), `empty_field_value` (default `""`) +- `columns`: a comma separated list of column names +- `file`: `path` (alternatively `url`, `from_config`) + +Tool data table entries for tools installed from a toolshed have an additional +element `tool_shed_repository` and sub-tags `tool_shed` +`repository_name`, `repository_owner`, `installed_changeset_revision`, e.g.: + +```xml + + value, name, date, path + + + toolshed.g2.bx.psu.edu + plasmidfinder + iuc + 7075b7a5441b + +
+``` + +The file path points to a `loc` file and can be given relative (to the +`tool_data_path`) or absolute. If a given relative path does not exist also the +base name is checked (many tools use something like `tool-data/xyz.loc` and +store example `loc` files in a `tool-data/` directory in the tool repository). +Currently also `.loc.sample` may be used in case the specified `.loc` is absent. + +Galaxy uses two tool data table config files: + +- `tool_data_table_config_path`: by default `tool_data_table_conf.xml` in Galaxy's `config/` directory. +- `shed_tool_data_table_config`: by default `shed_tool_data_table_conf.xml` in +Galaxy's `config/` directory. This file lists all tool data tables of tools +installed from a toolshed. Note that the entries are versioned, i.e. there is a +separate entry for each tool and tool version. + +The tool data table config files can (and do) contain multiple entries for the same data table +(identified by the same name). These content of the corresponding `loc` files are merged when +they are loaded. + +When a new tool is installed that uses a data table a new entry is added to +`shed_tool_data_table_config` and a `loc` file is placed in a versioned +subdirectory in `tool_data_path` (in a subdirectory that has the name of the +toolshed). By default this is `tool-data/toolshed.g2.bx.psu.edu/`. Note that +these directories will also contain tool data table config files, but they are unused. diff --git a/doc/source/admin/index.rst b/doc/source/admin/index.rst index 2ae8d1af72a0..61ed7db63ba4 100644 --- a/doc/source/admin/index.rst +++ b/doc/source/admin/index.rst @@ -20,6 +20,7 @@ This documentation is in the midst of being ported and unified based on resource job_metrics authentication tool_panel + data_tables mq dependency_resolvers container_resolvers