From de34121ae1f74f72ce587af075b32c94a0fea04a Mon Sep 17 00:00:00 2001 From: Matthias Bernt Date: Mon, 4 Mar 2024 16:03:30 +0100 Subject: [PATCH 1/4] add some admin facing docs on data tables --- doc/source/admin/data_tables.md | 93 +++++++++++++++++++++++++++++++++ doc/source/admin/index.rst | 1 + 2 files changed, 94 insertions(+) create mode 100644 doc/source/admin/data_tables.md diff --git a/doc/source/admin/data_tables.md b/doc/source/admin/data_tables.md new file mode 100644 index 000000000000..a92291eef0cc --- /dev/null +++ b/doc/source/admin/data_tables.md @@ -0,0 +1,93 @@ +# Tool data + +Galaxy stores tool data in the path defined by `tool_data_path` (by default `tool-data/`). +It's possible to to separate tool data of shed installed tools by setting (`shed_tool_data_path`). + +Tool data consists of: + +1. the actual data +2. a tool data table +3. tool data config files + +## Tool data + +This is the actual data that is stored by default in `tool_data_path`. It may be favorable to store the +actual tool data in a separate folder. For manually managed tool data this can be achieved by simply +storing the data in another folder. For data that is added by data managers this can be achieved by +setting `galaxy_data_manager_data_path`. + +## Tool data tables + +In order to make tool data usable from Galaxy tools so called tool data tables are used. +Those are tabular (by default tab separated) files with the extension `.loc`. +Besides the actual paths the entries can contain, e.g. IDs, names, or other +that can be used in tools to select reference data. The paths should be given as absolute paths, +but can also be given relative to the Galaxy root dir. +By default tool data tables are installed in `tool_data_path` (where also built-in tool data tables +are stored). By setting `shed_tool_data_path` this can be separated. + +## Tool data table config + +The tool data tables that should be used in a Galaxy instance are configured +using tool data table config files. In addition these files contain some +metadata. + +Tool data table config files are XML files listing tool data table configurations: + +```xml + + .... + +``` + +A tool data table configuration looks like this + +```xml + + value, dbkey, name, path + +
+``` + +- `table`: `name`, `comment_char` (default `"#"`), `separator` (default `"\t"`), `allow_duplicate_entries` (default `True`), `empty_field_value` (default `""`) +- `columns`: a comma separated list of column names +- `file`: `path` (alternatively `url`, `from_config`) + +Tool data table definitions for tools installed from a toolshed have an additional +element `tool_shed_repository` and sub-tags `tool_shed` +`repository_name`, `repository_owner`, `installed_changeset_revision`, e.g.: + +```xml + + value, name, date, path + + + toolshed.g2.bx.psu.edu + plasmidfinder + iuc + 7075b7a5441b + +
+``` + +The file path points to a data table (i.e. a `.loc` file) and can be given +relative (to the `tool_data_path`) or absolute. If a given relative path does +not exist also the base name is checked (many tools use something like +`tool-data/xyz.loc` and store example `loc` files in a `tool-data/` directory in +the tool repository). +Currently also `.loc.sample` may be used in case the specified `.loc` is absent. + +Tool data table config files: + +- `tool_data_table_config_path`: by default `tool_data_table_conf.xml` in Galaxy's `config/` directory. +- `shed_tool_data_table_config`: by default `shed_tool_data_table_conf.xml` in +Galaxy's `config/` directory. This file lists all tool data tables of tools +installed from a toolshed. Note that the entries are versioned, i.e. there is a +separate entry for each tool and tool version. These content of the tool data +tables are merged when they are loaded. + +When a new tool is installed that uses a data table a new entry is added to +`shed_tool_data_table_config` and a `.loc` file is placed in a versioned +subdirectory in `tool_data_path` (in a subdirectory that has the name of the +toolshed). By default thus is `tool-data/toolshed.g2.bx.psu.edu/`. Note that +these directories will also contain tool data config files, but they are unused. diff --git a/doc/source/admin/index.rst b/doc/source/admin/index.rst index 2ae8d1af72a0..61ed7db63ba4 100644 --- a/doc/source/admin/index.rst +++ b/doc/source/admin/index.rst @@ -20,6 +20,7 @@ This documentation is in the midst of being ported and unified based on resource job_metrics authentication tool_panel + data_tables mq dependency_resolvers container_resolvers From f15426c56b83436effd93bfe54f5612561dfdc6e Mon Sep 17 00:00:00 2001 From: M Bernt Date: Tue, 5 Mar 2024 08:28:56 +0100 Subject: [PATCH 2/4] Improve wording Co-authored-by: Nicola Soranzo --- doc/source/admin/data_tables.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/source/admin/data_tables.md b/doc/source/admin/data_tables.md index a92291eef0cc..8af597814119 100644 --- a/doc/source/admin/data_tables.md +++ b/doc/source/admin/data_tables.md @@ -1,7 +1,7 @@ # Tool data Galaxy stores tool data in the path defined by `tool_data_path` (by default `tool-data/`). -It's possible to to separate tool data of shed installed tools by setting (`shed_tool_data_path`). +It's possible to separate tool data of ToolShed-installed tools by setting `shed_tool_data_path`. Tool data consists of: @@ -20,13 +20,13 @@ setting `galaxy_data_manager_data_path`. In order to make tool data usable from Galaxy tools so called tool data tables are used. Those are tabular (by default tab separated) files with the extension `.loc`. -Besides the actual paths the entries can contain, e.g. IDs, names, or other +Besides the actual paths, the entries can contain IDs, names, or other information that can be used in tools to select reference data. The paths should be given as absolute paths, but can also be given relative to the Galaxy root dir. By default tool data tables are installed in `tool_data_path` (where also built-in tool data tables are stored). By setting `shed_tool_data_path` this can be separated. -## Tool data table config +## Tool data table config files The tool data tables that should be used in a Galaxy instance are configured using tool data table config files. In addition these files contain some @@ -89,5 +89,5 @@ tables are merged when they are loaded. When a new tool is installed that uses a data table a new entry is added to `shed_tool_data_table_config` and a `.loc` file is placed in a versioned subdirectory in `tool_data_path` (in a subdirectory that has the name of the -toolshed). By default thus is `tool-data/toolshed.g2.bx.psu.edu/`. Note that -these directories will also contain tool data config files, but they are unused. +toolshed). By default this is `tool-data/toolshed.g2.bx.psu.edu/`. Note that +these directories will also contain tool data table config files, but they are unused. From c964d89ba3195abb6d70d215ba5b85f29dd38bd4 Mon Sep 17 00:00:00 2001 From: Matthias Bernt Date: Wed, 6 Mar 2024 13:15:21 +0100 Subject: [PATCH 3/4] try to use terms properly loc files and tool data tables should now be used as they should be --- doc/source/admin/data_tables.md | 53 ++++++++++++++++++++------------- 1 file changed, 33 insertions(+), 20 deletions(-) diff --git a/doc/source/admin/data_tables.md b/doc/source/admin/data_tables.md index a92291eef0cc..9b88787fcd0d 100644 --- a/doc/source/admin/data_tables.md +++ b/doc/source/admin/data_tables.md @@ -6,8 +6,19 @@ It's possible to to separate tool data of shed installed tools by setting (`shed Tool data consists of: 1. the actual data -2. a tool data table -3. tool data config files +2. one or more so called `loc` files +3. entries in a tool data table (config) file + + +## History + +In order to understand the naming and structure of these three components it might be of advantage +to look in the history. Tool data was organized in tabular `loc` that contained metadata and paths +of the data. Those files were were installed with the tool and could be accessed with the +[`from_file`](https://docs.galaxyproject.org/en/master/dev/schema.html#from-file) mechanism from tools. +Since each tool version had it's own `loc` file the maintenance was difficult. With tool data tables +an additional abstraction layer was introduced that is used from tools via +[`from_datatable`](https://docs.galaxyproject.org/en/master/dev/schema.html#from-data-table). ## Tool data @@ -16,20 +27,20 @@ actual tool data in a separate folder. For manually managed tool data this can b storing the data in another folder. For data that is added by data managers this can be achieved by setting `galaxy_data_manager_data_path`. -## Tool data tables +## `loc` files -In order to make tool data usable from Galaxy tools so called tool data tables are used. +In order to make tool data usable from Galaxy tools so called `loc` files are used. Those are tabular (by default tab separated) files with the extension `.loc`. -Besides the actual paths the entries can contain, e.g. IDs, names, or other +Besides the actual paths the entries can contain, e.g. IDs, names, or other metadata that can be used in tools to select reference data. The paths should be given as absolute paths, but can also be given relative to the Galaxy root dir. -By default tool data tables are installed in `tool_data_path` (where also built-in tool data tables +By default `loc` files are installed in `tool_data_path` (where also built-in `loc` files are stored). By setting `shed_tool_data_path` this can be separated. -## Tool data table config +## Tool data tables -The tool data tables that should be used in a Galaxy instance are configured -using tool data table config files. In addition these files contain some +The tool data tables that should be used in a Galaxy instance are listed +in tool data table config files. In addition these contain some metadata. Tool data table config files are XML files listing tool data table configurations: @@ -40,7 +51,7 @@ Tool data table config files are XML files listing tool data table configuration ``` -A tool data table configuration looks like this +An entry for a tool data looks like this ```xml @@ -53,7 +64,7 @@ A tool data table configuration looks like this - `columns`: a comma separated list of column names - `file`: `path` (alternatively `url`, `from_config`) -Tool data table definitions for tools installed from a toolshed have an additional +Tool data table entries for tools installed from a toolshed have an additional element `tool_shed_repository` and sub-tags `tool_shed` `repository_name`, `repository_owner`, `installed_changeset_revision`, e.g.: @@ -70,24 +81,26 @@ element `tool_shed_repository` and sub-tags `tool_shed`
``` -The file path points to a data table (i.e. a `.loc` file) and can be given -relative (to the `tool_data_path`) or absolute. If a given relative path does -not exist also the base name is checked (many tools use something like -`tool-data/xyz.loc` and store example `loc` files in a `tool-data/` directory in -the tool repository). +The file path points to a `loc` file and can be given relative (to the +`tool_data_path`) or absolute. If a given relative path does not exist also the +base name is checked (many tools use something like `tool-data/xyz.loc` and +store example `loc` files in a `tool-data/` directory in the tool repository). Currently also `.loc.sample` may be used in case the specified `.loc` is absent. -Tool data table config files: +Galaxy uses two tool data table config files: - `tool_data_table_config_path`: by default `tool_data_table_conf.xml` in Galaxy's `config/` directory. - `shed_tool_data_table_config`: by default `shed_tool_data_table_conf.xml` in Galaxy's `config/` directory. This file lists all tool data tables of tools installed from a toolshed. Note that the entries are versioned, i.e. there is a -separate entry for each tool and tool version. These content of the tool data -tables are merged when they are loaded. +separate entry for each tool and tool version. + +The tool data table config files can (and do) contain multiple entries for the same data table +(identified by the same name). These content of the corresponding `loc` files are merged when +they are loaded. When a new tool is installed that uses a data table a new entry is added to -`shed_tool_data_table_config` and a `.loc` file is placed in a versioned +`shed_tool_data_table_config` and a `loc` file is placed in a versioned subdirectory in `tool_data_path` (in a subdirectory that has the name of the toolshed). By default thus is `tool-data/toolshed.g2.bx.psu.edu/`. Note that these directories will also contain tool data config files, but they are unused. From abad770d5a7ff7a49bd6c2aae61f54d5ed05e644 Mon Sep 17 00:00:00 2001 From: Martin Cech Date: Thu, 18 Apr 2024 18:55:38 +0200 Subject: [PATCH 4/4] change text in a minor way --- doc/source/admin/data_tables.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/doc/source/admin/data_tables.md b/doc/source/admin/data_tables.md index 2004261fc813..44f1cc008e72 100644 --- a/doc/source/admin/data_tables.md +++ b/doc/source/admin/data_tables.md @@ -1,12 +1,12 @@ # Tool data Galaxy stores tool data in the path defined by `tool_data_path` (by default `tool-data/`). -It's possible to separate tool data of ToolShed-installed tools by setting `shed_tool_data_path`. +It's possible to separate tool data of toolshed-installed tools by setting `shed_tool_data_path`. Tool data consists of: 1. the actual data -2. one or more so called `loc` files +2. one or more `loc` files 3. entries in a tool data table (config) file @@ -14,7 +14,7 @@ Tool data consists of: In order to understand the naming and structure of these three components it might be of advantage to look in the history. Tool data was organized in tabular `loc` that contained metadata and paths -of the data. Those files were were installed with the tool and could be accessed with the +of the data. Those files were installed with the tool and could be accessed with the [`from_file`](https://docs.galaxyproject.org/en/master/dev/schema.html#from-file) mechanism from tools. Since each tool version had it's own `loc` file the maintenance was difficult. With tool data tables an additional abstraction layer was introduced that is used from tools via @@ -29,7 +29,7 @@ setting `galaxy_data_manager_data_path`. ## `loc` files -In order to make tool data usable from Galaxy tools so called `loc` files are used. +In order to make tool data usable from Galaxy tools `loc` files are used. Those are tabular (by default tab separated) files with the extension `.loc`. Besides the actual paths, the entries can contain IDs, names, or other metadata that can be used in tools to select reference data. The paths should be given as absolute paths, @@ -40,8 +40,7 @@ are stored). By setting `shed_tool_data_path` this can be separated. ## Tool data tables The tool data tables that should be used in a Galaxy instance are listed -in tool data table config files. In addition these contain some -metadata. +in tool data table config files. In addition these contain metadata. Tool data table config files are XML files listing tool data table configurations: