IASO was the Greek goddess of cures, remedies and modes of healing. cmd-iaso is a command-line tool to help the curators of the identifiers.org registry. Firstly, it provides the functionality to scrape data from the resource providers in the registry. With this information, the curator is walked through an interactive curation system of the discovered issues. The goal of cmd-iaso is to aid the curators in upholding the health and integrity of the identifiers.org registry.
cmd-iaso
was developed by Moritz Langenstein under the supervision of Manuel Bernal Llinares for the identifiers.org registry created by the European Bioinformatics Institute. The project is published under the MIT License.
This repository consists of four main parts:
iaso
Python packagecmd-iaso
command-line tool entry point (located iniaso/cli
)cmd-iaso-docker.py
wrapper to help runningcmd-iaso
inside a Docker containeriaso.plugins
package path to register validator plugins for the toolathena
analysis Python+Rust package based on themetis
Rust crate
The codebase is formatted using the black code style and tested using the tox automation project and pytest.
This overview will delve into the details of cmd-iaso
, cmd-iaso-docker.py
and iaso.plugins
However, we leave it up to the reader to look into the iaso
codebase themselves should they wish to see some of the implementation details.
There are four ways to install the functionality of this tool, all of which bring differences in isolation and control. All of these methods start by cloning the git repository using either HTTPS:
> git clone https://github.com/identifiers-org/cmd-iaso.git
or SSL:
> git clone [email protected]:identifiers-org/cmd-iaso.git
All methods apart from the Docker container require an installation of Python 3.7+ and pip. If you want to use athena
analysis, you also require a stable Rust installation in your PATH
.
To install the iaso
package and cmd-iaso
tool directly into your Python implementation, you can use pip
> pip install -e .
Note that this will also install all of the dependencies into your current Python environment. To create some isolation, you can use a Python virtual environment and install cmd-iaso
in there.
This installation is the most user-friendly as it takes care of any installation dependencies the package has automatically, including athena
analysis. Therefore, this method requires Rust to be installed in your PATH
. For more control over this, use the installation based on Python setuptools described below.
To install the iaso
package and cmd-iaso
tool directly into your Python implementation, you can make direct use of the setup.py
script:
> python3 setup.py install
Note that this will also install all of the dependencies into your current Python environment. To create some isolation, you can use a Python virtual environment and install cmd-iaso
in there.
This installation might skip some optional components if their installation dependencies are not already satisfied. If you want to install athena
analysis with this command, Rust needs to be installed in your PATH
. Furthermore, you need to install setuptools-rust in your Python environment using:
> pip install setuptools-rust
If you want to automatically install cmd-iaso
inside a fresh and isolated virtual environment, you can simply run:
> make install
In contrast to the direct setuptools method, cmd-iaso
will not be registered in your path automatically. To register the command-line extensions, you can run:
> source command-line-extensions.sh
This installation will skip athena
analysis on its first installation. You will need to manually follow the pip or Python setuptools installation steps described above to reinstall cmd-iaso
with athena
analysis in the newly created virtual Python environment.
If you have already installed Docker, you can simply run
> python3 cmd-iaso-docker.py
This command will build the Docker container during the first run. The Docker container will always be installed athena
analysis support. The cmd-iaso-docker.py
wrapper mirrors the functionality of cmd-iaso
, so every command that you can run as
> cmd-iaso COMMAND ARGS OPTIONS
can also be run as
> python3 cmd-iaso-docker.py COMMAND ARGS OPTIONS
There are a few small differences in semantics between running cmd-iaso
and python3 cmd-iaso-docker
, however. Firstly, all (file) paths mentioned in the arguments must already exist, which also means that in Docker mode, the tool will always complain about overwriting existing files. Secondly, any environment variables visible to cmd-iaso
, for instance through the .env
file will not be visible to the containerised tool. Lastly, if you want to use any custom curation validator plugins (see below), you will need to add a new layer to the Docker container to install them inside as well. Otherwise, they will not be found by python3 cmd-iaso-docker
.
It is also possible to manually run the docker run
commands yourself using:
> docker run -it --init identifiersorg/cmd-iaso COMMAND ARGS OPTIONS
Please beware that while you will have more control over the Docker container using this approach, we can provide no guarantees that the commands will run as expected.
cmd-iaso
offers some shell completion functionality for bash, fish, PowerShell and zsh. If you want to install the shell completion, you can use
> cmd-iaso completion install [SHELL] [PATH] [--append/--overwrite] [-i/--case-insensitive/--no-case-insensitive]
If you do not specify SHELL
explicitly, your current shell will be detected automatically and used instead. You can optionally also specify the PATH
to which the completion script will be appended (--append
) or which will be overwritten (--overwrite
). Finally, if you want the completion to be case-insensitive, you can enable that via the -i
or --case-insensitive
option. To explicitly disable case-insensitive completion, you can provide the --no-case-insensitive
flag.
If you do not want cmd-iaso
to install the shell completion, you can simply use
> cmd-iaso completion show [SHELL] [-i/--case-insensitive/--no-case-insensitive]
to output the completion script to the terminal.
cmd-iaso
comes with many commands and options. While this document will outline their functionality, you can always provide the --help
option to any command to read a description of the command and its available options. Most options have default values, while some always require a user-provided value. All options can be provided either via the command-line or via environment variables. cmd-iaso
also supports reading a .env
file to get the values of the environment variables. Note that providing command-line options will always overwrite environment variables. A default configuration is provided in config.default
which is automatically copied to .env
by make install
. Please refer to the --help
pages to find out about the names of the supported environment variables.
To print a description of your current runtime environment, you can run:
> cmd-iaso environment
To print the current status of the identifiers.org registry, you can use:
> cmd-iaso registry
Before performing curation of the resource providers in the identifiers.org registry, cmd-iaso
needs to scrape some data. This section will outline how to configure and run the scraping pipeline.
[Optional]: Extracting LUIs from the load balancing logs of identifiers.org
If you want the data scraping to probe valid resource LUIs, you need to provide the tool with a list of them. One way to get some heuristically more likely to be valid LUIs, you can extract them from the load balancing logs of identifiers.org:
> cmd-iaso logs2luis LOGS VALID_NAMESPACE_IDS [--resolution-endpoint RESOLUTION_ENDPOINT]
Here, LOGS
refers to the folder in which the logs are stored, VALID_NAMESPACE_IDS
is the file path to which the list of extracted LUIs will be written. Optionally, --resolution-endpoint RESOLUTION_ENDPOINT
can be provided to specify a custom resolution API endpoint, for instance in order not to overload the public one.
See the Makefile target download_latest_logs
for an example of how to get logs from the cloud kubernetes.
cmd-iaso
needs to know exactly which resource providers and LUIs it will probe during the scraping. To generate the jobs specification file, you can run:
> cmd-iaso jobs JOBS [--valid VALID] [--random RANDOM] [--pings PINGS] [--valid-namespace-ids VALID_NAMESPACE_IDS]
This command will attempt to use VALID
valid LUIs for each resource provider in addition to generating RANDOM
random LUIs per provider. Iff VALID
is greater than one, you must also provide --valid-namespace-ids VALID_NAMESPACE_IDS
where VALID_NAMESPACE_IDS
points to the file you generated using cmd-iaso logs2luis
. Each job will be repeated PINGS
times in the jobs list. The final list of jobs will be saved at the JOBS
file path.
Note that the resulting jobs list of this command is random. Both the random LUIs and the selection of valid LUIs is random on each run of this command. Furthermore, note that this command will attempt to use valid LUIs from a different namespace if some namespace does not have enough valid LUIs specified in VALID_NAMESPACE_IDS
. Therefore, as long as there are enough LUIs in VALID_NAMESPACE_IDS
, it will use more than VALID
LUIs from some namespaces to compensate for others.
cmd-iaso
uses an HTTPS intercepting proxy to detect and flag some common error cases without exposing the rest of the scraping pipeline to them. While cmd-iaso scrape
can launch its own proxy (see below), you can also launch your own:
> cmd-iaso proxy3 [--port PORT] [--timeout TIMEOUT] [--log null|stderr|proxy3.log]
PORT
specifies the free port the proxy should run on. TIMEOUT
specifies in seconds how long the proxy should wait internally for resources on the Internet to respond. It is recommended to choose a lower timeout for the proxy than for the scraping command. The --log
option specifies which logging output will be used. 'null' discards all messages, 'stderr' redirects them to stderr and 'proxy3.log' appends them to the proxy3.log file in the current working directory. By default, all messages are discarded.
To run the data scraping pipeline, you must first create a new folder to save the collected data dumps in, for instance:
> mkdir dump
Now, you can run the data scaping command to run the jobs defined in the JOBS
file and save the results in the DUMP
folder:
> cmd-iaso scrape JOBS DUMP [--resume] [--proxy PROXY] [--chrome CHROME] [--workers WORKERS] [--timeout TIMEOUT] [--log null|stderr|scrape.log]
This command is highly customisable. Firstly, you can automatically launch a proxy (this is default option but can also be done explicitly using --proxy launch
) or connect to an existing one by providing its address, e.g. --proxy localhost:8080
. If a new proxy is launched, its log will be implicitly discared. The --chrome
option should be used with care, as it provides the path to the Chrome browser executable. By not providing this option, cmd-iaso
will use a version of Chromium that is automatically downloaded if required. WORKERS
specifies the number of processes that should be launched in parallel to work on different scraping jobs. Lastly, TIMEOUT
specifies in seconds a baseline timeout that will be used to cancel too long-running scraping jobs.
Running this command will take some time, so a progress bar is provided to keep the user informed. If you want to pause the scraping, you can iterrupt it using CTRL-C
or CMD-C
depending on your operating system. The scraper will then shutdown and wait for all running workers to complete. A paused scraping task can be resumed later on by passing the --resume
flag to the command. Finally, the --log
option specifies which logging output will be used. 'null' discards all messages, 'stderr' redirects them to stderr and 'scrape.log' appends them to the scrape.log file in the current working directory. By default, all messages are appended to scrape.log.
The collected raw data dumps contain mostly raw information about the scraped resources. To collect and compress this data into a structured format that can be read by the curation process, you can run:
> cmd-iaso dump2datamine DUMP DATAMINE
which will read the data dumps from the DUMP
folder and save the datamine to the DATAMINE
file path.
The dump2datamine
command also allows you to perform analysis on the scraped responses to determine if the resource providers are working as expected. This working state is assessed by the information content of a resource:
- The information content of a resource is the maximum information content per LUI pinged during scraping, i.e. one working LUI is sufficient to be classified as working.
- Only the content which is deterministic per LUI is considered as informative, i.e. random or time-dependent elements are excluded.
- The information content of a LUI is the amount of information that is not shared with other LUIs. Longer segments of information are given a heigher weight than shorter segments in measuring the amount of shared information. This definition means that any resource that always responds with the same or completely random responses will be classified as defunct. In contrast, if a resource provides deterministic distinct responses for at least one LUI, its information content will be significantly higher.
As the athena
analysis is very computationally expensive, it is implemented in the Rust library crate metis
. To enable this optional analysis, cmd-iaso
must be installed with athena
analysis support, which is described in the installation guidelines outlined above. You can check whether athena analysis is available by running:
> cmd-iaso dump2datamine --check-athena
If the --analyse
flag is passed to the dump2datamine
command, the analysis will be performed and integrated with the normal dump compaction in the DATAMINE
. The calculated information contents can then be checked during curation by enabling the information-content
validator.
The identifiers.org registry might contain duplicate institution entries which refer to the same entity. In the old platform, a resource's institution was simply stored as a string. As a result of the migration from the old platform, many institution entries still have only their name field filled out, and some names are concatenations of multiple institutions. The institution deduplication command
> cmd-iaso dedup4institutions ACADEMINE
collects all existing institutions from the registry. It then attempts to link them to the mentioned entities. This process deduplicates the entries and disentangles concatenations of institution names. It also tries to fill in information about the institutions like their name, official URL, ROR ID, country and a description. The results of this command are stored in the ACADEMINE
file.
The primary purpose of cmd-iaso
is to aide the curator in their curation process. The interactive curation is run either on the datamine file created from the data scraping pipeline using the cmd-iaso dump2datamine
command or the academine file created from the institution deduplication using the cmd-iaso dedup4institutions
command.
cmd-iaso
uses validator plugins to provide customisable modularised validation of the resource providers. Each validator is a subclass of the iaso.curation.validator.CurationValidator
class:
from abc import ABC, abstractmethod
from typing import Union
class CurationValidator(ABC):
@classmethod
def validate_params(cls, validator_name: str, **kwargs) -> CurationValidator:
"""
Overwrite this classmethod if your validator can take parameters.
This method should either raise an exception or return a subclass of cls.
"""
if len(kwargs) > 0:
raise click.UsageError(
click.style(
f"The validator {validator_name} does not accept any parameters.",
fg="red",
)
)
return cls
@staticmethod
@abstractmethod
def check_and_create(get_compact_identifier, valid_luis_threshold, random_luis_threshold, provider) -> Union[CurationValidator, bool]:
"""
Returns False iff this data_entry cannot be included during
curation at all.
Returns True iff this validator has found nothing to report on
this data_entry.
Returns an instance of the particular CurationValidator iff it
found something to report about this data_entry.
"""
pass
@abstractmethod
def format(self, formatter) -> None:
pass
Here get_compact_identifier
is a function of the signature:
def get_compact_identifier(lui: str, provider_id: int) -> str:
...
Curation validators must be registered in the iaso.plugins
module using setuptools entry points. For instance, to register a class MyValidator
you should write:
from setuptools import setup
setup(
...
entry_points={
"iaso.plugins": [
"my-validator = my_module.my_validator:MyValidator",
],
},
...
)
As a more general alternative, you can also use a pyproject.toml
file to register your curation validator:
[project]
name = "My Curation Plugin"
[project.entry-points."iaso.plugins"]
my-validator = "my_module:my_validator:MyValidator"
cmd-iaso
comes with the following validators by default:
redirection-chain
displays the entire redirection chain of a resource and, therefore, marks every resource as erroneousdns-error
detects DNS errors caught by the scraping proxyssl-error
detects SSL errors caught by the scraping proxyinvalid-response
detects invalid HTTP responseshttp-status-error
detects requests that resulted in HTTP error codesscheme-only-redirect
detects redirects where only the scheme of the URL changed, e.g.http://url
->https://url
information-content
displays and puts into context the output of theathena
analysis
To list all validators that are registered with cmd-iaso
, you can use
> cmd-iaso curate --list-validators
The interactive curation tool is composed of three components which can all run either in the terminal or in the Chrome browser. The selection is independent for each component to allow for maximum customisability. All of the component-options can either be set to terminal
or to chrome
.
The Controller allows the curator to navigate through the resource providers which have been flagged as problematic. The controller component can be set by the --controller
option.
The Navigator leads the curator to the provider's corresponding namespace page in the identifiers.org registry. If the navigator is in Chrome mode and the curator is logged in, the navigator will automatically enter edit mode for the relevant resource information. The navigator component can be set by the --navigator
option.
The Informant formats and presents information about the discovered issues with each resource provider to the curator. The informant component can be set by the --informant
option.
Iff any of the components are set to chrome
, the curator must also provide the --chrome
option to select how the curation pipeline should connect to Chrome. It can either launch
a new instance or connect to an existing one if its address, e.g. localhost:9222
is provided. Note that in order to connect to a running Chrome browser, it must have been started with the --remote-debugging-port=PORT
option, where PORT
would be 9222
in this case.
If you want to connect to a running Chrome browser instance on a different machine, for instance if you are calling cmd-iaso
through SSH, we recommend taking a look at inlets which allows you to "[e]xpose your local endpoints to the Internet or to another network, traversing firewalls and NAT".
All of these options have to be provided via the command line or environment variables. Otherwise, the curator will be asked for their value via a prompt:
> cmd-iaso curate --controller CONTROLLER --navigator NAVIGATOR --informant INFORMANT [--chrome CHROME] [--tags TAGS] {-i TAG} [--statistics]
The curation process also allows the curator to tag identified issues. These tags are associated with a fingerprint of that issue and are stored across different curation session. If you want to change the location of this tags storage from its default of tags.gz
, you can use the --tags TAGS
option.
The tags are not only a great way to keep notes on recurring or unsolved issues, but they also allow you to hide the issues they tag temporarily. If you want to ignore any issues with a specific tag, you can pass -i TAG
or --ignore TAG
for every tag you wish to ignore. By default, the fixed
and ignore
tags are ignored. It is important to note that you can change which tags are ignored at any point during curation. You will have to reload an entry, however, for any change in the ignored tags to take effect.
If you want to just get an overview of all the issues identified, you can provide the --statistics
flag. Instead of launching an interactive curation session, cmd-iaso
will then only print a statistical summary. Therefore, none of the --controller
, --navigator
, --informant
or --chrome
options must be provided.
Curation is performed in sessions to enable the curator to pause and save their progress. Furthermore, they can then resume the curation later on. The settings on how the curation is run, e.g. whether in the terminal or the Chrome browser, is session independent. In contrast, the information dump on which the curation is based is fixed per session. Settings which narrow down the set of issues that are reported are also saved with the session. The session also remembers the point at which the curator left off.
To start a new session for curating resource providers, you can use:
> cmd-iaso curate [...] start resources DATAMINE {-v VALIDATOR} [--valid-luis-threshold VALID_LUIS_THRESHOLD] [--random-luis-threshold RANDOM_LUIS_THRESHOLD] [--session SESSION]
This command starts a new session using the DATAMINE
file created by the dump2datamine
command and will save it either to the SESSION
file path -- if provided -- or the default resources_session.gz
location. If the curator does not want to save the session, they can provide the --discard-session
instead.
The -v VALIDATOR
/ --validate VALIDATOR
option can be provided multiple times to explicitly name all validator modules which should be enabled in this session. By default, dns-error
, invalid-response
and http-status-error
are enabled. Some validators support parameterisation using a named parameter list suffix of the form -v VALIDATOR:param=value,flag,param=value
. For instance, the information-content
validator supports a threshold
parameter in the range [0.0, 1.0]
to only report resource providers with an information content smaller or equal to the threshold.
It is also possible to only report errors which occur with a high enough percentage. For instance, to only report errors using valid LUIs if they occur on more than of the valid LUIs, you can specify --valid-luis-threshold 50
. Similarly, you can specify --random-luis-threshold 50
to configure it the same for randomly generated LUIs. By default, all errors on valid LUIs and no errors on random LUIs will be reported. Note that each validator can decide whether to abide by this setting.
The [...]
between curate
and resources
refer to the general curation options discussed above.
To start a new session for curating institutions, you can use:
> cmd-iaso curate [...] start institutions ACADEMINE [--session SESSION]
This command starts a new session using the ACADEMINE
file created by the dedup4institutions
command and will save it either to the SESSION
file path -- if provided -- or the default institutions_session.gz
location. If the curator does not want to save the session, they can provide the --discard-session
instead.
The [...]
between curate
and institutions
refer to the general curation options discussed above.
An existing session at the SESSION
file path can be resumed using:
> cmd-iaso curate [...] resume resources/institutions SESSION
The [...]
between curate
and resume
refer to the general curation options discussed above.
This command will also warn the curator if they have already completed curation on this session.