Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataverse RDM repository integration #19367

Open
wants to merge 64 commits into
base: dev
Choose a base branch
from

Conversation

KaiOnGitHub
Copy link

Dataverse RDM Repository Integration

This PR integrates Dataverse into Galaxy, mirroring the functionality developed for Invenio in PR #16381. It provides the same features tailored for Dataverse.

Terminology

The terminology used across RDM systems (like Dataverse and Invenio) and Galaxy can be confusing. The table below clarifies these differences. To reduce ambiguity, the RDMFilesSource base class (_rdm.py) has been updated to use abstract terms. This description follows the abstract naming to avoid further confusion:

  • File: Corresponds to a "dataset" in Galaxy.
  • Container: Refers to a "history" in Galaxy or a "dataset" in Dataverse.
Galaxy Invenio Dataverse Abstract
Dataset File File File
History Record Dataset Container

Features

  • List and search files and containers in a connected Dataverse instance.
  • Download files from Dataverse.
  • Export a history to a new (draft) container in Dataverse (equivalent to "Export to new record" in Galaxy).
  • Export a history to an existing draft container in Dataverse.
  • (Re)import containers from Dataverse into Galaxy as a history.

Current Limitations

Missing User Context

When downloading files from external draft containers in Dataverse, the user_context is not passed to the _realize_to method of the BaseFilesSource class. This causes authentication failures, preventing downloads. A similar issue arises when using the "Export datasets" feature to send files from Galaxy to an external draft container.

Archiving and Reimporting Histories

While exporting a history as a ZIP file to Dataverse works, the archive is automatically extracted by Dataverse, with its contents stored individually. This creates an issue for Galaxy, as reimporting relies on the archive's existence. A workaround enables reimport by downloading the container as a ZIP if the original archive is missing. Note: exporting as .tar.gz is currently not working, as the Dataverse API returns an error (Failed to add file to dataset). The recommended format is "RO-Crate" (ZIP) in advanced export options.

For Users

You can now use all the features listed above, which behave the same as those in the Invenio integration. UI workflows are demonstrated in the linked PR.

For Reviewers

Manual Testing Requirements

  1. (Optional) Create a Dataverse Instance:
    You may use the official demo for testing, as data is periodically deleted. Alternatively, set up a local instance using the quickstart guide.

  2. Create a Dataverse User and API Token:
    Generate a token via "Your username" → "My Data" → "API token" or directly at the this URL: https://demo.dataverse.org/dataverseuser.xhtml?selectTab=apiTokenTab.

  3. Configure the Connection:
    Add this to your file_sources_conf.yml file and fill out with your data. Ideally you use a Vault to store the token but for testing purposes you could also directly provide it in the config:

    - type: dataverse
      id: dataverse_sandbox
      doc: This is the sandbox instance of Dataverse. It is used for testing purposes only, content is NOT preserved.
      label: Dataverse Sandbox (use only for testing purposes)
      url: https://demo.dataverse.org
      token: XXXX
      public_name: ${user.preferences['dataverse_sandbox|public_name']}
      writable: true
  4. Add user config:
    Add a new entry in your user_preferences_extra_conf.yml to allow the user to store his required Dataverse settings:

    dataverse_sandbox:
        description: Your Dataverse Integration Settings (TESTING ONLY)
        inputs:
            - name: token
              label: API Token used to create draft records and to upload files. You can manage your tokens at https://demo.dataverse.org/dataverseuser.xhtml?selectTab=apiTokenTab (Replace demo.dataverse.org with your Dataverse instance URL)
              type: secret
              # store: vault # Requires setting up vault_config_file in your galaxy.yml
              required: False
            - name: public_name
              label: Creator name to associate with new datasets (formatted as "Last name, First name"). If left blank "Anonymous Galaxy User" will be used. You can always change this by editing your dataset directly.
              type: text
              required: False
  1. Setup Celery Task runner
    The Add task-based history export tracking #14839 requires having Celery Task (with Redis backend) setup in your instance. So you will need something like this in your galaxy.yml:
   # For details, see Celery documentation at
    # https://docs.celeryq.dev/en/stable/userguide/configuration.html.
    celery_conf:
        result_backend: redis://127.0.0.1:6379/0
    #  task_routes:
    #    galaxy.fetch_data: galaxy.external
    #    galaxy.set_job_metadata: galaxy.external

    # Offload long-running tasks to a Celery task queue. Activate this
    # only if you have setup a Celery worker for Galaxy. For details, see
    # https://docs.galaxyproject.org/en/master/admin/production.html
    enable_celery_tasks: true

Also ensure that redis is installed in your python environment with pip install redis

How to test the changes?

  • Instructions for manual testing are as follows:
    1. See section "Manual Testing Requirements"
    2. See the Invenio PR for the galaxy UI workflows

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant