Skip to content

Content slurper and access gateway

Notifications You must be signed in to change notification settings

medusa-project/metaslurp

Repository files navigation

This is a getting-started guide for developers.

Quick Links

Dependencies

  • PostgreSQL
  • OpenSearch >= 1.x
  • Cantaloupe 4.1.x (image server)
    • Required for thumbnails but otherwise optional.
    • You can install and configure this yourself, but it will be easier to run a metaslurp-cantaloupe container in Docker.
  • metaslurper

Installation

1) Install rbenv:

$ brew install rbenv
$ brew install ruby-build
$ brew install rbenv-gemset --HEAD
$ rbenv init
$ rbenv rehash

2) Clone the repository and submodules:

$ git clone --recursive https://github.com/medusa-project/metaslurp.git
$ cd metaslurp

3) Install Ruby into rbenv

$ rbenv install "$(< .ruby-version)"

4) Install Bundler

$ gem install bundler

5) Install the application gems:

$ bundle install

6) Configure the application

cp config/credentials/template.yml config/credentials/development.yml
cp config/credentials/template.yml config/credentials/test.yml

Fill in the new files and do not commit them to version control.

7) Create and seed the database

$ bin/rails db:setup

8) Configure Opensearch

Support a single node

Uncomment discovery.type: single-node in config/opensearch.yml. Also add the following lines:

plugins.security.disabled: true
plugins.index_state_management.enabled: false
reindex.remote.whitelist: "localhost:*"

Install the analysis-icu plugin

$ bin/opensearch-plugin install analysis-icu

Start OpenSearch

$ bin/opensearch

To confirm that it's running, try to access http://localhost:9200.

Create the OpenSearch indexes

$ bin/rails opensearch:indexes:create[my_index]
$ bin/rails opensearch:indexes:create_alias[my_index,my_index_alias]

(my_index_alias is the value of the opensearch_index configuration key.)

9) Install Cantaloupe

Cantaloupe has several dependencies of its own and requires particular configuration and delegate method implementations to work with the application. Rather than documenting all of that here, see the README in the metaslurp-cantaloupe repository. It is recommended to clone that and run it locally using Docker.

Note that Cantaloupe plays a relatively minor role in the application (only rendering thumbnails) and it is perfectly possible to do 99% of development on Metaslurp without it running.

10) Start Metaslurp

bin/rails server

N.B.: In macOS, if you get an error like "+[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called" in macOS, try setting this envrionment variable:

OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Upgrading

Migrating the database schema

bin/rails db:migrate

Migrating the OpenSearch indexes

For the most part, once created, index schemas can't be modified. To migrate to an incompatible schema, the procedure would be something like:

  1. Update the index schema in app/search/index_schema.yml
  2. Create an index with the new schema: bin/rails opensearch:indexes:create[my_new_index]
  3. Populate the new index with documents. There are a couple of ways to do this:
    1. If the schema change was backwards-compatible with the source documents added to the index, invoke bin/rails opensearch:indexes:reindex[my_current_index,my_new_index]. This will reindex all source documents from the current index into the new index.
    2. Otherwise, reharvest everything into the new index. This can be accomplished by invoking the harvester with the SERVICE_SINK_METASLURP_INDEX environment variable set to the name of the index.

Because all of the above can be a huge pain, an effort has been made to design the index schema to be flexible enough to require migration as infrequently as possible.

Harvesting

In production, the various web-based buttons for initiating harvests trigger calls to the ECS API to start new harvesting tasks. This won't work in development. Instead, metaslurper should be invoked manually. Here is an example that will harvest the DLS into a local Metaslurp instance:

SERVICE_SOURCE_DLS_KEY=dls
SERVICE_SOURCE_DLS_ENDPOINT=https://digital.library.illinois.edu
# your NetID
SERVICE_SOURCE_DLS_USERNAME=...
# your API key; see https://digital.library.illinois.edu/admin/users/{NetID}
SERVICE_SOURCE_DLS_SECRET=...
SERVICE_SINK_METASLURP_KEY=metaslurp
SERVICE_SINK_METASLURP_ENDPOINT=http://localhost:3000
# username of a "non-human user"; see http://localhost:3000/admin/users
SERVICE_SINK_METASLURP_USERNAME=...
# the above user's API key
SERVICE_SINK_METASLURP_SECRET=...

java -jar target/metaslurper-VERSION.jar \
    -source $SERVICE_SOURCE_DLS_KEY \
    -sink $SERVICE_SINK_METASLURP_KEY \
    -threads 2

(These environment variable values are just examples. The variables used in production are stored in Metaslurp's ECS task definition, which is Terraformed.)

See the metaslurper README for more information about using metaslurper.

Once a harvest is running, you can monitor it from the harvests page just like any other harvest.

Notes

Signing in locally

Sign in as admin with password [email protected].

About

Content slurper and access gateway

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published