Skip to content

Indexing user-defined directories on a shared filesystem

License

Notifications You must be signed in to change notification settings

ArtifactDB/SewerRat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

74 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Collecting random shit from a shared filesystem

Test and build Publish version Latest version

Introduction

SewerRat retrieves user-supplied metadata from a shared filesystem and indexes them into a giant SQLite file. This allows users to easily search for files of interest generated by other users, typically in high performance computing (HPC) clusters associated with the shared filesystem. The aim is to promote discovery of analysis artifacts in an ergonomic manner - we do not require uploads to an external service, we do not impose schemas on the metadata format, and we re-use the existing storage facilities on the HPC cluster. SewerRat can be considered a much more relaxed version of the Gobbler that federates the storage across users.

For convenience, we'll assume that the URL to the SewerRat API is present in an environment variable named SEWER_RAT_URL. Readers should obtain an appropriate URL for their SewerRat deployment before trying the code examples below. Alternatively, readers can spin up their own instance on localhost by running the binaries here or building the executable from source with the usual go build . command.

Registering a directory

Initialization

Any directory can be indexed as long as (i) the requesting user has write access to it and (ii) the account running the SewerRat service has read access to it. To demonstrate, let's make a directory containing JSON-formatted metadata files. Other files may be present, of course, but SewerRat only cares about the metadata.

mkdir test 
echo '{ "title": "YAY", "description": "whee" }' > test/A.json
mkdir test/sub
echo '{ "authors": { "first": "Aaron", "last": "Lun" } }' > test/sub/A.json
echo '{ "foo": "bar", "gunk": [ "stuff", "blah" ] }' > test/sub/B.json

To start the registration process, we make a POST request to the /register/start endpoint. This should have a JSON-encoded request body that contains the path, the absolute path to our directory that we want to register.

PWD=$(pwd)
curl -X POST -L ${SEWER_RAT_URL}/register/start \
    -H "Content-Type: application/json" \
    -d '{ "path": "'${PWD}'/test" }' | jq
## {
##   "code": ".sewer_HP0JOaQ14NBadaLGDPjOW712S2SIA_u-9yQH6AKbaQ8",
##   "status": "PENDING"
## }

On success, this returns a PENDING status with a verification code. The caller is expected to verify that they have write access to the specified directory by creating a file with the same name as the verification code (i.e., .sewer_XXX) inside that directory.

Verification

Once this is done, we call the /register/finish endpoint with a JSON-encoded request body that contains the same directory path in path. The body may also contain base, an array of strings containing the names of the metadata files in the directory to be indexed. If base is not provided and path has already been registered, the base associated with path's prior registration is re-used; otherwise, if path was not previously registered, only files named metadata.json will be indexed.

curl -X POST -L ${SEWER_RAT_URL}/register/finish \
    -H "Content-Type: application/json" \
    -d '{ "path": "'${PWD}'/test", "base": [ "A.json", "B.json" ] }' | jq
## {
##   "comments": [],
##   "status": "SUCCESS"
## }

Upon receiving a valid request, SewerRat will walk recursively through the directory specified in path. It will identify all metadata files with the specified base names (i.e., A.json and B.json in our example above), parsing them as JSON for indexing. SewerRat will skip any problematic files that cannot be indexed due to, e.g., invalid JSON, insufficient permissions. The causes of any failures are reported in the comments array in the HTTP response.

On success, the metadata files in the specified directory will be incorporated into the SQLite index. We can then search on the contents of these files or fetch the contents of any file in the registered directory.

Indexing in detail

As mentioned above, SewerRat will recurse through the specified directory to find metadata files with the listed base names. Subdirectories with names starting with . are skipped during the recursive walk, so any metadata files therein will be ignored. This is generally a sensible choice as these directories usually do not contain any interesting (scientific) information. If any such subdirectory is relevant, a user can force SewerRat to include it in the index by passing its path directly as path. This is because leading dots are allowed in the components of the supplied path, just not in its subdirectories. Conversely, a user can force SewerRat to skip a particular subdirectory by placing a (possibly empty) .SewerRatignore file inside it.

Symbolic links in the specified directory are treated differently depending on their target. If the directory contains symbolic links to files, the contents of the target files can be indexed as long as the link has one of the base names. All file information (e.g., modification time, owner) is taken from the link target, not the link itself; SewerRat effectively treats the symbolic link as a proxy for the target file. If the directory contains symbolic links to other directories, these will not be recursively traversed.

Each identified metadata document is parsed as JSON and converted into tokens. For strings, we use an adaptation of the FTS5 Unicode61 tokenizer to break each string into tokens, i.e., strings are split into tokens at any character that is not a Unicode letter/number or a dash. For numbers and booleans, the string representation of the value is tokenized. All tokens are stored in the index, associated with the JSON object properties in which it was found, e.g., the value "Chris" is associated with the properties "b.c" in the document below.

{
    "a": "Aaron",
    "b": {
        "c": "Chris"
    }
}

Automatic updates

SewerRat will periodically update the index by inspecting all of its registered directories for new content. If we added or modified a file with one of the registered names (e.g., A.json), SewerRat will (re-)index that file. Similarly, if we deleted a file, SewerRat will remove it from the index. This ensures that the information in the index reflects the directory contents on the filesystem. Users can also manually update a directory by repeating the process above to re-index the directory's contents.

Updates and symbolic links can occasionally interact in strange ways. Specifically, updates to the indexed information for symbolic links are based on the modification time of the link target. One can imagine a pathological case where a symbolic link is changed to a different target with the same modification time as the previous target, which will not be captured by SewerRat. Currently, this can only be resolved by deleting all affected symbolic links, re-registering the directory, and then restoring the links and re-registering again.

Deregistering

To remove files from the index, we use the same procedure as above but replacing the /register/* endpoints with /deregister/*. The only potential difference is when the caller requests deregistration of a directory that does not exist. In this case, /deregister/start may return a SUCCESS status instead of PENDING, after which /deregister/finish does not need to be called.

Other comments

If an error is encountered in the /register/* or /deregister/* endpoints, the response usually has the application-json content type. The body encodes a JSON object with an ERROR status and a reason string property explaining the reason for the failure. That said, some error types (e.g., 404, 405) may instead return a text/plain content type with the reason directly in the response body.

Any failure to parse specific JSON files is not considered an error and will only show up in the comments of a successful response from /register/finish. This provides some robustness to partial writes or invalid files inside directories with complex internal structure.

Regardless of whether the registration is successful or not, the verification code file is no longer needed after a response is received. This can be deleted from the directory to reduce clutter.

We provide some small utility functions from scripts/functions.sh to perform the registration from the command line. The process should still be simple enough to implement equivalent functions in any language.

Querying the index

Making the request

We can query the SewerRat index to find files of interest based on the contents of the metadata, the user name of the file owner, the modification date, or any combination thereof. This is done by making a POST request to the /query endpoint of the SewerRat API, where the request body contains the JSON-encoded search parameters:

curl -X POST -L ${SEWER_RAT_URL}/query \
    -H "Content-Type: application/json" \
    -d '{ "type": "text", "text": "Aaron" }' | jq
## {
##   "results": [
##     {
##       "path": "/Users/luna/Programming/ArtifactDB/SewerRat/scripts/test/sub/A.json",
##       "user": "luna",
##       "time": 1709320903,
##       "metadata": {
##         "authors": {
##           "first": "Aaron",
##           "last": "Lun"
##         }
##       }
##     }
##   ]
## }

The request body should be a JSON-formatted "search clause", see below for details. The response is a JSON object with the following properties:

  • results, an array of objects containing the matching metadata files, sorted by decreasing modification time. Each object has the following properties:
    • path, a string containing the path to the file.
    • user, the identity of the file owner.
    • time, the Unix time of most recent file modification.
    • metadata, the contents of the file.
  • (optional) next, a string containing the endpoint to use for the next page of results. A request to this endpoint should use the exact same request body to correctly obtain the next page. If next is not present, callers may assume that all results have already been obtained.

Callers can control the number of results to return in each page by setting the limit= query parameter. This should be a positive integer, up to a maximum of 100. Any value greater than 100 is ignored.

Defining search clauses

The request body should be a "search clause", a JSON object with the type string property. The nature of the search depends on the value of type:

  • For "text", SewerRat searches on the text (i.e., any string property) in the metadata file. The search clause should contain the following additional properties:
    • text, the search string. The tokenization process described above is applied to this string to create tokens. All tokens in text must be present in the metadata file in order for that file to be considered a match.
    • (optional) field, the name of the metadata property to be matched. Matches to tokens are only considered within the named property. Properties of nested objects can be specified via .-delimited names, e.g., authors.first. If field is not specified, matches are not restricted to any single property within a file.
    • (optional) is_pattern, a boolean indicating whether text is a wildcard-containing pattern. Currently supported wildcards are *, for any number of any characters; and ?, for a match to any single character. If true, wildcards will be preserved by tokenization and used for pattern matching to metadata-derived tokens. Defaults to false.
  • For "user", SewerRat searches on the user names of the file owners. The search clause should contain the user property, a string which contains the user name. A file is considered to be a match if the owning user is the same as that in user. Note that this only considered the most recent owner if the file was written by multiple people.
  • For "path", SewerRat searches on the path to each file. The search clause should contain the following additional properties:
    • path, a substring of the absolute path to each file. A file is considered to be a match if its path contains path as a substring.
    • (optional) is_prefix, a boolean indicating whether to search for absolute paths that start with path. Defaults to false.
    • (optional) is_suffix, a boolean indicating whether to search for absolute paths that end with path. Defaults to false.
    • (optional) is_pattern, a boolean indicating whether path is a wildcard-containing pattern, see the equivalent field for text. Defaults to false.
  • For "time", SewerRat searches on the latest modification time of each file. The search clause should contain the following additional properties:
    • time, an integer containing the Unix time. SewerRat searches for files that were modified before this time.
    • (optional) after, a boolean indicating whether to instead search for files that were created after time.
  • For "and" and "or", SewerRat searches on a combination of other filters. The search clause should contain the children property, which is an array of other search clauses. A file is only considered to be a match if it matches all ("and") or any ("or") of the individual clauses in children.
  • For "not", SewerRat negates the filter. The search clause should contain the child property, which contains the search clause to be negated. A file is only considered to be a match if it does not match the clause in child.

Human-readable syntax for text queries

For text searches, we support a more human-readable syntax for boolean operations in the query. The search string below will look for all metadata documents that match foo or bar but not whee:

(foo OR bar) AND NOT whee

The AND, OR and NOT (note the all-caps!) are automatically translated to the corresponding search clauses. This can be combined with parentheses to control precedence; otherwise, AND takes precedence over OR, and NOT takes precedence over both. Note that any sequence of adjacent text terms are implicitly AND'd together, so the two expressions below are equivalent:

foo bar whee
foo AND bar AND whee

Users can prefix any sequence of text terms with the name of a metadata field, to only search for matches within that field of the metadata file. For example:

(title: prostate cancer) AND (genome: GRCh38 OR genome: GRCm38)

This also works for properties of JSON objects that are nested in other objects. Here, the name of the field is defined by concatenating all property names with an intervening period, e.g.:

publication.author.first_name: Aaron

Note that this scoping-by-field does not extend to the AND, OR and NOT keywords, e.g., title:foo OR bar will not limit the search for bar to the title field.

If a * or ? wildcard is present in a search term, pattern matching will be performed to the metadata-derived tokens. This only applies to the search clause immediately containing the term, e.g., foo* and bar will be used for pattern matching but whee and stuff will not.

(foo* bar) AND (whee stuff)

The human-friendly mode can be enabled by setting the translate=true query parameter in the request to the /query endpoint. The structure of the request body is unchanged except that any text field is assumed to contain a search string and will be translated into the relevant search clause.

curl -X POST -L ${SEWER_RAT_URL}/query?translate=true \
    -H "Content-Type: application/json" \
    -d '{ "type": "text", "text": "Aaron OR stuff" }' | jq
## {
##   "results": [
##     {
##       "path": "/Users/luna/Programming/ArtifactDB/SewerRat/scripts/test/sub/B.json",
##       "user": "luna",
##       "time": 1711754321,
##       "metadata": {
##         "foo": "bar",
##         "gunk": [
##           "stuff",
##           "blah"
##         ]
##       }
##     },
##     {
##       "path": "/Users/luna/Programming/ArtifactDB/SewerRat/scripts/test/sub/A.json",
##       "user": "luna",
##       "time": 1711754321,
##       "metadata": {
##         "authors": {
##           "first": "Aaron",
##           "last": "Lun"
##         }
##       }
##     }
##   ]
## }

The html/ subdirectory contains a minimal search page that queries a local SewerRat instance using this syntax. Developers can copy this page and change the base_url to point to their production instance.

Accessing registered directories

Motivation

In general, users are expected to be operating on the same filesystem as the SewerRat API. This makes it trivial to access the contents of directories registered with SewerRat, as we expect each registered directory to be world-readable. For remote applications, the situation is more complicated as they are able to query the SewerRat index but cannot directly read from the filesystem. This section describes some API endpoints that fill this gap for remote access.

Listing directory contents

We can list the contents of a directory by making a GET request to the /list endpoint of the SewerRat API, where the URL-encoded path to the directory of interest is provided as a query parameter.

path=/Users/luna/Programming/ArtifactDB/SewerRat/scripts/test/
curl -L ${SEWER_RAT_URL}/list -G --data-urlencode "path=${path}" --data "recursive=true" | jq
## [
##   "A.json",
##   "hello.txt",
##   "sub/A.json",
##   "sub/B.json"
## ]

On success, the response contains a JSON-encoded array of strings, each of which is a relative path in the directory at path. The recursive= parameter specifies whether a recursive listing should be performed. If true, all paths refer to files; otherwise, the names of directories may be returned and will be suffixed with /. All symbolic links are reported as files in the response. Symbolic links to directories will not be recursively traversed, even if recursive=true.

On error, the exact response may either be text/plain content containing the error message directly, or application/json content encoding a JSON object with the reason for the error. If the path does not exist in the index, a standard 404 error is returned.

Fetching file contents

We can obtain the contents for a path inside any registered directory by making a GET request to the /retrieve/file endpoint of the SewerRat API, where the URL-encoded path of interest is provided as a query parameter. This is not limited to the registered metadata files - any file inside a registered directory can be extracted in this manner.

# Mocking up a non-metadata file.
echo "HELLO" > test/hello.txt

# Fetching it:
path=/Users/luna/Programming/ArtifactDB/SewerRat/scripts/test/hello.txt
curl -L ${SEWER_RAT_URL}/retrieve/file -G --data-urlencode "path=${path}"
## HELLO

On success, the contents of the target file are returned with a content type guessed from its name. If path is a symbolic link to a file, the contents of the target file will be returned by this endpoint.

On error, the exact response may either be text/plain content containing the error message directly, or application/json content encoding a JSON object with the reason for the error. If the path does not exist in the index, a standard 404 error is returned.

Fetching metadata

For the special case of a metadata file, we can alternatively obtain its contents by making a GET request to the /retrieve/metadata endpoint of the SewerRat API, where the URL-encoded path of interest is provided as a query parameter.

path=/Users/luna/Programming/ArtifactDB/SewerRat/scripts/test/A.json
curl -L ${SEWER_RAT_URL}/retrieve/metadata -G --data-urlencode "path=${path}" | jq
## {
##   "path": "/Users/luna/Programming/ArtifactDB/SewerRat/scripts/test/A.json",
##   "user": "luna",
##   "time": 1711754321,
##   "metadata": {
##     "title": "YAY",
##     "description": "whee"
##   }
## }

On success, this returns an object containing:

  • path, a string containing the path to the file.
  • user, the identity of the file owner.
  • time, the Unix time of most recent file modification.
  • metadata, the contents of the file.

If we do not actually need the metadata (e.g., we just want to check if the file exists), we can skip it by setting the metadata=false URL query parameter in our request.

On error, the exact response may either be text/plain content containing the error message directly, or application/json content encoding a JSON object with the reason for the error. If the path does not exist in the index, a standard 404 error is returned.

Identifying registered directories

We can determine which directories are actually registered by making a GET request to the /registered endpoint of the SewerRat API.

curl -L ${SEWER_RAT_URL}/registered | jq

On success, this returns an array of objects containing:

  • path, a string containing the path to the registered directory.
  • user, the identity of the user who registered this directory.
  • time, the Unix time of the registration.
  • names, the base names of the metadata files to be indexed in this directory.

This can be filtered by passing additional query parameters:

  • user=, which filters on the user.
  • contains_path=, which filters for path that contain (i.e., are parents of) the specified path.
  • path_prefix=, which filters for path that start with the specified prefix.

On error, the response may either be text/plain content containing the error message directly, or application/json content encoding a JSON object with the reason for the error.

Administration

Clone this repository and build the binary. This assumes that Go version 1.20 or higher is available.

git clone https://github.com/ArtifactDB/SewerRat
cd SewerRat
go build

And then execute the SewerRat binary to spin up an instance. The -db flag specifies the location of the SQLite file (default to index.sqlite3) and -port is the port we're listening to for requests (defaults to 8080).

./SewerRat -db DBPATH -port PORT

If a SQLite file at DBPATH already exists, it will be used directly, so a SewerRat instance can be easily restarted with the same database.

SewerRat will periodically create a back-up of the index at DBPATH.backup. This can be used to manually recover from problems with the SQLite database by copying the backup to DBPATH and restarting the SewerRat instance.

Additional arguments can be passed to ./SewerRat to control its behavior (check out ./SewerRat -h for details):

  • -backup controls the frequency of back-up creation. This defaults to 24 hours.
  • -update controls the frequency of index updates. This defaults to 24 hours.
  • -session specifies the lifetime of a registration sesssion (i.e., the maximum time between starting and finishing the registration, see above). This defaults to 10 minutes.
  • -checkpoint specifies the frequency of SQLite checkpoints, to manually synchronize the write-ahead log with the SQLite database file. This defaults to 60 minutes.
  • -finish specifies the time spent polling for the verification code after a request has been made to /register/finish or /deregister/finish. A non-zero value is often necessary on network filesystems where newly written files do not immediately synchronize. This defaults to 30 seconds.
  • -prefix adds an extra prefix to all endpoints, e.g., to disambiguate between versions. For example, a prefix of api/v2 would change the list endpoint to /api/v2/list. This defaults to an empty string, i.e., no prefix.

🚨🚨🚨 IMPORTANT! 🚨🚨🚨 It is assumed that SewerRat runs under a service account with no access to credentials or other sensitive information. This is because users can, in their registered directories, craft symlinks to arbitrary locations that will be followed by SewerRat. Any file path that can be accessed by the service account should be assumed to be public when the SewerRat API is active.

Links

Clients to the SewerRat API are available in R and Python.

The Gobbler's registry can serve as a source of files for the SewerRat search index.