Dump / load complete collections (share between aleph instances) #1523

simonwoerpel · 2020-04-07T20:17:50Z

simonwoerpel
Apr 7, 2020

As already started discussion in the #datacommons channel in the aleph slack, here is a feature proposal – to discuss – for dumping complete collections out of aleph and re-import them in another aleph instance (could serve as a sharing / backup mechanism as well)

Such a dump of a collection could be used like this:

aleph dump_collection <foreign_id> /path/to/dump/directory/

and

aleph load_collection <foreign_id> /path/to/dump/directory/

or kept in sync between aleph instances like

export ALEPH_DATA_MIRROR=https://data.org
aleph sync aleph://<foreign_id>/

To discuss in this ticket:

How should such a dump look like and what needs to be included to be able to re-create a collection to it's dumped state without the need of alephs ingestors / index processing?

Idea:

Directory structure (unzipped):

./<foreign_id>/
    collection.yml              - metadata for collection (dump from sql table?)
    entities.json                 - dump for all the ftm entities
    fulltext.json                  - elasticsearch fulltext dump (is this useful or everything into `entities.json` ?
    documents.csv            - dump from the sql documents table
      ./archive/                   - archive of the source documents
        04/ef/…
        8e/f1/…

Requirements:

zip /or as .tar.xz archive
partial sync (via rsync for instance) for updated datasets should be possible as well

Looking forward to the discussion! 🙂

tjstavenger-pnnl · 2020-05-14T20:46:21Z

tjstavenger-pnnl
May 14, 2020

My team is interested in a collection / dataset backup & restore feature within Aleph. Our environment is such that we ingest data in one instance of Aleph and need to be able to export that data from one instance to import into another. Ideally we could accomplish this without having to reprocess (extract text & entities, link entities, etc.) each time we import.

We could start by just re-importing the same data at each of our deployments, but eventually we'll run into difficulties with large datasets and processing that can only happen in one environment (some analytics are not available everywhere). For example, some of our datasets include files that total 500+gb compressed, and we anticipate that not being the largest we'll see, so finding a clean way to support large datasets will be helpful. Also we anticipate adding analytics to the ingest process (translation is one that comes to mind initially), where we'd want to have this data persisted in the export that wouldn't need to be re-processed on import into the other system.

I know this a relatively new GitHub issue request -- has there been any movement on implementation or planning? We at PNNL will be working on an implementation that works in our system, and we'd be interested in helping the open source feature development too.

0 replies

tjstavenger-pnnl · 2020-05-15T17:56:59Z

tjstavenger-pnnl
May 15, 2020

To augment the uncompressed data structure proposed, I've been looking through the data & code to find what I think are the storage locations of the various pieces of data needed to export a collection:

PostgreSQL:

Collection metadata (collection table)
Document metadata (document table)
Mapping metadata? (mapping table, I haven't figured out what this is used for, yet)
Document text extract (balkhash_collection_<id> table)

Elasticsearch:

Entity metadata (or the JSON export from aleph dump-entities, though exporting parts of the Elasticsearch index, if possible, would avoid the need to re-index)
Is the document text extract indexed here, too (I didn't see an index that looked like it was)?

Documents:

Files themselves (stored locally on the file system or S3)

Since I'm really looking through the code & data for the first time, I'm sure I'm missing something or misunderstood what is stored where. Corrections are welcome.

Without an existing collection export process, I'm assuming there would need to be an import process written too -- is there already a way to ingest an document with an existing text extract and other processing done or similarly a way to ingest a not file-based entities where the index is already done? Maybe it would be better to rebuild the index? To figure this out I'll start by looking at the current ingest process.

0 replies

pudo · 2020-05-18T07:43:39Z

pudo
May 18, 2020
Maintainer

Hey @tjstavenger-pnnl thanks for investigating this in such detail. Two quick points on the substance: you probably don't need to dump/restore the balkhash_ tables, but you do want to take along the entity table.

We also need to accommodate the fact that a collection (and mappings) would likely have a different ID in the two different instances, so we'd need to rewrite some of the other entries that reference the collection ID accordingly.

Regarding the import process, I think the essential bit is to not actually trigger a full re-process of the data. So you'd likely want to restore the database tables first and then import the ES index dump like it's a bulk API load.

There's going to be a million edge cases to this, but I reckon it's gonna leave the thing in a state where it can be searched and browsed nicely...

0 replies

tjstavenger-pnnl · 2020-05-18T12:50:54Z

tjstavenger-pnnl
May 18, 2020

I was uncertain about the balkhash_ tables as I thought that was a tool your team wrote to aggregate the entities together, though it was the only place I found where the extracted text was stored. Is the full text extracts stored elsewhere? I would think it would be helpful to not have to perform that processing again.

0 replies

pudo · 2020-05-18T15:34:51Z

pudo
May 18, 2020
Maintainer

The full text is normally in the ftm entities, and in the ES index.

One question I have: if you have single files that are 500GB, are these archives, or large tables? It's worth pointing out that ES has a document limit of 100MB, it just will not index more text than that for a single file. If what you're trying to do is to run mappings of structured data into the system then at that size, I'd consider doing it via an external ETL process (we basically run a bunch of FtM loaders externally, and then load in their data via a bulk API).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dump / load complete collections (share between aleph instances) #1523

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Dump / load complete collections (share between aleph instances) #1523

simonwoerpel Apr 7, 2020

Replies: 5 comments

tjstavenger-pnnl May 14, 2020

tjstavenger-pnnl May 15, 2020

pudo May 18, 2020 Maintainer

tjstavenger-pnnl May 18, 2020

pudo May 18, 2020 Maintainer

simonwoerpel
Apr 7, 2020

tjstavenger-pnnl
May 14, 2020

tjstavenger-pnnl
May 15, 2020

pudo
May 18, 2020
Maintainer

tjstavenger-pnnl
May 18, 2020

pudo
May 18, 2020
Maintainer