Dump / load complete collections (share between aleph instances) #1523
Replies: 5 comments
-
My team is interested in a collection / dataset backup & restore feature within Aleph. Our environment is such that we ingest data in one instance of Aleph and need to be able to export that data from one instance to import into another. Ideally we could accomplish this without having to reprocess (extract text & entities, link entities, etc.) each time we import. We could start by just re-importing the same data at each of our deployments, but eventually we'll run into difficulties with large datasets and processing that can only happen in one environment (some analytics are not available everywhere). For example, some of our datasets include files that total 500+gb compressed, and we anticipate that not being the largest we'll see, so finding a clean way to support large datasets will be helpful. Also we anticipate adding analytics to the ingest process (translation is one that comes to mind initially), where we'd want to have this data persisted in the export that wouldn't need to be re-processed on import into the other system. I know this a relatively new GitHub issue request -- has there been any movement on implementation or planning? We at PNNL will be working on an implementation that works in our system, and we'd be interested in helping the open source feature development too. |
Beta Was this translation helpful? Give feedback.
-
To augment the uncompressed data structure proposed, I've been looking through the data & code to find what I think are the storage locations of the various pieces of data needed to export a collection: PostgreSQL:
Elasticsearch:
Documents:
Since I'm really looking through the code & data for the first time, I'm sure I'm missing something or misunderstood what is stored where. Corrections are welcome. Without an existing collection export process, I'm assuming there would need to be an import process written too -- is there already a way to ingest an document with an existing text extract and other processing done or similarly a way to ingest a not file-based entities where the index is already done? Maybe it would be better to rebuild the index? To figure this out I'll start by looking at the current ingest process. |
Beta Was this translation helpful? Give feedback.
-
Hey @tjstavenger-pnnl thanks for investigating this in such detail. Two quick points on the substance: you probably don't need to dump/restore the We also need to accommodate the fact that a collection (and mappings) would likely have a different ID in the two different instances, so we'd need to rewrite some of the other entries that reference the collection ID accordingly. Regarding the import process, I think the essential bit is to not actually trigger a full re-process of the data. So you'd likely want to restore the database tables first and then import the ES index dump like it's a bulk API load. There's going to be a million edge cases to this, but I reckon it's gonna leave the thing in a state where it can be searched and browsed nicely... |
Beta Was this translation helpful? Give feedback.
-
I was uncertain about the |
Beta Was this translation helpful? Give feedback.
-
The full text is normally in the ftm entities, and in the ES index. One question I have: if you have single files that are 500GB, are these archives, or large tables? It's worth pointing out that ES has a document limit of 100MB, it just will not index more text than that for a single file. If what you're trying to do is to run mappings of structured data into the system then at that size, I'd consider doing it via an external ETL process (we basically run a bunch of FtM loaders externally, and then load in their data via a bulk API). |
Beta Was this translation helpful? Give feedback.
-
As already started discussion in the #datacommons channel in the aleph slack, here is a feature proposal – to discuss – for dumping complete collections out of aleph and re-import them in another aleph instance (could serve as a sharing / backup mechanism as well)
Such a dump of a collection could be used like this:
and
or kept in sync between aleph instances like
To discuss in this ticket:
How should such a dump look like and what needs to be included to be able to re-create a collection to it's dumped state without the need of alephs ingestors / index processing?
Idea:
Directory structure (unzipped):
Requirements:
rsync
for instance) for updated datasets should be possible as wellLooking forward to the discussion! 🙂
Beta Was this translation helpful? Give feedback.
All reactions