introduce different dataloader categories #8

motey · 2020-04-22T22:43:32Z

To prevent messed up data and enable possible new features we need to categorize dataloaders

none idempotent dataloader

Dataloaders that only run once inital. these are for static data like gene databases

idempotent dataloader

Dataloaders that will evolve and data will probably change. Like publication data in the CORD19 dataset which iterates from time to time.

If a rerun is neccesary could be decide by changing docker hub hashed (changing dataloader image)

service dataloaders

Data that will change in any case regulary, like covid case statistics.
These dataloaders should run periodically

mpreusse · 2020-04-23T11:40:43Z

We should also consider that not all data loaders have a simple update logic. I.e. they have to perform complex oerations to define the updates.

Example: The loading script that generates :Fragment nodes with sentence from full text nodes (:BodyText, :PatentAbstract etc ). This has to rerun whenever we have new text. But the text fragments have no primary key except for the sentence itself. It would need to check every existing full text and check if all sentences exist (costly) or create the :Fragment nodes only for full text nodes that have no :Fragment nodes yet (error prone if the content of the full text node changes).

Btw gene databases are not static 😄

motey · 2020-04-23T11:52:06Z

Btw gene databases are not static smile

I was allready afraid thats the case. but my brain just wouldnt come up with a good example at 1am :D

Example: The loading script that generates :Fragment nodes with sentence from full text nodes(:BodyText, :PatentAbstract etc ) [...]

imho the dataloader is the problem in this case :) What about a flag to fragged text, or a simple logic like "when textfraggments are on the node, no fragging is needed anymore"
Changing fulltext nodes should be rather rare (and if changes should be rather subtile)

motey added Tag: Documentation About CovidGraph Documentation Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Tag: Help Wanted Extra attention is needed labels Apr 22, 2020

This was referenced Apr 22, 2020

Motherlode as a service #7

Open

Refactor/Rewrite motherlode #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce different dataloader categories #8

introduce different dataloader categories #8

motey commented Apr 22, 2020

mpreusse commented Apr 23, 2020

motey commented Apr 23, 2020 •

edited

Loading

introduce different dataloader categories #8

introduce different dataloader categories #8

Comments

motey commented Apr 22, 2020

mpreusse commented Apr 23, 2020

motey commented Apr 23, 2020 • edited Loading

motey commented Apr 23, 2020 •

edited

Loading