Standoff property value retrieving performance #1578

gfoo · 2020-01-21T15:34:40Z

We have in our project huge transcriptions based on our own standoff mapping and we have poor performances to retrieve the value of this property. I mainly use gravsearch to retrieve data, but even with the v2/resources we have poor perfs.
We are talking here about 20 or 30 seconds to retrieve this resource.

If needed @loicjaouen will provide to you our Mem/CPU stack config

I'm going to prepare a test case to let you try to reproduce this perf problem on your side.

The text was updated successfully, but these errors were encountered:

benjamingeer · 2020-01-21T17:04:46Z

Have you read https://discuss.dasch.swiss/t/large-texts-and-xml-databases/134 ?

You have two options:

Break your text into smaller pieces, instead of storing a huge text in a single TextValue.
Wait until Knora supports storing text in an XML database.

benjamingeer · 2020-01-21T22:35:26Z

I suggested the same thing to you last April:

#1293 (comment)

gfoo · 2020-01-22T06:14:08Z

Have you read https://discuss.dasch.swiss/t/large-texts-and-xml-databases/134 ?

no, sorry, no more enough motivations and time to follow your next devs, I just try to find solutions with the existing Knora :)

I suggested the same thing to you last April:

yep, I remember, with @mrivoal we thought about that, but not so easy for us to automatically split our user's data during the migration process from their mysql db into Knora. And anyway, at the end of the day, they probably won't want to split their data :|

Just have a look to their job : http://lumieres.unil.ch/fiches/trans/1088/ , in the edit mode, you need an account for that, they use ckeditor which produces a kind of pseudo-html, we provided a standoff mapping and it works very well, it's a shame that (probably) just for few transcriptions we have this kind of low perfs :(

gfoo · 2020-01-22T06:17:38Z

The test case, if you want to reproduce : PerfTrans.zip

gfoo · 2020-01-22T06:30:47Z

@mrivoal The only solution I see right now is to ask them to split their existing transcriptions in their database before our final migration.

@benjamingeer the save process is also very slow, it is not a problem for our migration process but probably a problem in our web app client if the end user have to wait more than 30 sec to save smthing... they didn't give us feedbacks about that but they probably will in a near future !

benjamingeer · 2020-01-22T08:05:33Z

the save process is also very slow

If you can split the text into smaller pieces, both saving and loading will be faster.

mrivoal · 2020-01-22T10:31:23Z

Yes, the modeling solution, as usual.
However, artificially splitting long editions that users can easily deal with with other tools (existDB) is not an acceptable solution (this is already the feedback we have from another of our edition projects).

Then I guess, for the long run, Knora will have to store long texts in XML databases.

benjamingeer · 2020-01-22T10:47:19Z

However, artificially splitting long editions that users can easily deal with with other tools (existDB) is not an acceptable solution (this is already the feedback we have from another of our edition projects).

It's a trade-off. If you can store texts in small enough pieces (1000 words is a good size if you have a lot of markup), you can store them as RDF, and get functionality that you wouldn't get by storing the text in eXist-db, like "find me a text that mentions a person who was born after 1720 and who was a student of Euler". (Maybe you could do that in eXist-db if you were willing to store all your data as XML.)

Otherwise, you can store the text in eXist-db: storage and retrieval will be faster, and some queries will be faster, but you will lose some search capabilities.

I think the best we can do is offer both options, and let each project decide which is best for them.

mrivoal · 2020-01-22T10:54:07Z

What do you consider will be "a lot of markup"?

benjamingeer · 2020-01-22T10:58:52Z

What do you consider will be "a lot of markup"?

In the test I did, nearly every word had a tag. The more markup you have, the more triples have to be retrieved, and the slower it's going to be. If you have a big text with very little markup, GraphDB can still retrieve it pretty quickly.

mrivoal · 2020-01-22T11:07:21Z

Ok, thanks.

benjamingeer · 2020-01-22T11:12:14Z

Just have a look to their job : http://lumieres.unil.ch/fiches/trans/1088/

That text has chapters. Why not store one chapter per resource? That would also make navigation and editing a lot easier. Do you really want to scroll through that much text on one HTML page?

gfoo changed the title ~~Standoff property retrieving performance~~ Standoff property value retrieving performance Jan 21, 2020

subotic added this to the Backlog milestone Feb 7, 2020

irinaschubert removed this from the Backlog milestone Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standoff property value retrieving performance #1578

Standoff property value retrieving performance #1578

gfoo commented Jan 21, 2020 •

edited

Loading

benjamingeer commented Jan 21, 2020

benjamingeer commented Jan 21, 2020

gfoo commented Jan 22, 2020

gfoo commented Jan 22, 2020

gfoo commented Jan 22, 2020

benjamingeer commented Jan 22, 2020

mrivoal commented Jan 22, 2020

benjamingeer commented Jan 22, 2020

mrivoal commented Jan 22, 2020

benjamingeer commented Jan 22, 2020

mrivoal commented Jan 22, 2020

benjamingeer commented Jan 22, 2020

Standoff property value retrieving performance #1578

Standoff property value retrieving performance #1578

Comments

gfoo commented Jan 21, 2020 • edited Loading

benjamingeer commented Jan 21, 2020

benjamingeer commented Jan 21, 2020

gfoo commented Jan 22, 2020

gfoo commented Jan 22, 2020

gfoo commented Jan 22, 2020

benjamingeer commented Jan 22, 2020

mrivoal commented Jan 22, 2020

benjamingeer commented Jan 22, 2020

mrivoal commented Jan 22, 2020

benjamingeer commented Jan 22, 2020

mrivoal commented Jan 22, 2020

benjamingeer commented Jan 22, 2020

gfoo commented Jan 21, 2020 •

edited

Loading