[OPTIMADE] Deep pagination of NOMAD OPTIMADE API by clients #45

ml-evs · 2023-02-22T18:14:28Z

Hi devs, as I've discussed in the past with @markus1978, we'd like to be able to do deep pagination of the NOMAD OPTIMADE API. Currently pagination is limited to the first 10000 results of a query, until it hits an ES limitation. This can be circumvented with some work on the ES backend (https://www.elastic.co/guide/en/elasticsearch//reference/current/paginate-search-results.html#search-after) but I think this might a common problem so instead would like to resolve it in my client code at optimade-python-tools.

The "fix" would be to paginate into a query until failure, then use the last page of results to generate a new query that limits results to the remaining pages. This requires a field to sort over, and ideally I would sort over id/immutable_id or _nmd_upload_time. Unfortunately these sorts are not available in the current NOMAD OPTIMADE API, though I think the default sort is _nmd_upload_time.

Of course, this workaround can lead to inconsistent data between paginated queries, but I think basically every other OPTMADE API is already affected by this...

So, I have a few questions:

Would it be possible to enable this kind of sort in the OPTIMADE API?
Would you support the addition of a specific error type for this in the specification, so clients can detect it and apply a workaround?
Do you want to help try to implement the suggested ES search_after fix within optimade-python-tools?

The text was updated successfully, but these errors were encountered:

markus1978 · 2023-02-23T09:02:11Z

You do not need to cheat it like this. You can use value-based pagination in elasticsearch with unlimited depth.

The OPTIMADE standard offers page_above and page_below to do value-based pagination.

We could put this into our OPTIMADE implementation (I guess also into the ES impl of OPT). Currently only offset-based pagination with the described limitations are implemented.

To implement this clean and easy, the OPT function handle_query_params should extract the page_above and page_below query parameters. Currently this is not in there? At least in the OPT version we use, I can't find it.

markus1978 · 2023-02-23T09:12:11Z

Maybe another remark. The 10000 entries limit has a reason. To do offsets, ES internally always has to produce the full result set. This is why its capped. The larger your offset, the more ES has to scrape through. Therefore, your "fix" would also be quite stressful for our ES.

ml-evs · 2023-02-23T11:14:27Z

You do not need to cheat it like this. You can use value-based pagination in elasticsearch with unlimited depth.

The OPTIMADE standard offers page_above and page_below to do value-based pagination.

We could put this into our OPTIMADE implementation (I guess also into the ES impl of OPT). Currently only offset-based pagination with the described limitations are implemented.

To implement this clean and easy, the OPT function handle_query_params should extract the page_above and page_below query parameters. Currently this is not in there? At least in the OPT version we use, I can't find it.

If that would work then lets do it! We only have number-based pagination via page_number and offset-based via page_offset at the moment.

Maybe another remark. The 10000 entries limit has a reason. To do offsets, ES internally always has to produce the full result set. This is why its capped. The larger your offset, the more ES has to scrape through. Therefore, your "fix" would also be quite stressful for our ES.

As far as I understand it my fix is not offset-based? I am just paging through what I can, then generating a new filter with offset 0 and starting again. It would be stressful in that every set of 10000 results will need a new query, but that doesn't seem too bad to me? (In fact I think my fix maps pretty much onto the search_after approach suggested by ES, just without access to the actual indexes?)

markus1978 · 2023-02-23T16:30:45Z

If that would work then lets do it! We only have number-based pagination via page_number and offset-based via page_offset at the moment.

Well, we need to update our OPT version anyways, when you have this result-pydandic-validation-optional-switch in it. Does this exist now? If you could squeeze page_below into handle_query_params for the next release, I can do the ES part.

As far as I understand it my fix is not offset-based? I am just paging through what I can, then generating a new filter with offset 0 and starting again. It would be stressful in that every set of 10000 results will need a new query, but that doesn't seem too bad to me? (In fact I think my fix maps pretty much onto the search_after approach suggested by ES, just without access to the actual indexes?)

Yes, but under the hood it is translated into offset-based pagination in ES (this is the only thing implemented). In this discussion offset and page based are synonyms.

ml-evs · 2023-02-23T17:06:28Z

Well, we need to update our OPT version anyways, when you have this result-pydandic-validation-optional-switch in it. Does this exist now? If you could squeeze page_below into handle_query_params for the next release, I can do the ES part.

I'll see what I can do -- I can at least patch in the query params for now and release as you suggest. I never ended up making a PR for disabling validation as I couldn't generate a test case where there was any signficant perf. improvement, but in the case of allowing "slightly" wrong data through then it is useful ;)

markus1978 added the enhancement New feature or request label Feb 23, 2023

ml-evs mentioned this issue Feb 23, 2023

Add special response/error for case where deep pagination is not accessible Materials-Consortia/OPTIMADE#458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPTIMADE] Deep pagination of NOMAD OPTIMADE API by clients #45

[OPTIMADE] Deep pagination of NOMAD OPTIMADE API by clients #45

ml-evs commented Feb 22, 2023

markus1978 commented Feb 23, 2023

markus1978 commented Feb 23, 2023

ml-evs commented Feb 23, 2023 •

edited

Loading

markus1978 commented Feb 23, 2023

ml-evs commented Feb 23, 2023

[OPTIMADE] Deep pagination of NOMAD OPTIMADE API by clients #45

[OPTIMADE] Deep pagination of NOMAD OPTIMADE API by clients #45

Comments

ml-evs commented Feb 22, 2023

markus1978 commented Feb 23, 2023

markus1978 commented Feb 23, 2023

ml-evs commented Feb 23, 2023 • edited Loading

markus1978 commented Feb 23, 2023

ml-evs commented Feb 23, 2023

ml-evs commented Feb 23, 2023 •

edited

Loading