Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OPTIMADE] Deep pagination of NOMAD OPTIMADE API by clients #45

Open
ml-evs opened this issue Feb 22, 2023 · 5 comments
Open

[OPTIMADE] Deep pagination of NOMAD OPTIMADE API by clients #45

ml-evs opened this issue Feb 22, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@ml-evs
Copy link

ml-evs commented Feb 22, 2023

Hi devs, as I've discussed in the past with @markus1978, we'd like to be able to do deep pagination of the NOMAD OPTIMADE API. Currently pagination is limited to the first 10000 results of a query, until it hits an ES limitation. This can be circumvented with some work on the ES backend (https://www.elastic.co/guide/en/elasticsearch//reference/current/paginate-search-results.html#search-after) but I think this might a common problem so instead would like to resolve it in my client code at optimade-python-tools.

The "fix" would be to paginate into a query until failure, then use the last page of results to generate a new query that limits results to the remaining pages. This requires a field to sort over, and ideally I would sort over id/immutable_id or _nmd_upload_time. Unfortunately these sorts are not available in the current NOMAD OPTIMADE API, though I think the default sort is _nmd_upload_time.

Of course, this workaround can lead to inconsistent data between paginated queries, but I think basically every other OPTMADE API is already affected by this...

So, I have a few questions:

  1. Would it be possible to enable this kind of sort in the OPTIMADE API?
  2. Would you support the addition of a specific error type for this in the specification, so clients can detect it and apply a workaround?
  3. Do you want to help try to implement the suggested ES search_after fix within optimade-python-tools?
@markus1978
Copy link
Member

You do not need to cheat it like this. You can use value-based pagination in elasticsearch with unlimited depth.

The OPTIMADE standard offers page_above and page_below to do value-based pagination.

We could put this into our OPTIMADE implementation (I guess also into the ES impl of OPT). Currently only offset-based pagination with the described limitations are implemented.

To implement this clean and easy, the OPT function handle_query_params should extract the page_above and page_below query parameters. Currently this is not in there? At least in the OPT version we use, I can't find it.

@markus1978 markus1978 added the enhancement New feature or request label Feb 23, 2023
@markus1978
Copy link
Member

Maybe another remark. The 10000 entries limit has a reason. To do offsets, ES internally always has to produce the full result set. This is why its capped. The larger your offset, the more ES has to scrape through. Therefore, your "fix" would also be quite stressful for our ES.

@ml-evs
Copy link
Author

ml-evs commented Feb 23, 2023

You do not need to cheat it like this. You can use value-based pagination in elasticsearch with unlimited depth.

The OPTIMADE standard offers page_above and page_below to do value-based pagination.

We could put this into our OPTIMADE implementation (I guess also into the ES impl of OPT). Currently only offset-based pagination with the described limitations are implemented.

To implement this clean and easy, the OPT function handle_query_params should extract the page_above and page_below query parameters. Currently this is not in there? At least in the OPT version we use, I can't find it.

If that would work then lets do it! We only have number-based pagination via page_number and offset-based via page_offset at the moment.

Maybe another remark. The 10000 entries limit has a reason. To do offsets, ES internally always has to produce the full result set. This is why its capped. The larger your offset, the more ES has to scrape through. Therefore, your "fix" would also be quite stressful for our ES.

As far as I understand it my fix is not offset-based? I am just paging through what I can, then generating a new filter with offset 0 and starting again. It would be stressful in that every set of 10000 results will need a new query, but that doesn't seem too bad to me? (In fact I think my fix maps pretty much onto the search_after approach suggested by ES, just without access to the actual indexes?)

@markus1978
Copy link
Member

If that would work then lets do it! We only have number-based pagination via page_number and offset-based via page_offset at the moment.

Well, we need to update our OPT version anyways, when you have this result-pydandic-validation-optional-switch in it. Does this exist now? If you could squeeze page_below into handle_query_params for the next release, I can do the ES part.

As far as I understand it my fix is not offset-based? I am just paging through what I can, then generating a new filter with offset 0 and starting again. It would be stressful in that every set of 10000 results will need a new query, but that doesn't seem too bad to me? (In fact I think my fix maps pretty much onto the search_after approach suggested by ES, just without access to the actual indexes?)

Yes, but under the hood it is translated into offset-based pagination in ES (this is the only thing implemented). In this discussion offset and page based are synonyms.

@ml-evs
Copy link
Author

ml-evs commented Feb 23, 2023

Well, we need to update our OPT version anyways, when you have this result-pydandic-validation-optional-switch in it. Does this exist now? If you could squeeze page_below into handle_query_params for the next release, I can do the ES part.

I'll see what I can do -- I can at least patch in the query params for now and release as you suggest. I never ended up making a PR for disabling validation as I couldn't generate a test case where there was any signficant perf. improvement, but in the case of allowing "slightly" wrong data through then it is useful ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants