Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Don't scroll if max docs < scroll size (update by query/delete by query) #13704

Open
luisfavila opened this issue May 16, 2024 · 7 comments
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance

Comments

@luisfavila
Copy link

luisfavila commented May 16, 2024

Is your feature request related to a problem? Please describe

Using update_by_query and delete_by_query over a small amount of docs, where the max docs is less than scroll size, still uses a scroll and contributes to the max_open_scroll_context limit.

Describe the solution you'd like

Don't scroll unless necessary. This has been patched in ES as of 7.17.0

Related component

Search:Performance

Describe alternatives you've considered

I'm aware of #12923 which should also eventually solve this problem, but it may take a while to get there. Would be great to have this patched before then.

Additional context

No response

@luisfavila luisfavila added enhancement Enhancement or improvement to existing feature or request untriaged labels May 16, 2024
@dblock
Copy link
Member

dblock commented May 16, 2024

@luisfavila If you bring in a fix, please make sure not to be copying/looking at non-APLv2 code.

@harshavamsi
Copy link
Contributor

@luisfavila will this help speed up those queries or just bring down the scroll count?

@luisfavila
Copy link
Author

@harshavamsi I'm not entirely sure whether it speeds up the queries. It helps immensively if you're running a big number of queries that affect a small number of documents (<10k) as you won't hit the 500 max within the scroll window defined. I'm also under the impression scrolls aren't cleared when the query finishes, only when they expire, but am not entirely sure (would have to check the actual code)

@msfroh
Copy link
Collaborator

msfroh commented May 22, 2024

Discussed this briefly in our weekly search meetup (https://www.meetup.com/opensearch/events/300759986/).

I'm not sure if we have a great way of preventing the scroll from being created in the first place, but if the scrolls aren't being cleaned up once we hit the end (especially for a delete_by_query or update_by_query), we should fix that.

Why can't we prevent the scroll from being created in the first place?

As far as I know, the flow should be roughly:

  1. Create a scroll to paginate through all docs with a consistent view of the world.
  2. Get page 1 of results. No pages left.
  3. Ideally, remove the scroll.

If we fetch page 1 and it's not the last page and then create a scroll, it might not match the same reader state as the page that we fetched. That could lead to bizarre outcomes (docs skipped or visited twice). AFAIK, that's why we always create the scroll first.

@getsaurabh02
Copy link
Member

Will it also be useful to identify if its technically feasible to know upfront if there are less number of documents - based on if max docs is less than scroll size, then skip using the scroll altogether and avoid creating the max_open_scroll_context? Else we fallback to aggressively clean up the context as Froh suggested above.
@harshavamsi do you think we can quickly validate that?

@msfroh
Copy link
Collaborator

msfroh commented May 22, 2024

Oh... I was thinking about the internals of how a search works anyway. For every search, we create a SearchContext that persists from the query phase to the fetch phase. A scroll or a point-in-time search are essentially just fancy ways of keeping the SearchContext around even after returning results.

Maybe the solution is to fix the scroll logic to let the SearchContext get closed anyway (after the fetch phase) if the query phase comes back with fewer than scroll_size results? Functionally, it would be like preventing the scroll from being created.

@luisfavila
Copy link
Author

@msfroh Would it still add to the scroll context limit that way? It'd be nice if it didn't unless it needs to, so there's no actual limit on how many UpdateByQuery / DeleteByQuery you can run concurrently, given they don't need to scroll.

Separately, is this still planned to be implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance
Projects
Status: Next (Next Quarter)
Development

No branches or pull requests

5 participants