Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance in sync client #446

Open
navneet1v opened this issue Jul 20, 2023 · 34 comments
Open

Improve performance in sync client #446

navneet1v opened this issue Jul 20, 2023 · 34 comments
Labels
enhancement New feature or request performance Make it fast.

Comments

@navneet1v
Copy link

navneet1v commented Jul 20, 2023

What is the bug?

I am using opensearch-py client version 2.2.0 to do search on the a cluster. I started using the multi-threading to push more and more request to the cluster. But I see that after 4 threads we are not able to increase the throughput. I see that no throttling happening from Server, but client is not able to push the request.

How can one reproduce the bug?

Create an OpenSearch cluster, and use OpenSearch-py client with multi-threading to start doing then queries. Thread count we tested with 1,2,4,8,16,32,48,96 threads.

What is the expected behavior?

Expectation is OpenSearch client should be able to push through more and more request and if OpenSearch cluster is not able to take the request we should get a throttling exception.

What is your host/environment?

Linux.

Do you have any screenshots?

NA

Do you have any additional context?

I tried increasing the worker pool size from 10 to 20, but that didn't help.

@navneet1v navneet1v added bug Something isn't working untriaged Need triage labels Jul 20, 2023
@navneet1v navneet1v changed the title [BUG] [BUG] OpenSearch Client not able push more request to OpenSearch Cluster Jul 20, 2023
@saimedhi saimedhi removed the untriaged Need triage label Jul 20, 2023
@dtaivpp
Copy link

dtaivpp commented Jul 21, 2023

Hey, what are your clusters index settings? I would expect this with default cluster settings as shards in an index are thread bound as well. A default index will have 3 shards and that's the max number of threads that can be used for ingestion.

@dblock
Copy link
Member

dblock commented Jul 21, 2023

Are you spawning threads and creating instances of the client? Can you please post a repro? Is your actual usage of the client sync or async?

I do think that a client should be able to push the server into a state where it starts returning 429s.

@navneet1v
Copy link
Author

@dblock
It's the same client which is getting used across the threads.

@dtaivpp , I checked the cluster and the CPU util was not going above 60% even at the maximum thread. That is where I think its the problem where client is not able to push the requests.

I also tried changing the pool_max_size from 10 to 20 even then there is no increase in the requests that can be sent to the cluster.

@dblock
Copy link
Member

dblock commented Jul 24, 2023

Are you using a sync client or an async client? My guess is that you're using a sync client.

I wrote a benchmark that compares synchronous, threaded and asynchronous I/O in opensearch-py. I was not able to push the synchronous nor the threaded version of the client past a certain limit, no-matter the amount of pool_maxsize or other settings I tried. The client does use a connection pool, but it's probably just too busy doing other work on a single thread blocked by the GIL, so spawning more threads doesn't help.

Sync vs. threaded inserts make no difference. For 5x50 inserts async outperforms sync 1.4x.

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃          Benchmark ┃ Min     ┃ Max     ┃ Mean    ┃ Min (+)         ┃ Max (+)         ┃ Mean (+)        ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. threaded. │ 1.387   │ 1.604   │ 1.450   │ 1.377 (1.0x)    │ 1.424 (1.1x)    │ 1.399 (1.0x)    │
│    Sync vs. async. │ 1.312   │ 1.479   │ 1.400   │ 0.960 (1.4x)    │ 1.047 (1.4x)    │ 0.994 (1.4x)    │
└────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

5x250 items, 2.5x improvement

┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. async. │ 4.426   │ 4.994   │ 4.694   │ 1.798 (2.5x)    │ 1.992 (2.5x)    │ 1.894 (2.5x)    │
└─────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

5x1000 items, 3.5x improvement

┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. async. │ 16.629  │ 17.493  │ 16.999  │ 4.863 (3.4x)    │ 5.025 (3.5x)    │ 4.980 (3.4x)    │
└─────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

We may be able to improve the synchronous client performance on insert, but I don't (yet) know how.

@Xtansia
Copy link
Contributor

Xtansia commented Jul 24, 2023

I have a (not very scientific) flame chart I recorded of a synchronous client doing some search requests with SigV4. Obviously these numbers are very dependent on the specific request etc but was done with large machines that hardware shouldn't be a bottleneck.

profile_q

@dblock
Copy link
Member

dblock commented Jul 24, 2023

Here's the sync/async benchmark profile, it's actually stuck doing synchronous I/O.

Screenshot 2023-07-24 at 6 45 00 PM Screenshot 2023-07-24 at 6 45 06 PM Screenshot 2023-07-24 at 6 49 55 PM

@navneet1v
Copy link
Author

Here's the sync/async benchmark profile, it's actually stuck doing synchronous I/O.

Screenshot 2023-07-24 at 6 45 00 PM Screenshot 2023-07-24 at 6 45 06 PM

This is strange, why it is struck in synchronous I/O. My understanding was I/O should be done in async fashion, even when it is a sync client.

@dblock
Copy link
Member

dblock commented Jul 24, 2023

This is strange, why it is struck in synchronous I/O. My understanding was I/O should be done in async fashion, even when it is a sync client.

Unfortunately not, and it's by design. The default Urllib3HttpConnection uses urllib3 and the requests are synchronous. This is why aiohttp was introduced, but it requires your code to become async to take advantage of.

@navneet1v
Copy link
Author

@dblock can we use some other implementation of HttpConnection. Generally we use RequestsHttpConnection , does that also have the same problem?

@navneet1v
Copy link
Author

@dblock here is my thinking, as a user of the OpenSearch client I feel that OpenSearch sync client is easy to use. Most of the services have the same thing. But if we cannot use threading with simple sync clients that a bumber. Because async client is not easy.

@dblock
Copy link
Member

dblock commented Jul 24, 2023

@dblock can we use some other implementation of HttpConnection. Generally we use RequestsHttpConnection , does that also have the same problem?

Yes it does. That one does allow changing pool_maxsize, but that has little effect on throughput.

@dblock
Copy link
Member

dblock commented Jul 24, 2023

@dblock here is my thinking, as a user of the OpenSearch client I feel that OpenSearch sync client is easy to use. Most of the services have the same thing. But if we cannot use threading with simple sync clients that a bumber. Because async client is not easy.

What are other open-source client examples that talk HTTP that don't exhibit this kind of problem?

PS: I did find https://number1.co.za/how-to-speed-up-http-calls-in-python-with-examples/ that claims that requests-futures can get close to async performance, which could be an idea for a reimplementation, but it ain't a small effort.

@navneet1v
Copy link
Author

@dblock can we check milvus.

@dblock
Copy link
Member

dblock commented Jul 24, 2023

@dblock can we check milvus.

Seems to be using grpc, whole other ball game.

@navneet1v
Copy link
Author

Let me check if there are any other clients I can find out which are not grpc. This seems to be a standard thing and a fix should be there.

@wbeckler
Copy link
Contributor

Can you run multiple pythons with multiple clients? And there is also bulk, which helps max out the connection.

@navneet1v
Copy link
Author

@wbeckler No, we don't want to run multiple clients, as that defeats the purpose of what we are trying to benchmark.

Also, the request which we are trying to send is for query and not indexing. Hence bulk is not an option.

@dblock
Copy link
Member

dblock commented Jul 26, 2023

Thanks @navneet1v. I do think the problem is clear, we want the client to not block on I/O across requests.

@dblock dblock changed the title [BUG] OpenSearch Client not able push more request to OpenSearch Cluster [BUG] Blocking I/O across requests Jul 26, 2023
@navneet1v
Copy link
Author

Thanks @navneet1v. I do think the problem is clear, we want the client to not block on I/O across requests.

Yes that is correct.

@navneet1v
Copy link
Author

@dblock @wbeckler is anything required from my side to prioritize this issue?

@wbeckler
Copy link
Contributor

Is it okay to use python multiprocessing instead of multithreading in your use case?
https://docs.python.org/3/library/multiprocessing.html

@navneet1v
Copy link
Author

@wbeckler we can use that, but given that sync client is easy to use and is already integrated in various ann benchmarks, this is the reason why we want to keep things simple.

@wbeckler
Copy link
Contributor

I'm suggesting that the benchmarking or high-volume query application rely on multiprocessing, and that way the python client would not be blocking on IO. So instead of multithread: {do python client stuff}, you would have multiprocess: {do python client stuff}

I am not a python expert, so I might be way off here, but that's how it seems it should work from my naive perspective.

@dblock dblock self-assigned this Oct 3, 2023
@dblock dblock removed the good first issue Good for newcomers label Oct 3, 2023
@dblock
Copy link
Member

dblock commented Oct 3, 2023

I am going to take another look at this, assigning to myself.

@dblock
Copy link
Member

dblock commented Oct 7, 2023

Updated benchmarks in https://github.com/dblock/opensearch-py/tree/benchmarks/samples/benchmarks.

Benchmark Min Max Mean Min (+) Max (+) Mean (+)
1 client vs. more clients (async) 1.076 1.076 1.076 0.787 (1.4x) 0.787 (1.4x) 0.787 (1.4x)
1 thread vs. 2 threads (sync) 6.013 6.013 6.013 3.629 (1.7x) 3.629 (1.7x) 3.629 (1.7x)
1 thread vs. 8 threads (sync) 6.031 6.031 6.031 1.186 (5.1x) 1.186 (5.1x) 1.186 (5.1x)
1 thread vs. 32 threads (sync) 5.777 5.777 5.777 1.058 (5.5x) 1.058 (5.5x) 1.058 (5.5x)

We are getting an improvement by using threads (1.7x with 2, 5.1x with 8, with diminishing returns further, 5.5x with 32).

If we set the pool_maxsize to the number of threads this gets improved up to a certain amount.

Benchmark Min Max Mean Min (+) Max (+) Mean (+)
1 client vs. more clients (async) 1.401 1.401 1.401 0.788 (1.8x) 0.788 (1.8x) 0.788 (1.8x)
1 thread vs. 2 threads (sync) 7.355 7.355 7.355 2.835 (2.6x) 2.835 (2.6x) 2.835 (2.6x)
1 thread vs. 4 threads (sync) 6.616 6.616 6.616 1.777 (3.7x) 1.777 (3.7x) 1.777 (3.7x)
1 thread vs. 8 threads (sync) 6.340 6.340 6.340 1.272 (5.0x) 1.272 (5.0x) 1.272 (5.0x)
1 thread vs. 32 threads (sync) 6.608 6.608 6.608 1.076 (6.1x) 1.076 (6.1x) 1.076 (6.1x)

Sync vs. async. Async really just depends on the pool size (3 iterations).

Benchmark Min Max Mean Min (+) Max (+) Mean (+)
async (1 client) vs. sync (8 threads) 3.560 3.560 3.560 3.210 (1.1x) 3.210 (1.1x) 3.210 (1.1x)
async (8 clients) vs. sync (8 threads) 3.526 3.526 3.526 2.342 (1.5x) 2.342 (1.5x) 2.342 (1.5x)

@dblock
Copy link
Member

dblock commented Oct 12, 2023

I've spent a lot of time on the synchronous client, having written multiple benchmarks and digging through the code to understand how pooling works. I've varied number of threads, number of clients, search and ingestion, data sizes, switching between the default Urllib3HttpConnection, or the available RequestsHttpConnection connection classes.

The simplest benchmark is to do info() in a loop 100 times. The client improves throughput 5x and 3x with different connection classes on more threads and setting the max pool size to the number of threads.

Benchmark Min Max Mean Min (+) Max (+) Mean (+)
1 thread vs. 32 threads (sync), RequestsHttpConnection 30.104 30.104 30.104 5.697 (5.3x) 5.697 (5.3x) 5.697 (5.3x)
1 thread vs. 32 threads (sync), Urllib3HttpConnection 9.646 9.646 9.646 2.748 (3.5x) 2.748 (3.5x) 2.748 (3.5x)

This does show that RequestsHttpConnection is a lot slower. @navneet1v mentions we should remove it from being the default recommended in the documentation, which seems right.

Overall, these benchmarks are pretty conclusive that the client does not have the limitation as described and we need to dig deeper into how it's being used to see where the bottleneck is.

@navneet1v
Copy link
Author

navneet1v commented Oct 12, 2023

@dblock Thanks for doing the deep-dive. Most of the time as a user we go by seeing the examples. If changing the connection client is the only problem we should just fix the examples and our documentation.

Also can we mark RequestsHttpConnection as deprecated and more things around using this connection client can degrade the performance.

@dblock
Copy link
Member

dblock commented Oct 12, 2023

In #535 I documented the connection pools and made sure pool_maxsize works for both implementations (it doesn't improve benchmarks by much).

I don't think we should deprecate RequestsHttpConnection, even if it's slower. If an application fully depends on that, you will not want to take a dependency on another HTTP library. Similarly, using async in this client is much faster than Urllib3HttpConnection, but we aren't depredating the synchronous implementation.

@navneet1v
Copy link
Author

navneet1v commented Oct 12, 2023

In #535 I documented the connection pools and made sure pool_maxsize works for both implementations (it doesn't improve benchmarks by much).

I don't think we should deprecate RequestsHttpConnection, even if it's slower. If an application fully depends on that, you will not want to take a dependency on another HTTP library. Similarly, using async in this client is much faster than Urllib3HttpConnection, but we aren't depredating the synchronous implementation.

Make sense. My whole idea was we should provide our recommendation for best performance and also the challenges in using other connection libs RequestsHttpConnection doesn't perform well, per our benchmarks.

@VachaShah
Copy link
Collaborator

I've spent a lot of time on the synchronous client, having written multiple benchmarks and digging through the code to understand how pooling works. I've varied number of threads, number of clients, search and ingestion, data sizes, switching between the default Urllib3HttpConnection, or the available RequestsHttpConnection connection classes.

The simplest benchmark is to do info() in a loop 100 times. The client improves throughput 5x and 3x with different connection classes on more threads and setting the max pool size to the number of threads.

Benchmark Min Max Mean Min (+) Max (+) Mean (+)
1 thread vs. 32 threads (sync), RequestsHttpConnection 30.104 30.104 30.104 5.697 (5.3x) 5.697 (5.3x) 5.697 (5.3x)
1 thread vs. 32 threads (sync), Urllib3HttpConnection 9.646 9.646 9.646 2.748 (3.5x) 2.748 (3.5x) 2.748 (3.5x)
This does show that RequestsHttpConnection is a lot slower. @navneet1v mentions we should remove it from being the default recommended in the documentation, which seems right.

Overall, these benchmarks are pretty conclusive that the client does not have the limitation as described and we need to dig deeper into how it's being used to see where the bottleneck is.

Its great to see the benchmarks for this client! @dblock Anyway we can make these benchmarks part of this repo or part of opensearch-benchmark?

@dblock
Copy link
Member

dblock commented Oct 13, 2023

Its great to see the benchmarks for this client! @dblock Anyway we can make these benchmarks part of this repo or part of opensearch-benchmark?

#537

@dblock dblock changed the title [BUG] Blocking I/O across requests [BUG] Degraded performance in sync client Nov 10, 2023
@dblock
Copy link
Member

dblock commented Nov 10, 2023

I renamed the issue to "improve performance in sync client".

@navneet1v I know you've been running some anecdotal benchmarks with k-nn. What are your most recent numbers between RequestsHttpConnection, Urllib3HttpConnection, multiple clients and async?

@dblock dblock changed the title [BUG] Degraded performance in sync client Improve performance in sync client Nov 10, 2023
@dblock dblock added enhancement New feature or request and removed bug Something isn't working labels Nov 10, 2023
@navneet1v
Copy link
Author

navneet1v commented Nov 14, 2023

@dblock Here is results.

  1. So what I can see is for Multi Threading change connection class from Request HTTP Connection to Urllib3 helped improve the latency if you do more work than just doing the search in same thread(aka parsing and doing some more calculation).
  2. In case you just do search, then there is no impact in latency with and without Request HTTP Connection.
  3. Multi processing code works better with Request HTTP Connection class as compared to Urllib3.

With Request HTTP Connection Client

Multi Threading

With Parsing
average latency: 108.16474408659936
p50 Latency(sec): 104.17016150003633
p90 Latency(sec): 156.73814179999113
p99 Latency(sec): 215.99599194000638
average took time latency: 12.9771
p50 took Latency(ms): 11.0
p90 took Latency(ms): 18.0
p99 took Latency(ms): 42.0

Without parsing logic

average latency: 55.61106889247895
p50 Latency(sec): 52.75726318359375
p90 Latency(sec): 76.60882472991945
p99 Latency(sec): 114.45076704025274
average took time latency: 25721.5
p50 took Latency(ms): 23.0
p90 took Latency(ms): 42.0
p99 took Latency(ms): 73.0

Multi Processing

With Parsing
average latency: 41.32260456085205
p50 Latency(ms): 39.499521255493164
p90 Latency(ms): 54.305076599121094
p99 Latency(ms): 85.85322380065924
p50 took Latency(ms): 16.0
p90 took Latency(ms): 28.0
p99 took Latency(ms): 49.0


Without parsing logic
average latency: 52.904314446449284
p50 Latency(ms): 50.8497953414917
p90 Latency(ms): 71.18616104125977
p99 Latency(ms): 100.52832365036012
p50 took Latency(ms): 22.0
p90 took Latency(ms): 39.0
p99 took Latency(ms): 63.0

With UrlLib3

Multi Threading

With Parsing
average latency: 79.73963953840034
p50 Latency(sec): 77.58072550001316
p90 Latency(sec): 114.84195310003997
p99 Latency(sec): 156.2950457700498
average took time latency: 11.8679
p50 took Latency(ms): 11.0
p90 took Latency(ms): 14.0
p99 took Latency(ms): 29.0

Without parsing logic
average latency: 54.77132937908173
p50 Latency(sec): 51.97501182556152
p90 Latency(sec): 75.43172836303712
p99 Latency(sec): 111.9501113891602
average took time latency: 25799.7
p50 took Latency(ms): 23.0
p90 took Latency(ms): 41.0
p99 took Latency(ms): 74.0

Multi Processing

With Parsing
average latency: 42.12491910457611
p50 Latency(ms): 40.169358253479004
p90 Latency(ms): 56.438875198364265
p99 Latency(ms): 84.30349588394165
p50 took Latency(ms): 16.0
p90 took Latency(ms): 30.0
p99 took Latency(ms): 51.01000000000022

Without parsing logic
average latency: 53.499785208702086
p50 Latency(ms): 50.940632820129395
p90 Latency(ms): 72.27611541748048
p99 Latency(ms): 106.08061790466309
p50 took Latency(ms): 23.0
p90 took Latency(ms): 40.0
p99 took Latency(ms): 66.01000000000022

@dblock
Copy link
Member

dblock commented Nov 14, 2023

Thanks, this is helpful as a comparison of multithreading vs. multiprocessing and thank you for the conclusions, I agree with them. Here's a more readable version.

requests urllib3
threads processes threads processes
with parsing
average 108.16 41.32 79.74 42.12
p50 104.17 39.50 77.58 40.17
p90 156.74 54.31 114.84 56.44
p99 216.00 85.85 156.30 84.30
average took 12.98 11.87
p50 took 11.00 16.00 11.00 16.00
p90 took 18.00 28.00 14.00 30.00
p99 took 42.00 49.00 29.00 51.01
without parsing
average latency 55.61 52.90 54.77 53.50
p50 52.76 50.85 51.98 50.94
p90 76.61 71.19 75.43 72.28
p99 114.45 100.53 111.95 106.08
average latency 25.72 25.80
p50 took 23.00 22.00 23.00 23.00
p90 took 42.00 39.00 41.00 40.00
p99 took 73.00 63.00 74.00 66.01

tl;dr with urllib3 we're seeing a 30% improvement over where this issue started

We can conclude that urllib3 is 14% better than requests in pure transport. If we are doing more work, urllib performs better by 30% than requests (likely because it has less blocking behavior). Finally, because because of additional processing multithreading is roughly 2x slower that multiprocessing with urllib3.

We need to move to proper benchmarking from here I think, with the goal of reducing the 156ms p99 in multithreading above closer to the 84.30ms multiprocessing.

We have 3 kinds of requests possible:

  1. GET without content (e.g. info())
  2. POST with a large payload sent, a small payload returned, e.g. bulk()
  3. POST with a small payload sent, a large payload returned, e.g. search()

It's clear with @navneet1v's numbers that the blocking nature of the client has outsized impact on overall processing. So apart from pure client transport benchmarking along the (3) variations above, we will need to add processing to simulate realistic scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make it fast.
Projects
None yet
Development

No branches or pull requests

7 participants