Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Poor Scaling with-in single node while running vector search workload #531

Closed
layavadi opened this issue May 10, 2024 · 17 comments
Closed
Labels
question Further information is requested

Comments

@layavadi
Copy link

Describe the bug
While running opensearch benchmark tool it was noticed that with vectorsearch workload, having multiple shards or multiple segment results in poor performance compare to 1 shard/ 1 segment . CPU utilisation with multiple shards ( single segment) is much higher than 1 shard / 1 segment but performance ( response time and throughput) was poor than 1 shard/1 segment config.

To Reproduce
R6i.xlarge single data node with k8 pods of data nodes deployed through opensearch helm chart. Each data node has the following setting .
opensearchJavaOpts: "-Xms6G -Xmx6G"
resources:
requests:
cpu: "3000m"
memory: "8Gi"
service:
type: LoadBalancer
persistence:
size: 51Gi

And benchmark parameter file is ( using lucene engine with l2 space for vector)
{
"target_index_name": "target_index",
"target_field_name": "target_field",
"target_index_body": "indices/lucene-index.json",
"target_index_primary_shards": 1,
"target_index_dimension": 768,
"target_index_space_type": "l2",
"target_index_bulk_size": 100,
"target_index_bulk_index_data_set_format": "hdf5",
"target_index_bulk_index_data_set_corpus": "cohere-1m",
"target_index_bulk_indexing_clients": 10,
"target_index_max_num_segments": 1,
"hnsw_ef_search": 256,
"hnsw_ef_construction": 256,
"query_k": 100,
"query_body": {
"docvalue_fields" : ["_id"],
"stored_fields" : "none"
},
"query_data_set_format": "hdf5",
"query_data_set_corpus": "cohere-1m",
"query_count": 10000,
"search_clients": 2
}

opensearch-benchmark execute-test --target-hosts $ENDPOINT --workload vectorsearch --workload-params ${PARAMS_FILE} --pipeline benchmark-only --kill-running-processes --client-options=basic_auth_user:admin,basic_auth_password:Clouder@4213,verify_certs:false

Expected behavior
It is expected to scale linearly as we add more clients with more shards . But 1 shard/ 1 segment only consumes 1 core compare to 3 shard / 1 segment which consumes more than 3 cores, however resulting in poor performance compare to 1 shard/ 1 segment

Logs
If applicable, add logs to help explain your problem.

More Context (please complete the following information):

  • Workload cohere-1M
  • Service - OpenSearch)
  • Version 2.12

Additional context
Add any other context about the problem here.

image image
@layavadi layavadi added bug Something isn't working untriaged labels May 10, 2024
@IanHoang
Copy link
Collaborator

Hi @layavadi, to get better clarity on this, are you seeing similar issues when running tests on workloads other than vectorsearch (such as NYC Taxis, PMC, or http_logs)?

@layavadi
Copy link
Author

layavadi commented May 14, 2024 via email

@gkamat
Copy link
Collaborator

gkamat commented May 15, 2024

There are multiple facets to this issue:

  • Issues in this repository pertain only to OSB. If OSB is scaling properly for the 1-shard, 1-segment case and simply changing the backend configuration results in a performance change, you should open an issue on the server side with the vector search team.
  • The 3 shard case probably has 3 segments in total, i.e., each shard with 1 segment, rather than 1 segment in total. Is that not the case?
  • How much data has been ingested in these tests? Does the 3-shard setup have more vectors?
  • Is there any deleted data?
  • Is one of the shards hot?
  • Have you run any query profiles?

Most likely, this item should be followed up with the vector search team as indicated above. Please close this issue if that is the case.

@layavadi
Copy link
Author

Hi @layavadi, to get better clarity on this, are you seeing similar issues when running tests on workloads other than vectorsearch (such as NYC Taxis, PMC, or http_logs)?

@IanHoang Is there a way to only run "search" operations in NYC taxis ? Do I have to modify the default.json to make it happen ?

@layavadi
Copy link
Author

There are multiple facets to this issue:

  • Issues in this repository pertain only to OSB. If OSB is scaling properly for the 1-shard, 1-segment case and simply changing the backend configuration results in a performance change, you should open an issue on the server side with the vector search team.
  • The 3 shard case probably has 3 segments in total, i.e., each shard with 1 segment, rather than 1 segment in total. Is that not the case?
  • How much data has been ingested in these tests? Does the 3-shard setup have more vectors?
  • Is there any deleted data?
  • Is one of the shards hot?
  • Have you run any query profiles?

Most likely, this item should be followed up with the vector search team as indicated above. Please close this issue if that is the case.

  • I have informed the vectosearch team about the performance. I just want to make sure that OSB side, it can push the queries not serialising the clients. Also is there a way to run NYC taxi search only workload ? vectorsearch provides that .

  • Yes 3 shard case has 3 segments , 1 for each shard

  • Cohere 1 M data is what is used to populate the 3 shards. We haven't changed the insertion volume. Kept it constant between various configurations

  • No . Workload is load once and then run the benchmark with search-only option

  • as per the metrics no. queries are equally distributed between shards

  • I am currently running cluster in EKS. Performance analyser is not supported yet in k8 environment. Any suggestion here will help.

@layavadi
Copy link
Author

@gkamat I did try with NYC Taxi, With just 1 client I was able to saturate the CPU. it is not issue with the benchmark. As you mentioned it is with vectorsearch itself.

@layavadi
Copy link
Author

@VijayanB Looks like both nmslib and faiss has the same issue
image

image

@VijayanB
Copy link
Member

VijayanB commented Jun 4, 2024

@navneet1v Have you seen this pattern in your experiments?

@navneet1v
Copy link

@layavadi when we are taking about throughput this is indexing throughput or Search Throughput?

@layavadi
Copy link
Author

layavadi commented Jun 5, 2024 via email

@navneet1v
Copy link

It is expected to scale linearly as we add more clients with more shards . But 1 shard/ 1 segment only consumes 1 core compare to 3 shard / 1 segment which consumes more than 3 cores, however resulting in poor performance compare to 1 shard/ 1 segment

@layavadi if this is Search, then what you are getting is expected for a single node, where 1 shard/1seg will get more throughput then 3 shards/1 seg per shard. Below are some of things to keep in mind here:

  1. When you have 1 shard, Opensearch have an optimization built in place in which it will reduce the number of round trips and do query and fetch in 1 single call. Code Ref This true for all indices in Opensearch.
  2. With 1 node and more shards more CPUs(1 per shard) are getting utilized and you can see its 1 JVM now handling more queries(1 per shard) hence more GC cycles will happen which will also lead to lower throughput.
  3. Another thing to note here is with more shards, each shard will give its size number of results and then at the coordinator node we need to pick up top results some CPU cycles are also lost there.

I am not sure from where you got this expectation that more shards will lead to better performance, this is true if you have replicas.

Please let me know if you have more questions. Removing the bug label as this is not a bug.

@navneet1v navneet1v added question Further information is requested and removed bug Something isn't working labels Jun 5, 2024
@layavadi
Copy link
Author

layavadi commented Jun 5, 2024

@navneet1v Thanks for the clarification. If that is the case , then as we scale clients , throughput should increase incase of 1 shard/1 seg. However throughput flattens and CPU utilisation is limited to just 1 core. Is it the case that when multiple clients are searching on 1 shard/1 seg index, they all serialise through 1 CPU ? BTW I did try 1 Shard/1 seg with 1 Replica on second node. It didn't scale either .

@layavadi
Copy link
Author

@navneet1v is there any limit on search threads based on shards or CPU cores ? Is it possible with one shard we are using only 1 thread for searches ?

@navneet1v
Copy link

@navneet1v is there any limit on search threads based on shards or CPU cores ? Is it possible with one shard we are using only 1 thread for searches ?

No there is no limit like this. Every search request you send is picked up 1 search thread from search thread pool. The search thread pool size is ((# number of cores * 3)/2)+1 . So I would check if you are going beyond these number of Search clients.

I would also check if there is any bottle neck from client machine which you are using to send the request. You can check this by ensuring your client machine cores > search clients you are setting in OSB.

BTW I did try 1 Shard/1 seg with 1 Replica on second node. It didn't scale either .

This should not happen, it points me towards client machine not able to send enough traffic to Opensearch cluster.

@layavadi
Copy link
Author

@navneet1v client machine has 4 cores. I had shown it to @prudhvigodithi before on the config and client was not saturated. I tried with NYC taxi workload it works fine. With 3 shards I am able to saturarte the CPU but with poor performance with the same number of clients , where as with same number of clients I am not able to saturate beyond 1 core in Opensarch node. Is there any search metrics I can get to understand what is happening ?

@layavadi
Copy link
Author

@navneet1v and others, I think issue is with prometheus monitoring. Because of the short duration of the test which is with in a 1m , 5m average was discarding the peaks and I got mislead by node exporter metrics. My sincere apologies. I logged into the node and did real time monitoring which showed all the 4 Cores being used. Thanks again for jumping . Sorry for the false alarm.

@navneet1v
Copy link

@navneet1v and others, I think issue is with prometheus monitoring. Because of the short duration of the test which is with in a 1m , 5m average was discarding the peaks and I got mislead by node exporter metrics. My sincere apologies. I logged into the node and did real time monitoring which showed all the 4 Cores being used. Thanks again for jumping . Sorry for the false alarm.

No problem I can understand, been there. happy your issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants