-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRT library uses massive CPU on large instances #836
Comments
Can you provide a minimal reproduction code sample that produces the same CPU spikes that you're seeing? Does this only happen when you're running on a large instance? |
Sorry it took so long, work got busy and it was hard to make this minimal. I think there are two things at play here:
|
@LarryFinn Thank you for the code. Can you please explain a bit more about the exact problem that you are facing? Is it causing the application to crash, or is it slow, etc.?
Each client of CRT will create N threads, but they are green threads and will not have much overhead. Number of threads doesn't correlate with the concurrency. You can limit the overall concurrency by lowering maxConcurrency and targetThroughput. So, even though CRT will create N threads, it should not have much overhead as we will be limiting the concurrency per client.
Any reason you need to do multiple S3 clients? It is not a best practice since each client will create it's own thread pool and other resources and will lead to duplicated work and resources. CRT client is designed to maximize performance with a single client, and having multiple clients will not improve performance.
Yes, clients do need to acquire JNI locks. Is the problem that multiple clients are just waiting to acquire the lock? |
It seems like the clients are waiting to acquire JNI locks. Thats all i can discern from the profiling information. When I don't use multiple clients, our code actually get stuck a bit. even if using multiple clients isnt a best practice, i wouldnt expect the cpu jump. |
Can you explain a bit more about the use case, workload and the error that you are getting? I think it might be more useful to focus on getting one client to work properly.
That will depend on how many clients are getting created in parallel. Each client will have its own thread pool and will use the CPU. So N clients will have N times the resource usage. |
Im not an expert on the C code, but it def looks like the datadog profiler is showing massive amount of time and cpu spent in locking, which doesnt make sense, even with more threads doing more things
Yeah, we store a lot of results in s3 and what we will do is some set operations on multiple s3 files. so as we download s3 files we process them, and upload the results. it's hard to say what happens with a single shared client but sometimes it just gets stuck. java thread dumps dont really show anything too interesting, and im not sure if there is a better way to debug what's going on |
@waahm7 i created a simple example to show issues we've been having. it's contrived but whatever
but this one doesnt
My thought is the AWS threadpool is smaller than 10 (or whatever number of uploads) so this gets blocked. if i do a small number like 3, it is fine |
Let's focus on the single client result in hanging first! So, I tried your code sample with minor change to add the executors
And I cannot reproduce the hanging. What kind of hanging you seen? Is it hanging during waiting for the futures to complete? If so, can you provide any trace level logs from CRT? (Refer to here to enable the log) |
@TingDaoK you didn't use the CRT client in your example.
Gist of crt log is here https://gist.github.com/LarryFinn/833db0013c982e2392279d6e2b15dc54 |
My gut says this has something to do with how uploads work and the event pool in crt client. if you look at the gist 14 start up (which is the number of cores on my laptop) and my gut says bc im doing 15 uploads that block eachother, the event loop pool cannot handle that |
yeah, sorry that I made the mistake to use the default java client instead of CRT client. From the example I have. I made couple modifications and I made sure I am using the CRT client using the following code:
I did reproduce the hanging, when the number of the threads in the but, if I get 20 threads, which equals to the number of the requests in concurrent, it didn't hang. I believe if you have too less threads for the pool, each request will submit a task to the executor. https://github.com/aws/aws-sdk-java-v2/blob/master/core/sdk-core/src/main/java/software/amazon/awssdk/core/internal/async/InputStreamWithExecutorAsyncRequestBody.java#L80 and if you have much less threads comparing to the number of requests, it will be blocking each other. |
@TingDaoK which version of the libraries are you using? im still seeing it get stuck for some number of concurrent sends regardless of threadpool size (i was originally using a cached threadpool anyway, so that size was moot) |
I modified the loop to have a little more output (im doing 8 concurrent writes)
and the output looks like
|
I am using the latest version of crt java, but I also tried to use the one you mentioned. It could related to your file size, how large is the data you try to upload? |
code du -h /tmp/TEXT-subscribers-100.csv |
I did reproduce the issue when I bump up the file size. I will reach out to Java SDK team for their support on this. Quick question, is your use case requiring to use the PipedInputStream and PipedOutputStream to provide data async? |
One of our big use-cases needs this, or something like this, yes |
Okay, big thanks for @zoewangg to help out debug this. Firstly, in your code sample:
You are writing into the streams all from the main thread. But, in CRT, we have a scheduling logic that will try to get one part for each request and then focus on the first request before working on the others. And CRT will keep reading until the data for the first request is enough to be sent as a part. So, in the main thread, you are writing the same amount of data into all the streams. In the end, CRT client has its own order to work on the requests, and (I believe) Java side blocks on the threads/resource usage as CRT requests the data that was not provided yet, while the main thread is trying to provide the data that CRT is not consuming and leads to the hang. Apparently my reproduce code that tries to send the request from another thread pool has a bug, which will lead to multiple threads write into the same stream. Sorry for the confusion. The proper way to provide data for CRT client is something like:
So, for each stream, submit a task to read from the file and write to the stream. So that when CRT asks for data, there will be a thread to provide the data and not blocking on others. You can check out some sample code here |
hi @TingDaoK thanks for getting back to me! |
I believe your example here:
is writing the same line to all the streams. But, to write different lines to different streams, you can still do that by editing how the write works.
Note: you will also need to update your content length carefully. But anyhow, the idea is the streams should NOT block each other. |
@TingDaoK oh sorry you're absolutely right about what the example code is doing, this example i gave is a bit different than what we actually do so i got confused. in your prior comment you wrote The reason i am kinda digging into this is because our actual use case is a bit more complicated. we could buffer some amount of the data in memory to make sure the first concurrent send/request completes its part. im just a little hesitant about buffering a lot of stuff into memory since we deal with very big datasets. |
I still don't full understand what resource blocked the program that leads to the hang. the CRT eventloop pool size is half the CPU numbers. The part size is by default 8 MB, and it's configurable. I think in the case of the order when you provide the data is different then the order client asks for data, you cannot avoid buffering the data in memory. Either it is buffered before you submit it to the client, or client has to buffer the data somewhere. In C, we do have an alternative API to provide data asynchronous, which can put some request on hold and work on other requests. It have not integrated with Java yet, but it has integrated with mountpoint, https://github.com/awslabs/mountpoint-s3, which maybe the tool you will be interested in. If you only want Java support, maybe create a feature request, we can prioritize it. With current API, we cannot change the schedule logic on the client, so that in your case, you will have to buffer the data. |
Ah cool, thanks so much for the info! is there any way to dig into the original question about pthread lock system calls when using multiple s3crt clients or is the answer just "don't use multiple clients"? |
you can create multiple clients, but with multiple client, the network eventloop will be increased as each client will try to take the network bandwidth to meet the target throughput. So that, it's not recommended. And with multiple client, the schedule logic will be the same, each client will work on the first request assigned to that client first. I guess you can try to have one client to download, and then multiple client to upload, and only one request for each upload client. Also, need to configure the target throughput for each client to meet the total network bandwidth you have. |
Describe the bug
We recently upgraded our crt library from 0.24.0 to 0.31.1 and noticed massive cpu spikes. It seems that the library tries to use all the cpu the machine has, which is beyond what the kube container is allocated. I've tried toggling
maxConcurrency
andtargetThroughputInGbps
but it has no affects. I've decreased the "apparent" cpu count in java by using -XX:ActiveProcessorCount=10. The very odd part is in datadog the cpu spikes seem to be from lock functions. Attached are two files during a load test. You can see the change in cpu time for the underlying libraryExpected Behavior
Should not utilize all cpu available
Current Behavior
Utilizes all cpu available
Reproduction Steps
We see this in load tests, i am not sure how to reproduce it reliably
Possible Solution
No response
Additional Information/Context
No response
aws-crt-java version used
0.31.1
Java version used
17
Operating System and version
Ubuntu 20.04.6 x86_64
The text was updated successfully, but these errors were encountered: