-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCXUnreachable when running a benchmark with UCX_TLS=sm #1006
Comments
With |
HI @pentschev thanks for a quick response! Actually, I am trying to limit communication to shared memory intentionally (to compare it with our hand written shared memory based data transfer for low latency communication). I am running a benchmark in a single node, so it should theoretically work? Also, if I add tcp,sm, how can I ensure the sm is used for the communication (I am not interested in tcp based transport performance now)? |
Indeed this is possible with UCX, but the benchmark isn't prepared for that case, unfortunately. IIUC, tools like Would you mind sharing a bit more on what you're trying to do, in particular to what you mean by "hand written"? Depending on what you want to do UCX-Py may not be the most performant, for instance the benchmark runs by default with |
What's the best way to use the shm? I also made a simple script using create_endpoint and listener, but it basically had the same error. Is it safe to assume ucx-py doesn't support shm and I should instead use the core or cpp APIs (https://github.com/rapidsai/ucxx)? Also if I use tcp,sm, can I assume shared memory is actually used? I have the impression this will only use tcp?
We currently use grpc for data transfer, and for some low-overhead use cases, we'd like to skip this and just simply use shared memory (basically read and write to the buffer in shared memory) for sending/receiving the data. The detail is a bit hard to explain, but it's something like this
In terms of data size, it varies. But we are planning to microbenchmark with 1KB~100MB I don't actually expect "faster" performance. I think close enough performance should be good enough. |
Using a listener wouldn't work, a listener must use an IP address to establish connections. For the UCX-Py benchmark we would thus be unable to use a UCX listener but would need to establish endpoints using the remote worker's address, similar to how this test handles endpoints/data transfer. We would first need to exchange worker addresses either via IPC (e.g., a Python Please note that you would also need to specify I could probably attempt doing that next week, but cannot promise that. EDIT: apologies, I hit the wrong button and closed/posted the response too soon, I'll follow up with the remaining in the next comment. |
Apologies for the late reply to the second part, I got sidetracked by other errands.
This will depend on your system. For instance, if workers are running on separate NUMA nodes, then SM will not be used. One way I know you can verify this is to have a UCX debug mode build and run with
I'm assuming you're doing that for Ray, is that right? The details resemble ray-project/ray#30094 a lot.
If that is for Ray and I remember correctly, the communication backend is implemented in C(++), isn't it? If that's the case I would absolutely point you to https://github.com/rapidsai/ucxx instead, as there you can use its C++ backend explicitly and completely forget about any Python code, additionally all the UCX-Py API is now implemented on top of UCXX and part of that repo as well, so UCX-Py is expected to be deprecated/archived in the next 3-6 months in favor of UCXX. If you need Python, and more importantly so if you need Python async, then performance for small messages may have significant impact, but performance for large messages should have much smaller impact (I believe in the 1-2% range when compared to pure C). EDIT: Forgot to mention initially that we need to disable endpoint error handling for SM. I now opened #1007 to add that as an argument and updated instructions above accordingly. |
Yes, it is for Ray! And ray-project/ray#30094 is great. Thanks for the great context there.
It is great to know. And yeah Ray is written in C++, and I would eventually expect to use C (or C++ if exists) APIs. It is great there's already Cpp API there! I wanted to use ucx-py for quick prototyping & perf benchmark, but it makes sense there's high overhead due to Python.
Is it specific to Python API? Or sth I should keep in mind while using UCXX? I will play with UCXX and get back to you with more questions! As I mentioned, the first goal is to do benchmarking (sm, tcp, RDMA, and maybe GPU) and learn the APIs better. I have a couple of last questions.
|
You should keep that in mind. This is a limitation in UCX, some transports (like shared memory) do not support endpoint error handling which we enable by default here for InfiniBand use cases where having no error handling may cause deadlocks at shutdown. I will port #1007 to UCXX as well.
The only other public channel we have is the RAPIDS Slack. For very technical questions GitHub is preferred as we keep a public record people can find though.
Unfortunately I'm not aware of any plans to support EFA. However, UCX is an open standard and if there's interest from AWS or other members of the community to implement EFA support in UCX that is certainly welcome. If that ever happens, UCXX would support it out-of-the-box. Also there are no plans for UCXX to support other transports on its own, the intent is to always go to UCX and let it be the communication engine. |
Hi @pentschev! I could successfully start playing with ucxx, and the initial performance seems pretty good! I am going to play with it more and start benchmark more seriously, but also I'd like to ask a couple more questions;
I am also very new to this HPC style communication, so please bear with me if I am asking a bad question! |
TagSend and TagRecv are both non-blocking (they return a requests = std::vector<std::shared_ptr<Request>>{};
request.push_back(send/recv);
while (!std::all_of(requests.cbegin() ,requests.cend(), [](auto r) { return r.isCompleted(); })) {
worker->progress(); // polling mode
// worker->progressWorkerEvent(-1); // blocking mode
// worker->waitProgress(); // wait mode
} The docstrings of the various progress functions on the worker object are probably your best bet right now.
If you pass a device pointer to tagsend/tagrecv and your UCX install is built with the appropriate device transports enabled, then this should "just work".
I suspect it depends a bit on your use case. Active messages will always allocate new buffers for the incoming message on the receiver (so UCX arranges to to send a header along with the message that indicates how many bytes and so forth). If you already know the size of the buffer on the receive size (and/or you want to reuse buffers) then the tag API is better. The core implementation in UCX uses the same transports under the hood so the raw transfer of bits should be about the same performance. Active messages have a bit more book-keeping I think (not a lot) that you might have done anyway in a distributed algorithm. |
I tried running a benchmark using
And it seems like the client cannot reach the server for some reasons;
When I enable debug logs, I also see
Have you guys seen any similar issue before
The text was updated successfully, but these errors were encountered: