-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long-Tail Requests #9976
Comments
@Clownier can you pls try the following:
|
Thank you for your advice.
|
Describe the bug
We are using the RDMA transport service provided by UCX under a ROCE v2 network and have observed 1 to 2-second long-tail requests. The frequency of these long-tail requests appears to correlate positively with system pressure and the size of requested IOs. Additionally, we have noticed that the eth's tx_pause counter and bond1's ecn, cnp counters continue to increase even when the traffic volume is relatively low (under 500MB per machine). Is this behavior normal?
Our business model is depicted in the attached diagram, where we have encapsulated the Server and Client ends using UCX. The primary APIs used are ucp_tag_send_nb and ucp_tag_recv_nb. Each Server end establishes connections with multiple Client ends. The long-tail issue primarily occurs between the Middle and Tail processes, both of which use the rc_x transport mode.
During the request-response interaction between the Client and Server ends facilitated by UCX's TagMatch, the Client initially sends a request header with a specific Tag to the Server, followed by the corresponding data field for that Tag. Upon receiving the request header, the Server parses the Tag and initiates data reception for that Tag. The Server then sends a Response back to the Client, also in the form of a request header. The request headers utilize a fixed Tag matching scheme, with the first bit set to 1.
After troubleshooting, we have identified that the primary cause of the long-tail issue lies in the Server's inability to receive data. For a duration of approximately 2 seconds, the Server's Progress function can only receive new request headers but fails to receive the data or complete the sending of the Response (i.e., the send callback is not invoked).
Steps to Reproduce
Command line
UCX version used: v1.12.0
UCX configure flags
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni
Any UCX environment variables used
Setup and versions
CentOS Linux release 7.2.1511 (Core) 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Additional information (depending on the issue)
The text was updated successfully, but these errors were encountered: