diff --git a/doc/plots/lat-msgsize.p b/doc/plots/lat-msgsize.p index d2606d2..82b958e 100644 --- a/doc/plots/lat-msgsize.p +++ b/doc/plots/lat-msgsize.p @@ -59,8 +59,8 @@ plot $bufread with linespoints pt 11 ps 1.5 title "Buffered Read (BR)", \ $dirread with linespoints pt 9 ps 1.5 title "Direct Read (DR)", \ $writeOff with linespoints pt 5 ps 1.5 title "Buffered Write Offset (BR-Off)", \ $writeRev with linespoints pt 4 ps 1.5 title "Buffered Write Reverse (BR-Rev)", \ + $send with linespoints pt 7 ps 1.5 lc rgb "dark-orange" title "Send-Receive (SR)" , \ $median with linespoints pt 13 ps 1.5 title "Direct Write (DW)", \ - $send with linespoints pt 7 ps 1.5 title "Send-Receive (SR)", \ diff --git a/doc/thesis/background.tex b/doc/thesis/background.tex index b337447..74e661a 100644 --- a/doc/thesis/background.tex +++ b/doc/thesis/background.tex @@ -3,7 +3,7 @@ \section{RDMA} \label{sec:rdma} Remote Direct Memory Access (RDMA) is a network mechanism that allows moving buffers between applications over the network. The main difference to traditional network protocols like TCP/IP is that it is able to completely bypass the hosts kernel and even circumvents the CPU for data transfer. This allows applications using RDMA to achieve latencies as low as 2 $\mu s$ -and throughputs of up to 100 $Gb/s$, all while having a smaller CPU footprint. +and throughputs of up to 100 $Gbit/s$, all while having a smaller CPU footprint. \paragraph{} While initially developed as part of the \emph{InfiniBand} network protocol, which completely replaces the OSI @@ -114,14 +114,15 @@ \subsection{Verbs API} \begin{itemize} \item \textbf{Send (with Immediate):} Transfers data from the senders memory to a prepared memory region at the receiver. \item \textbf{Receive:} Prepares a memory region to receive data through the send verb. - \item \textbf{Write (with Immediate):} Copies data from the senders memory to known memory location at the receiver without any + \item \textbf{Write (with Immediate):} Copies data from the senders memory to a known memory location at the receiver without any interaction from the remote CPU. \item \textbf{Read:} Copies data from remote memory to a local buffer without any inteaction from the remote CPU. \item \textbf{Atomics:} Two different atomic operations. Compare and Swap (CAS) and Fetch and Add (FAA). They can access 64-bit values in the remote memory. \end{itemize} -\paragraph{} Like traditional socket these QPs come in different transport modes: Reliable Connection (RC), Unreliable Connection (UC), +\paragraph{} Like traditional socket, these QPs come in different transport modes: Reliable Connection (RC), +Unreliable Connection (UC), and Unreliable Datagram (UD). While UD supports sending to arbitrary other endpoints, similarly to a UDP socket, RC and UC need to establish a one to one connection between Queue Pairs, similarly to TCP sockets. Only RC supports all verbs and we will focus on this transport mode. @@ -129,7 +130,7 @@ \subsection{Verbs API} \paragraph{} Queue Pairs give us a direct connection to the RNIC. A QP essentially consists of two queues that allow us to issue verbs directly to the RNIC. The \emph{Send Queue} is used to issue Send, Write, Read, and Atomic verbs, and the -\emph{Receive Queue} which is used to issue a Receive verb. These verbs are issued by pushing a \emph{Work Request (WR)} +\emph{Receive Queue} is used to issue Receive verbs. These verbs are issued by pushing a \mbox{\emph{Work Request~(WR)}} to the respective queue. A work request is simply a struct that contains an id, the type of verb to issue, and all necessary additional information to perform it. The RNIC will pop the WR from the queue and execute the corresponding action. @@ -158,10 +159,12 @@ \subsection{Verbs API} \end{figure} -\paragraph{} This gives us an asynchronous interface. Issuing a work request for a Write does not mean that the Write was -performed, but simply that the RNIC will eventually process this request. To signal completion of certain work requests there +\paragraph{} This gives us an asynchronous interface. Issuing a work request for a Write operations +does not mean that the Write was +performed, but simply that the RNIC will eventually process this request. To signal the completion of a +work requests there is an additional type of queue called the \emph{Completion Queue (CQ)}. There needs to be a CQ associated with each Send and -Receive Queue. When the RNIC completes a work request it will enqueue a \emph{Completion Queue Entry~(CQE)} to the respective +Receive Queue. When the RNIC completes a work request it will enqueue a \mbox{\emph{Completion Queue Entry~(CQE)}} to the respective CQ. This CQE informs the application whether the request was processed successfully. The application can match CQEs to previously issued work requests by the ID it provided during issuing. It is also possible to post an \emph{unsignaled} network operation that does not generate a CQE after its completion. @@ -169,7 +172,7 @@ \subsection{Verbs API} \paragraph{} All locally and remotely accessible memory needs to be previously registered for the RNIC to be able write to or read from it. We call these preregistered regions of memory \emph{Memory Regions (MRs)}. -Registering memory pins it so that it is not swapped out by the host. The process of registering is generally orders of magnitude slower +Registering memory pins it so that it is not swapped out by the host. The process of registering is orders of magnitude slower then data operations like sending or writing. So in general all accessed memory is registered at connection setup. Henceforth, we will assume these memory regions to be registered if not specified otherwise. @@ -225,7 +228,7 @@ \subsubsection{Send / Receive} \label{sec:bg:send} \paragraph{} To better understand this communication model we will walk through the operations involved in sending a single message from a system A to another system B. We assume that the two nodes have already setup a connection. In this -thesis we will not go into the details of connections setup. Each nodes has prepared a QP and associated a completion +thesis we will not go into the details of connections setup. Each node has prepared a QP and associated a completion queue to it. Both systems have registered a MR of at least the size of the message to be sent. \begin{enumerate} @@ -238,7 +241,7 @@ \subsubsection{Send / Receive} \label{sec:bg:send} System B now polls its CQ until it receives a CQE for its issued receive request. - \item System A can initiate the transfer by posting a \emph{Send Request}. To do this it copies a work request to + \item System A initiates the transfer by posting a \emph{Send Request}. To do this it copies a work request to the Send Queue. This request contains a pointer to its local memory containing the message to be sent and its size. This request notifies the NIC to initiate the transfer. It also starts polling its CQ to notice the completion of the send request. @@ -298,17 +301,18 @@ \subsubsection{Write} \label{sec:bg:write} \label{fig:seq-wrt} \end{figure} -Figure~\ref{fig:seq-wrt} show the operations involved in writing data to the remote using RDMA write. It is generally very +Figure~\ref{fig:seq-wrt} show the operations involved in writing data to remote memory using RDMA write. It is generally very similar to the send and receive sequence presented in the previous section. The sending CPU still issues a work request which is handled by the NIC and is notified of its completion through the completion queue. The main difference is that the remote system does not need to post a receive buffer and there is no CQE generated at the remote. Also, the work request is a structured differently. It does not only contain a pointer to the local send buffer but also provides the remote address to write it to. -\paragraph{} The standard RDMA write does not generate a completion entry at the receiver, which is generally more efficient. +\paragraph{} The standard RDMA write does not generate a completion queue entry at the receiver, +which is generally more efficient. However, sometimes it is very helpful for the receiver to be notified of a completed write. For this purpose the Verbs API also provides a related operation called \emph{Write with Immediate}. -This operation works very similarly to a normal RDMA write, but it generates a completion entry at the receiver, in the same way +This operation works very similarly to a normal RDMA write, but it generates a CQE at the receiver, in the same way the send verb does. This means it will also consume a posted receive request, so the receiver needs to post a receive request prior to the transfer. Write with Immediate however will not write any data to the associated receive buffer. diff --git a/doc/thesis/buffered_read.tex b/doc/thesis/buffered_read.tex index 1732e4f..a6de1db 100644 --- a/doc/thesis/buffered_read.tex +++ b/doc/thesis/buffered_read.tex @@ -2,11 +2,11 @@ \section{Buffered Read}\label{sec:conn:buf_read} The idea of a buffered read protocol is to have a ring-buffer at the sender from which the receiver fetches the messages using -RDMA reads. There are multiple different ways to implement such a protocol, with the main variations being in, how to notify +RDMA reads. There are multiple different ways to implement such a protocol, with the main difference being in, how to notify the receiver of new messages, where to transfer them to, and how to acknowledge to the sender that a message has been processed. \paragraph{} We decided to focus on an implementation which gives us a \emph{Passive Sender} and allows for -\emph{Variable Message Sizes}. We decided to stick with the basic interface defined in Section~\ref{sec:protocols}. This +\emph{Variable Message Sizes}. We stuck with the basic interface defined in Section~\ref{sec:protocols}. This results in a system with two ring-buffers, illustrated in Figure~\ref{fig:buf_read_struct}. \begin{figure}[!ht] @@ -65,7 +65,7 @@ \section{Buffered Read}\label{sec:conn:buf_read} \label{fig:buf_read_struct} \end{figure} -\subsubsection{Protocol} +\subsection{Protocol} As mentioned the sender of our buffered read connection is entirely passive, that means after the connection setup the sending CPU does not issue any RDMA operations. The only thing it needs to do to send is to check its head if there is enough space, copy the message to its tail, and update the tail offset. It also prepends the message size when writing. @@ -84,7 +84,7 @@ \subsubsection{Protocol} return the next message with the length at the beginning of the buffer and update the \code{read\_pointer}. If there are no transmitted messages to return, the receiver updates the \code{remote\_tail} with an RDMA read. If the remote -tail has not updated, it retries until it has. With the updated tail the receiver issues an RDMA read for the whole buffer +tail has not been updated, it retries until it has. With the updated tail the receiver issues an RDMA read for the whole buffer section between the read pointer and the updated tail. It then returns the next message in the newly transmitted section. This gives us an interesting characteristic of most receives being very cheap, while some are very expensive as they need to @@ -109,5 +109,5 @@ \subsection{Feature Analysis} \paragraph{} For future work it would also be interesting to implement a system where multiple threads can write to and receive -from a single connection by atomically updating head or tail pointers, or we could explore the possibilities to share send or +from a single connection by atomically updating head or tail pointers, or we could explore the possibilities of sharing send or receive buffers using atomic operations. The current implementation, however, does not provide any kind of \emph{Resource Sharing} diff --git a/doc/thesis/conclusion.tex b/doc/thesis/conclusion.tex index 3c202af..b996d2c 100644 --- a/doc/thesis/conclusion.tex +++ b/doc/thesis/conclusion.tex @@ -18,7 +18,7 @@ \section{Conclusion} \paragraph{} We analysed the presented protocols and evaluated more then just raw performance. We looked at other features that can be critical for applications. These features include achieving effective memory usage by allowing -variable message sizes, avoiding additional copying by being truly zero-copy, or one slow processing message not +variable message sizes, avoiding additional copying by being truly zero-copy, or one slowly processing message not being able to stall the whole connection. \paragraph{} We introduced a new performance model for RDMA based message passing protocols that allows us to diff --git a/doc/thesis/direct_read.tex b/doc/thesis/direct_read.tex index 27af137..5264161 100644 --- a/doc/thesis/direct_read.tex +++ b/doc/thesis/direct_read.tex @@ -1,9 +1,9 @@ \section{Direct Read} \label{sec:conn:direct_read} -In Section~\ref{sec:conn:direct_write} we discussed how we can possibly avoid an additional copy at the receiver by giving +In Section~\ref{sec:conn:direct_write} we discussed how we can possibly avoid an additional copy at the receiver, by giving the sender information which allows him to potentially write the data to the correct final memory location. The next logical -step is to let the receiver decide for each message where to write it to. We can achieve this by our implementation of a +step is to let the receiver decide for each message where to write it to. We can achieve this with our implementation of a \emph{Direct Read Connection}. \paragraph{} The core idea of a direct read protocol is that instead of directly sending a message through a send or write @@ -24,8 +24,8 @@ \subsection{Protocol} \paragraph{} To wait for the transfer to be completed, and for the buffer to be able to be reused, we can not simply wait for the completion event of the send, like we do for the send or write based connections. We need to wait for the receiver to explicitly signal that the buffer was transfered. We append a signaling byte at the end of the send buffer. -When sending this byte, will be set to 0 and we can wait for the transport to be completed by polling this byte until the -receiver will update it. +When sending, this byte is set to 0 and we can wait for the transport to be completed by polling this byte until the +receiver updates it. This push based implementation introduces little additional complexity, but there are other ways to implement such signaling. The signaling bit forces us to use a specific memory arrangement, which could prevent us to send data directly @@ -39,8 +39,8 @@ \subsection{Protocol} It is crucial that we do not block until the read is completed to get reasonable performance. This means the receiver has a slightly different interface than the previously presented connections. We split the receive call into a \code{RequestAsync} and a \code{Wait} method. The \code{RequestAsync} takes a receive buffer -to read into. It will wait for an incoming read request and issue the corresponding read. It uses the same increasing -\code{wr\_id} approach we use for sending with which the \code{Wait} method can wait for the read to complete. This approach +to read into. It waits for an incoming read request and issue the corresponding read. It uses the same increasing +\code{wr\_id} approach we use for sending, with which the \code{Wait} method can wait for the read to complete. This approach allows us to pipeline receives the same way we pipeline sends. \paragraph{} As soon as the transfer is complete, the receiver updates the corresponding signaling bit using an RDMA write. diff --git a/doc/thesis/direct_write.tex b/doc/thesis/direct_write.tex index 453ec8f..ea43653 100644 --- a/doc/thesis/direct_write.tex +++ b/doc/thesis/direct_write.tex @@ -5,7 +5,7 @@ \section{Direct Write}\label{sec:conn:direct_write} regions for the sender to write to. \paragraph{} For this thesis we implemented something very reminiscent of the send receive protocol. The core idea is for -the receiver to send information on prepared receive buffers to the sender. The sender will then use these buffers in order. +the receiver to send information on prepared receive buffers to the sender. The sender will then use these buffers. @@ -104,19 +104,20 @@ \subsection{Protocol} order, for any modern systems~\cite{herd, farm}. This guarantees us that the complete message has been written as soon a we see an update to the last byte. -\subsection{Features Analysis} +\subsection{Feature Analysis} With our Direct Write connection we essentially rebuilt the send and receive verbs using only RDMA writes. This gives us similar features but also allows for more control over the protocol. This could potentially allow us to extend it for specific systems. \paragraph{} Our current implementation fulfills our requirements for being \emph{non-blocking}, the same way the send-receive -protocol does. It however does not provide any \emph{interrupts} and we did not explore any \emph{resource sharing} approaches. +protocol does. It, however, does not provide any \emph{interrupts} and we did not explore any \emph{resource sharing} approaches. As it stands, it also does not provide \emph{True Zero Copy} capabilities and it lacks support for \emph{variable message sizes}. -\paragraph{} We do however think that this protocol can be adapted to enable more features. By adding metadata when posting a +\paragraph{} Nevertheless, we do think that this protocol can be adapted to enable more features. +By adding metadata when posting a buffer we could enable \emph{True Zero Copy} capabilities for certain applications, by performing some routing decisions at the sender and being able to directly write to the correct final memory location. And with a more sophisticated buffer management one could more effectively utilize the available memory, by supporting diff --git a/doc/thesis/evaluation.tex b/doc/thesis/evaluation.tex index 43520aa..a61eb62 100644 --- a/doc/thesis/evaluation.tex +++ b/doc/thesis/evaluation.tex @@ -113,7 +113,7 @@ \subsection{Bandwidth} The direct-write protocol, however, seems to be limited by the sending NIC instead or possibly by the PCIe bus. The CPU needs to wait for the operation to be completed, so we are not limited by our ability to issue work requests. This results in up to 30\% lower throughput for medium sized messages. We suspect that the large -amount of returning writes interferes with outgoing writes and increases the per message overhead for the NIC. +amount of returning writes interfere with outgoing writes and increase the per message overhead for the NIC. \paragraph{} Figure~\ref{fig:plot-bw-bw} shows the 1:1 bandwidth for all buffered write variants. We only show the measurements @@ -172,7 +172,7 @@ \subsection{Bandwidth} \emph{read} acknowledger, we see a significant drop in performance when sending messages that are slightly larger than 4090 bytes. With the added 6 byte of protocol overhead, the performance degradation happens when the total write is larger than 4KB. -Both the pagesize as well as the MTU is 4KB. Bandwidth then seems to linearly increase and will drop again very similarly +Both the pagesize as well as the MTU are 4KB. Bandwidth then seems to linearly increase and will drop again very similarly for all multiples of 4 KB. Interestingly, this cannot be observed with other sender implementations and the effect is greatly @@ -242,7 +242,7 @@ \subsection{N:N} We can see that for large messages we are bottlenecked by the network bandwidth of 100 Gbit/s. For smaller messages, the throughput first increases linearly until we hit a bottleneck. Interestingly, for a message size of 16 bytes we are able to send over twice the amount of messages per second compared to a message size of 512 bytes. -We suspect this to be caused by NIC level optimizations for small messages, such as in-line receiving~\cite{anuj-guide} which +We suspect this to be caused by NIC level optimizations for small messages, such as inline receiving~\cite{anuj-guide} which is supported by Mellanox NICs for messages up to 64 bytes. The bottleneck seems to be the total throughput of the receiving NIC. This is significantly higher than the maximum throughput seen for a single connection, this can be attributed to the usage of multiple processing units~\cite{anuj-guide}. @@ -257,8 +257,8 @@ \subsection{N:N} to be a limit of the receivers NIC. \paragraph{} Figure~\ref{fig:plot-wdir-bw-threads} shows the N:N bandwidth for the direct-write protocol. When using 8192 -byte messages, we are again limited by the maximum link speed. When using 512 byte messages we seem to be bottlenecked at around -70 Gbit/s, and for small messages we are capped at around 6 Gbit/s, which is very close to the performance of the send-receive +byte messages, we are again limited by the maximum link speed. When using 512 byte messages, we seem to be bottlenecked at around +70 Gbit/s, and for small messages, we are capped at around 6 Gbit/s, which is very close to the performance of the send-receive protocol. However, the 70 Gbit/s limit for 512 byte messages is significantly less than the 80 Gbit/s maximum we saw for the @@ -298,8 +298,8 @@ \subsection{N:N} \paragraph{} Figure \ref{fig:plot-write-bw-thread} shows the total throughput of all buffered write implementations with varying number of connections and the three different message sizes. -We can again see that for large messages we keep being bottlenecked by the network bandwidth of 100 Gbit/s, regardless of -the sender. However, the write offset (BW-Off) implementation shows a small overhead and does not quite achieve the same +For large messages, we are bottlenecked by the network bandwidth of 100 Gbit/s, regardless of +the sender. However, the write offset (BW-Off) implementation shows a very small overhead and does not quite achieve the same throughput as the other two implementations. For smaller messages, we see a linear increase in throughput with increasing number of connections until we hit a bottleneck. @@ -330,7 +330,8 @@ \subsection{N:N} \caption{N:N Bandwidth Read-Based Protocols} \end{figure} -\paragraph{}Figure~\ref{fig:plot-dirread-bw-threads} shows the direct-read protocol's N:N performance. The results look very similar to +\paragraph{}Figure~\ref{fig:plot-dirread-bw-threads} shows the direct-read protocol's N:N performance. +The results look very similar to the measurements of other connections. There is a linear increase in performance caused by the ability to post more work requests as well as the ability to utilize more NIC processing units.~\cite{anuj-guide} @@ -343,7 +344,7 @@ \subsection{N:N} \paragraph{} Figure~\ref{fig:plot-bufread-bw-threads} shows the buffered-read protocol's N:N performance. We see drastically improved performance compared to the single connection evaluation. This is mainly caused by the fact that we now have multiple concurrent active read operations. Through the inbuilt message batching we achieve significantly higher throughput for small -messages compared to the other connection types we evaluated. Throughput for all message sizes grow linearly with increasing +messages compared to the other connection types we evaluated. Throughput for all message sizes grows linearly with increasing number of connections and are only limited by the line rate or eventually by the individual copying of buffers at the sender and its function call overhead. @@ -358,7 +359,7 @@ \subsection{N:N} direct-write protocol with its returning writes performed very similarly to the write-offset or direct-read protocol. One thing the buffered read connection has shown us is that to achieve optimal bandwidth for small messages, sending side and, -whenever possible, application level batching is necessary. We also saw that SRQs, while saving memory usage, have +whenever possible, application level batching is necessary. We also saw that SRQs, while reducing memory usage, have significant performance limitations. @@ -373,7 +374,7 @@ \subsection{N:1} round robin over the $N$ open connections. As we did for the N:N experiments, we evaluate the throughput for three different message sizes: 16 bytes, -512 bytes, and 8192 bytes. We do not perform any sender side batching but allow for sufficient unacknowledged messages +512 bytes, and 8192 bytes. We do not perform any sender side batching but allow for sufficient unacknowledged messages. In our plots, we report the sum of all connection throughputs. @@ -406,7 +407,7 @@ \subsection{N:1} \paragraph{} Figure~\ref{fig:plot-sndrcv-bw-n1-nosrq} shows the throughput for the send-receive protocol and a single receiver. We use the single receive approach described in Section~\ref{sec:conn:send}, which allows us to route all completion events for multiple QPs to a single receiving thread. To prevent Reader-Not-Ready (RNR) errors, which happen when the receiving CPU is -unable to repost receive buffers quickly enough, we limit the sender to a stable sending rate. +unable to repost receive buffers quickly enough, we limit the senders to a stable sending rate. For 16 byte messages, we seem to be limited at around 11 MOp/s, while for 512 bytes we are limited at around 16 MOp/s. Both bottlenecks are caused by the receiving CPU. We expect the difference in sustainable message rates to be the result of inline receives for @@ -414,11 +415,12 @@ \subsection{N:1} CPU which is already the bottleneck in this situation. -However for both smaller message sizes we see a drop in performance when further increasing the number of sender. We explain +However for both smaller message sizes we see a drop in performance when further increasing the number of senders. +We explain this by increased cache misses, caused by having to access more QPs and the linearly growing number of receive buffers. \paragraph{} Figure~\ref{fig:plot-sndrcv-bw-n1-srq} shows the same plot while using a shared receive queue for all QPs. For -large messages, we are again only limited by the link speed. +large messages, we are only limited by the link speed. For small messages of 32 bytes or lower, we see similar throughput as without using a shared receive queue. We are still limited by the receiving CPU. We do, however, not see any performance drops when using an increasing amount of senders. This @@ -473,30 +475,30 @@ \subsection{N:1} \paragraph{} Figure~\ref{fig:plot-write-bw-n1} shows the N:1 throughput for all buffered-write protocols as well as the two shared-write implementations. -For large messages of size 8196, throughput is again limited by the link speed for all buffered-write implementations, +For large messages of size 8196, throughput is once again limited by the link speed for all buffered-write implementations, with the same slight overhead for the write offset (BW-Off) implementation which we have seen in all previous bandwidth plots. For smaller messages, the buffered-write protocols are limited by the receiver's CPU. We can avoid any RNR errors for the write immediate sender by limiting the ring-buffer size. This way, we do not have to artificially limit the sender as we did for the send-receive protocol. For both 16 byte as well as 512 byte messages, the write reverse \mbox{(BW-Rev)} implementation achieves up to 20\% higher throughput -compared to the write immediate (BW-Imm) protocol. This can be explained by the additional which receive buffer reposting the -receiving CPU needs to perform when using write with immediate. +compared to the write immediate (BW-Imm) protocol. This can be explained by the additional receive buffer reposting +which the receiving CPU needs to perform when using write with immediate. The write offset (BW-Off) implementation achieves similar performance to BW-Imm. Interestingly, BW-Off seems to reach slightly higher performance when opening more than six concurrent connections with both 512 byte as well as 16 byte sized messages. As of -today, we do not have a clear explanation for it. Our current best guess is that with more than six connections the +today, we do not have a clear explanation for this. Our current best guess is that with more than six connections the metadata is spread over more than one cache line. This can result in less cache invalidation and in turn in lower receiver CPU usage and better overall performance. Further research into this is necessary. -\paragraph{} Figure~\ref{fig:plot-write-bw-n1} also evaluates the shared write protocol (SW). We can see that we are strongly limited +\paragraph{} Figure~\ref{fig:plot-write-bw-n1} also evaluates the \mbox{shared write protocol (SW)}. We can see that we are strongly limited by our two phase approach for all message sizes. That means, we see a linear increase in bandwidth with increasing number of senders as we are able to issue more requests in parallel. For the large message size of 8 KB, the total throughput increases linearly until we hit the line rate with 10 -active senders. The use of device memory \mbox{(SW-DM)} allows us to saturate the link with only nine connections. +active senders. The use of \mbox{device memory (SW-DM)} allows us to saturate the link with only nine connections. This is not a large difference and we are able to take full advantage of the bandwidth with or without the usage of device memory. @@ -528,7 +530,7 @@ \subsection{N:1} \end{figure} Figure~\ref{fig:plot-dirread-bw-n1} shows the direct-read protocol's throughput with a single receiver and varying number of senders. -We are again very much limited by the receiving CPU in this case as the receiver already needs to do the heavy lifting for +We are very much limited by the receiving CPU in this case as the receiver already needs to do the heavy lifting for this protocol. For smaller messages, there is no throughput improvements at all with increasing number of connections as we already have @@ -548,9 +550,9 @@ \subsection{N:1} by increased cache misses, caused by the growing total buffer space and the usage of multiple QPs. -\paragraph{} Unsurprisingly, when developing a protocol for a N:1 communication, it is vitally important to reduce the +\paragraph{} Unsurprisingly, when developing a protocol for a N:1 communication pattern, it is vitally important to reduce the involvement of the receiving CPU in the transmission. This makes read based protocols unsuitable. Interestingly, in our -evaluation the send-receive protocol outperformed both the direct as well as buffered-read protocols. But we need to +evaluation the send-receive protocol outperformed both the direct-write as well as buffered-write protocols. But we need to keep in mind that we should be able to drastically improve the direct-write protocol's performance by reposting buffers in batches or by redesigning this reposting entirely to reduce the number of returning writes. diff --git a/doc/thesis/intro.tex b/doc/thesis/intro.tex index eb39325..27e9d84 100644 --- a/doc/thesis/intro.tex +++ b/doc/thesis/intro.tex @@ -3,25 +3,26 @@ \section{Introduction} Remote Direct Memory Access (RMDA) is a powerful communication mechanism that offers the potential for exceptional performance. RDMA allows one machine to directly access the memory of a remote machine across the network without the interaction of the remote CPU. This gives developers a plethora of options to implement communication protocols. However, using these options -effectively is nontrivial and the observed performance can vary greatly for seemingly minor differences. +effectively is not trivial and the observed performance can vary greatly for seemingly minor differences. -Existing research either primarily focus on evaluating very low level verb performance~\cite{anuj-guide} or focus strongly on +Existing research either primarily focuses on evaluating very low level verb performance~\cite{anuj-guide} or focuses +strongly on Remote Procedure Calls (RPCs)~\cite{eval-mpp} often comparing the observed performance to using remote data structures~\cite{fasst, rpc-vs-rdma}. Nearly all of them employ naive message passing protocols using either send receive or RDMA writes with ring-buffers~\cite{rdma-fast-dbms} or \emph{mailboxes}~\cite{ziegler2020rdma} and do not take full advantage of -features offered by modern RDMA-capable network controllers. Further hardly any work looks into the usage of shared receive +features offered by modern RDMA-capable network controllers. Further, hardly any work looks into the usage of shared receive queues, memory fences, or atomics for resource sharing. \paragraph{}In this thesis we implement and evaluate various different message passing protocols. We show that there are a many ways to implement data exchange connections using less used RDMA features such as \emph{shared receive queues}, \emph{reads}, \emph{memory fences}, and \emph{RDMA atomics}. We also show that even common approaches such as ring-buffer based protocols -can be implemented in multiple ways giving us different performance characteristics and features. +can be implemented in multiple ways, giving us different performance characteristics and features. \begin{itemize} \item We focus on implementing message passing protocols, without limiting us to RPCs. We believe this gives engineers building blocks to develop more sophisticated protocols without micro-benchmarking basic verbs. - \item We define other connection features outside of raw performance which have been relevant for applications such as + \item We define connection features outside of raw performance which have been relevant for applications, such as efficient resource usage. \item We implement and evaluate different message passing protocols. We reason why we explicitly implemented these protocols and evaluate them for different communication patterns. diff --git a/doc/thesis/model.tex b/doc/thesis/model.tex index 6a1d23b..029f86f 100644 --- a/doc/thesis/model.tex +++ b/doc/thesis/model.tex @@ -101,7 +101,7 @@ \section{Performance Model}\label{sec:perf-model} \label{sec:model} to receive into (e.g. post receive). This needs to happen sometime before receiving the next message. \end{itemize} -\paragraph{} This model does not take the MTU into account. While some prior work use a slightly different model parameters +\paragraph{} This model does not take the MTU into account. While some prior work use slightly different model parameters when sending messages over the MTU~\cite{dare}, which span multiple transmission units, we found our model to work well enough to understand most of our results. @@ -167,7 +167,7 @@ \subsection{Evaluating the Model} linearly with the message size $k$. It is worth noting here, that our reported network latency $L$ includes both the PCI latency as well as the latency -of the switch between our two nodes. The omission of the PCI latency is also the reason that $g$ can be less than $2L$, +of the switch between our two nodes. The PCI latency is also the reason that $g$ can be less than $2L$, as it does not include a PCI round-trip. \paragraph{} For the network bandwidth G, we also report a batched and unbatched estimation. We use the unbatched estimation @@ -250,7 +250,7 @@ \subsubsection{Predicting Bandwidth} increasing throughput we predict until we reach a message size of 2 KB where we start to be limited by the NIC. \paragraph{} We can avoid this bottleneck by introducing batching. The verbs API allows us to post multiple work requests at -the same time. This \emph{Doorbell batching} reduces the number of generated MMIOs~\cite{anuj-guide} and reduces the CPU +the same time. This \emph{doorbell batching} reduces the number of generated MMIOs~\cite{anuj-guide} and reduces the CPU load. We measured that batching can reduce the sending CPU overhead $o_{snd}$ by up to a factor of 10. When introducing doorbell batching to the send-receive protocol, we are never bottlenecked by the sending CPU. diff --git a/doc/thesis/protocols.tex b/doc/thesis/protocols.tex index 1943200..c7283b4 100644 --- a/doc/thesis/protocols.tex +++ b/doc/thesis/protocols.tex @@ -1,10 +1,10 @@ \section{Data Exchange Protocols}\label{sec:protocols} -The goal of this work is design, implement, and study a comprehensive list of RDMA based protocols for data exchange. +The goal of this work is to design, implement, and study a comprehensive list of RDMA based protocols for data exchange. However, without a clear definition of \emph{data exchange} the number of possible protocols is virtually endless. In this section we limit our definition of \emph{data exchange protocol} to that of a more clearly defined message passing protocol, which reduces the size of our design space and allows us to compare protocols. We then introduce -features outside of raw performance that have been relevant for real-world applications. Finally we introduce six +features outside of raw performance that have been relevant for real-world applications. Finally, we introduce six protocols that all use different RDMA features and design approaches to implement message passing protocols with different capabilities and performance guarantees. @@ -13,14 +13,14 @@ \subsection{Definition}\label{sec:proto-def} Most research focuses on RPCs \cite{anuj-guide, fasst, herd}, remote datastructure access \cite{pilaf, farm}, or replications of socket like interfaces \cite{socksdirect}. In this thesis we look at message passing protocols with fairly few guarantees. We believe with this more general communication model we can give engineers and researchers a better understanding of the -building blocks that can be used to build more specific connection protocols, like RPCs, while not only +building blocks, that can be used to build more specific connection protocols, like RPCs, while not only micro-benchmarking RDMA verbs. \begin{defn} We define a message passing protocol $P$ as an algorithm for moving a message $m$ that resides in the memory of node $N_a$ to another node $N_b$. The transfer must be initiated by $N_a$. After the transfer, $N_b$ must know that the -message was fully received and $N_a$ must know that it can reuse $m$. +message was fully received and $N_a$ must know that it can reuse local resources that where used to send $m$. \end{defn} We explicitly have no synchronisation requirements, so the sender does not need to know whether the receiver has actually @@ -33,7 +33,7 @@ \subsection{Definition}\label{sec:proto-def} \subsection{Features} \label{sec:features} -There is more to data exchange protocols than raw throughput or latency. Sometimes it is more important for an application +There is more to data exchange protocols than raw throughput or latency. Sometimes, it is more important for an application that the connection is \emph{non-blocking} or that the memory requirements do not grow too large, even if that protocol wastes a few microseconds. @@ -45,7 +45,7 @@ \subsection{Features} \label{sec:features} strictly speaking not zero-copy, as the receiver will always have to copy the data from the buffer to its actual destination. This is can be especially important when, for example, transferring a large amount of data that does not have to be further processed. -\paragraph{Variable Message Size} The protocol allows us to send messages with different sizes, without using the +\paragraph{Variable Message Size} The protocol allows us to send messages of different sizes, without using the complete buffer at the receiver. Ring-buffers, for example, can be designed to only use the necessary space per message while send receive will always use the complete receive buffer, no matter how small the message actually is. Variable message sizes allow us to avoid memory fragmentation and reduce total memory usage in general. @@ -56,7 +56,7 @@ \subsection{Features} \label{sec:features} reduces the necessary CPU usage to a minimum and can be useful for applications with heterogeneous CPU requirements. \paragraph{Interrupts} There exists some kind of notification system, that allows the receiver to be notified -of an incoming message without having to constantly poll for it. While polling gives us better performance in almost all cases +of an incoming message without having to constantly poll for it. While polling gives us better performance in almost all cases, constantly polling wastes a lot of CPU cycles when we do not receive a lot of messages. \paragraph{Resource Sharing} Multiple connections can share resources, especially memory. We are using \emph{reliable connections} @@ -66,7 +66,7 @@ \subsection{Features} \label{sec:features} \paragraph{Non-Blocking} By non-blocking we mean that a single not processed message cannot block a complete connection. It is very common in systems to distribute incoming messages to different threads. It is important that a single slow running -task cannot completely block the system of making progress. The way we use it a ring-buffer is a good example of a blocking +task cannot completely block the system from making progress. The way we use it a ring-buffer is a good example of a blocking behaviour. If a single buffer segment is not marked as available to be reused it may block the whole buffer. A send receive based connection however is non-blocking. As long as there is at least one posted receive buffer, the connection is able to make progress. @@ -102,7 +102,7 @@ \subsection{Design Space} \label{sec:proto-ds} Shared-Write (SW) & & \checkmark & & \checkmark & & \checkmark \\ \hline \hline - Direct-Read (DR) & \checkmark & \checkmark & & \checkmark & \checkmark & \\ + Direct-Read (DR) & \checkmark & \checkmark & & \checkmark & \checkmark & (\checkmark)\\ \hline Buffered-Read (BR) & & \checkmark & Sender & & & \\ \hline @@ -134,7 +134,8 @@ \subsubsection{Send-Based Protocols} \subsubsection{Write-Based Protocols} The \emph{Write} verb is a lot less restrictive. It allows the sender to write to an arbitrary -location in the receivers memory, without any interaction of the receivers CPU. This also means we need to solve some of the +location in the receiver's memory, without any interaction of the receiver's CPU. +This also means we need to solve some of the problems which the \emph{Send} verb solved for us. \begin{itemize} @@ -156,7 +157,7 @@ \subsubsection{Write-Based Protocols} \paragraph{} The \emph{Unbuffered Write Protocol} avoids the additional copy which is usually necessary when -using a \emph{Buffered Write Protocol}. The receiver should be able to decide the target location for each message. +using a \emph{Buffered Write Protocol}. The receiver should be able to choose the target location for each message. That means, there is a communication overhead for each message, as the sender either needs to query the receiver where to write to or the receiver needs to preemptively send locations to write to. For a more detailed analysis of \emph{Unbuffered Write Protocols} see Section \ref{sec:conn:direct_write}. @@ -168,7 +169,7 @@ \subsubsection{Write-Based Protocols} \subsubsection{Read-Based Protocols} The \emph{Read} verb is generally very similar to the \emph{Write} verbs. This time it allows the receiver to read from -an arbitrary location in the senders memory, without any interaction of the senders CPU. Any protocol +an arbitrary location in the sender's memory, without any interaction of the sender's CPU. Any protocol using \emph{Read} as the data transfer verb needs to solve two problems: \begin{itemize} @@ -203,10 +204,10 @@ \subsubsection{Sender} vital to avoid unnecessary copies. Also, it is very important to not block on a single transfer, but to give an RNIC multiple requests to work on and fill up its processing pipeline. -This results in the interface shown in Listing~\ref{list:sender}. Where we send a message by providing a reference +This results in the interface shown in Listing~\ref{list:sender}. We send a message by providing a reference to a previously allocated memory region, which asynchronously starts the message transfer. This asynchronous approach allows us to start multiple concurrent transmissions. Throughout this thesis, we will talk about \emph{unacknowledged messages}, which -are messages that have been sent but its transfer has not yet been completed. These unacknowledged messages are very +are messages that have been sent but their transfers have not yet been completed. These unacknowledged messages are very important to achieve the best possible performance. \begin{figure}[htp] @@ -249,7 +250,7 @@ \subsubsection{Receiver} ReceiveRegion Receive(); // Marks the previously received region - // to be reused by the protocol + // as ready to be reused by the protocol void Free(ReceiveRegion reg); } \end{lstlisting} diff --git a/doc/thesis/related_work.tex b/doc/thesis/related_work.tex index b5ae7f8..d238091 100644 --- a/doc/thesis/related_work.tex +++ b/doc/thesis/related_work.tex @@ -1,20 +1,20 @@ \section{Related Work} In recent years there has been a lot of work on making effective use of RDMA and on avoiding pitfalls of designing -RDMA enabled protocols. Most of it focuses on either low level verb performance or design and evaluate much higher level -abstractions that mostly focus on RPC performance or the performance of a whole system. +RDMA enabled protocols. Most of it focuses either on low level verb performance or ond designing and evaluating +much higher level abstractions that mostly focus on RPC performance or the performance of a whole system. -\paragraph{} We give an overview of related work that discuss how RDMA can be used effectively as well as an overview of +\paragraph{} We give an overview of related works that discuss how RDMA can be used effectively as well as an overview of some recent systems that try to utilize these capabilities. \subsection{Using RDMA Effectively} -Design Guidelines for High Performance RDMA Systems~\cite{anuj-guide} provides a very low level guide on how to +\emph{Design Guidelines for High Performance RDMA Systems}~\cite{anuj-guide} provides a very low level guide on how to effectively use RDMA enabled networks. It focuses on specific optimizations to make and pitfalls to avoid in order to achieve the best possible performance. \paragraph{} There are many papers focusing on RPCs. Some try to systematically compare RPC protocol -approaches~\cite{ziegler2020rdma,Huang2019AnEO}. These gives us valuable insight in verb performance and scalability, but +approaches~\cite{ziegler2020rdma,Huang2019AnEO}. These give us valuable insight in verb performance and scalability, but they all use very simplified protocols and a very limited design space. The Remote Fetching Paradigm~(RFP)~\cite{rfp} exposes an RPC interface while utilizing RDMAs asymmetric performance @@ -24,12 +24,12 @@ \subsection{Using RDMA Effectively} ScaleRPC~\cite{scal-rdma-rpc} and other works~\cite{fasst, rfp, herd} address the scalability of RDMA for RPCs when using \emph{Reliable Connections}. They point out that the main scalability problem of reliable connections stem from limited -RNIC and CPU caches, resulting in cache thrashing with a large amount of active Queue Pairs(QPs). ScaleRPC uses RDMA writes to +RNIC and CPU caches, resulting in cache thrashing with a large amount of active Queue Pairs~(QPs). ScaleRPC uses RDMA writes to transmit both requests as well as responses. It addresses scaling problems with temporal slicing. -\paragraph{} There has been quite a lot of research into designing fast and scalable RDMA system. This thesis aims to take +\paragraph{} There has been quite a lot of research into designing fast and scalable RDMA systems. This thesis aims to take a step back and take a broader look at RDMA based protocols for data exchange. A recent master thesis of the University of Waterloo~\cite{sharma2020design} saw a similar need for a design space analysis. They look at the design space of \emph{flow structures}, a much narrower protocol definition compared to our definition. But the work provides a similar analysis of possible @@ -47,7 +47,7 @@ \subsection{RDMA Systems} RPC like approaches for \code{DELETE} and \code{PUT}. While Pilaf relies on send and receive verbs, FaRM uses a ring-buffer based protocol using RDMA writes. HERD~\cite{herd} uses a fully server-driven approach and issues all key-value operations through an RPC interface. For this -RPC protocol clients issue requests using RDMA writes to server-polled memory region. The server processes the request and +RPC protocol clients issue requests using RDMA writes to server-polled memory regions. The server processes the request and replies using the RDMA send verb and unreliable datagrams. Other works design new database systems~\cite{dbrackjoin} or propose new approaches to message brokers~\cite{broker} @@ -63,7 +63,8 @@ \subsection{RDMA Systems} that does not have RDMA at its core, but rather uses it as another features to speed up the system. -This thesis is a good starting point for engineers building similar systems to design an appropriate RDMA-based protocol. +This thesis is a good starting point for engineers building similar systems to design an appropriate RDMA-based protocol, +without abandoning the classical message passing interface. diff --git a/doc/thesis/sendrcv.tex b/doc/thesis/sendrcv.tex index e2aa5b2..48527c4 100644 --- a/doc/thesis/sendrcv.tex +++ b/doc/thesis/sendrcv.tex @@ -28,8 +28,10 @@ \subsection{Protocol} numerous unacknowledged messages which improves performance drastically. \paragraph{} It is important to note that in production systems there needs to be a way for the sender to notice whether -enough receive buffers are ready. If there is no posted receive buffer available when a message is received, the receiver generates -a so-called \emph{Reader Not Ready} error. This will either cause a large back-off for the sender or even cause the connection +enough receive buffers are ready. If there is no posted receive buffer available when a message is received, +the receiver generates +a so-called \emph{Reader Not Ready} error. This will either cause a large back-off for the sender +or even cause the connection to break. We observed this problem specifically for N:1 communication. This can be mitigated to some extend by optimizing receiving and @@ -46,7 +48,8 @@ \subsection{Extensions} \subsubsection{Inline Sending} The send verb has a slight variation called \emph{inline send}. Inline sending means that instead of simply referencing the payload in a work request, the sending CPU directly copies it to the RNIC using MMIO. This prevents -the NIC from having to issue an addition DMA and can reduce latency for very small messages~\cite{anuj-guide}. It does however +the NIC from having to issue an addition DMA and can reduce latency for very small messages~\cite{anuj-guide}. It does, +however, increase the load for the sending CPU. As we will see, the sending CPU is oftentimes the bottleneck so we did not further evaluate inline sending in this thesis. It can, however, be a viable optimization for small messages. @@ -63,7 +66,8 @@ \subsubsection{Shared Receive Queues} the total request volume is quite small (i.e., we expect only burst of $k$ messages but from different nodes at different times). We can reduce the total memory usage by using \emph{Shared Receive Queues~(SRQ)}. As the name already tells us SRQs allow us -share receive queues between multiple QPs. This means, we can reuse a single receive queue for multiple connections, allowing +to share receive queues between multiple QPs. This means, we can reuse a single receive queue +for multiple connections, allowing multiple receivers to share the same receive buffers. This means the total memory usage does not grow with the number of open connections, but stays constant. @@ -79,7 +83,7 @@ \subsubsection{Shared Receive Queues} \subsubsection{Single Receiver} It is very common to have an N:1 communication pattern where a single server receives messages from multiple clients. This -could be achieved by simply round-robin over the $N$ connections. For this connection however we used the fact that we can +could be achieved by simply round-robin over the $N$ connections. For this connection, however, we used the fact that we can associate a single completion queue with multiple queue pairs. This means, if we are in a \emph{single receiver mode}, all receive completion events will end up in a single CQ. Allowing us to poll a single queue to receive a message from $N$ different sender. diff --git a/doc/thesis/shared_write.tex b/doc/thesis/shared_write.tex index 8ba25e9..9a4d574 100644 --- a/doc/thesis/shared_write.tex +++ b/doc/thesis/shared_write.tex @@ -2,9 +2,10 @@ \section{Shared Write} \label{sec:conn:shared_write} In Section~\ref{sec:conn:buf_write} we presented multiple ways to build a message passing protocol based on the RDMA write verb and a ring-buffer. In this section we present a way to share a single ring-buffer for multiple connections. -\paragraph{} We only implement a single variant based on the \emph{Write with Immediate} verb and focus on a N:1 connection pattern only, -where multiple sender transmit messages to a single receiver. -We point out that there are multiple different approaches that use the same atomics based reserving we present below. One +\paragraph{} We only implement a single variant based on the \emph{Write with Immediate} verb and +focus on a N:1 connection pattern only, +where multiple senders transmit messages to a single receiver. +We point out that there are multiple different approaches that use the same atomics based reservation we present below. One could use the same \emph{write offset} or \emph{write reverse} approach presented earlier and multiple receivers could utilize the same ring-buffer when using \emph{write with immediate}. A deep dive in all these variations, however, is out of scope of this thesis. @@ -19,7 +20,7 @@ \subsection{Protocol} \item In the second phase the sender writes to the reserved space. \end{itemize} \subsubsection{Reservation} -To allow for multiple sender to write to a shared ring-buffer we need to coordinate them. We need a system that +To allow for multiple senders to write to a shared ring-buffer, we need to coordinate them. We need a system that assigns a unique destination addresses to each message. This translates to implementing a \emph{sequencer} that issues increasing addresses based on the message size. This can either be implemented using atomic operations or by using RPCs and handling the sequencing at the receive's CPU. An RPC based approach is expected to provide us with higher @@ -49,7 +50,7 @@ \subsubsection{Transmission} to the receiver. We decided to do this using the \emph{Write with Immediate} verb. This allows us to notify the receiver with completion event. -One thing we need to note is that this time the completion events do not have to arrive in the same oder as they appear in +This time the completion events do not have to arrive in the same oder as they appear in the ring-buffer. That means we need to send the offset of the corresponding buffer segment as immediate data. @@ -63,8 +64,10 @@ \subsection{Device Memory} reduce latency and does not have to make any PCI accesses. From this we especially expect atomic operations to be a lot faster when using device memory. -\paragraph{} Quick micro benchmarks reaffirm these expectations. In our case maximum throughput of fetch and add operations -increased form around 2 MOps to nearly 8 MOps. This means by using device memory for the ring-buffer metadata we can increase +\paragraph{} Quick micro-benchmarks reaffirm these expectations. In our case the maximum throughput of +fetch and add operations +increased form around 2 MOp/s to nearly 8 MOp/s. This means by using device memory for the ring-buffer +metadata we can increase the throughput of our reservation phase by nearly 300\%. We can also achieve a minor speed-up for reading the head update after reserving a buffer segment. @@ -74,7 +77,7 @@ \subsection{Feature Analysis} The shared write protocol allows us to share a single ring-buffer between multiple connections by using atomic operations. This gives \emph{resource sharing} capabilities to a buffered read protocol, while keeping the features of -\emph{variable message size}. This allows us to have a constant memory requirement for a lot of connections, +\emph{variable message size}. It allows us to have a constant memory requirement for a lot of connections, resulting in very effective use of buffer space. diff --git a/doc/thesis/thesis.tex b/doc/thesis/thesis.tex index 0b8d05c..f435ccc 100644 --- a/doc/thesis/thesis.tex +++ b/doc/thesis/thesis.tex @@ -123,12 +123,11 @@ \tableofcontents \pagebreak \input{intro.tex} - +\pagebreak +\input{background.tex} \pagebreak \input{related_work.tex} -\pagebreak -\input{background.tex} diff --git a/doc/thesis/write.tex b/doc/thesis/write.tex index 92eeaaa..1c6d3b6 100644 --- a/doc/thesis/write.tex +++ b/doc/thesis/write.tex @@ -3,7 +3,7 @@ \section{Buffered Write} \label{sec:conn:buf_write} The basic idea of a buffered write protocol is to have a ring-buffer at the receiver which the sender writes to using RDMA write operations. This is a widely used approach and can be found in many systems~\cite{herd, scal-rdma-rpc}. -One goal of this section is to demonstrate that there are many different ways to implement such a buffer write +One goal of this section is to demonstrate that there are many different ways to implement such a buffered write protocol, which all can have different features and performance characteristics. @@ -30,7 +30,7 @@ \section{Buffered Write} \label{sec:conn:buf_write} \item \textbf{Read Acknowledging}, where the sender issues an RDMA read operation to check for available space. \end{itemize} -This essentially gives us 6 different versions of our buffered write protocol. +This essentially gives us six different versions of our buffered write protocol. \subsection{Ring-Buffer} \label{sec:conn:write:buf} \begin{figure}[!ht] @@ -104,7 +104,7 @@ \subsection{Ring-Buffer} \label{sec:conn:write:buf} \end{tikzpicture} \end{center} -\caption{Ring-buffer with twice mapped memory to allows DMA writes over the end} +\caption{Ring-buffer and the three different offsets, with a visualisation of double mapped memory.} \label{fig:ringbuffer} \end{figure} @@ -125,7 +125,6 @@ \subsection{Ring-Buffer} \label{sec:conn:write:buf} \paragraph{}\emph{Receiving} from a buffered read protocol will return a pointer to a segment of the buffer. \emph{Freeing} a message marks this buffer segment as being free to be reused. - This is where the difference between the \emph{head} and the \emph{read pointer} becomes apparent. The read pointer points to the next message the receiver has not read, while the head points to the oldest message the receiver has not freed. Reading and freeing messages should work interleaved, so we need some kind of management of free buffer to update the head @@ -172,7 +171,7 @@ \subsection{Sender} \label{sec:conn:write:sender} \subsubsection{Write Immediate} We introduced \emph{Write with Immediate} in Section \ref{sec:bg:write}, which allows us to implement our sending -and receiving in a very similar way to the send connection presented in the last section. +and receiving in a very similar way to the send connection presented earlier. \paragraph{} Write with Immediate sends a 32 bit value while also writing to the specified location. More importantly it consumes a receive buffer and creates a completion event at the receiver. This completion event also contains the number @@ -182,7 +181,7 @@ \subsubsection{Write Immediate} order as the completion events in the queue. We can reuse both the receive buffer management and batching from the send connection. - +\pagebreak \subsubsection{Write Offset} One key reason to use the write instead of send verb is that we do not need to generate a completion event at the receiver. @@ -192,15 +191,15 @@ \subsubsection{Write Offset} \paragraph{} One way to implement this is by having additional metadata that allows the receiver to notice incoming data. We implement such a protocol with what we call \emph{Write Offset}. The idea is that together with each message, the sender also updates a metadata region at the receiver containing the \emph{tail}. The receive can then notice new incoming messages -by polling this tail and comparing it to the \emph{Read Pointer}. We write the size of the message as the first 4 bytes +by polling this tail and comparing it to the \emph{read pointer}. We write the size of the message as the first 4 bytes that we write to the ring-buffer. This means to send a message of size $s$ the sender prepends the size $s$ to the message and writes it to the tail of the buffer. It then writes the updated tail to the metadata region at the sender. Of these two RDMA writes only the latter needs to be signaled and we can issue both of the work requests at the same time, mitigating the impact of having to perform two writes for a single message. The receiver will always poll its local copy of the tail. As soon as the tail is -not equal to the \emph{Read Pointer} it knows that there are outstanding messages. We can read the first 4 bytes to get the size -of the next message and can then read it and update the \emph{Read Pointer}. +not equal to the read pointer it knows that there are outstanding messages. We can read the first 4 bytes to get the size +of the next message and can then read it and update the read pointer. \paragraph{} This connection has obvious drawbacks, as we need to issue two writes for a single message. But this way we circumvent the need of receive buffers and end up with a completely \emph{passive receiver}. @@ -321,7 +320,7 @@ \subsubsection{Send Acknowledging} us from easily using this connection bidirectionally when coupled with the write immediate sender implementation, as both incoming messages as well as tail updates will generate completion events. -\subsection{Features Analysis} +\subsection{Feature Analysis} Ring-buffer based connections are often used because of their apparent speed, but they also have a few additional features.