Skip to content

Commit

Permalink
Final corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
Glorfischi authored and Glorfischi committed Feb 2, 2021
1 parent efbed03 commit af8ed17
Show file tree
Hide file tree
Showing 15 changed files with 126 additions and 111 deletions.
2 changes: 1 addition & 1 deletion doc/plots/lat-msgsize.p
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ plot $bufread with linespoints pt 11 ps 1.5 title "Buffered Read (BR)", \
$dirread with linespoints pt 9 ps 1.5 title "Direct Read (DR)", \
$writeOff with linespoints pt 5 ps 1.5 title "Buffered Write Offset (BR-Off)", \
$writeRev with linespoints pt 4 ps 1.5 title "Buffered Write Reverse (BR-Rev)", \
$send with linespoints pt 7 ps 1.5 lc rgb "dark-orange" title "Send-Receive (SR)" , \
$median with linespoints pt 13 ps 1.5 title "Direct Write (DW)", \
$send with linespoints pt 7 ps 1.5 title "Send-Receive (SR)", \



30 changes: 17 additions & 13 deletions doc/thesis/background.tex
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ \section{RDMA} \label{sec:rdma}
Remote Direct Memory Access (RDMA) is a network mechanism that allows moving buffers between applications over the network.
The main difference to traditional network protocols like TCP/IP is that it is able to completely bypass the hosts kernel
and even circumvents the CPU for data transfer. This allows applications using RDMA to achieve latencies as low as 2 $\mu s$
and throughputs of up to 100 $Gb/s$, all while having a smaller CPU footprint.
and throughputs of up to 100 $Gbit/s$, all while having a smaller CPU footprint.


\paragraph{} While initially developed as part of the \emph{InfiniBand} network protocol, which completely replaces the OSI
Expand Down Expand Up @@ -114,22 +114,23 @@ \subsection{Verbs API}
\begin{itemize}
\item \textbf{Send (with Immediate):} Transfers data from the senders memory to a prepared memory region at the receiver.
\item \textbf{Receive:} Prepares a memory region to receive data through the send verb.
\item \textbf{Write (with Immediate):} Copies data from the senders memory to known memory location at the receiver without any
\item \textbf{Write (with Immediate):} Copies data from the senders memory to a known memory location at the receiver without any
interaction from the remote CPU.
\item \textbf{Read:} Copies data from remote memory to a local buffer without any inteaction from the remote CPU.
\item \textbf{Atomics:} Two different atomic operations. Compare and Swap (CAS) and Fetch and Add (FAA). They can access 64-bit
values in the remote memory.
\end{itemize}

\paragraph{} Like traditional socket these QPs come in different transport modes: Reliable Connection (RC), Unreliable Connection (UC),
\paragraph{} Like traditional socket, these QPs come in different transport modes: Reliable Connection (RC),
Unreliable Connection (UC),
and Unreliable Datagram (UD). While UD supports sending to arbitrary other endpoints, similarly to a UDP socket, RC and UC
need to establish a one to one connection between Queue Pairs, similarly to TCP sockets. Only RC supports all
verbs and we will focus on this transport mode.


\paragraph{} Queue Pairs give us a direct connection to the RNIC. A QP essentially consists of two queues that allow us to
issue verbs directly to the RNIC. The \emph{Send Queue} is used to issue Send, Write, Read, and Atomic verbs, and the
\emph{Receive Queue} which is used to issue a Receive verb. These verbs are issued by pushing a \emph{Work Request (WR)}
\emph{Receive Queue} is used to issue Receive verbs. These verbs are issued by pushing a \mbox{\emph{Work Request~(WR)}}
to the respective queue. A work request is simply a struct that contains an id, the type of verb to issue, and all necessary
additional information to perform it. The RNIC will pop the WR from the queue and execute the corresponding action.

Expand Down Expand Up @@ -158,18 +159,20 @@ \subsection{Verbs API}
\end{figure}


\paragraph{} This gives us an asynchronous interface. Issuing a work request for a Write does not mean that the Write was
performed, but simply that the RNIC will eventually process this request. To signal completion of certain work requests there
\paragraph{} This gives us an asynchronous interface. Issuing a work request for a Write operations
does not mean that the Write was
performed, but simply that the RNIC will eventually process this request. To signal the completion of a
work requests there
is an additional type of queue called the \emph{Completion Queue (CQ)}. There needs to be a CQ associated with each Send and
Receive Queue. When the RNIC completes a work request it will enqueue a \emph{Completion Queue Entry~(CQE)} to the respective
Receive Queue. When the RNIC completes a work request it will enqueue a \mbox{\emph{Completion Queue Entry~(CQE)}} to the respective
CQ. This CQE informs the application whether the request was processed successfully. The application can match CQEs to
previously issued work requests by the ID it provided during issuing. It is also possible to post an \emph{unsignaled} network
operation that does not generate a CQE after its completion.


\paragraph{} All locally and remotely accessible memory needs to be previously registered for the RNIC to be able write to
or read from it. We call these preregistered regions of memory \emph{Memory Regions (MRs)}.
Registering memory pins it so that it is not swapped out by the host. The process of registering is generally orders of magnitude slower
Registering memory pins it so that it is not swapped out by the host. The process of registering is orders of magnitude slower
then data operations like sending or writing. So in general all accessed memory is registered at connection setup. Henceforth,
we will assume these memory regions to be registered if not specified otherwise.

Expand Down Expand Up @@ -225,7 +228,7 @@ \subsubsection{Send / Receive} \label{sec:bg:send}

\paragraph{} To better understand this communication model we will walk through the operations involved in sending a single
message from a system A to another system B. We assume that the two nodes have already setup a connection. In this
thesis we will not go into the details of connections setup. Each nodes has prepared a QP and associated a completion
thesis we will not go into the details of connections setup. Each node has prepared a QP and associated a completion
queue to it. Both systems have registered a MR of at least the size of the message to be sent.

\begin{enumerate}
Expand All @@ -238,7 +241,7 @@ \subsubsection{Send / Receive} \label{sec:bg:send}

System B now polls its CQ until it receives a CQE for its issued receive request.

\item System A can initiate the transfer by posting a \emph{Send Request}. To do this it copies a work request to
\item System A initiates the transfer by posting a \emph{Send Request}. To do this it copies a work request to
the Send Queue. This request contains a pointer to its local memory containing the message to be sent
and its size. This request notifies the NIC to initiate the transfer.
It also starts polling its CQ to notice the completion of the send request.
Expand Down Expand Up @@ -298,17 +301,18 @@ \subsubsection{Write} \label{sec:bg:write}
\label{fig:seq-wrt}
\end{figure}

Figure~\ref{fig:seq-wrt} show the operations involved in writing data to the remote using RDMA write. It is generally very
Figure~\ref{fig:seq-wrt} show the operations involved in writing data to remote memory using RDMA write. It is generally very
similar to the send and receive sequence presented in the previous section. The sending CPU still issues a work request which
is handled by the NIC and is notified of its completion through the completion queue.
The main difference is that the remote system does not need to post a receive buffer and there is no CQE generated at the
remote. Also, the work request is a structured differently. It does not only contain a pointer to the local send buffer but
also provides the remote address to write it to.

\paragraph{} The standard RDMA write does not generate a completion entry at the receiver, which is generally more efficient.
\paragraph{} The standard RDMA write does not generate a completion queue entry at the receiver,
which is generally more efficient.
However, sometimes it is very helpful for the receiver to be notified of a completed write. For this purpose the Verbs API
also provides a related operation called \emph{Write with Immediate}.
This operation works very similarly to a normal RDMA write, but it generates a completion entry at the receiver, in the same way
This operation works very similarly to a normal RDMA write, but it generates a CQE at the receiver, in the same way
the send verb does. This means it will also consume a posted receive request, so the receiver needs to post a receive request
prior to the transfer. Write with Immediate however will not write any data to the associated receive buffer.

Expand Down
10 changes: 5 additions & 5 deletions doc/thesis/buffered_read.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
\section{Buffered Read}\label{sec:conn:buf_read}

The idea of a buffered read protocol is to have a ring-buffer at the sender from which the receiver fetches the messages using
RDMA reads. There are multiple different ways to implement such a protocol, with the main variations being in, how to notify
RDMA reads. There are multiple different ways to implement such a protocol, with the main difference being in, how to notify
the receiver of new messages, where to transfer them to, and how to acknowledge to the sender that a message has been processed.

\paragraph{} We decided to focus on an implementation which gives us a \emph{Passive Sender} and allows for
\emph{Variable Message Sizes}. We decided to stick with the basic interface defined in Section~\ref{sec:protocols}. This
\emph{Variable Message Sizes}. We stuck with the basic interface defined in Section~\ref{sec:protocols}. This
results in a system with two ring-buffers, illustrated in Figure~\ref{fig:buf_read_struct}.

\begin{figure}[!ht]
Expand Down Expand Up @@ -65,7 +65,7 @@ \section{Buffered Read}\label{sec:conn:buf_read}
\label{fig:buf_read_struct}
\end{figure}

\subsubsection{Protocol}
\subsection{Protocol}
As mentioned the sender of our buffered read connection is entirely passive, that means after the connection setup the sending
CPU does not issue any RDMA operations. The only thing it needs to do to send is to check its head if there is enough space,
copy the message to its tail, and update the tail offset. It also prepends the message size when writing.
Expand All @@ -84,7 +84,7 @@ \subsubsection{Protocol}
return the next message with the length at the beginning of the buffer and update the \code{read\_pointer}.

If there are no transmitted messages to return, the receiver updates the \code{remote\_tail} with an RDMA read. If the remote
tail has not updated, it retries until it has. With the updated tail the receiver issues an RDMA read for the whole buffer
tail has not been updated, it retries until it has. With the updated tail the receiver issues an RDMA read for the whole buffer
section between the read pointer and the updated tail. It then returns the next message in the newly transmitted section.

This gives us an interesting characteristic of most receives being very cheap, while some are very expensive as they need to
Expand All @@ -109,5 +109,5 @@ \subsection{Feature Analysis}


\paragraph{} For future work it would also be interesting to implement a system where multiple threads can write to and receive
from a single connection by atomically updating head or tail pointers, or we could explore the possibilities to share send or
from a single connection by atomically updating head or tail pointers, or we could explore the possibilities of sharing send or
receive buffers using atomic operations. The current implementation, however, does not provide any kind of \emph{Resource Sharing}
2 changes: 1 addition & 1 deletion doc/thesis/conclusion.tex
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ \section{Conclusion}

\paragraph{} We analysed the presented protocols and evaluated more then just raw performance. We looked at other
features that can be critical for applications. These features include achieving effective memory usage by allowing
variable message sizes, avoiding additional copying by being truly zero-copy, or one slow processing message not
variable message sizes, avoiding additional copying by being truly zero-copy, or one slowly processing message not
being able to stall the whole connection.

\paragraph{} We introduced a new performance model for RDMA based message passing protocols that allows us to
Expand Down
12 changes: 6 additions & 6 deletions doc/thesis/direct_read.tex
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@

\section{Direct Read} \label{sec:conn:direct_read}

In Section~\ref{sec:conn:direct_write} we discussed how we can possibly avoid an additional copy at the receiver by giving
In Section~\ref{sec:conn:direct_write} we discussed how we can possibly avoid an additional copy at the receiver, by giving
the sender information which allows him to potentially write the data to the correct final memory location. The next logical
step is to let the receiver decide for each message where to write it to. We can achieve this by our implementation of a
step is to let the receiver decide for each message where to write it to. We can achieve this with our implementation of a
\emph{Direct Read Connection}.

\paragraph{} The core idea of a direct read protocol is that instead of directly sending a message through a send or write
Expand All @@ -24,8 +24,8 @@ \subsection{Protocol}
\paragraph{} To wait for the transfer to be completed, and for the buffer to be able to be reused, we can not simply wait
for the completion event of the send, like we do for the send or write based connections. We need to wait for the receiver
to explicitly signal that the buffer was transfered. We append a signaling byte at the end of the send buffer.
When sending this byte, will be set to 0 and we can wait for the transport to be completed by polling this byte until the
receiver will update it.
When sending, this byte is set to 0 and we can wait for the transport to be completed by polling this byte until the
receiver updates it.

This push based implementation introduces little additional complexity, but there are other ways to implement such
signaling. The signaling bit forces us to use a specific memory arrangement, which could prevent us to send data directly
Expand All @@ -39,8 +39,8 @@ \subsection{Protocol}
It is crucial that we do not block until the read is completed to get reasonable performance. This means the receiver has a
slightly different interface than the previously presented connections. We split the receive call into a
\code{RequestAsync} and a \code{Wait} method. The \code{RequestAsync} takes a receive buffer
to read into. It will wait for an incoming read request and issue the corresponding read. It uses the same increasing
\code{wr\_id} approach we use for sending with which the \code{Wait} method can wait for the read to complete. This approach
to read into. It waits for an incoming read request and issue the corresponding read. It uses the same increasing
\code{wr\_id} approach we use for sending, with which the \code{Wait} method can wait for the read to complete. This approach
allows us to pipeline receives the same way we pipeline sends.

\paragraph{} As soon as the transfer is complete, the receiver updates the corresponding signaling bit using an RDMA write.
Expand Down
9 changes: 5 additions & 4 deletions doc/thesis/direct_write.tex
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ \section{Direct Write}\label{sec:conn:direct_write}
regions for the sender to write to.

\paragraph{} For this thesis we implemented something very reminiscent of the send receive protocol. The core idea is for
the receiver to send information on prepared receive buffers to the sender. The sender will then use these buffers in order.
the receiver to send information on prepared receive buffers to the sender. The sender will then use these buffers.



Expand Down Expand Up @@ -104,19 +104,20 @@ \subsection{Protocol}
order, for any modern systems~\cite{herd, farm}. This guarantees us that the complete message has been written as soon a we
see an update to the last byte.

\subsection{Features Analysis}
\subsection{Feature Analysis}

With our Direct Write connection we essentially rebuilt the send and receive verbs using only RDMA writes. This gives us
similar features but also allows for more control over the protocol. This could potentially allow us to extend it for
specific systems.

\paragraph{} Our current implementation fulfills our requirements for being \emph{non-blocking}, the same way the send-receive
protocol does. It however does not provide any \emph{interrupts} and we did not explore any \emph{resource sharing} approaches.
protocol does. It, however, does not provide any \emph{interrupts} and we did not explore any \emph{resource sharing} approaches.
As it stands, it also does not provide \emph{True Zero Copy} capabilities and it lacks support for
\emph{variable message sizes}.


\paragraph{} We do however think that this protocol can be adapted to enable more features. By adding metadata when posting a
\paragraph{} Nevertheless, we do think that this protocol can be adapted to enable more features.
By adding metadata when posting a
buffer we could enable \emph{True Zero Copy} capabilities for certain applications, by performing some routing decisions at
the sender and being able to directly write to the correct final memory location.
And with a more sophisticated buffer management one could more effectively utilize the available memory, by supporting
Expand Down
Loading

0 comments on commit af8ed17

Please sign in to comment.