Final corrections

glrf · Feb 2, 2021 · af8ed17 · af8ed17
1 parent efbed03
commit af8ed17
Show file tree

Hide file tree

Showing 15 changed files with 126 additions and 111 deletions.
diff --git a/doc/plots/lat-msgsize.p b/doc/plots/lat-msgsize.p
@@ -59,8 +59,8 @@ plot $bufread with linespoints pt 11 ps 1.5 title "Buffered Read (BR)", \
      $dirread with linespoints pt 9 ps 1.5 title "Direct Read (DR)", \
      $writeOff with linespoints pt 5 ps 1.5 title "Buffered Write Offset (BR-Off)", \
      $writeRev with linespoints pt 4 ps 1.5  title "Buffered Write Reverse (BR-Rev)", \
+     $send with linespoints pt 7 ps 1.5 lc rgb "dark-orange" title "Send-Receive (SR)" , \
      $median with linespoints pt 13 ps 1.5 title "Direct Write (DW)", \
-     $send with linespoints pt 7 ps 1.5 title "Send-Receive (SR)", \
 
 
 
diff --git a/doc/thesis/background.tex b/doc/thesis/background.tex
@@ -3,7 +3,7 @@ \section{RDMA} \label{sec:rdma}
 Remote Direct Memory Access (RDMA) is a network mechanism that allows moving buffers between applications over the network.
 The main difference to traditional network protocols like TCP/IP is that it is able to completely bypass the hosts kernel
 and even circumvents the CPU for data transfer. This allows applications using RDMA to achieve latencies as low as 2 $\mu s$
-and throughputs of up to 100 $Gb/s$, all while having a smaller CPU footprint.
+and throughputs of up to 100 $Gbit/s$, all while having a smaller CPU footprint.
 
 
 \paragraph{} While initially developed as part of the \emph{InfiniBand} network protocol, which completely replaces the OSI 
@@ -114,22 +114,23 @@ \subsection{Verbs API}
 \begin{itemize}
   \item \textbf{Send (with Immediate):} Transfers data from the senders memory to a prepared memory region at the receiver.
   \item \textbf{Receive:} Prepares a memory region to receive data through the send verb.
-  \item \textbf{Write (with Immediate):} Copies data from the senders memory to known memory location at the receiver without any 
+  \item \textbf{Write (with Immediate):} Copies data from the senders memory to a known memory location at the receiver without any 
     interaction from the remote CPU.
   \item \textbf{Read:} Copies data from remote memory to a local buffer without any inteaction from the remote CPU.
   \item \textbf{Atomics:} Two different atomic operations. Compare and Swap (CAS) and Fetch and Add (FAA). They can access 64-bit 
     values in the remote memory. 
 \end{itemize}
 
-\paragraph{} Like traditional socket these QPs come in different transport modes: Reliable Connection (RC), Unreliable Connection (UC),
+\paragraph{} Like traditional socket, these QPs come in different transport modes: Reliable Connection (RC),
+Unreliable Connection (UC),
 and Unreliable Datagram (UD). While UD supports sending to arbitrary other endpoints, similarly to a UDP socket, RC and UC 
 need to establish a one to one connection between Queue Pairs, similarly to TCP sockets. Only RC supports all 
 verbs and we will focus on this transport mode.
 
 
 \paragraph{} Queue Pairs give us a direct connection to the RNIC. A QP essentially consists of two queues that allow us to 
 issue verbs directly to the RNIC. The \emph{Send Queue} is used to issue Send, Write, Read, and Atomic verbs, and the 
-\emph{Receive Queue} which is used to issue a Receive verb. These verbs are issued by pushing a \emph{Work Request (WR)}
+\emph{Receive Queue} is used to issue  Receive verbs. These verbs are issued by pushing a \mbox{\emph{Work Request~(WR)}}
 to the respective queue. A work request is simply a struct that contains an id, the type of verb to issue, and all necessary 
 additional information to perform it. The RNIC will pop the WR from the queue and execute the corresponding action.
 
@@ -158,18 +159,20 @@ \subsection{Verbs API}
 \end{figure}
 
 
-\paragraph{} This gives us an asynchronous interface. Issuing a work request for a Write does not mean that the Write was
-performed, but simply that the RNIC will eventually process this request. To signal completion of certain work requests there
+\paragraph{} This gives us an asynchronous interface. Issuing a work request for a Write operations
+does not mean that the Write was
+performed, but simply that the RNIC will eventually process this request. To signal the completion of a
+work requests there
 is an additional type of queue called the \emph{Completion Queue (CQ)}. There needs to be a CQ associated with each Send and 
-Receive Queue. When the RNIC completes a work request it will enqueue a \emph{Completion Queue Entry~(CQE)} to the respective
+Receive Queue. When the RNIC completes a work request it will enqueue a \mbox{\emph{Completion Queue Entry~(CQE)}} to the respective
 CQ. This CQE informs the application whether the request was processed successfully. The application can match CQEs to 
 previously issued work requests by the ID it provided during issuing. It is also possible to post an \emph{unsignaled} network
 operation that does not generate a CQE after its completion.
 
 
 \paragraph{}  All locally and remotely accessible memory needs to be previously registered for the RNIC to be able write to
 or read from it. We call these preregistered regions of memory \emph{Memory Regions (MRs)}.  
-Registering memory pins it so that it is not swapped out by the host. The process of registering is generally orders of magnitude slower 
+Registering memory pins it so that it is not swapped out by the host. The process of registering is orders of magnitude slower 
 then data operations like sending or writing. So in general all accessed memory is registered at connection setup. Henceforth,  
 we will assume these memory regions to be registered if not specified otherwise.
 
@@ -225,7 +228,7 @@ \subsubsection{Send / Receive} \label{sec:bg:send}
 
 \paragraph{} To better understand this communication model we will walk through the operations involved in sending a single
 message from a system A to another system B. We assume that the two nodes have already setup a connection. In this
-thesis we will not go into the details of connections setup. Each nodes has prepared a QP and associated a completion 
+thesis we will not go into the details of connections setup. Each node has prepared a QP and associated a completion 
 queue to it. Both systems have registered a MR of at least the size of the message to be sent.
 
 \begin{enumerate}
@@ -238,7 +241,7 @@ \subsubsection{Send / Receive} \label{sec:bg:send}
 
     System B now polls its CQ until it receives a CQE for its issued receive request.
 
-  \item System A can initiate the transfer by posting a \emph{Send Request}. To do this it copies a work request to 
+  \item System A initiates the transfer by posting a \emph{Send Request}. To do this it copies a work request to 
     the Send Queue. This request contains a pointer to its local memory containing the  message to be sent
     and its size. This request notifies the NIC to initiate the transfer.
     It also starts polling its CQ to notice the completion of the send request.
@@ -298,17 +301,18 @@ \subsubsection{Write} \label{sec:bg:write}
 \label{fig:seq-wrt}
 \end{figure}
 
-Figure~\ref{fig:seq-wrt} show the operations involved in writing data to the remote using RDMA write. It is generally very
+Figure~\ref{fig:seq-wrt} show the operations involved in writing data to remote memory using RDMA write. It is generally very
 similar to the send and receive sequence presented in the previous section. The sending CPU still issues a work request which
 is handled by the NIC and is notified of its completion through the completion queue.
 The main difference is that the remote system does not need to post a receive buffer and there is no CQE generated at the 
 remote. Also, the work request is a structured differently. It does not only contain a pointer to the local send buffer but
 also provides the remote address to write it to.
 
-\paragraph{} The standard RDMA write does not generate a completion entry at the receiver, which is generally more efficient.
+\paragraph{} The standard RDMA write does not generate a completion queue entry at the receiver, 
+which is generally more efficient.
 However, sometimes it is very helpful for the receiver to be notified of a completed write. For this purpose the Verbs API 
 also provides a related operation called \emph{Write with Immediate}.
-This operation works very similarly to a normal RDMA write, but it generates a completion entry at the receiver, in the same way 
+This operation works very similarly to a normal RDMA write, but it generates a CQE at the receiver, in the same way 
 the send verb does. This means it will also consume a posted receive request, so the receiver needs to post a receive request 
 prior to the transfer. Write with Immediate however will not write any data to the associated receive buffer.
 

diff --git a/doc/thesis/buffered_read.tex b/doc/thesis/buffered_read.tex
@@ -2,11 +2,11 @@
 \section{Buffered Read}\label{sec:conn:buf_read}
 
 The idea of a buffered read protocol is to have a ring-buffer at the sender from which the receiver fetches the messages using
-RDMA reads. There are multiple different ways to implement such a protocol, with the main variations being in, how to notify
+RDMA reads. There are multiple different ways to implement such a protocol, with the main difference being in, how to notify
 the receiver of new messages, where to transfer them to, and how to acknowledge to the sender that a message has been processed.
 
 \paragraph{} We decided to focus on an implementation which gives us a \emph{Passive Sender} and allows for 
-\emph{Variable Message Sizes}. We decided to stick with the basic interface defined in Section~\ref{sec:protocols}. This 
+\emph{Variable Message Sizes}. We stuck with the basic interface defined in Section~\ref{sec:protocols}. This 
 results in a system with two ring-buffers, illustrated in Figure~\ref{fig:buf_read_struct}.
 
 \begin{figure}[!ht]
@@ -65,7 +65,7 @@ \section{Buffered Read}\label{sec:conn:buf_read}
 \label{fig:buf_read_struct}
 \end{figure}
 
-\subsubsection{Protocol}
+\subsection{Protocol}
 As mentioned the sender of our buffered read connection is entirely passive, that means after the connection setup the sending
 CPU does not issue any RDMA operations. The only thing it needs to do to send is to check its head if there is enough space,
 copy the message to its tail, and update the tail offset.  It also prepends the message size when  writing. 
@@ -84,7 +84,7 @@ \subsubsection{Protocol}
 return the next message with the length at the beginning of the buffer and update the \code{read\_pointer}.
 
 If there are no transmitted messages to return, the receiver updates the \code{remote\_tail} with an RDMA read. If the remote
-tail has not updated, it retries until it has. With the updated tail the receiver issues an RDMA read for the whole buffer 
+tail has not been updated, it retries until it has. With the updated tail the receiver issues an RDMA read for the whole buffer 
 section between the read pointer and the updated tail. It then returns the next message in the newly transmitted section.
 
 This gives us an interesting characteristic of most receives being very cheap, while some are very expensive as they need to 
@@ -109,5 +109,5 @@ \subsection{Feature Analysis}
 
 
 \paragraph{} For future work it would also be interesting to implement a system where multiple threads can write to and receive 
-from a single  connection by atomically updating head or tail pointers, or we could explore the possibilities to share send or
+from a single  connection by atomically updating head or tail pointers, or we could explore the possibilities of sharing send or
 receive buffers using atomic operations. The current implementation, however, does not provide any kind of \emph{Resource Sharing}
diff --git a/doc/thesis/conclusion.tex b/doc/thesis/conclusion.tex
@@ -18,7 +18,7 @@ \section{Conclusion}
 
 \paragraph{} We analysed the presented protocols and evaluated more then just raw performance. We looked at other
 features that can be critical for applications.  These features include achieving effective memory usage by allowing 
-variable message sizes, avoiding additional copying by being truly zero-copy, or one slow processing message not
+variable message sizes, avoiding additional copying by being truly zero-copy, or one slowly processing message not
 being able to stall the whole connection.
 
 \paragraph{} We introduced a new performance model for RDMA based message passing protocols that allows us to 

diff --git a/doc/thesis/direct_read.tex b/doc/thesis/direct_read.tex
@@ -1,9 +1,9 @@
 
 \section{Direct Read} \label{sec:conn:direct_read}
 
-In Section~\ref{sec:conn:direct_write} we discussed how we can possibly avoid an additional copy at the receiver by giving 
+In Section~\ref{sec:conn:direct_write} we discussed how we can possibly avoid an additional copy at the receiver, by giving 
 the sender information which allows him to potentially write the data to the correct final memory location. The next logical
-step is to let the receiver decide for each message where to write it to. We can achieve this by our implementation of a
+step is to let the receiver decide for each message where to write it to. We can achieve this with our implementation of a
 \emph{Direct Read Connection}.
 
 \paragraph{} The core idea of a direct read protocol is that instead of directly sending a message through a send or write 
@@ -24,8 +24,8 @@ \subsection{Protocol}
 \paragraph{} To wait for the transfer to be completed, and for the buffer to be able to be reused, we can not simply wait 
 for the completion event of the send, like we do for the send or write based connections. We need to wait for the receiver 
 to explicitly signal that the buffer was transfered. We append a signaling byte at the end of the send buffer. 
-When sending this byte, will be set to 0 and we can wait for the transport to be completed by polling this byte until the 
-receiver will update it.
+When sending, this byte  is set to 0 and we can wait for the transport to be completed by polling this byte until the 
+receiver updates it.
 
 This push based implementation introduces little additional complexity, but there are other ways to implement such 
 signaling. The signaling bit forces us to use a specific memory arrangement, which could prevent us to send data directly 
@@ -39,8 +39,8 @@ \subsection{Protocol}
 It is crucial that we do not block until the read is completed to get reasonable performance.  This means the receiver has a
 slightly different interface than the previously presented connections. We split the receive call into a 
 \code{RequestAsync} and a \code{Wait} method. The \code{RequestAsync} takes a receive buffer
-to read into. It will wait for an incoming read request and issue the corresponding read. It uses the same increasing 
-\code{wr\_id} approach we use for sending with which the \code{Wait} method can wait for the read to complete. This approach
+to read into. It  waits for an incoming read request and issue the corresponding read. It uses the same increasing 
+\code{wr\_id} approach we use for sending, with which the \code{Wait} method can wait for the read to complete. This approach
 allows us to pipeline receives the same way we pipeline sends.
 
 \paragraph{} As soon as the transfer is complete, the receiver updates the corresponding signaling bit using an RDMA write.

diff --git a/doc/thesis/direct_write.tex b/doc/thesis/direct_write.tex
@@ -5,7 +5,7 @@ \section{Direct Write}\label{sec:conn:direct_write}
 regions for the sender to write to. 
 
 \paragraph{} For this thesis we implemented something very reminiscent of the send receive protocol. The core idea is for
-the receiver to send information on prepared receive buffers to the sender. The sender will then use these buffers in order. 
+the receiver to send information on prepared receive buffers to the sender. The sender will then use these buffers. 
 
 
 
@@ -104,19 +104,20 @@ \subsection{Protocol}
 order, for any modern systems~\cite{herd, farm}. This guarantees us that the complete message has been written as soon a we
 see an update to the last byte.
 
-\subsection{Features Analysis}
+\subsection{Feature Analysis}
 
 With our Direct Write connection we essentially rebuilt the send and receive verbs using only RDMA writes. This gives us 
 similar features but also allows for more control over the protocol. This could potentially allow us to extend it for 
 specific systems.
 
 \paragraph{} Our current implementation fulfills our requirements for being \emph{non-blocking}, the same way the send-receive
-protocol does. It however does not provide any \emph{interrupts} and we did not explore any \emph{resource sharing} approaches.
+protocol does. It, however, does not provide any \emph{interrupts} and we did not explore any \emph{resource sharing} approaches.
 As it stands, it also does not provide \emph{True Zero Copy} capabilities and it lacks support for 
 \emph{variable message sizes}.
 
 
-\paragraph{} We do however think that this protocol can be adapted to enable more features. By adding metadata when posting a
+\paragraph{} Nevertheless, we do think that this protocol can be adapted to enable more features.
+By adding metadata when posting a
 buffer we could enable \emph{True Zero Copy} capabilities for certain applications, by performing some routing decisions at
 the sender and being able to directly write to the correct final memory location.
 And with a more sophisticated buffer management one could more effectively utilize the available memory, by supporting