Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve GPU offloading mechanisms #6

Open
achimnol opened this issue May 5, 2015 · 0 comments
Open

Improve GPU offloading mechanisms #6

achimnol opened this issue May 5, 2015 · 0 comments

Comments

@achimnol
Copy link
Member

achimnol commented May 5, 2015

The current GPU offloading implementation has several limitations:

  • For subsequent offloadable elements, we should just skip the "aggregation" phase for the previous "batch of batches" that were offloaded.
  • The aggregation phase and/or the load balancer should perform "adaptive batching" -- when there are few packets/batches, we should stick on CPUs instead of GPUs.
    • Currently there is no way to compare the queue lengths of CPUs and GPUs, because CPUs does not have processing input queues at all! However, we could determine whether to use CPUs or GPUs by inspecting the packet aggregation array in the RX phase, like SSLShader and Kargus.
    • We need to combine "opportunistic/dynamic" offloading with our adaptive load balancing algorithm.
  • The current aggregation phase just counts the number of batches only: we need to do it smarter -- for example, use total payload sizes for variable-length datablocks and the number of packets for fixed-length datablocks. (Of course, such differentiation should be implemented light-weight.)
@achimnol achimnol changed the title Improving GPU offloading mechanisms Improve GPU offloading mechanisms May 7, 2015
achimnol added a commit that referenced this issue May 21, 2015
 * This prevents potential bugs due to mismatch of actual system
   parameter values (given by the configs) and their limits (used by
   data structures).  I had a problem with COPROC_PPDEPTH value: it
   was set 64 while its limit was 32, and thus offloading never
   happened!
achimnol added a commit that referenced this issue May 26, 2015
 * When using heterogeneous GPUs, there are multiple occurrences of CUDA
   kernels for different PTX assembly versions.  I don't know how
   exactly nvcc treats "static const" variables when generting multiple
   cubin binaries, but we can just choose NOT to depend on such
   behaviour for device-side datablock indices.
achimnol added a commit that referenced this issue May 27, 2015
 * Yes, now CUDA (>= 6.5) supports C++11 and we can get rid of the
   macro value mismatch!
achimnol added a commit that referenced this issue May 28, 2015
 * We need to return the pointer "before" adding the newly allocated
   size (in lib/mempool.hh).

 * Now GPU does not crash, but it still hangs after processing the first
   offload task.
achimnol added a commit that referenced this issue Sep 13, 2015
…ation branch.

 * Backported changes

   - Change NBA_MAX_PROCESSOR_TYPES to NBA_MAX_COPROCESSOR_TYPES
     to count coprocessors from zero instead one.

   - Add xmem argument to FixedRing to allow use of externally allocated
     memory area for the ring.

   - Add graph analysis for datablock reuses.
     It shows when to preproc/postproc datablocks during element graph
     initialization.  (Not working yet...)

     . Also add check_preproc(), check_postproc(), check_postproc_all()
       methods to ElementGraph for later use.

   - Refactor and optimize scanning of schedulable elements.

   - Refactor OffloadableElement to make it schedulable for
     consistency.  This moves task prepration codes into
     OffloadableElement from ElementGraph.

   - Remove support for the "sleepy" IO loop.

 * Excluded changes

   - Change the IO loop to not consume all received packets,
     but instead to call comp_process_batch() only once per iteration.
     Use the number of packets exceeding the computation batch size
     to reduce IO polling overheads.

     => Rejected since it actually reduces the performance about 10%
        with cpu-only configurations.

 * New changes

   - Move invocations to elemgraph->flush_* methods into ev_check event
     handler for brevity and reduced possibility of mistakes.

 * Performance impacts

   - There is no degradation of CPU-only and GPU-only performances
     compared to the previous commit.
achimnol added a commit that referenced this issue Dec 24, 2015
 * Merry Christmas!

 * Adds "io_base" concept to pipeline multiple offload tasks in each
   worker thread and allow reuse of datablocks in subsequent offloadable
   elements.

   - Differently from historical initial implementation, we now reuse
     offload task objects WITHOUT re-aggregation of batches between
     subsequent offloadable elements.

   - For this, elementgraph->tasks now holds both PacketBatch and
     OffloadTask using a bitmask type specifier on void* pointers in the
     ring.  Depending on the task type, ElementGraph now chooses whether
     to run the normal pipeline loop or to feed offloadable elements so
     that they begin the offloading process immediately.

 * Preserves generalization of batching schemes.
   (Yes, it was a huge manual merge job..)

 * TODO: GPU versions do not work yet (as always expected). Let's debug!
achimnol added a commit that referenced this issue Dec 27, 2015
achimnol added a commit that referenced this issue Dec 28, 2015
 * Minor optimization using unlikely from perf monitoring results.
   (about ~3%)

 * IPv4/IPv6 works well as before, but still IPsec hangs.

 * Removes dead codes.
achimnol added a commit that referenced this issue Jan 10, 2016
 * IPsec works well with task/datablock reuse optimization.
   - However, this refactored architecture has high libev/syscall
     overheads (> 10% CPU cycles of worker threads).
   - TODO: optimize it...

 * When reusing tasks,
   - we should keep task itself (do not free it!) and
   - we should update task->elem as well as task->tracker.element

 * There was a serious bug that reused GPU input buffer for outputs
   (for cases when roi/wri is WHOLE_PACKET) are not actually included
   in device-to-host copies, resulting in NO take-back of computation
   results.
   - Currently we allocate an output buffer explicitly without such
     buffer reuse optimization.
   - TODO: reuse input buffer and include its offset/lengths to
     coalescing of d2h copies
achimnol added a commit that referenced this issue Jan 12, 2016
 * Uses the latest version of libev (4.2.22 instead of 4.2.15)

   - To use default settings, install it inside /usr/local.
     (Run ./configure --prefix=/usr/local && make && sudo make install)

   - This does not show significant performance changes.

 * Uses blocking calls to ev_run() when waiting for new batch objects
   available, and invokes ev_break() when we release batch objects.

   - This reduces excessive ev_run() polling overheads and improves the
     performance by 30%.  (Still, the absolute number is too low...)

 * Uses already-dynamic-casted references in scan_offloadable_elements().

   - This reduces CPU cycles used by scan_offloadable_elements(),
     but still no performance gains.

 * Tried no-opping GPU kernel execution, but it does not give
   any performance improvments.

   - This means that current bottleneck is not the computation itself.
achimnol added a commit that referenced this issue Jan 13, 2016
 * In libev, check watchers are called when there ARE pending events to
   execute while prepare watcher are called when ev_run() is about to
   BLOCK due to no pending events.
   The timing when we need to flush any pending tasks is the latter,
   not the former.

   In my tests, using check watchers occaisionally gives performance
   fluctuation more than 50% (about once per 5-10 seconds), while
   prepare watchers do not have such symptoms.

 * Removes no longer used semi-coalesced codes in offloadtask.cc.
achimnol added a commit that referenced this issue Jan 13, 2016
 * No significant performance changes.
achimnol added a commit that referenced this issue Jan 13, 2016
 * Some operations were not being executed properly when checking
   TASK_PREPARED status in send_offload_task_to_device().

 * Still, it needs to be improved more....
achimnol added a commit that referenced this issue Jan 13, 2016
achimnol added a commit that referenced this issue Jan 13, 2016
 * Now the performance gets affected by computation.
achimnol added a commit that referenced this issue Jan 14, 2016
 * Uses DPDK's memzone instead of CUDA's portable buffers.

 * Just keep it as an optional feature.
   Default is not to use it.
achimnol added a commit that referenced this issue Jan 18, 2016
 * Removes "expect_false" on EVRUN_NOWAIT in libev's ev_run() code.
   It cuts down the CPU cycle usage by ev_run() to half!
   (Still, our performance bottleneck is not on the libev itself.
   When we observe high CPU percentage on libev, it means that the CPU
   is wasting its cycles.)

 * Limits NBA_MAX_IO_BASES from 17 to 1.
   This reduces the size of memory area used by both CPU and GPU.
   It meas that we now have bottlenecks in memory/cache subsystems.

   - Adds a blocking ev_run() call to wait io_bases to become available,
     using the same technique for waiting batch pools:
     ev_run() <=> ev_break() pairs

   - This provides performance improvements.

 * Increases offset sizes from uint16_t to uint32_t for when running
   IPsec with MTU-sized packets, where offsets may exceed 65535.
   This has been the main reason of frequent errors when running IPsec
   with large-size packets (>= 512 bytes).

   - This decreases performance.

 => Above two performance improvements/degradation compensate each
    other.  So there is no performance change compared to the previous
    commit.

 * Reduces memory footprint by using variable-sized array in datablock
   arguments.  However, this does not yield significant performance
   changes because we already have "full" aggregated batches when
   offloading IPsec encryption due to computation/memory bottlenecks.
achimnol added a commit that referenced this issue Jan 22, 2016
 * This restores former performance improvements!
   (Now ~3.4 Gbps per node with IPsec@64B, previously ~2.6 Gbps)
achimnol added a commit that referenced this issue Jan 22, 2016
 * This eases experiments to compare performance by different offset sizes.

 * Confirmed performance drop when dev_offset_t is ShiftedInt<uint32_t, 0>.
achimnol added a commit that referenced this issue Feb 8, 2016
 * Tried to support OpenCL, but confirmed that we need OpenCL 2.0+.
   (which is not supported by current generation of Xeon Phi...)

   - Related codes will be rewritten someday using SVM (shared virtual
     memory) APIs in OpenCL 2.0+.

 * Reduced memory footprint of batch_ids array passed to the device.

 * Rollbacked ev_prepare -> ev_check watcher type change (c732a25),
   as it has broken CPU-only cross-node IP forwarding scenarios. :(

 * TODO: fix IPsec GPU-only mode...
achimnol added a commit that referenced this issue Feb 11, 2016
 * Enforce same alignment of data structures shared by the host CPU and
   CUDA GPUs using "alignas" C++11 keyword.

 * Fix wrong uses of pkt_idx, where they should be item_idx.
   (Note that IPsec parallelizes by the unit of "blocks", which are
    16-byte sized slices of packets)

 * Remove some unnecessary branches in IPsecAES kernels.

 * Let the CUDA engine to ignore "cudaErrorCudartUnloading" which
   may be returned from API calls during program termination.

 * Now the performance is half of the CPU version with 64-B packets.
achimnol added a commit that referenced this issue Feb 24, 2016
 * It means that the CPU version and GPU version yields the same
   results upon the same inputs.
achimnol added a commit that referenced this issue Mar 5, 2016
 * Replaces them with the new accumulated index calculator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant