Improve GPU offloading mechanisms #6

achimnol · 2015-05-05T09:29:09Z

The current GPU offloading implementation has several limitations:

For subsequent offloadable elements, we should just skip the "aggregation" phase for the previous "batch of batches" that were offloaded.
The aggregation phase and/or the load balancer should perform "adaptive batching" -- when there are few packets/batches, we should stick on CPUs instead of GPUs.
- Currently there is no way to compare the queue lengths of CPUs and GPUs, because CPUs does not have processing input queues at all! However, we could determine whether to use CPUs or GPUs by inspecting the packet aggregation array in the RX phase, like SSLShader and Kargus.
- We need to combine "opportunistic/dynamic" offloading with our adaptive load balancing algorithm.
The current aggregation phase just counts the number of batches only: we need to do it smarter -- for example, use total payload sizes for variable-length datablocks and the number of packets for fixed-length datablocks. (Of course, such differentiation should be implemented light-weight.)

* This prevents potential bugs due to mismatch of actual system parameter values (given by the configs) and their limits (used by data structures). I had a problem with COPROC_PPDEPTH value: it was set 64 while its limit was 32, and thus offloading never happened!

* When using heterogeneous GPUs, there are multiple occurrences of CUDA kernels for different PTX assembly versions. I don't know how exactly nvcc treats "static const" variables when generting multiple cubin binaries, but we can just choose NOT to depend on such behaviour for device-side datablock indices.

* Yes, now CUDA (>= 6.5) supports C++11 and we can get rid of the macro value mismatch!

* We need to return the pointer "before" adding the newly allocated size (in lib/mempool.hh). * Now GPU does not crash, but it still hangs after processing the first offload task.

…ation branch. * Backported changes - Change NBA_MAX_PROCESSOR_TYPES to NBA_MAX_COPROCESSOR_TYPES to count coprocessors from zero instead one. - Add xmem argument to FixedRing to allow use of externally allocated memory area for the ring. - Add graph analysis for datablock reuses. It shows when to preproc/postproc datablocks during element graph initialization. (Not working yet...) . Also add check_preproc(), check_postproc(), check_postproc_all() methods to ElementGraph for later use. - Refactor and optimize scanning of schedulable elements. - Refactor OffloadableElement to make it schedulable for consistency. This moves task prepration codes into OffloadableElement from ElementGraph. - Remove support for the "sleepy" IO loop. * Excluded changes - Change the IO loop to not consume all received packets, but instead to call comp_process_batch() only once per iteration. Use the number of packets exceeding the computation batch size to reduce IO polling overheads. => Rejected since it actually reduces the performance about 10% with cpu-only configurations. * New changes - Move invocations to elemgraph->flush_* methods into ev_check event handler for brevity and reduced possibility of mistakes. * Performance impacts - There is no degradation of CPU-only and GPU-only performances compared to the previous commit.

* Merry Christmas! * Adds "io_base" concept to pipeline multiple offload tasks in each worker thread and allow reuse of datablocks in subsequent offloadable elements. - Differently from historical initial implementation, we now reuse offload task objects WITHOUT re-aggregation of batches between subsequent offloadable elements. - For this, elementgraph->tasks now holds both PacketBatch and OffloadTask using a bitmask type specifier on void* pointers in the ring. Depending on the task type, ElementGraph now chooses whether to run the normal pipeline loop or to feed offloadable elements so that they begin the offloading process immediately. * Preserves generalization of batching schemes. (Yes, it was a huge manual merge job..) * TODO: GPU versions do not work yet (as always expected). Let's debug!

* Minor optimization using unlikely from perf monitoring results. (about ~3%) * IPv4/IPv6 works well as before, but still IPsec hangs. * Removes dead codes.

* IPsec works well with task/datablock reuse optimization. - However, this refactored architecture has high libev/syscall overheads (> 10% CPU cycles of worker threads). - TODO: optimize it... * When reusing tasks, - we should keep task itself (do not free it!) and - we should update task->elem as well as task->tracker.element * There was a serious bug that reused GPU input buffer for outputs (for cases when roi/wri is WHOLE_PACKET) are not actually included in device-to-host copies, resulting in NO take-back of computation results. - Currently we allocate an output buffer explicitly without such buffer reuse optimization. - TODO: reuse input buffer and include its offset/lengths to coalescing of d2h copies

* Uses the latest version of libev (4.2.22 instead of 4.2.15) - To use default settings, install it inside /usr/local. (Run ./configure --prefix=/usr/local && make && sudo make install) - This does not show significant performance changes. * Uses blocking calls to ev_run() when waiting for new batch objects available, and invokes ev_break() when we release batch objects. - This reduces excessive ev_run() polling overheads and improves the performance by 30%. (Still, the absolute number is too low...) * Uses already-dynamic-casted references in scan_offloadable_elements(). - This reduces CPU cycles used by scan_offloadable_elements(), but still no performance gains. * Tried no-opping GPU kernel execution, but it does not give any performance improvments. - This means that current bottleneck is not the computation itself.

* In libev, check watchers are called when there ARE pending events to execute while prepare watcher are called when ev_run() is about to BLOCK due to no pending events. The timing when we need to flush any pending tasks is the latter, not the former. In my tests, using check watchers occaisionally gives performance fluctuation more than 50% (about once per 5-10 seconds), while prepare watchers do not have such symptoms. * Removes no longer used semi-coalesced codes in offloadtask.cc.

* No significant performance changes.

* Some operations were not being executed properly when checking TASK_PREPARED status in send_offload_task_to_device(). * Still, it needs to be improved more....

* Now the performance gets affected by computation.

* Uses DPDK's memzone instead of CUDA's portable buffers. * Just keep it as an optional feature. Default is not to use it.

* Removes "expect_false" on EVRUN_NOWAIT in libev's ev_run() code. It cuts down the CPU cycle usage by ev_run() to half! (Still, our performance bottleneck is not on the libev itself. When we observe high CPU percentage on libev, it means that the CPU is wasting its cycles.) * Limits NBA_MAX_IO_BASES from 17 to 1. This reduces the size of memory area used by both CPU and GPU. It meas that we now have bottlenecks in memory/cache subsystems. - Adds a blocking ev_run() call to wait io_bases to become available, using the same technique for waiting batch pools: ev_run() <=> ev_break() pairs - This provides performance improvements. * Increases offset sizes from uint16_t to uint32_t for when running IPsec with MTU-sized packets, where offsets may exceed 65535. This has been the main reason of frequent errors when running IPsec with large-size packets (>= 512 bytes). - This decreases performance. => Above two performance improvements/degradation compensate each other. So there is no performance change compared to the previous commit. * Reduces memory footprint by using variable-sized array in datablock arguments. However, this does not yield significant performance changes because we already have "full" aggregated batches when offloading IPsec encryption due to computation/memory bottlenecks.

* This restores former performance improvements! (Now ~3.4 Gbps per node with IPsec@64B, previously ~2.6 Gbps)

* This eases experiments to compare performance by different offset sizes. * Confirmed performance drop when dev_offset_t is ShiftedInt<uint32_t, 0>.

* Tried to support OpenCL, but confirmed that we need OpenCL 2.0+. (which is not supported by current generation of Xeon Phi...) - Related codes will be rewritten someday using SVM (shared virtual memory) APIs in OpenCL 2.0+. * Reduced memory footprint of batch_ids array passed to the device. * Rollbacked ev_prepare -> ev_check watcher type change (c732a25), as it has broken CPU-only cross-node IP forwarding scenarios. :( * TODO: fix IPsec GPU-only mode...

* Enforce same alignment of data structures shared by the host CPU and CUDA GPUs using "alignas" C++11 keyword. * Fix wrong uses of pkt_idx, where they should be item_idx. (Note that IPsec parallelizes by the unit of "blocks", which are 16-byte sized slices of packets) * Remove some unnecessary branches in IPsecAES kernels. * Let the CUDA engine to ignore "cudaErrorCudartUnloading" which may be returned from API calls during program termination. * Now the performance is half of the CPU version with 64-B packets.

* It means that the CPU version and GPU version yields the same results upon the same inputs.

* Replaces them with the new accumulated index calculator.

achimnol added the enhancement label May 5, 2015

achimnol changed the title ~~Improving GPU offloading mechanisms~~ Improve GPU offloading mechanisms May 7, 2015

achimnol added a commit that referenced this issue May 13, 2015

refs #6: Remove debug-only memory initialization

640ceb3

achimnol added a commit that referenced this issue May 26, 2015

refs #6: Update use of deprecated CUDA APIs

fcef4fb

achimnol added a commit that referenced this issue May 27, 2015

refs #6: Fix a critical memory overlapping bug.

9e209ab

* Yes, now CUDA (>= 6.5) supports C++11 and we can get rid of the macro value mismatch!

achimnol added a commit that referenced this issue May 27, 2015

refs #2, #6: Update the minimum required CUDA version to 7.0.

2752d8b

achimnol added the persistent label Jul 10, 2015

achimnol added a commit that referenced this issue Sep 13, 2015

refs #6: Fix graph analysis to work again.

78bd3f0

achimnol added a commit that referenced this issue Dec 27, 2015

refs #6: Now GPU version (IPv4) works.

daec371

achimnol added a commit that referenced this issue Dec 28, 2015

refs #6: Implement offload-task reuse

f63dd07

* Minor optimization using unlikely from perf monitoring results. (about ~3%) * IPv4/IPv6 works well as before, but still IPsec hangs. * Removes dead codes.

achimnol added a commit that referenced this issue Jan 13, 2016

refs #6, #8: Use bulk API instead of itemwise API for datablock states.

f6b3a73

achimnol added a commit that referenced this issue Jan 13, 2016

refs #6, #8: Unify task queue for batches and offloading.

ae60d38

* No significant performance changes.

achimnol added a commit that referenced this issue Jan 13, 2016

refs #6: Fix some logical bugs and got twice performance!

edfe539

* Some operations were not being executed properly when checking TASK_PREPARED status in send_offload_task_to_device(). * Still, it needs to be improved more....

achimnol added a commit that referenced this issue Jan 13, 2016

refs #6: Some comments and clean up.

1147178

achimnol added a commit that referenced this issue Jan 13, 2016

refs #6: Revive no-opped kernel execution and more comments.

6612281

* Now the performance gets affected by computation.

achimnol added a commit that referenced this issue Jan 14, 2016

refs #6: Try physically contiguous memory as GPU IO buffer, but failed.

d81c5f9

* Uses DPDK's memzone instead of CUDA's portable buffers. * Just keep it as an optional feature. Default is not to use it.

achimnol added a commit that referenced this issue Jan 22, 2016

refs #6, #8: Use ShiftedInt<uint16_t, 2> instead of uint32_t for offsets

f041e78

* This restores former performance improvements! (Now ~3.4 Gbps per node with IPsec@64B, previously ~2.6 Gbps)

achimnol added a commit that referenced this issue Jan 22, 2016

refs #6: Refactor all occurrences of ShiftedInt to dev_offset_t.

07b67b8

* This eases experiments to compare performance by different offset sizes. * Confirmed performance drop when dev_offset_t is ShiftedInt<uint32_t, 0>.

achimnol added a commit that referenced this issue Feb 24, 2016

refs #27, #6: Ensure IPv4 random nexthop to be non-zero.

607c50e

achimnol added a commit that referenced this issue Feb 24, 2016

refs #6, #27: Fix missing dbarray_d variable in ipv4route test.

171037a

achimnol added a commit that referenced this issue Feb 24, 2016

refs #6, #27: IPv4 router passes "match" test.

2a4050c

* It means that the CPU version and GPU version yields the same results upon the same inputs.

achimnol added a commit that referenced this issue Mar 5, 2016

refs #6: Reduce memory footprints for batch/item ID arrays

82d0bac

* Replaces them with the new accumulated index calculator.

achimnol added a commit that referenced this issue Mar 7, 2016

refs #6: Unify in/out buffer pointers in datablocks.

b033410

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve GPU offloading mechanisms #6

Improve GPU offloading mechanisms #6

achimnol commented May 5, 2015

Improve GPU offloading mechanisms #6

Improve GPU offloading mechanisms #6

Comments

achimnol commented May 5, 2015