-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve GPU offloading mechanisms #6
Labels
Comments
achimnol
changed the title
Improving GPU offloading mechanisms
Improve GPU offloading mechanisms
May 7, 2015
achimnol
added a commit
that referenced
this issue
May 13, 2015
achimnol
added a commit
that referenced
this issue
May 21, 2015
* This prevents potential bugs due to mismatch of actual system parameter values (given by the configs) and their limits (used by data structures). I had a problem with COPROC_PPDEPTH value: it was set 64 while its limit was 32, and thus offloading never happened!
achimnol
added a commit
that referenced
this issue
May 26, 2015
achimnol
added a commit
that referenced
this issue
May 26, 2015
* When using heterogeneous GPUs, there are multiple occurrences of CUDA kernels for different PTX assembly versions. I don't know how exactly nvcc treats "static const" variables when generting multiple cubin binaries, but we can just choose NOT to depend on such behaviour for device-side datablock indices.
achimnol
added a commit
that referenced
this issue
May 27, 2015
* Yes, now CUDA (>= 6.5) supports C++11 and we can get rid of the macro value mismatch!
achimnol
added a commit
that referenced
this issue
May 27, 2015
achimnol
added a commit
that referenced
this issue
May 28, 2015
* We need to return the pointer "before" adding the newly allocated size (in lib/mempool.hh). * Now GPU does not crash, but it still hangs after processing the first offload task.
achimnol
added a commit
that referenced
this issue
Sep 13, 2015
…ation branch. * Backported changes - Change NBA_MAX_PROCESSOR_TYPES to NBA_MAX_COPROCESSOR_TYPES to count coprocessors from zero instead one. - Add xmem argument to FixedRing to allow use of externally allocated memory area for the ring. - Add graph analysis for datablock reuses. It shows when to preproc/postproc datablocks during element graph initialization. (Not working yet...) . Also add check_preproc(), check_postproc(), check_postproc_all() methods to ElementGraph for later use. - Refactor and optimize scanning of schedulable elements. - Refactor OffloadableElement to make it schedulable for consistency. This moves task prepration codes into OffloadableElement from ElementGraph. - Remove support for the "sleepy" IO loop. * Excluded changes - Change the IO loop to not consume all received packets, but instead to call comp_process_batch() only once per iteration. Use the number of packets exceeding the computation batch size to reduce IO polling overheads. => Rejected since it actually reduces the performance about 10% with cpu-only configurations. * New changes - Move invocations to elemgraph->flush_* methods into ev_check event handler for brevity and reduced possibility of mistakes. * Performance impacts - There is no degradation of CPU-only and GPU-only performances compared to the previous commit.
achimnol
added a commit
that referenced
this issue
Sep 13, 2015
achimnol
added a commit
that referenced
this issue
Dec 24, 2015
* Merry Christmas! * Adds "io_base" concept to pipeline multiple offload tasks in each worker thread and allow reuse of datablocks in subsequent offloadable elements. - Differently from historical initial implementation, we now reuse offload task objects WITHOUT re-aggregation of batches between subsequent offloadable elements. - For this, elementgraph->tasks now holds both PacketBatch and OffloadTask using a bitmask type specifier on void* pointers in the ring. Depending on the task type, ElementGraph now chooses whether to run the normal pipeline loop or to feed offloadable elements so that they begin the offloading process immediately. * Preserves generalization of batching schemes. (Yes, it was a huge manual merge job..) * TODO: GPU versions do not work yet (as always expected). Let's debug!
achimnol
added a commit
that referenced
this issue
Dec 27, 2015
achimnol
added a commit
that referenced
this issue
Dec 28, 2015
* Minor optimization using unlikely from perf monitoring results. (about ~3%) * IPv4/IPv6 works well as before, but still IPsec hangs. * Removes dead codes.
achimnol
added a commit
that referenced
this issue
Jan 10, 2016
* IPsec works well with task/datablock reuse optimization. - However, this refactored architecture has high libev/syscall overheads (> 10% CPU cycles of worker threads). - TODO: optimize it... * When reusing tasks, - we should keep task itself (do not free it!) and - we should update task->elem as well as task->tracker.element * There was a serious bug that reused GPU input buffer for outputs (for cases when roi/wri is WHOLE_PACKET) are not actually included in device-to-host copies, resulting in NO take-back of computation results. - Currently we allocate an output buffer explicitly without such buffer reuse optimization. - TODO: reuse input buffer and include its offset/lengths to coalescing of d2h copies
achimnol
added a commit
that referenced
this issue
Jan 12, 2016
* Uses the latest version of libev (4.2.22 instead of 4.2.15) - To use default settings, install it inside /usr/local. (Run ./configure --prefix=/usr/local && make && sudo make install) - This does not show significant performance changes. * Uses blocking calls to ev_run() when waiting for new batch objects available, and invokes ev_break() when we release batch objects. - This reduces excessive ev_run() polling overheads and improves the performance by 30%. (Still, the absolute number is too low...) * Uses already-dynamic-casted references in scan_offloadable_elements(). - This reduces CPU cycles used by scan_offloadable_elements(), but still no performance gains. * Tried no-opping GPU kernel execution, but it does not give any performance improvments. - This means that current bottleneck is not the computation itself.
achimnol
added a commit
that referenced
this issue
Jan 13, 2016
* In libev, check watchers are called when there ARE pending events to execute while prepare watcher are called when ev_run() is about to BLOCK due to no pending events. The timing when we need to flush any pending tasks is the latter, not the former. In my tests, using check watchers occaisionally gives performance fluctuation more than 50% (about once per 5-10 seconds), while prepare watchers do not have such symptoms. * Removes no longer used semi-coalesced codes in offloadtask.cc.
achimnol
added a commit
that referenced
this issue
Jan 13, 2016
achimnol
added a commit
that referenced
this issue
Jan 13, 2016
achimnol
added a commit
that referenced
this issue
Jan 13, 2016
* Some operations were not being executed properly when checking TASK_PREPARED status in send_offload_task_to_device(). * Still, it needs to be improved more....
achimnol
added a commit
that referenced
this issue
Jan 13, 2016
achimnol
added a commit
that referenced
this issue
Jan 13, 2016
* Now the performance gets affected by computation.
achimnol
added a commit
that referenced
this issue
Jan 14, 2016
* Uses DPDK's memzone instead of CUDA's portable buffers. * Just keep it as an optional feature. Default is not to use it.
achimnol
added a commit
that referenced
this issue
Jan 18, 2016
* Removes "expect_false" on EVRUN_NOWAIT in libev's ev_run() code. It cuts down the CPU cycle usage by ev_run() to half! (Still, our performance bottleneck is not on the libev itself. When we observe high CPU percentage on libev, it means that the CPU is wasting its cycles.) * Limits NBA_MAX_IO_BASES from 17 to 1. This reduces the size of memory area used by both CPU and GPU. It meas that we now have bottlenecks in memory/cache subsystems. - Adds a blocking ev_run() call to wait io_bases to become available, using the same technique for waiting batch pools: ev_run() <=> ev_break() pairs - This provides performance improvements. * Increases offset sizes from uint16_t to uint32_t for when running IPsec with MTU-sized packets, where offsets may exceed 65535. This has been the main reason of frequent errors when running IPsec with large-size packets (>= 512 bytes). - This decreases performance. => Above two performance improvements/degradation compensate each other. So there is no performance change compared to the previous commit. * Reduces memory footprint by using variable-sized array in datablock arguments. However, this does not yield significant performance changes because we already have "full" aggregated batches when offloading IPsec encryption due to computation/memory bottlenecks.
achimnol
added a commit
that referenced
this issue
Jan 22, 2016
achimnol
added a commit
that referenced
this issue
Jan 22, 2016
* This eases experiments to compare performance by different offset sizes. * Confirmed performance drop when dev_offset_t is ShiftedInt<uint32_t, 0>.
achimnol
added a commit
that referenced
this issue
Feb 8, 2016
* Tried to support OpenCL, but confirmed that we need OpenCL 2.0+. (which is not supported by current generation of Xeon Phi...) - Related codes will be rewritten someday using SVM (shared virtual memory) APIs in OpenCL 2.0+. * Reduced memory footprint of batch_ids array passed to the device. * Rollbacked ev_prepare -> ev_check watcher type change (c732a25), as it has broken CPU-only cross-node IP forwarding scenarios. :( * TODO: fix IPsec GPU-only mode...
achimnol
added a commit
that referenced
this issue
Feb 11, 2016
* Enforce same alignment of data structures shared by the host CPU and CUDA GPUs using "alignas" C++11 keyword. * Fix wrong uses of pkt_idx, where they should be item_idx. (Note that IPsec parallelizes by the unit of "blocks", which are 16-byte sized slices of packets) * Remove some unnecessary branches in IPsecAES kernels. * Let the CUDA engine to ignore "cudaErrorCudartUnloading" which may be returned from API calls during program termination. * Now the performance is half of the CPU version with 64-B packets.
achimnol
added a commit
that referenced
this issue
Feb 24, 2016
achimnol
added a commit
that referenced
this issue
Feb 24, 2016
achimnol
added a commit
that referenced
this issue
Feb 24, 2016
achimnol
added a commit
that referenced
this issue
Mar 5, 2016
* Replaces them with the new accumulated index calculator.
achimnol
added a commit
that referenced
this issue
Mar 7, 2016
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The current GPU offloading implementation has several limitations:
The text was updated successfully, but these errors were encountered: