Optimized "blitter" routine written in assembler [wip] #719

lukego · 2016-01-19T13:43:16Z

This is an experiment towards doubling the performance of virtio-net copies (#710) with an optimized blitter routine (#711) written in AVX assembler (compatible with Sandy Bridge onwards).

Caveats and notes:

The blitter is used only for Virtio-net "DMA" routines i.e. guest memory data copies.
This version only supports copies that are multiples of 32 bytes.
This works with the DPDK-VM performance test suite (packet sizes are always multiple of 32).
This does not work with the Linux-VM functional test suite (packets of all shapes and sizes).
Could easily be undetected bugs that invalidate all results, not confident until the full test suite runs.

To sanity check the performance I tried three scenarios: master branch (baseline), hack to skip data copies entirely (maximum possible speedup), and then the actual asm blitter code.

Testing with 128-byte packets on lugano-1 I see these results:

5.6 Mpps baseline from the master branch.
8.1 Mpps (+45%) maximum when copies are skipped.
7.0 Mpps (+25%) with the assembler blitter.

I interpret this to mean that the copy performance is more than doubled i.e. with this optimization we are achieving more than half of the maximum possible speedup.

The challenges I see now are:

Get the full test suite running without losing this performance boost.
Use the PMU to understand what is really going on and how dependable this speedup will be.
Try some further optimization ideas e.g. streaming multiple packets in parallel.

Likely we also need to update the DPDK-VM benchmark to test with a more interesting variation of packet sizes so that we don't accidentally optimize for the "packet size is always a power of 2" special case that we would never see in real life :-).

This is a simple placeholder implementation for an optimized bit-blitting API.

Update the vhost-user code to perform Snabb<->VM memory copies via the lib.blit module. This allows experimental optimizations with local changes to the blit module. This essentially separates "virtio vring processing" and "virtio memory copies" into being two separate problems that can be profiled and optimized separately. This is work-in-progress: Care must be taken not to let the guest see that packets are available until the blit.barrier() operation has been executed and I think this will require moving the ring index updates.

work in progress / not complete: - Rounds all copies up to 32-bytes. - Fails NFV benchmark test.

The lib.blit API is now implemented by an assembler routine that batches copies together. This is a work in progress due to one major restriction: copy length has to be a multiple of 32 bytes.

wingo · 2016-01-19T14:35:05Z

src/lib/blit.dasl

+   | xor rax, rax
+   |->copy:
+   | vmovdqu ymm0, [rsi+rax]
+   | vmovdqu [rdi+rax], ymm0


You might want to experiment unrolling this manually. I got some significant speedups by having more loads in flight.

It is quite delicate :-). I started out with an unrolled version of the inner loop and then found that the looping version delivered the same performance. There have been other very innocent code variations that were much slower though. I want to use the PMU to explore these differences.

I would like to try unrolling the outer loop though to see if coping several packets in parallel could help.

Yes, delicate indeed :) One thing to try is instead of doing load, store, load, store, to do load, load, store, store. That was what worked best for me. Good luck :)

wingo · 2016-01-19T14:36:43Z

Looks nice. See also Igalia#204, specifically https://github.com/Igalia/snabbswitch/blob/wip-multi-copy/src/apps/lwaftr/multi_copy.dasl.

lukego added 4 commits January 17, 2016 06:11

lib.blit: Added "blitter" module

9333b37

This is a simple placeholder implementation for an optimized bit-blitting API.

lib.blit: Draft assembler code version

d0e9194

work in progress / not complete: - Rounds all copies up to 32-bytes. - Fails NFV benchmark test.

lib.blit: Added blitter implementation in assembler [wip]

a3c65f0

The lib.blit API is now implemented by an assembler routine that batches copies together. This is a work in progress due to one major restriction: copy length has to be a multiple of 32 bytes.

lukego mentioned this pull request Jan 19, 2016

[draft] lib.blit: Introduce API for optimized "blitter" for Virtio-net DMA #711

Closed

wingo reviewed Jan 19, 2016
View reviewed changes

lukego mentioned this pull request Jan 27, 2016

Low DPDK throughput in NFV benchmark #665

Closed

lukego mentioned this pull request Feb 26, 2016

[wip] snabbmark: Add preliminary "byteops" benchmark #755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized "blitter" routine written in assembler [wip] #719

Optimized "blitter" routine written in assembler [wip] #719

lukego commented Jan 19, 2016

wingo Jan 19, 2016

lukego Jan 19, 2016

wingo Jan 19, 2016

wingo commented Jan 19, 2016

Optimized "blitter" routine written in assembler [wip] #719

Are you sure you want to change the base?

Optimized "blitter" routine written in assembler [wip] #719

Conversation

lukego commented Jan 19, 2016

wingo Jan 19, 2016

Choose a reason for hiding this comment

lukego Jan 19, 2016

Choose a reason for hiding this comment

wingo Jan 19, 2016

Choose a reason for hiding this comment

wingo commented Jan 19, 2016