-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized "blitter" routine written in assembler [wip] #719
base: master
Are you sure you want to change the base?
Conversation
This is a simple placeholder implementation for an optimized bit-blitting API.
Update the vhost-user code to perform Snabb<->VM memory copies via the lib.blit module. This allows experimental optimizations with local changes to the blit module. This essentially separates "virtio vring processing" and "virtio memory copies" into being two separate problems that can be profiled and optimized separately. This is work-in-progress: Care must be taken not to let the guest see that packets are available until the blit.barrier() operation has been executed and I think this will require moving the ring index updates.
work in progress / not complete: - Rounds all copies up to 32-bytes. - Fails NFV benchmark test.
The lib.blit API is now implemented by an assembler routine that batches copies together. This is a work in progress due to one major restriction: copy length has to be a multiple of 32 bytes.
| xor rax, rax | ||
|->copy: | ||
| vmovdqu ymm0, [rsi+rax] | ||
| vmovdqu [rdi+rax], ymm0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to experiment unrolling this manually. I got some significant speedups by having more loads in flight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is quite delicate :-). I started out with an unrolled version of the inner loop and then found that the looping version delivered the same performance. There have been other very innocent code variations that were much slower though. I want to use the PMU to explore these differences.
I would like to try unrolling the outer loop though to see if coping several packets in parallel could help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, delicate indeed :) One thing to try is instead of doing load, store, load, store, to do load, load, store, store. That was what worked best for me. Good luck :)
Looks nice. See also Igalia#204, specifically https://github.com/Igalia/snabbswitch/blob/wip-multi-copy/src/apps/lwaftr/multi_copy.dasl. |
This is an experiment towards doubling the performance of virtio-net copies (#710) with an optimized blitter routine (#711) written in AVX assembler (compatible with Sandy Bridge onwards).
Caveats and notes:
To sanity check the performance I tried three scenarios: master branch (baseline), hack to skip data copies entirely (maximum possible speedup), and then the actual asm blitter code.
Testing with 128-byte packets on
lugano-1
I see these results:I interpret this to mean that the copy performance is more than doubled i.e. with this optimization we are achieving more than half of the maximum possible speedup.
The challenges I see now are:
Likely we also need to update the DPDK-VM benchmark to test with a more interesting variation of packet sizes so that we don't accidentally optimize for the "packet size is always a power of 2" special case that we would never see in real life :-).