-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] iostream: extended read_exactly2 interface with alignment #5
base: ceph-octopus-19.06.0-45-g7744693c
Are you sure you want to change the base?
[RFC] iostream: extended read_exactly2 interface with alignment #5
Conversation
Update: writer already does padding on the wire in v2, so no need for reader to do the same thing. |
include/seastar/core/iostream.hh
Outdated
// can work with user provided buffer pointer with less copy. | ||
return get().then([buf, size] (auto read_buf) mutable { | ||
auto len_needs_copy = std::min(read_buf.size(), size); | ||
std::copy(read_buf.get_write(), read_buf.get_write() + len_needs_copy, buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is contrary to DPDK as the network stack is responsible for providing the memory. Delegating this responsibility to user directly translates into obligatory memcpy
– even regardless of the contiguity imposed by returning temporary_buffer
instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -214,6 +230,8 @@ public: | |||
input_stream(input_stream&&) = default; | |||
input_stream& operator=(input_stream&&) = default; | |||
future<temporary_buffer<CharType>> read_exactly(size_t n); | |||
static constexpr uint16_t DEFAULT_ALIGNMENT = alignof(void*); | |||
future<temporary_buffer<CharType>> read_exactly2(size_t n, uint16_t alignment = DEFAULT_ALIGNMENT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that using read_exactly2()
to read big chunks will impose memcpy
for DPDK due to contiguity requirement. The new method returns temporary_buffer
which means: only one data pointer and one data size.
If we expect from DPDK fragmented payloads, we should expect from read_exactly2()
a lot of memcpy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if crimson-OSD supports fragmented payloads (such as SPDK), it should explicitly instruct messenger to use ceph::net::Socket::read(size)
instead of ceph::net::Socket::read_exactly(size, alignment)
. Because ceph::net::Socket::read(size)
will return internally fragmented bufferlist
as expected, and IMO it is better renamed to ceph::net::Socket::read_fragmented(size)
.
Also, the current ceph::net::Socket::read(size)
is already optimal for both DPDK stack and POSIX stack if OSD-side supports fragmented DATA payload:
- for DPDK: it's zero copy.
- for POSIX: it's zero copy in user-space, and also minimizes syscalls.
If OSD doesn't support fragmented payloads itself (such as kernel), ceph::net::Socket::read_exactly(size)
still needs to be used to build up big chunks of aligned payload, regardless of whether the messenger is using Native or POSIX stack.
My point is that whether or not to use fragmented/aligned payloads should be instructed by OSD, not seastar framework. It's our (framework user) specific requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the current
ceph::net::Socket::read(size)
is already optimal for both DPDK stack and POSIX stack if OSD-side supports fragmented DATA payload:
- for DPDK: it's zero copy.
- for POSIX: it's zero copy in user-space, and also minimizes syscalls.
I disagree with that. For POSIX stack the SGL will be terribly fragmented and many syscalls will be issued because of the small, 8 KB-long prefetch buffer. For instance: reading 4 MB payload requires 4096 KB / 8 KB = 512
fragments and also 512 syscalls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of the small, 8 KB-long prefetch buffer
I think it's a separate issue, and there is already another PR addressing it (#4). My analysis (#4 (comment)) shows that messenger performance is much better with larger trunks (1 MB), as expected. But I still don't know why rados bench
disagreed (from kefu).
@@ -325,6 +325,21 @@ posix_data_source_impl::get() { | |||
}); | |||
} | |||
|
|||
future<size_t, temporary_buffer<char>> | |||
posix_data_source_impl::get_direct(char* buf, size_t size) { | |||
if (size > _buf_size / 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, currently 4096 looks in the 1 syscal/msg testing as a reasonable threshold for prefetching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it is the same strategy implemented in the current async-messenger.
src/net/posix-stack.cc
Outdated
posix_data_source_impl::get_direct(char* buf, size_t size) { | ||
if (size > _buf_size / 2) { | ||
// this was a large read, we don't prefetch | ||
return _fd->read_some(buf, size).then([] (auto read_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the POSIX stack would serve large chunks with less syscalls and with limited-but-still-present mempcy
. Two factors contribute:
- the mechanism is not aware (and really shouldn't be – this is Seastar ;-) where in the stream are boundaries of Ceph frame's parts (preamble, segments, epilogue). In the consequence the prefetch buffer may already contain a portion of the data we're interested in. If so, we must to
memmove
it to the single output buffer as read_exactly2
can return only contiguous memory.
The probability for having the mempcy
is quite large, I bet. The impact depends on chunk size. Copying up to 8k isn't a big deal for 4M but can be meaningful for 16k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's under the assumption that syscall is slower than memcpy
, which is to say, for smaller chunks, prefetch with memcpy
is faster than exclusive syscalls.
I'm not 100% sure about this, and that's why I'm working on improving perf_crimson_msgr
to get more accurate & informative results.
auto len_needs_copy = std::min(available(), n); | ||
std::copy(_buf.get(), _buf.get() + len_needs_copy, out.get_write()); | ||
_buf.trim_front(len_needs_copy); | ||
if (len_needs_copy == n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe we don't need this special if
. We might consider unifying with the one in ::read_exactly_part_direct()
:
if (completed == n) {
return make_ready_future<tmp_buf>(std::move(out));
}
}); | ||
} else { | ||
// read with prefetch, but with extra memory copy, | ||
// because we prefer less system calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about io_uring
? Currently it's reasonable to do a lot of extra work just to lower the number of syscalls. However, io_uring
is intended to lower the costs of communication between kernel and user-space, and thus I would expect it will move the threshold much lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For io_uring
I believe we need another concrete data_source_impl
class, not posix_data_source_impl
. And io_uring also needs a new poller, right?
@@ -325,6 +325,21 @@ posix_data_source_impl::get() { | |||
}); | |||
} | |||
|
|||
future<size_t, temporary_buffer<char>> | |||
posix_data_source_impl::get_direct(char* buf, size_t size) { | |||
if (size > _buf_size / 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ::get_direct()
good place for such logic? Maybe moving it up is preferred? I'm afraid the name is currently a little bit misleading as the _direct
part is actually conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I cannot move it up, because this is the special case to reduce syscall for posix sockets, which is not general to other concrete data_source_impl
classes such as native_data_source_impl
.
include/seastar/core/iostream.hh
Outdated
// can work with user provided buffer pointer with less copy. | ||
return get().then([buf, size] (auto read_buf) mutable { | ||
auto len_needs_copy = std::min(read_buf.size(), size); | ||
std::copy(read_buf.get_write(), read_buf.get_write() + len_needs_copy, buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe get()
instead of get_write()
?
Early evaluation shows this performance is similar with smaller chunks (256B 4K), but much faster with larger chunks (64K 1M) because of reduced I'm working on improving |
For posix-stack: minimize system-calls with prefetch, and minimize unecessary memory copies. For native-stack: minimize unecessary memory copies. TODO: compatible but may not be optimal - tls_connected_socket_impl - file_data_source_impl - loopback_data_source_impl - packet_data_source Signed-off-by: Yingxin <[email protected]>
Signed-off-by: Yingxin <[email protected]>
This reverts commit 33406cf. It introduces memory leaks: Direct leak of 24 byte(s) in 1 object(s) allocated from: #0 0x7fb773b389d7 in operator new(unsigned long) (/lib64/libasan.so.5+0x10f9d7) ceph#1 0x108f0d4 in seastar::reactor::poller::~poller() ../src/core/reactor.cc:2879 ceph#2 0x11c1e59 in std::experimental::fundamentals_v1::_Optional_base<seastar::reactor::poller, true>::~_Optional_base() /usr/include/c++/9/experimental/optional:288 ceph#3 0x118f2d7 in std::experimental::fundamentals_v1::optional<seastar::reactor::poller>::~optional() /usr/include/c++/9/experimental/optional:491 ceph#4 0x108c5a5 in seastar::reactor::run() ../src/core/reactor.cc:2587 ceph#5 0xf1a822 in seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) ../src/core/app-template.cc:199 ceph#6 0xf1885d in seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) ../src/core/app-template.cc:115 ceph#7 0xeb2735 in operator() ../src/testing/test_runner.cc:72 ceph#8 0xebb342 in _M_invoke /usr/include/c++/9/bits/std_function.h:300 ceph#9 0xf3d8b0 in std::function<void ()>::operator()() const /usr/include/c++/9/bits/std_function.h:690 ceph#10 0x1034c72 in seastar::posix_thread::start_routine(void*) ../src/core/posix.cc:52 ceph#11 0x7fb7738804e1 in start_thread /usr/src/debug/glibc-2.30-13-g919af705ee/nptl/pthread_create.c:479 Reported-by: Rafael Avila de Espindola <[email protected]>
…o_with Fixes failures in debug mode: ``` $ build/debug/tests/unit/closeable_test -l all -t deferred_close_test WARNING: debug mode. Not for benchmarking or production random-seed=3064133628 Running 1 test case... Entering test module "../../tests/unit/closeable_test.cc" ../../tests/unit/closeable_test.cc(0): Entering test case "deferred_close_test" ../../src/testing/seastar_test.cc(43): info: check true has passed ==9449==WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases! terminate called after throwing an instance of 'seastar::broken_promise' what(): broken promise ==9449==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 0x7fbf1f49f000; bottom 0x7fbf40971000; size: 0xffffffffdeb2e000 (-558702592) False positive error reports may follow For details see google/sanitizers#189 ================================================================= ==9449==AddressSanitizer CHECK failed: ../../../../libsanitizer/asan/asan_thread.cpp:356 "((ptr[0] == kCurrentStackFrameMagic)) != (0)" (0x0, 0x0) #0 0x7fbf45f39d0b (/lib64/libasan.so.6+0xb3d0b) #1 0x7fbf45f57d4e (/lib64/libasan.so.6+0xd1d4e) #2 0x7fbf45f3e724 (/lib64/libasan.so.6+0xb8724) #3 0x7fbf45eb3e5b (/lib64/libasan.so.6+0x2de5b) #4 0x7fbf45eb51e8 (/lib64/libasan.so.6+0x2f1e8) #5 0x7fbf45eb7694 (/lib64/libasan.so.6+0x31694) #6 0x7fbf45f39398 (/lib64/libasan.so.6+0xb3398) #7 0x7fbf45f3a00b in __asan_report_load8 (/lib64/libasan.so.6+0xb400b) #8 0xfe6d52 in bool __gnu_cxx::operator!=<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > >(__gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&, __gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&) /usr/include/c++/10/bits/stl_iterator.h:1116 #9 0xfe615c in dl_iterate_phdr ../../src/core/exception_hacks.cc:121 #10 0x7fbf44bd1810 in _Unwind_Find_FDE (/lib64/libgcc_s.so.1+0x13810) #11 0x7fbf44bcd897 (/lib64/libgcc_s.so.1+0xf897) #12 0x7fbf44bcea5f (/lib64/libgcc_s.so.1+0x10a5f) #13 0x7fbf44bcefd8 in _Unwind_RaiseException (/lib64/libgcc_s.so.1+0x10fd8) #14 0xfe6281 in _Unwind_RaiseException ../../src/core/exception_hacks.cc:148 scylladb#15 0x7fbf457364bb in __cxa_throw (/lib64/libstdc++.so.6+0xaa4bb) scylladb#16 0x7fbf45e10a21 (/lib64/libboost_unit_test_framework.so.1.73.0+0x1aa21) scylladb#17 0x7fbf45e20fe0 in boost::execution_monitor::execute(boost::function<int ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2afe0) scylladb#18 0x7fbf45e21094 in boost::execution_monitor::vexecute(boost::function<void ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2b094) scylladb#19 0x7fbf45e43921 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned long) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d921) scylladb#20 0x7fbf45e5eae1 (/lib64/libboost_unit_test_framework.so.1.73.0+0x68ae1) scylladb#21 0x7fbf45e5ed31 (/lib64/libboost_unit_test_framework.so.1.73.0+0x68d31) scylladb#22 0x7fbf45e2e547 in boost::unit_test::framework::run(unsigned long, bool) (/lib64/libboost_unit_test_framework.so.1.73.0+0x38547) scylladb#23 0x7fbf45e43618 in boost::unit_test::unit_test_main(bool (*)(), int, char**) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d618) scylladb#24 0x44798d in seastar::testing::entry_point(int, char**) ../../src/testing/entry_point.cc:77 scylladb#25 0x4134b5 in main ../../include/seastar/testing/seastar_test.hh:65 scylladb#26 0x7fbf44a1b1e1 in __libc_start_main (/lib64/libc.so.6+0x281e1) scylladb#27 0x4133dd in _start (/home/bhalevy/dev/seastar/build/debug/tests/unit/closeable_test+0x4133dd) ``` Signed-off-by: Benny Halevy <[email protected]> Message-Id: <[email protected]>
When we enable the sanitizer, we get following error while running iotune: ==86505==ERROR: LeakSanitizer: detected memory leaks Direct leak of 4096 byte(s) in 1 object(s) allocated from: #0 0x5701b8 in aligned_alloc (/home/syuu/seastar.2/build/sanitize/apps/iotune/iotune+0x5701b8) (BuildId: 411f9852d64ed8982d5b33d02489b5932d92b8b7) #1 0x6d0813 in seastar::filesystem_has_good_aio_support(seastar::basic_sstring<char, unsigned int, 15u, true>, bool) /home/syuu/seastar.2/src/core/fsqual.cc:74:16 #2 0x5bcd0d in main::$_0::operator()() const::'lambda'()::operator()() const /home/syuu/seastar.2/apps/iotune/iotune.cc:742:21 #3 0x5bb1f1 in seastar::future<int> seastar::futurize<int>::apply<main::$_0::operator()() const::'lambda'()>(main::$_0::operator()() const::'lambda'()&&, std::tuple<>&&) /home/syuu/seastar.2/include/seastar/core/future.hh:2118:28 #4 0x5bb039 in seastar::futurize<std::invoke_result<main::$_0::operator()() const::'lambda'()>::type>::type seastar::async<main::$_0::operator()() const::'lambda'()>(seastar::thread_attributes, main::$_0::operator()() const::'lambda'()&&)::'lambda'()::operator()() const /home/syuu/seastar.2/include/seastar/core/thread.hh:258:13 #5 0x5bb039 in seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<main::$_0::operator()() const::'lambda'()>::type>::type seastar::async<main::$_0::operator()() const::'lambda'()>(seastar::thread_attributes, main::$_0::operator()() const::'lambda'()&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) /home/syuu/seastar.2/include/seastar/util/noncopyable_function.hh:124:20 #6 0x8e0a77 in seastar::thread_context::main() /home/syuu/seastar.2/src/core/thread.cc:299:9 #7 0x7f30ff8547bf (/lib64/libc.so.6+0x547bf) (BuildId: 85c438f4ff93e21675ff174371c9c583dca00b2c) SUMMARY: AddressSanitizer: 4096 byte(s) leaked in 1 allocation(s). This is because we don't free buffer which allocated at filesystem_has_good_aio_support(), we should free it to avoid such error. And this is needed to test Scylla machine image with debug mode binary, since it tries to run iotune with the sanitizer and fails. Closes scylladb#1284
in main(), we creates an instance of `http_server_control` using new, but we never destroy it. this is identified by ASan ``` ==2190125==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x55e21cf487bd in operator new(unsigned long) /home/kefu/dev/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:86:3 #1 0x55e21cf6cf31 in main::$_0::operator()() const::'lambda'()::operator()() const /home/kefu/dev/seastar/apps/httpd/main.cc:121:27 #2 0x55e21cf6b4cc in int std::__invoke_impl<int, main::$_0::operator()() const::'lambda'()>(std::__invoke_other, main::$_0::operator()() const::'lambda'()&&) /usr/lib/gcc/x86_64-redhat-linux/14/../../../../incl ude/c++/14/bits/invoke.h:61:14 #3 0x55e21cf6b46c in std::__invoke_result<main::$_0::operator()() const::'lambda'()>::type std::__invoke<main::$_0::operator()() const::'lambda'()>(main::$_0::operator()() const::'lambda'()&&) /usr/lib/gcc/x86_ 64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:96:14 #4 0x55e21cf6b410 in decltype(auto) std::__apply_impl<main::$_0::operator()() const::'lambda'(), std::tuple<>>(main::$_0::operator()() const::'lambda'()&&, std::tuple<>&&, std::integer_sequence<unsigned long, . ..>) /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/tuple:2921:14 #5 0x55e21cf6b3b2 in decltype(auto) std::apply<main::$_0::operator()() const::'lambda'(), std::tuple<>>(main::$_0::operator()() const::'lambda'()&&, std::tuple<>&&) /usr/lib/gcc/x86_64-redhat-linux/14/../../../ ../include/c++/14/tuple:2936:14 #6 0x55e21cf6b283 in seastar::future<int> seastar::futurize<int>::apply<main::$_0::operator()() const::'lambda'()>(main::$_0::operator()() const::'lambda'()&&, std::tuple<>&&) /home/kefu/dev/seastar/include/sea star/core/future.hh:2005:28 #7 0x55e21cf6b043 in seastar::futurize<std::invoke_result<main::$_0::operator()() const::'lambda'()>::type>::type seastar::async<main::$_0::operator()() const::'lambda'()>(seastar::thread_attributes, main::$_0: :operator()() const::'lambda'()&&)::'lambda'()::operator()() const /home/kefu/dev/seastar/include/seastar/core/thread.hh:260:13 #8 0x55e21cf6ae74 in seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<main::$_0::operator()() const::'lambda'()>::type>::type seastar::async<main::$_0::operator()() const::'lambda'()>(seastar::thread_attributes, main::$_0::operator()() const::'lambda'()&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) /home/kefu/dev/seastar/include/seastar/util/noncopyable _function.hh:129:20 #9 0x7f5d757a0fb3 in seastar::noncopyable_function<void ()>::operator()() const /home/kefu/dev/seastar/include/seastar/util/noncopyable_function.hh:215:16 #10 0x7f5d75ef5611 in seastar::thread_context::main() /home/kefu/dev/seastar/src/core/thread.cc:311:9 #11 0x7f5d75ef50eb in seastar::thread_context::s_main(int, int) /home/kefu/dev/seastar/src/core/thread.cc:287:43 #12 0x7f5d72f8a18f (/lib64/libc.so.6+0x5a18f) (BuildId: b098f1c75a76548bb230d8f551eae07a2aeccf06) ``` so, in this change, let's hold it using a smart pointer, so we can destroy it when it leaves the lexical scope. Signed-off-by: Kefu Chai <[email protected]> Closes scylladb#2224
See https://gist.github.com/cyx1231st/57727c8aa6c98ed48a8b06d64b7923d7
This PR introduces a less intrusive way to implement read with alignment.
Considerations:
posix_data_source_impl
, system-call is assumed to be much more expensive than memory-copy, so the prefetch is disabled only if:(This idea is very similar to the current async-msgr, which will always do prefetch if the read is not large enough, and copy the prefetched data to the out buffer
p
, see https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L235-L271)native_data_source_impl
, the current implementation will try its best to verify if the memory alignment is already good, and will trigger user-to-user-space copy only if required.*_data_source_impl
, they are simply compatible and we currently haven't used them.The code is functioning now, but still needs further evaluation of performance impacts.