Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FIOFFS on illumos instead of fsync (again) #1199

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Mar 11, 2024

This is a cleaned-up version of #1148, which has fallen well behind main (rebasing was awkward, so I just copy-pasted the relevant code)

It splits the per-extent flush operation into pre_flush, flush_inner, and post_flush, then skips flush_inner in favor of the ioctl if we're running with the omicron-build feature.

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

  1. Did you confirm with dtrace that this is doing what we expect? I don't see how it couldn't but it may be safe to be paranoid.
  2. What about any performance impact? I remember you saying in chat that there wasn't one anymore and I'm curious if you reran any of your tests.

@mkeeter
Copy link
Contributor Author

mkeeter commented Mar 13, 2024

Testing using

dtrace -n 'fbt::zil_commit:entry/pid == $target/ { printf("HI %u", arg1) }' -p `pgrep -nf crucible-ds-fioffs`

I see it firing 2x per second with the right value for arg1 (foid):

 62  53249                 zil_commit:entry HI 0
 78  53249                 zil_commit:entry HI 0
103  53249                 zil_commit:entry HI 0
103  53249                 zil_commit:entry HI 0
103  53249                 zil_commit:entry HI 0
103  53249                 zil_commit:entry HI 0
104  53249                 zil_commit:entry HI 0
101  53249                 zil_commit:entry HI 0

To make this easier for future readers, I added a comment in bbb6cfe tracing the execution path. After chatting in person last week, @jclulow is also going to socialize the idea of making this IOCTL committed, so we could rely on it in the future.

The performance question is interesting. Running on Propolis + Crucible main (6f7884ae4)

4M WRITE: bw=547MiB/s (574MB/s), 547MiB/s-547MiB/s (574MB/s-574MB/s), io=32.2GiB (34.6GB), run=60199-60199msec
4K WRITE: bw=28.8MiB/s (30.2MB/s), 28.8MiB/s-28.8MiB/s (30.2MB/s-30.2MB/s), io=1729MiB (1813MB), run=60009-60009msec
1M WRITE: bw=644MiB/s (675MB/s), 644MiB/s-644MiB/s (675MB/s-675MB/s), io=37.8GiB (40.5GB), run=60025-60025msec
4M WRITE: bw=546MiB/s (572MB/s), 546MiB/s-546MiB/s (572MB/s-572MB/s), io=32.1GiB (34.4GB), run=60171-60171msec
4K READ: bw=32.2MiB/s (33.8MB/s), 32.2MiB/s-32.2MiB/s (33.8MB/s-33.8MB/s), io=1931MiB (2025MB), run=60003-60003msec
1M READ: bw=633MiB/s (663MB/s), 633MiB/s-633MiB/s (663MB/s-663MB/s), io=37.1GiB (39.8GB), run=60035-60035msec
4M READ: bw=778MiB/s (816MB/s), 778MiB/s-778MiB/s (816MB/s-816MB/s), io=45.7GiB (49.1GB), run=60134-60134msec

Running on this branch (fb4ca887e) with omicron-build set, I see a speedup:

4M WRITE: bw=704MiB/s (738MB/s), 704MiB/s-704MiB/s (738MB/s-738MB/s), io=41.3GiB (44.3GB), run=60100-60100msec
4K WRITE: bw=26.8MiB/s (28.1MB/s), 26.8MiB/s-26.8MiB/s (28.1MB/s-28.1MB/s), io=1610MiB (1689MB), run=60014-60014msec
1M WRITE: bw=665MiB/s (698MB/s), 665MiB/s-665MiB/s (698MB/s-698MB/s), io=39.0GiB (41.9GB), run=60029-60029msec
4M WRITE: bw=725MiB/s (761MB/s), 725MiB/s-725MiB/s (761MB/s-761MB/s), io=42.6GiB (45.8GB), run=60176-60176msec
4K READ: bw=29.7MiB/s (31.1MB/s), 29.7MiB/s-29.7MiB/s (31.1MB/s-31.1MB/s), io=1780MiB (1867MB), run=60003-60003msec
1M READ: bw=640MiB/s (671MB/s), 640MiB/s-640MiB/s (671MB/s-671MB/s), io=37.5GiB (40.3GB), run=60037-60037msec
4M READ: bw=750MiB/s (786MB/s), 750MiB/s-750MiB/s (786MB/s-786MB/s), io=44.0GiB (47.3GB), run=60129-60129msec

However, with omicorn-build not set (i.e. using the same old fsync path), I see the same speedup (???)

4M WRITE: bw=787MiB/s (825MB/s), 787MiB/s-787MiB/s (825MB/s-825MB/s), io=46.2GiB (49.6GB), run=60183-60183msec
4K WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=1698MiB (1780MB), run=60010-60010msec
1M WRITE: bw=797MiB/s (836MB/s), 797MiB/s-797MiB/s (836MB/s-836MB/s), io=46.7GiB (50.2GB), run=60036-60036msec
4M WRITE: bw=809MiB/s (848MB/s), 809MiB/s-809MiB/s (848MB/s-848MB/s), io=47.5GiB (51.0GB), run=60100-60100msec
4K READ: bw=25.8MiB/s (27.1MB/s), 25.8MiB/s-25.8MiB/s (27.1MB/s-27.1MB/s), io=1549MiB (1625MB), run=60004-60004msec
1M READ: bw=640MiB/s (671MB/s), 640MiB/s-640MiB/s (671MB/s-671MB/s), io=37.5GiB (40.3GB), run=60032-60032msec
4M READ: bw=706MiB/s (741MB/s), 706MiB/s-706MiB/s (741MB/s-741MB/s), io=41.5GiB (44.5GB), run=60145-60145msec

This is puzzling; I don't expect the pre_flush / flush_inner / post_flush refactoring to make such a noticeable performance difference.

@leftwo
Copy link
Contributor

leftwo commented Mar 14, 2024

This has the Automatically split > 1 MiB writes (#1198), so I'm a little surprised the 4M WRITE test is
faster than the 1M WRITE test, but I guess we avoid having to go all the way out to the guest to
get the next 1M chunk and deal with a bit of that layer.

Also, for each grouping I see two rows that start with 4M WRITE. Is that the same test twice, or is there
something different between the two rows?

To make it a little easier for me to see the differences, I changed the groupings so you see all
three of the same test next to each other.

 STOCK 4M WRITE: bw=547MiB/s (574MB/s), 547MiB/s-547MiB/s (574MB/s-574MB/s), io=32.2GiB (34.6GB), run=60199-60199msec                                          
FIOFFS 4M WRITE: bw=704MiB/s (738MB/s), 704MiB/s-704MiB/s (738MB/s-738MB/s), io=41.3GiB (44.3GB), run=60100-60100msec                                          
NOCORN 4M WRITE: bw=787MiB/s (825MB/s), 787MiB/s-787MiB/s (825MB/s-825MB/s), io=46.2GiB (49.6GB), run=60183-60183msec                                          

Clear win here, but like you said, omicron-build not set (fsync) is even faster.

 STOCK 4K WRITE: bw=28.8MiB/s (30.2MB/s), 28.8MiB/s-28.8MiB/s (30.2MB/s-30.2MB/s), io=1729MiB (1813MB), run=60009-60009msec                                    
FIOFFS 4K WRITE: bw=26.8MiB/s (28.1MB/s), 26.8MiB/s-26.8MiB/s (28.1MB/s-28.1MB/s), io=1610MiB (1689MB), run=60014-60014msec                                    
NOCORN 4K WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=1698MiB (1780MB), run=60010-60010msec                                    

All the same here really.

 STOCK 1M WRITE: bw=644MiB/s (675MB/s), 644MiB/s-644MiB/s (675MB/s-675MB/s), io=37.8GiB (40.5GB), run=60025-60025msec                                          
FIOFFS 1M WRITE: bw=665MiB/s (698MB/s), 665MiB/s-665MiB/s (698MB/s-698MB/s), io=39.0GiB (41.9GB), run=60029-60029msec                                          
NOCORN 1M WRITE: bw=797MiB/s (836MB/s), 797MiB/s-797MiB/s (836MB/s-836MB/s), io=46.7GiB (50.2GB), run=60036-60036msec                                          

The FIOFFS could be faster, or just in the noise, but NOCORN is indeed a big bump here.

 STOCK 4M WRITE: bw=546MiB/s (572MB/s), 546MiB/s-546MiB/s (572MB/s-572MB/s), io=32.1GiB (34.4GB), run=60171-60171msec                                          
FIOFFS 4M WRITE: bw=725MiB/s (761MB/s), 725MiB/s-725MiB/s (761MB/s-761MB/s), io=42.6GiB (45.8GB), run=60176-60176msec                                          
NOCORN 4M WRITE: bw=809MiB/s (848MB/s), 809MiB/s-809MiB/s (848MB/s-848MB/s), io=47.5GiB (51.0GB), run=60100-60100msec                                          

FFIOS faster, but NOCORN a step faster from it.

 STOCK 4K READ: bw=32.2MiB/s (33.8MB/s), 32.2MiB/s-32.2MiB/s (33.8MB/s-33.8MB/s), io=1931MiB (2025MB), run=60003-60003msec                                     
FIOFFS 4K READ: bw=29.7MiB/s (31.1MB/s), 29.7MiB/s-29.7MiB/s (31.1MB/s-31.1MB/s), io=1780MiB (1867MB), run=60003-60003msec                                     
NOCORN 4K READ: bw=25.8MiB/s (27.1MB/s), 25.8MiB/s-25.8MiB/s (27.1MB/s-27.1MB/s), io=1549MiB (1625MB), run=60004-60004msec                                     

A little slower here, but that could be in the noise as well.

 STOCK 1M READ: bw=633MiB/s (663MB/s), 633MiB/s-633MiB/s (663MB/s-663MB/s), io=37.1GiB (39.8GB), run=60035-60035msec                                           
FIOFFS 1M READ: bw=640MiB/s (671MB/s), 640MiB/s-640MiB/s (671MB/s-671MB/s), io=37.5GiB (40.3GB), run=60037-60037msec                                           
NOCORN 1M READ: bw=640MiB/s (671MB/s), 640MiB/s-640MiB/s (671MB/s-671MB/s), io=37.5GiB (40.3GB), run=60032-60032msec                                           

Faster, but probably in the noise again.

 STOCK 4M READ: bw=778MiB/s (816MB/s), 778MiB/s-778MiB/s (816MB/s-816MB/s), io=45.7GiB (49.1GB), run=60134-60134msec                                           
FIOFFS 4M READ: bw=750MiB/s (786MB/s), 750MiB/s-750MiB/s (786MB/s-786MB/s), io=44.0GiB (47.3GB), run=60129-60129msec                                           
NOCORN 4M READ: bw=706MiB/s (741MB/s), 706MiB/s-706MiB/s (741MB/s-741MB/s), io=41.5GiB (44.5GB), run=60145-60145msec                                           

FIOFFS slower, but maybe that ~4% is just in the noise? NOCORN slower still.

For omicron-build not being set, that's not a setting we will ever run in production, right?

@mkeeter
Copy link
Contributor Author

mkeeter commented Mar 14, 2024

Also, for each grouping I see two rows that start with 4M WRITE. Is that the same test twice, or is there
something different between the two rows?

There are two independent 4M runs that happen at different times during the benchmarking (first and last). I was noticing that the initial 4K run was misleadingly fast because it was writing to an empty disk, so I added an initial 4M random write step to make the disk not completely empty.

For omicron-build not being set, that's not a setting we will ever run in production, right?

Correct, it's just a convenient way to separate out the FIOFFS effects versus the rest of the refactoring.


I've been digging into performance mysteries, and the plot continues to thicken.

The new, even-weirder development is that the slowness seems to depend on propolis-standalone (i.e. the Crucible upstairs), despite those files not changing.

  • Crucible built with fioffs-redux + Propolis built with Crucible fioffs-redux is fast
  • Crucible built with fioffs-redux + Propolis built with Crucible main is slow

This is true despite the fact that all of the changes in fioffs-redux are in the crucible-downstairs, which isn't even used in propolis-standalone.

matt@atrium ~/crucible $ git diff main fioffs-redux --stat
 Cargo.toml                            |   2 +-
 downstairs/Cargo.toml                 |   1 +
 downstairs/src/extent.rs              |  81 +++++++++++++++++++++++++++++---
 downstairs/src/extent_inner_raw.rs    |  36 ++++++++-------
 downstairs/src/extent_inner_sqlite.rs |  78 ++++++++++++++++++++++---------
 downstairs/src/lib.rs                 |  36 ++++-----------
 downstairs/src/region.rs              | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------
 7 files changed, 311 insertions(+), 128 deletions(-)

(The Cargo.toml change is adding the ioctl feature to nix; I tested this and it's not the smoking gun)

This is very weird.

I see significant differences in method sizes and implementations in the disassembly. For example, on propolis-standalone built with Crucible main:

00000000012a8b00 <crucible::Buffer::into_bytes>:
 12a8b00:	55                   	push   %rbp
 12a8b01:	48 89 e5             	mov    %rsp,%rbp
 12a8b04:	41 57                	push   %r15
 12a8b06:	41 56                	push   %r14
 12a8b08:	41 54                	push   %r12
 12a8b0a:	53                   	push   %rbx
 12a8b0b:	48 81 ec 80 00 00 00 	sub    $0x80,%rsp
 12a8b12:	49 89 f6             	mov    %rsi,%r14
 12a8b15:	48 89 fb             	mov    %rdi,%rbx
 12a8b18:	48 8b 36             	mov    (%rsi),%rsi
 12a8b1b:	49 8b 56 08          	mov    0x8(%r14),%rdx
 12a8b1f:	4d 8b 7e 18          	mov    0x18(%r14),%r15
 12a8b23:	41 f6 c7 01          	test   $0x1,%r15b
 12a8b27:	75 18                	jne    12a8b41 <crucible::Buffer::into_bytes+0x41>
 12a8b29:	48 89 73 08          	mov    %rsi,0x8(%rbx)
 12a8b2d:	48 89 53 10          	mov    %rdx,0x10(%rbx)
 12a8b31:	4c 89 7b 18          	mov    %r15,0x18(%rbx)
 12a8b35:	48 8b 05 b4 d5 f1 00 	mov    0xf1d5b4(%rip),%rax        # 21c60f0 <_GLOBAL_OFFSET_TABLE_+0x6f0>
 12a8b3c:	48 89 03             	mov    %rax,(%rbx)
 12a8b3f:	eb 55                	jmp    12a8b96 <crucible::Buffer::into_bytes+0x96>
 12a8b41:	49 8b 4e 10          	mov    0x10(%r14),%rcx
 12a8b45:	49 c1 ef 05          	shr    $0x5,%r15
 12a8b49:	4c 8d a5 60 ff ff ff 	lea    -0xa0(%rbp),%r12
 12a8b50:	4c 89 e7             	mov    %r12,%rdi
 12a8b53:	4d 89 f8             	mov    %r15,%r8
 12a8b56:	e8 65 24 94 00       	call   1beafc0 <bytes::bytes_mut::rebuild_vec>
 12a8b5b:	48 8d 7d b0          	lea    -0x50(%rbp),%rdi
 12a8b5f:	4c 89 e6             	mov    %r12,%rsi
 12a8b62:	e8 19 01 94 00       	call   1be8c80 <<bytes::bytes::Bytes as core::convert::From<alloc::vec::Vec<u8>>>::from>
 12a8b67:	4c 89 7d d8          	mov    %r15,-0x28(%rbp)
 12a8b6b:	48 8b 45 c0          	mov    -0x40(%rbp),%rax
 12a8b6f:	48 89 c1             	mov    %rax,%rcx
 12a8b72:	4c 29 f9             	sub    %r15,%rcx
 12a8b75:	72 3e                	jb     12a8bb5 <crucible::Buffer::into_bytes+0xb5>
 12a8b77:	48 89 4d c0          	mov    %rcx,-0x40(%rbp)
 12a8b7b:	4c 01 7d b8          	add    %r15,-0x48(%rbp)
 12a8b7f:	48 8b 45 c0          	mov    -0x40(%rbp),%rax
 12a8b83:	48 89 43 10          	mov    %rax,0x10(%rbx)
 12a8b87:	48 8b 45 c8          	mov    -0x38(%rbp),%rax
 12a8b8b:	48 89 43 18          	mov    %rax,0x18(%rbx)
 12a8b8f:	0f 10 45 b0          	movups -0x50(%rbp),%xmm0
 12a8b93:	0f 11 03             	movups %xmm0,(%rbx)
 12a8b96:	49 83 c6 20          	add    $0x20,%r14
 12a8b9a:	4c 89 f7             	mov    %r14,%rdi
 12a8b9d:	e8 1e 1b 94 00       	call   1bea6c0 <<bytes::bytes_mut::BytesMut as core::ops::drop::Drop>::drop>
 12a8ba2:	48 89 d8             	mov    %rbx,%rax
 12a8ba5:	48 81 c4 80 00 00 00 	add    $0x80,%rsp
 12a8bac:	5b                   	pop    %rbx
 12a8bad:	41 5c                	pop    %r12
 12a8baf:	41 5e                	pop    %r14
 12a8bb1:	41 5f                	pop    %r15
 12a8bb3:	5d                   	pop    %rbp
 12a8bb4:	c3                   	ret
 12a8bb5:	48 89 45 d0          	mov    %rax,-0x30(%rbp)
 12a8bb9:	48 8d 45 d8          	lea    -0x28(%rbp),%rax
 12a8bbd:	48 89 45 90          	mov    %rax,-0x70(%rbp)
 12a8bc1:	48 8d 05 f8 eb fe ff 	lea    -0x11408(%rip),%rax        # 12977c0 <core::fmt::num::<impl core::fmt::Debug for usize>::fmt>
 12a8bc8:	48 89 45 98          	mov    %rax,-0x68(%rbp)
 12a8bcc:	48 8d 4d d0          	lea    -0x30(%rbp),%rcx
 12a8bd0:	48 89 4d a0          	mov    %rcx,-0x60(%rbp)
 12a8bd4:	48 89 45 a8          	mov    %rax,-0x58(%rbp)
 12a8bd8:	48 8d 35 29 2a f6 00 	lea    0xf62a29(%rip),%rsi        # 220b608 <anon.65e99b03176a3e062032dedb025b3946.56.llvm.11851335155975330580+0x138>
 12a8bdf:	48 8d 9d 60 ff ff ff 	lea    -0xa0(%rbp),%rbx
 12a8be6:	48 8d 4d 90          	lea    -0x70(%rbp),%rcx
 12a8bea:	ba 02 00 00 00       	mov    $0x2,%edx
 12a8bef:	41 b8 02 00 00 00    	mov    $0x2,%r8d
 12a8bf5:	48 89 df             	mov    %rbx,%rdi
 12a8bf8:	e8 13 ec fe ff       	call   1297810 <core::fmt::Arguments::new_v1>
 12a8bfd:	48 8d 35 24 2a f6 00 	lea    0xf62a24(%rip),%rsi        # 220b628 <anon.65e99b03176a3e062032dedb025b3946.56.llvm.11851335155975330580+0x158>
 12a8c04:	48 89 df             	mov    %rbx,%rdi
 12a8c07:	e8 84 0b c6 00       	call   1f09790 <core::panicking::panic_fmt>
 12a8c0c:	0f 0b                	ud2
 12a8c0e:	90                   	nop
 12a8c0f:	90                   	nop

In propolis-standalone built against fioffs-redux:

0000000001262c40 <crucible::Buffer::into_bytes>:
 1262c40:	55                   	push   %rbp
 1262c41:	48 89 e5             	mov    %rsp,%rbp
 1262c44:	41 56                	push   %r14
 1262c46:	53                   	push   %rbx
 1262c47:	49 89 f6             	mov    %rsi,%r14
 1262c4a:	48 89 fb             	mov    %rdi,%rbx
 1262c4d:	e8 2e 34 98 00       	call   1be6080 <<bytes::bytes::Bytes as core::convert::From<alloc::vec::Vec<u8>>>::from>
 1262c52:	49 8b 76 20          	mov    0x20(%r14),%rsi
 1262c56:	48 85 f6             	test   %rsi,%rsi
 1262c59:	74 0e                	je     1262c69 <crucible::Buffer::into_bytes+0x29>
 1262c5b:	49 8b 7e 18          	mov    0x18(%r14),%rdi
 1262c5f:	ba 01 00 00 00       	mov    $0x1,%edx
 1262c64:	e8 67 7a c4 ff       	call   eaa6d0 <__rust_dealloc>
 1262c69:	48 89 d8             	mov    %rbx,%rax
 1262c6c:	5b                   	pop    %rbx
 1262c6d:	41 5e                	pop    %r14
 1262c6f:	5d                   	pop    %rbp
 1262c70:	c3                   	ret
 1262c71:	90                   	nop
 1262c72:	90                   	nop
 1262c73:	90                   	nop
 1262c74:	90                   	nop
 1262c75:	90                   	nop
 1262c76:	90                   	nop
 1262c77:	90                   	nop
 1262c78:	90                   	nop
 1262c79:	90                   	nop
 1262c7a:	90                   	nop
 1262c7b:	90                   	nop
 1262c7c:	90                   	nop
 1262c7d:	90                   	nop
 1262c7e:	90                   	nop
 1262c7f:	90                   	nop

I'm extremely puzzled, and will keep digging into this; hopefully, pulling on this thread will clear up some of the non-reproducible performance changes that I've seen in the past.

@mkeeter
Copy link
Contributor Author

mkeeter commented Mar 14, 2024

Looking at this assembly, the "fast" propolis-standalone can't be built from commit fb4ca88 (which is what I had in my notes), because the disassembly clearly predates #1200.

When opening this PR, I rebased the fioffs-redux branch onto main at 6f7884a. Previously, it was rooted at commit 662b84e (according to my git reflog history), so I suspect the fast propolis-standalone was built from commit
8133d70.

Sure enough, building propolis-standalone from that commit, I see the "fast" performance.

This indicates that something has changed in Crucible's main branch, where 662b84e is fast and 6f7884a is slow, and the change affects the crucible crate (i.e. the upstairs library).

The changelog is as follows:

commit 6f7884ae4dcf21ee7a92664b83d9b4ae89130223
Author: Alan Hanson <[email protected]>
Date:   Tue Mar 12 17:59:43 2024 -0700

    Added next job ID to dtrace counter probe (#1204)

    * Added next job ID to dtrace counter probe

    Added to the dtrace counter probe the next JobID that the upstairs
    will use.  This allows a caller to determine how many jobs have
    been issued since the last probe.

    ---------

    Co-authored-by: Alan Hanson <[email protected]>

commit 92f460083990fd1dbccf03decedb02267494fce9
Author: Alan Hanson <[email protected]>
Date:   Tue Mar 12 13:08:50 2024 -0700

    Sneak in a work queue dump on the repair API (#1205)

    Added a debug endpoint to the repair API that will dump the current
    downstairs work queue to whatever log the downstairs is using.  THis
    allows us to see what the downstairs is doing (or not doing).

    Co-authored-by: Alan Hanson <[email protected]>

commit 9983956e9620340cad9906b448ecb20fba026e00
Author: Matt Keeter <[email protected]>
Date:   Tue Mar 12 11:59:32 2024 -0400

    Don't use debug formatting for skipped jobs (#1202)

    This PR adds an implementation of `std::fmt::Display` to `Message` and
    `WireMessage` which doesn't print out bulk data, then uses that code in the
    "skipped jobs" log message.

commit 4a50958ba680a76400cd876c4bd1ae9ab216d19c
Author: Matt Keeter <[email protected]>
Date:   Mon Mar 11 17:23:46 2024 -0400

    Switch from `Vec` → `BytesMut` in `Buffer` (#1200)

    #1094 has removed the locks from `Buffer`, meaning we can use a `BytesMut`
    instead of a `Vec`.

    This opens up the possibility to automatically split large read operations, à la
    #1198 (which isn't possible with the `Vec`-based `Buffer`).

commit b9b6f8b63fba8eae3835f82262f56a818b007f4b
Author: Matt Keeter <[email protected]>
Date:   Mon Mar 11 17:12:38 2024 -0400

    Fault if too many bytes are in flight (#1192)

    As recommended in RFD 445, this PR adds a fault condition if too many bytes pile
    up in flight for any particular downstairs.

    This adds a nice symmetry to the system: both backpressure and per-downstairs
    faults consider both jobs and bytes in flight.

commit 2e80e4eac1c4b2689b2881d9a1c215956d52b12b
Author: Matt Keeter <[email protected]>
Date:   Mon Mar 11 14:51:43 2024 -0400

    Automatically split > 1 MiB writes (#1198)

    In #1192 , we noticed that arbitrarily-large writes can overwhelm our
    backpressure implementation, because they take longer than our maximum
    backpressure.

    This PR automatically splits large writes into 1 MiB chunks. Propolis already
    does this, so I don't see any performance impacts, but it means that we'll be
    robust against stuff like "crudd tries to send unreasonably large write jobs".

commit e906cbac051eebba29258b0e5094f535e39b92aa
Author: Matt Keeter <[email protected]>
Date:   Fri Mar 8 12:42:54 2024 -0800

    Remove debug prints (#1197)

commit 52970ba122ac84dc43b506b82b9eba2df73f283a
Author: Matt Keeter <[email protected]>
Date:   Fri Mar 8 06:13:12 2024 -0800

    Make ClientRequest pub(crate) instead of pub (#1195)

commit 9bf1bd3d8113f61ff0626723a427cfc84309646b
Author: Matt Keeter <[email protected]>
Date:   Thu Mar 7 13:00:27 2024 -0800

    Fix outdated log message (#1193)

commit b1969f095c53466dfbd8c18ea33f195825b645a6
Author: Alan Hanson <[email protected]>
Date:   Thu Mar 7 10:03:52 2024 -0800

    16 tasks for tokio (#1189)

    Co-authored-by: Alan Hanson <[email protected]>

commit 97294f527969335fc4d250781f7773bb1464755c
Author: Matt Keeter <[email protected]>
Date:   Wed Mar 6 14:35:49 2024 -0800

    Add generic 'message or raw message' code (#1182)

    This PR adds a generic implementation of the 'Message or raw message' serializer
    used for reduced-mempcy serialization (#1087), then updates `crucible-upstairs`
    to use the new implementation.

    It also more generally reduces allocations in the socket write path by reusing
    the same `BytesMut` (for `Message` serialization) or a `Vec` (for raw message
    headers).  This may have a small performance impact; I didn't measure it.

    These changes are a building block for faster serialization of Downstairs
    `ReadResponse`, but I pulled them into a standalone PR for ease of review.

commit 1c1574fb721f98f2df1b23e3fd27d83be018882e
Author: Matt Keeter <[email protected]>
Date:   Fri Mar 1 19:46:46 2024 -0500

    Per client, queue-based backpressure (#1186)

    This PR adds per-client delays to prevent queues from becoming arbitrarily long
    during read-heavy workflows. These delays are based either per-client queue
    lengths and bytes in flight (whichever is worst compared to the slowest client),
    which is analogous to the guest backpressure implementation.

    Compared to main, this keeps the queues relatively bounded, and is at least
    similar in performance (hard to tell, because there significant run-to-run
    variability).

    Fixes #1167, and is an alternative to #1181

commit fbe6f128d3d6d68cb611b462c305e2181a3340ef
Author: Alan Hanson <[email protected]>
Date:   Fri Mar 1 09:02:11 2024 -0800

    A builder for the Downstairs Downstairs struct. (#1152)

    Updated the downstairs Downstairs struct to have a builder.

    Replaced build_downstairs_for_region() and
    build_downstairs_for_region_with_backend() with the new builder
    pattern.

commit f55f2eb4b952ad04bf4454b2e0a35fa5a605a4f2
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Tue Feb 27 21:39:09 2024 -0800

    Update Rust to v1.76.0 (#1153)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

I suppose the next step is bisecting...

@mkeeter
Copy link
Contributor Author

mkeeter commented Mar 15, 2024

See #1208 for the performance explanation; I was accidentally running an older upstairs without a performance regression.

With that fixed, the FIOFFS changes are no longer an obvious performance win.

Here's main with both #1208 and #1206 cherry-picked onto it:

4M WRITE: bw=792MiB/s (830MB/s), 792MiB/s-792MiB/s (830MB/s-830MB/s), io=46.5GiB (49.9GB), run=60127-60127msec
4K WRITE: bw=27.7MiB/s (29.0MB/s), 27.7MiB/s-27.7MiB/s (29.0MB/s-29.0MB/s), io=1661MiB (1742MB), run=60008-60008msec
1M WRITE: bw=870MiB/s (912MB/s), 870MiB/s-870MiB/s (912MB/s-912MB/s), io=51.0GiB (54.8GB), run=60029-60029msec
4M WRITE: bw=767MiB/s (805MB/s), 767MiB/s-767MiB/s (805MB/s-805MB/s), io=45.1GiB (48.4GB), run=60166-60166msec
4K READ: bw=25.6MiB/s (26.8MB/s), 25.6MiB/s-25.6MiB/s (26.8MB/s-26.8MB/s), io=1535MiB (1610MB), run=60008-60008msec
1M READ: bw=696MiB/s (730MB/s), 696MiB/s-696MiB/s (730MB/s-730MB/s), io=40.8GiB (43.8GB), run=60035-60035msec
4M READ: bw=764MiB/s (802MB/s), 764MiB/s-764MiB/s (802MB/s-802MB/s), io=44.9GiB (48.2GB), run=60124-60124msec

Cherry-picking this PR as well:

4M WRITE: bw=768MiB/s (805MB/s), 768MiB/s-768MiB/s (805MB/s-805MB/s), io=45.1GiB (48.4GB), run=60088-60088msec
4K WRITE: bw=27.1MiB/s (28.4MB/s), 27.1MiB/s-27.1MiB/s (28.4MB/s-28.4MB/s), io=1625MiB (1704MB), run=60043-60043msec
1M WRITE: bw=746MiB/s (783MB/s), 746MiB/s-746MiB/s (783MB/s-783MB/s), io=43.8GiB (47.0GB), run=60025-60025msec
4M WRITE: bw=749MiB/s (785MB/s), 749MiB/s-749MiB/s (785MB/s-785MB/s), io=44.0GiB (47.2GB), run=60175-60175msec
4K READ: bw=27.4MiB/s (28.7MB/s), 27.4MiB/s-27.4MiB/s (28.7MB/s-28.7MB/s), io=1643MiB (1723MB), run=60023-60023msec
1M READ: bw=651MiB/s (682MB/s), 651MiB/s-651MiB/s (682MB/s-682MB/s), io=38.1GiB (41.0GB), run=60034-60034msec
4M READ: bw=672MiB/s (704MB/s), 672MiB/s-672MiB/s (704MB/s-704MB/s), io=39.4GiB (42.4GB), run=60147-60147msec

It's hard to tell if this is an actual regression or just normal run-to-run variation; the fact that reads also slowed down makes me suspicious of the latter.

@mkeeter
Copy link
Contributor Author

mkeeter commented Jun 10, 2024

Current state: we're hoping to stabilize this interface before merging, e.g. by implementing a syncfs(2)-equivalent syscall. That shouldn't preclude testing of this branch!

@mkeeter
Copy link
Contributor Author

mkeeter commented Jul 15, 2024

See https://code.oxide.computer/c/illumos-gate/+/382 , which adds syncfs to illumos libc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants