-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues with sync=disabled + compress=zstd #16371
Comments
My guess would be, most ultrabooks have CPUs with a very low limit on their power usage, so trying to saturate all cores is making it throttle very badly after it exceeds whatever threshold for bursting over power usage limits it has. I don't think limiting it to fewer cores will help here, because it's still going to probably blow those limits and throttle for a while, if I'm right. (I believe you can ask The best suggestion I've got would be to not do that - if I'm right and the problem is spending too long doing high CPU load, then the only workaround is limiting how much it does per unit time. You could, one imagines, lower the dirty data limit, which would, I believe, trigger a flush no matter what sync setting you've got if it's exceeded, but it might be hard to find a good limit for that. |
Hey @rincebrain, actually the CPU reports a clock speed above-mentioned the base clock on all 8 cores if I look into atop. Anyway, I don't think my cursor should freeze for minutes just because my CPU-speed drops. |
Ultrabooks, generally, are designed to try and avoid hitting their thermal limits, because the tiny form factor means you can only do so much so fast after you do. I'd suggest you record what the CPU is doing, exactly, and what the performance counters say about how it's stalling, and I would suspect you'll find that it's pinging either heat or power limit counters and clocking down, possibly briefly, possibly not. But even if I'm wrong, finding out what it's doing when it stalls out that way seems like the next step. You could also try renicing the write taskqs, if you think it's just that work scheduled in the kernel is winning over userland, but the problem with that is, all the other Linux IO stuff runs at -19, last I checked, so you might have weird priority inversion outcomes, like userland CPU usage blocking the zstd processing threads. In any case, I've given you 3 distinct suggestions on things you can investigate/test. |
Could you tell exact cpu and laptop model? Some intels have big+little arch, I'm interested if it may affect in this way. fwiw, on my laptop GPD win max 2 with Amd Ryzen 7840u (8 same zen4 cores 16threads) with 64gb ram, debian testing (6.7 kernel with 2.2.4) I use nearly extreme config:
And I don't have any freezes or stalls ever on powersave mode. But yes, this exact laptop has pretty decent heatsink. One year ago I used older laptop hp aero 13 with ryzen 5800u and 16g ram, experience was nearly the same. They both could give at least consistent 20w tdp for cpu. |
Sure. It's not a big+little architecture: i5-1135G7 @ 2.40GHz in Samsung Galaxy Book Flex2 5G I did some compilation yesterday, and I noticed that the ARC basically drops all memory usage, despite having still 10 GB of memory free. You can also see, that even without having sync disabled, it runs in a load of 46, which is waaay to much to keep the system responsive. Most of the CPU-time is spend in CPU however is still at 2.35 GHz of 2.4 GHz base clock, after an hour of load. So cooling isn't the issue here - I suppose. Btw: How many threads is each of the compression processes using? I mean zstd is multi-threaded. Maybe each process is using 4 threads or so?
Yeah, but I don't think the the idea is that you need 16 threads and 64 GB memory to run a computer on ZFS without stalling / freezing. Your system might have enough CPU power or memory to run into this issue. |
Alright. I found a fix. I set these kernel parameters:
load1 stayed around System stayed completely usable, and mind you I set Compilation time went down from |
Hey @behlendorf, I think you did the taskq sheduling in aa9af22 a couple of years back. Maybe you can take a look at this again. I think there are just too many concurrent compression tasks running, which is bogging down the system. Somehow this leads zfs to believe that it needs to drop the arc nearly completely, which increases the IO further, as it's no longer cached. The main difference between 2015 and now is probably, more memory intensive compression tasks going with zstd and that we don't use HDDs anymore, so the CPU tasks is the bottleneck now, not the HDD io. Which leads to the CPU tasks piling up. Hope I analyzed that correctly. Btw: How many threads does a single compression task with zstd create? Do they use the common multi threading settings of |
For compression ZFS has a number of z_wr_iss threads, covering 80% of CPU cores, controlled by zio_taskq_batch_pct module parameter. It is usually enough to reach sufficient speed, but same time not block the system completely for few seconds, similar you report here. But in the top outputs provided I see instead a bunch of z_rd_int threads, which handle checksuming and decryption on read completion. They should not handle compression/decompression, and it makes me wonder if aside of different compression algorithm you also changed checksum algorithm or enabled encryption. Few years ago before 7457b02, which was long after mentioned aa9af22, to many z_rd_int threads combined with too expensive checksum and/or encryption could also block the system for a bit, but now they are also limited to 80% of CPU cores, so can block the system only if combined with something else. |
So if you think the problem is in CPU saturation, instead of some unmotivated manipulations with ARC you should try reducing zio_taskq_batch_pct before importing pool. |
@amotin wrote:
I fail to see how it's unmotivated. The arc usage dropped to just 430 MB, as you can see in the atop screenshot provided, despite having 10.6 GB of memory available. I first wanted to see how the system behaves fixing the obvious issue of ZFS managing the arc size properly. But I agree, the different limits of the arc size only mask the real issue, which is task saturation by trying to do too much stuff at the same time. @amotin wrote:
Physical cores or logical cores? @amotin wrote:
All my subvolumes use @amotin wrote:
Correct me if I'm wrong, but I think these are processes, not threads. |
ARC sizing on Linux is a huge can of worms on its own. If you expect strong unexpected memory pressure on your system and need ARC to cooperate, assuming the kernel versions you use are sane in reporting the pressure (see https://lore.kernel.org/all/[email protected]/T/#u), I'd recommend to set
Logical.
Thinking again, I may be wrong, decompression may happen in z_rd_int when data are not speculatively prefetched, but read on demand. I am just not used to see much load there, since lz4 is very fast on decompression, while I am not sure about zstd, and prefetcher can often do the things, though not always.
Process is typically a group of threads sharing the same address space. Since everything in kernel shares kernel address space, it does not matter. Lets call them kernel execution entities, consuming one CPU core each. |
I might be wrong about this, but I thought data is always decompressed on demand, as the ARC keeps only the compressed data. At least since 0.7 that should be the case, no?
Rule of thumb: Decompression should be about 3 times as much CPU cycles per MB for zstd compared to lz4, memory usage is also higher. I checked with a 8 MB file (compressed with
But without turboboost I expect the speed to drop to half of that, per thread, maybe lower. But it should be plenty fast, given that my SSD only outputs 3.4 GB/s linear, which is a lot lower if the blocks gets smaller. Going off on a tangent: Kinda sad that we can't use dictionaries with zstd, that would massively improve compression and decompression speed, far beyond what lz4 does, while improving compression ratio even further. :)
Ah okay. I was just wondering if one process my be actually more than one thread, and thus would be able to use more than 1 core. |
That #16197 piqued my interest. Is this using the (somewhat) new PSI interface, that what's also giving out memsome and memfull in atop? |
You are right, data is decompressed on demand. So if speculative prefetcher issues read, data will first lend compressed in ARC, and only some time later may be decompressed by demand user thread, or not. But if read is issued by user thread, ARC will get compressed buffer first, then immediately decompress it into dbuf and only then wake up waiting user thread waiting for the dbuf. |
Are we talking about building Chromium or Firefox? Even much smaller build jobs can create a significant I/O burden esp. if you're compiling with I've observed something similar years ago on the SSHD I was using back then, that setting |
@RJVB wrote:
Nah, Ladybird is the Browser I'm building. @RJVB wrote:
Nope, I'm not:
Full building instructions are here. @RJVB wrote:
I'm not interested in using LZ4. Apart from that I found that a simple I'm using the |
System information
Describe the problem you're observing
I've dug a bit through the open issues here, but it seems not be described yet.
I run ArchLinux (in CachyOS variant) on a ZFS root on a single pool with a fast 1 TB NVMe drive below.
The computer itself is an average ultra book from Samsung with an 11 gen Intel CPU/GPU processor and 16 GB of memory.
CachyOS comes pre-configured to run a compressed swap in memory and overall I don't see any memory pressure in daily operation. Meaning I got some GBs in the ARC and still 2-3 GB free.
The system feels extremely snappy and powerful, regardless what I wanna do.
However, this all changes if I do heavy operations on a subvolume which is configured to not sync its write operations.
Rationale behind to use no sync is to run latency sensitive operations more like on a memory disk, than with NVMe-speed.
I currently have e.g. ~/.cache and /tmp configured this way.
So applications can get rid of their data directly to memory and return, while the NVMe and the compression task in ZFS starts to catch up.
However, there seems to be an issue in the amount of work or data ZFS accepts in total or try to get finished in concurrent tasks:
Once I start compiling a larger program - in this case a browser - all the compression and write operations seems to bog down the system completely. My cursor froze for a minute or two on several occasions and the system might even run into a kernel panic.
That's quite weird, as if I set all subvolumes back to sync=standard this issue goes away.
My suspect here is a combination of memory clogging, due to a high number of accepted write operations, which need to be finished and the fact that ZFS appears to use 8 threads/processes on a processor with 4 physical cores to get rid of the data by compressing it in parallel tasks.
If I'm right the fix may be to reduce the maximum workload ZFS can do for compression, by reducing it to just 75% of the physical cores as well as having a tighter limit on how much data will be accepted for async writes at a time.
Not sure how the limit for accepting async reads/writes atm works, but I think a latency target instead of amount of data would work nicely for a large variety of systems.
Describe how to reproduce the problem
I hope this is reproduceable on other systems as well by just using ZFS as root and doing heavy write operations with sync=disabled and compression=zstd.
Include any warning/errors/backtraces from the system logs
Sadly the kernel panics haven't output any backtraces, as ZFS seems to accept no longer new data and there's no output on the screen.
The text was updated successfully, but these errors were encountered: