Do the explicit readahead and don't pull in the entire overflow file into page cache #52

nemanja-boric-sociomantic · 2018-05-28T13:54:58Z

Kernel will by default fill the unused RAM bringing the parts of the open files into the page cache. This is very problematic for the DMQ node, where the overflow files can grow over several dozens of gigabytes - since caching in the file will put enormous pressure on the system, and when it's not clear that file will be needed in the future (imagine that there's no readers, and the file just grows). Even if there are some readers, they are going to read the file in something close to sequential fashion (because multiple channels are multiplexed in the single file, this is a bit complicating this issue, but not too much), so there's no need to cache in the entire file.

Linux provides posix_fadvise, wit the three flags that are of the interest:

       Under Linux, POSIX_FADV_NORMAL sets  the  readahead  window  to  the
       default  size  for the backing device; POSIX_FADV_SEQUENTIAL doubles
       this size, and POSIX_FADV_RANDOM disables file  readahead  entirely.
       These  changes affect the entire file, not just the specified region
       (but other open file handles to the same file are unaffected).

We could leverage POSIX_FADV_DONTNEED, but the issue is either we mark entire
file as DONTNEED (which is then same as POSIX_FADV_RANDOM) or we wait for the users
to read the file so we can drop the first parts - something that it's redundant, since we're already
truncating the file from the beginning; and also this doesn't help us much when there are no readers, and we don't have much to drop.

What we could do is to disable entire automatic readahead with advising kernel with POSIX_FADV_RANDOM and then do the readahead manually (issue readahead in the windows of some preconfigured size - should be tweakable via a knob in the config file). Also make sure that kernel is dropping the new pages as soon as they are flushed out by the writeback daemon (perhaps POSIX_FADV_RANDOM will not help us there, since it should only affect prefetching, if so, look if you can POSIX_FADV_DONTNEED on the new pages).

This way the reading of the file is still performed by the kernel - not blocking the application, but in controlled manner.

The text was updated successfully, but these errors were encountered:

federico-cuello-sociomantic · 2018-05-28T14:05:19Z

What about using mmap() instead? I think it should be much better for large files.

nemanja-boric-sociomantic · 2018-05-28T14:10:41Z

The reason is that we already have good and battle tested streaming implementation of the overflow file, and changing that to the array like access has a relatively large costs. Other than that, according to Linus and backed by my experience (from the limited tests I did), if you're doing sequential access, you'll not be better off with the mmap - it uses the completely same kernel interface, just presents the memory in the page cache via different interface, and if you're just doing sequential access, you're more likely to be better off with the read/write.

nemanja-boric-sociomantic · 2018-05-28T14:16:22Z

Though you may be right about the difference how the automatic prefetch is done, but then we'll still need to do a manual prefetch (and this time probably with posix_madvise).

david-eckardt-sociomantic · 2018-05-28T14:41:38Z

according to Linus

Do you have a link to his statement?

nemanja-boric-sociomantic · 2018-05-28T14:42:46Z

I saw it ages ago, let me dig it up.

nemanja-boric-sociomantic · 2018-05-28T14:47:13Z

http://lkml.iu.edu/hypermail/linux/kernel/0004.0/0728.html

Looking into the first upside, it's completely void for us, since: a) we're going through the same regions of the file at most few times; b) there's not much logic to avoid - we don't do any non-sequential access, just pread (pread?)/write is what we do with overflow file.

The other upside is what we need (the memory is not prefetch automatically and the pages are dropped as you don't need them) but "playing games with the virtual memory mapping is very expensive" (because of: page faulting is expensive. That's how the mapping gets and it's quite slow. - and we definitively need to prefetch just in windows - we can't afford page faults and blocking the entire DMQ process while reading the file from disk).

nemanja-boric-sociomantic · 2018-05-28T14:48:27Z

and followup: http://lkml.iu.edu/hypermail/linux/kernel/0004.0/0775.html

memcpy() (ie "read()" in this case) is always going to be faster in many
cases, just because it avoids all the extra complexity. While mmap() is
going to be faster in other cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do the explicit readahead and don't pull in the entire overflow file into page cache #52

Do the explicit readahead and don't pull in the entire overflow file into page cache #52

nemanja-boric-sociomantic commented May 28, 2018

federico-cuello-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018

david-eckardt-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018

Do the explicit readahead and don't pull in the entire overflow file into page cache #52

Do the explicit readahead and don't pull in the entire overflow file into page cache #52

Comments

nemanja-boric-sociomantic commented May 28, 2018

federico-cuello-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018

david-eckardt-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018

nemanja-boric-sociomantic commented May 28, 2018