Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve chainbase mapped and heap behavior #1691

Merged
merged 23 commits into from
Oct 5, 2023
Merged

Improve chainbase mapped and heap behavior #1691

merged 23 commits into from
Oct 5, 2023

Conversation

greg7mdp
Copy link
Contributor

@greg7mdp greg7mdp commented Sep 28, 2023

Resolves #1650 .

Includes the unmerged (yet) chainbase PR #21

  • issue with heap mode: at leap startup and exit, we perform a memcpy copy between two mappings of the full database size. This causes significant pressure on the file cache when physical RAM is less than 2x the database size

    => solution: use a series of smaller mappings for the copy, reducing RAM contention.

  • issue with mapped mode: when leap is running, linux will grind the disk into dust with its default dirty writeback algo. it's also a perf killer.

    => solution: map the file with MAP_PRIVATE (so changes are not written back to the file), and on exit copy just the modified pages using a series of RW mappings (info on which pages are dirty is gathered using the pagemap interface. Should we still set the dirty bit in the db file at startup?

    This new implementation of mapped mode, as well as the existing heap and locked modes, does not allow sharing an opened database in RW mode with other instances in RO mode (a capability not used in Leap afaik). In order to not completely remove the sharing functionality (for which a test exists in chainbase's test.cpp), a new mapped_shared mode is introduced, which is the same as the old mapped mode.

Also:

  • update chainbase's minimum C++ standard requirement to C++20
  • update chainbase's ci/cd platform from debian:buster to debian:bullseye to get c++20 support (std::span)

Performance test results (Github Actions)

Running:

./tests/PerformanceHarnessScenarioRunner.py findMax testBpOpMode --max-tps-to-test 80000 --test-iteration-min-step 10 --test-iteration-duration-sec 10 --final-iterations-duration-sec 10 --calc-chain-threads lmax overrideBasicTestConfig -v --tps-limit-per-generator 5000
main branch gh_1650 branch
resultAvgTps 13,760 14,104

Performance test results (Run locally)

Running:

./tests/PerformanceHarnessScenarioRunner.py findMax testBpOpMode --max-tps-to-test 50000 --test-iteration-min-step 10 --test-iteration-duration-sec 10 --final-iterations-duration-sec 10 --calc-chain-threads lmax overrideBasicTestConfig -v --tps-limit-per-generator 5000 > /tmp/log 2>&1 
main branch gh_1650 branch
resultAvgTps 29,271 30,091

@greg7mdp greg7mdp marked this pull request as draft September 28, 2023 15:20
@greg7mdp greg7mdp marked this pull request as ready for review September 29, 2023 16:52
Copy link
Member

@spoonincode spoonincode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done an in depth review yet, just some initial comments,

We should mix in the new mode as part of db-modes-test; edit: just realized we aren't exposing it as an option... so that might not work

New mapped mode is only available on Linux: it really shouldn't be available as an option on other platforms, and certainly shouldn't be the default on other platforms.

But wait, there's more to the above.. it seems not all CPU platforms support this soft-dirty feature on Linux. Shockingly, as far as I can tell, ARM8 does not because I do not see the CONFIG_HAVE_ARCH_SOFT_DIRTY set. ARM8 is major enough that we shouldn't break it, so if we can't find a run time way to detect a working default, maybe we need to ifdef this just to x86 for now.

This new mapped mode will largely interfere with users running state on tmpfs (from the standpoint of needlessly burning more memory). I am not sure what we should do to prevent this. Prevent starting in mapped if state is discovered on tmpfs? Silently treat mapped as mapped_shared if state is discovered on tmpfs?

// when loading a snapshot, all the state will be modified, so use the `shared` mode instead
// of `copy_on_write` to lower memory requirements
if (snapshot_path && chain_config->db_map_mode == pinnable_mapped_file::mapped)
chain_config->db_map_mode = pinnable_mapped_file::mapped_shared;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to maintain the new mapped mode for loading snapshots because I think it would be a huge perf boost (I've seen comments from some users it takes 30 minutes to load a WAX snapshot, but it takes me less than 5 minutes in heap mode.. I'm pretty sure it's disk grinding for those users).

If we're worried about leaving 100% dirty pages after loading a snapshot, maybe one option is chainbase could expose a flush() call that (synchronously) performs the write out so that all the pages are clean. and nodeos calls that after loading the snapshot but before continuing.

Copy link
Contributor Author

@greg7mdp greg7mdp Sep 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to maintain the new mapped mode for loading snapshots because I think it would be a huge perf boost (I've seen comments from some users it takes 30 minutes to load a WAX snapshot, but it takes me less than 5 minutes in heap mode.. I'm pretty sure it's disk grinding for those users).

This would be useful when you have just the right amount of RAM that can hold all the state in RAM, but not quite enough for heap mode. I'm a little bit concerned that we may get more crashes this way. Maybe I can detect the available RAM, and according to the size of the disk db configured decide if it makes sense to use the new mapped mode (for example go for it if RAM_size > 1.1 x chain-state-db-size-mb).

If we're worried about leaving 100% dirty pages after loading a snapshot, maybe one option is chainbase could expose a flush() call that (synchronously) performs the write out so that all the pages are clean. and nodeos calls that after loading the snapshot but before continuing.

Yes that's a great idea. We can still use the soft-dirty thing as is the state-db-size is still configured much greater than the actual db_size used, it will make the flush faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add an ilog or maybe even a wlog that the mode was changed from specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon reflection, I think the best compromise when loading a snapshot is:

  • load the snapshot in mapped_shared mode. Yes it is probably slower, but it minimizes the odds of running out of memory. With the new mapped mode, we need memory for both the currently read data from the snapshot + the full chainbase db.
  • when the snapshot is done loading, use a new API as suggested by Matt which flushes the still dirty pages to disk and restart with a new copy_on_write mapping.

Do you guys agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked in the implementation of the above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heifner I updated the change to mapped_shared to be temporary (just while loading the snapshot), so I don't think a ilog is necessary, but I can add it if you think it might be useful still.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not needed if it honors the configuration after snapshot load.

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Sep 30, 2023

We should mix in the new mode as part of db-modes-test; edit: just realized we aren't exposing it as an option... so that might not work

Yes, makes sense, will do. Actually the new mapped mode is tested, but I should add mapped_shared which is the old mappped.

[update] I misunderstood. I added the mapped_shared mode in chainbase's grow_shrink.cpp test. I'll look into db_modes_test.sh.

New mapped mode is only available on Linux: it really shouldn't be available as an option on other platforms, and certainly shouldn't be the default on other platforms. ... maybe we need to ifdef this just to x86 for now.

Yes my bad I forgot that. Thanks for catching it.
I'm not sure if it would be best to use the shared mode (with disk grinding) for those platforms that don't support the pagemap/soft-dirty interface, or if we should still use the copy_on_write mode with the full save at exit (which will be slow like the heap mode of course).

This new mapped mode will largely interfere with users running state on tmpfs (from the standpoint of needlessly burning more memory). I am not sure what we should do to prevent this. Prevent starting in mapped if state is discovered on tmpfs? Silently treat mapped as mapped_shared if state is discovered on tmpfs?

We could do one of your suggestions, preferably refuse to start, and suggest to not have the state on tempfs.
I'm not sure I understand tempfs fully, but I would think having the state on tempfs is not very attractive anymore, as it would require the whole specified chain-state-db-size-mb to be in RAM+ swap, not just the modified state pages..

- test both `mapped` and `mapped_shared` modes
- don't try to use the `pagemap` feature on platforms where it is not available
// when loading a snapshot, all the state will be modified, so use the `shared` mode instead
// of `copy_on_write` to lower memory requirements
if (snapshot_path && chain_config->db_map_mode == pinnable_mapped_file::mapped)
chain_config->db_map_mode = pinnable_mapped_file::mapped_shared;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add an ilog or maybe even a wlog that the mode was changed from specified.

@heifner
Copy link
Member

heifner commented Oct 2, 2023

  • Should we still set the dirty bit in the db file at startup?

I think so. If state is not flushed then it will not match the blocklog, fork-database, code-cache, and any future state we may have.

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Oct 2, 2023

  • Should we still set the dirty bit in the db file at startup?

I think so. If state is not flushed then it will not match the blocklog, fork-database, code-cache, and any future state we may have.

Yes, that's also what I thought, and what the code currently does.

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Oct 2, 2023

This new mapped mode will largely interfere with users running state on tmpfs (from the standpoint of needlessly burning more memory).

@spoonincode I think this is true as well for heap and locked mode, so I'll also prevent starting in these modes when the file is on a tmpfs file system.

const auto guard = my->conf.state_guard_size;
EOS_ASSERT(free >= guard, database_guard_exception, "database free: ${f}, guard size: ${g}", ("f", free)("g",guard));

// give a change to chainbase to write some pages to disk if memory becomes scarce.
if (auto flushed_pages = mutable_db().check_memory_and_flush_if_needed()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this call is only safe during the write window (at least, if it's actually clearing dirty bits). Are we sure validate_db_available_size() is only called during the write window? This one I'm not sure about due to ROtrx,

transaction_trace_ptr controller::push_transaction( const transaction_metadata_ptr& trx,
fc::time_point block_deadline, fc::microseconds max_transaction_time,
uint32_t billed_cpu_time_us, bool explicit_billed_cpu_time,
int64_t subjective_cpu_bill_us ) {
validate_db_available_size();
EOS_ASSERT( get_read_mode() != db_read_mode::IRREVERSIBLE, transaction_type_exception, "push transaction not allowed in irreversible mode" );
EOS_ASSERT( trx && !trx->implicit() && !trx->scheduled(), transaction_type_exception, "Implicit/Scheduled transaction not allowed" );
return my->push_transaction(trx, block_deadline, max_transaction_time, billed_cpu_time_us, explicit_billed_cpu_time, subjective_cpu_bill_us );
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is called from multiple read-only threads for read-only trx execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -80,6 +80,8 @@ int snapshot_actions::run_subcommand() {
cfg.state_size = opt->db_size * 1024 * 1024;
cfg.state_guard_size = opt->guard_size * 1024 * 1024;
cfg.eosvmoc_tierup = wasm_interface::vm_oc_enable::oc_none; // wasm not used, no use to fire up oc

cfg.db_map_mode = pinnable_mapped_file::map_mode::mapped;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think need this line any more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that's the default. removed.

@spoonincode
Copy link
Member

we should document this new mode in --help,

("database-map-mode", bpo::value<chainbase::pinnable_mapped_file::map_mode>()->default_value(chainbase::pinnable_mapped_file::map_mode::mapped),
"Database map mode (\"mapped\", \"heap\", or \"locked\").\n"
"In \"mapped\" mode database is memory mapped as a file.\n"
#ifndef _WIN32
"In \"heap\" mode database is preloaded in to swappable memory and will use huge pages if available.\n"
"In \"locked\" mode database is preloaded, locked in to memory, and will use huge pages if available.\n"
#endif
)

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Oct 4, 2023

we should document this new mode in --help,

Sure will do after my current meeting.

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Oct 4, 2023

we should document this new mode in --help,

Sure will do after my current meeting.

done.

if (is_write_window()) {
if (auto flushed_pages = mutable_db().check_memory_and_flush_if_needed()) {
ilog("CHAINBASE: flushed ${p} pages to disk to decrease memory pressure", ("p", flushed_pages));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure why to leave this in, since it's #if 0ed on the other side. Kind of deceptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will do something on the next PR, so the deception will not be long lasting.

@greg7mdp greg7mdp merged commit 2a0e9c9 into main Oct 5, 2023
21 checks passed
@greg7mdp greg7mdp deleted the gh_1650 branch October 5, 2023 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve chainbase mapped and heap behavior
3 participants