Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Unify cache instance between DLA and cloud native datacache. #52793

Merged
merged 5 commits into from
Nov 18, 2024

Conversation

GavinMar
Copy link
Contributor

@GavinMar GavinMar commented Nov 11, 2024

Why I'm doing:

The current DLA and cloud native datacache are both built on the underlying starcache library, but they are two different instances. This leads to:

  1. Increased user configuration complexity, requiring users to reserve hardware resources such as disks for two caches and configure them separately. Configure the parameters of two cache instances in order to use them properly.
  2. Low resource utilization rate. The inability to achieve global unified scheduling of resources such as disks can easily lead to resource waste.
  3. Redundant external configuration and indicator items.
  4. There are many redundant development logics.

What I'm doing:

  • Unified cache instance between DLA and cloud native datacache, including the cache instance, configurations and metrics, etc.
  • Support slru cache eviction policy.

Fixes #issue
#52940

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

}
LOG(INFO) << process_name << " start step " << start_step++ << ": staros worker init successfully";
#endif

// set up thrift client before providing any service to the external
// because these services may use thrift client, for example, stream
// load will send thrift rpc to FE after http server is started
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
Potential improper handling of std::filesystem::remove_all() failure due to not checking ec immediately after calling std::filesystem::remove_all(). This can lead to a failed attempt at renaming if the directory wasn't removed successfully, which might cause unexpected behavior.

You can modify the code like this:

#ifdef USE_STAROS
std::filesystem::path starlet_cache_path(root_path.path + "/starlet_cache");
if (std::filesystem::exists(starlet_cache_path)) {
    if (DiskInfo::disk_id(starlet_cache_path.c_str()) != DiskInfo::disk_id(datacache_path.c_str())) {
        LOG(ERROR) << "The datacache directory and the old starlet_cache directory cannot be located on different disks. "
                   << "Please manually mount the datacache to the same disk as starlet_cache and then restart again";
        return Status::InternalError("The datacache directory is different with old starlet_cache directory");
    }
    std::error_code ec;
    std::filesystem::remove_all(datacache_path, ec);
    if (ec) {
        LOG(ERROR) << "Fail to remove existing datacache directory: " << ec.message();
        return Status::InternalError("Fail to remove existing datacache directory");
    }
    std::filesystem::rename(starlet_cache_path, datacache_path, ec);
    if (ec) {
        LOG(ERROR) << "Fail to rename old starlet_cache directory to datacache, reason: " << ec.message();
        return Status::InternalError("Fail to handle the old starlet_cache data");
    }
}
#endif

// to optimize the io performance and reduce disk waste.
// Set the parameter to `0` will turn off this optimization.
CONF_Int64(datacache_inline_item_count_limit, "130172");


// The following configurations will be deprecated, and we use the `datacache` prefix instead.
// But it is temporarily necessary to keep them for a period of time to be compatible with
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
Changing CONF_mBool(datacache_auto_adjust_enable, "false") to CONF_mBool(datacache_auto_adjust_enable, "true") without ensuring system stability.

You can modify the code like this:

- CONF_mBool(datacache_auto_adjust_enable, "true");
+ // Ensure system can handle auto adjustment before enabling it
+ bool isSystemStable = checkSystemStability(); // Pseudo function for illustration
+ CONF_mBool(datacache_auto_adjust_enable, isSystemStable ? "true" : "false");

Enabling automatic cache adjustments without verifying system conditions might risk destabilizing systems that are not prepared for dynamic changes in resource allocation.

std::filesystem::rename(starlet_cache_path, datacache_path, ec);
}
if (ec) {
LOG(ERROR) << "Fail to rename old starlet_cache directory to datacache, reason: " << ec.message();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the specific directory for src and dst

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the specific directory for src and dst

done

if (DiskInfo::disk_id(starlet_cache_path.c_str()) != DiskInfo::disk_id(datacache_path.c_str())) {
LOG(ERROR) << "The datacache directory and the old starlet_cache directory cannot be located on different disks. "
<< "Please manually mount the datacache to the same disk as starlet_cache and then restart again";
return Status::InternalError("The datacache directory is different with old starlet_cache directory");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be do not return error and set datacache_unified_instance to false, so that user still can start be?

Copy link
Contributor Author

@GavinMar GavinMar Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be do not return error and set datacache_unified_instance to false, so that user still can start be?

This configuration item is mainly designed to solve the problem of unified instance exceptions, and users configure to start individual instances for fault tolerance. So this situation is usually more appropriate for users to know and make their own choices. Automatically switching to independent mode may cause unexpected problems due to resource usage and other factors.

starlet_cache_percent, -1);
disk_size = std::max(disk_size, starlet_cache_size);
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print a log here so that we know how much percent we are actually using

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print a log here so that we know how much percent we are actually using

When initializing blockcahe, the actual disk size will be printed.

@GavinMar GavinMar force-pushed the unified_cache_intance branch 2 times, most recently from 19d3c2b to ae7944a Compare November 14, 2024 04:34
@GavinMar GavinMar changed the title [WIP][Feature] Unified cache instance between DLA and cloud native datacache. [Feature] Unified cache instance between DLA and cloud native datacache. Nov 14, 2024
@andyziye andyziye linked an issue Nov 15, 2024 that may be closed by this pull request
1 task
@GavinMar GavinMar changed the title [Feature] Unified cache instance between DLA and cloud native datacache. [Feature] Unify cache instance between DLA and cloud native datacache. Nov 15, 2024
kevincai
kevincai previously approved these changes Nov 15, 2024
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[BE Incremental Coverage Report]

pass : 78 / 91 (85.71%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/common/daemon.cpp 0 2 00.00% [149, 150]
🔵 be/src/block_cache/disk_space_monitor.cpp 3 4 75.00% [259]
🔵 be/src/service/service_be/starrocks_be.cpp 23 30 76.67% [110, 111, 121, 122, 123, 131, 230]
🔵 be/src/block_cache/datacache_utils.cpp 15 18 83.33% [165, 166, 167]
🔵 be/src/block_cache/block_cache.cpp 6 6 100.00% []
🔵 be/src/service/staros_worker.cpp 3 3 100.00% []
🔵 be/src/io/cache_input_stream.cpp 1 1 100.00% []
🔵 be/src/block_cache/starcache_wrapper.cpp 14 14 100.00% []
🔵 be/src/block_cache/starcache_wrapper.h 1 1 100.00% []
🔵 be/src/block_cache/block_cache.h 1 1 100.00% []
🔵 be/src/io/cache_input_stream.h 1 1 100.00% []
🔵 be/src/http/action/update_config_action.cpp 9 9 100.00% []
🔵 be/src/io/cache_select_input_stream.hpp 1 1 100.00% []

@luohaha luohaha merged commit b987c8c into StarRocks:main Nov 18, 2024
43 of 44 checks passed
Smith-Cruise pushed a commit to Smith-Cruise/starrocks that referenced this pull request Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify cache instance between DLA and cloud native datacache
5 participants