Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: set THP_DISABLE=true in shim, and restore it before starting runc #195

Merged
merged 1 commit into from
Feb 20, 2024

Conversation

zzzzzzzzzy9
Copy link
Contributor

If /sys/kernel/mm/transparent_hugepage/enabled=always, the shim process will use huge pages, which will consume a lot of memory.

cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Just like this:

ps -efo pid,rss,comm | grep shim
    PID   RSS COMMAND
   2614  7464 containerd-shim

cat /proc/2614/smaps | grep -i hugepages
AnonHugePages:      2048 kB
...

I don't think shim needs to use huge pages, and if we turn off the huge pages option, we can save a lot of memory resources.

After we set THP_DISABLE=true:

ps -efo pid,rss,comm 
    PID   RSS COMMAND
   2470  5444 containerd-shim

cat /proc/2470/smaps | grep -i hugepages
AnonHugePages:         0 kB
...
containerd
    |
    |--shim1   --start
        |
        |--shim2    (this shim will on host)
            |
            |--runc create (when containerd send create request by ttrpc)
                |
                |--runc init (this is the pid 1 in container)
we should set thp_disabled=1 in shim1 --start, because if we set this
in shim 2, the huge page has been setted while func main() running,
we set thp_disabled cannot change the setted huge pages.
So We need to set thp_disabled=1 in shim1 so that shim2 inherits the
settings of the parent process shim1, and shim2 has closed the
hugepage when it starts.

For runc processes, we need to set thp_disabled='before' in shim2 after
fork() and before execve(). So we use cmd.pre_exec to do this.

@github-actions github-actions bot added C-runc-shim Runc shim C-runc runc helper labels Sep 11, 2023
@Burning1020
Copy link
Member

I think this problem is related to specific application scenarios.

  • If node resources are insufficient, there is no need to enable transparent huge page for shim process and also other management processes. Only service processes need to be enabled separately. In this case, you do not need to set transparent_hugepage/enabled=always.
  • If the node resources are abundant, does it matter to care about the little usage of the shim process?

@zzzzzzzzzy9
Copy link
Contributor Author

Because some processes in the environment do not show the use of huge pages, but need to use huge pages to improve performance.
For the shim process, real-time performance is not particularly needed, and the shim process can save at least 2M memory by turning off the huge page option, which can control the shim memory footprint to about 5M. According to my observations, if the shim process does not turn off the huge page option, it is marked with a huge page by 2M, and in more cases it can reach 10M+. If turned off, memory is saved by a minimum of 30% (2M / 7M), which means that more containers can be started.

@Burning1020
Copy link
Member

Burning1020 commented Sep 14, 2023

I feel it's more like a custom-made thing. /cc @mxpv @fuweid

crates/runc-shim/src/service.rs Outdated Show resolved Hide resolved
@@ -366,6 +366,7 @@ pub trait Spawner: Debug {
/// and some other utilities.
#[cfg(feature = "async")]
impl Runc {
#[cfg(not(target_os = "linux"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have runc on non Linux environments. But also can apply same suggestion as above to avoid func duplication.

crates/runc-shim/src/service.rs Outdated Show resolved Hide resolved
@mxpv
Copy link
Member

mxpv commented Sep 14, 2023

I feel it's more like a custom-made thing. /cc @mxpv @fuweid

I'm generally ok with allowing more precise configuration of the runtime.
@fuweid might have more thoughts on this though.

@zzzzzzzzzy9
Copy link
Contributor Author

Thanks for your suggestions. I have changed them.

@@ -368,6 +368,22 @@ pub trait Spawner: Debug {
impl Runc {
async fn launch(&self, cmd: Command, combined_output: bool) -> Result<Response> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async fn launch(&self, cmd: Command, combined_output: bool) -> Result<Response> {

There are some problems here, the cmd variable needs to be mutable, but if on a non-Linux environment, cmd cannot be mutable. If you don't use a separate function name, the code looks ugly because you need to create a new mut cmd and use a #[cfg] block. Is there a better way?

let mut vars: Vec<(&str, &str)> = Vec::new();
#[cfg(target_os = "linux")]
let mut thp_disabled = String::new();
#[cfg(not(target_os = "linux"))]
Copy link
Member

@mxpv mxpv Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        let disabled = if cfg!(target_os = "linux") {
            // Query whether THP is disabled.
            if let Ok(x) = prctl::get_thp_disable() {
                let _ = prctl::set_thp_disable(true);
                true
            } else {
                false
            }
        } else {
            false
        };

        let vars = vec![("THP_DISABLED", &disabled.to_string())]

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it sounds like #[cfg(target_os = "linux")] is needed, it still might be clearer to do something like:

 let thp_disabled = false;
#[cfg(target_os = "linux")]
let thp_disabled = match prctl::get_thp_disable() {
    Ok(x) => {
        let _ = prctl::set_thp_disable(true);
        true
    }
    Err(_) => false,
};
let vars: Vec<(&str, &str)> = vec![("THP_DISABLED",  &disabled.to_string())];

Copy link
Contributor Author

@zzzzzzzzzy9 zzzzzzzzzy9 Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need the return value of the prctl::get_thp_disable function and assign this return value to the variable thp_disable. That is, the x in Ok(x) is needed, besides, x may be true or false. We should not just judge whether it is Ok() or Err(). This value means the state before setting the set_thp_disable, and will be used to set_thp_disable before starting runc later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need the return value of the prctl::get_thp_disable function and assign this return value to the variable thp_disable. That is, the x in Ok(x) is needed, besides, x may be true or false. We should not just judge whether it is Ok() or Err().

get_thp_disable returns a Result<bool, i32> so we could still return x instead of converting to string here. using the bool makes it clearer in the code than x.tostring and string::new() which is semantically unclear what string:new() is in this case.

Copy link
Contributor Author

@zzzzzzzzzy9 zzzzzzzzzy9 Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually 3 states here, true, false and error. When the get_thp_disable function returns error, it means that the thp parameter cannot be obtained, if the set_thp_disable is executed at this time, it will cause runc to be unable to recover the value of thp_disable, therefore, thp_disabled parameter needs 3 states, true, false, error, due to the variable life cycle, so string is used here to return.

@@ -368,6 +368,22 @@ pub trait Spawner: Debug {
impl Runc {
async fn launch(&self, cmd: Command, combined_output: bool) -> Result<Response> {
debug!("Execute command {:?}", cmd);
#[cfg(target_os = "linux")]
let mut cmd = cmd;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since runc is Linux only, we can rewrite this to something like:

    async fn launch(&self, mut cmd: Command, combined_output: bool) -> Result<Response> {
        debug!("Execute command {:?}", cmd);

        if let Ok(thp) = std::env::var("THP_DISABLED") {
            if let Ok(thp_disabled) = thp.parse::<bool>() {
                unsafe {
                    cmd.pre_exec(move || {
                        #[cfg(target_os = "linux")]
                        if let Err(e) = prctl::set_thp_disable(thp_disabled) {
                            log::debug!("set_thp_disable err: {}", e);
                        };

                        Ok(())
                    });
                }
            }
        }

@zzzzzzzzzy9
Copy link
Contributor Author

zzzzzzzzzy9 commented Sep 18, 2023

Thanks for your suggestions.
We must use conditional compilation #[cfg], otherwise it will compile with an error. This is because there are no functions in the prctl::set_thp_disabled and prctl::get_thp_disabled in other environments in the prctl package.
Done.

Comment on lines +86 to +95
let _ = prctl::set_thp_disable(true);
x.to_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that you get Ok(false) and the set it to true and this will return false? The does weren't clear (https://docs.rs/prctl/latest/prctl/fn.get_thp_disable.html)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our goal is to set thp disabled = true on the shim side and then restore thp disabled before starting runc.
So we only need to focus on the return value of the function get_thp_disabled, which is Result<bool, i32>.
The return value of the function set_thp_disabled is Result<(), i32>, we don't care if the setting is successful, because even if the setting failed, we should not exit the shim process, therefore, there is no need to pay attention to the set_thp_disabled function's return value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of the function set_thp_disabled is Result<(), i32>, we don't care if the setting is successful, because even if the setting failed, we should not exit the shim process, therefore, there is no need to pay attention to the set_thp_disabled function's return value.

could you add a comment in the code that indicates this? I worry about someone doing maitaince long term and wondering why return value and failure case is ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, done.

@jsturtevant
Copy link
Contributor

I'm generally ok with allowing more precise configuration of the runtime.
@fuweid might have more thoughts on this though.

@zzyyzte thanks for your patience. I guess I might be missing something but I don't see how the current configuration configurable? it seems to always try to let _ = prctl::set_thp_disable(true); unless it can't get thp_disable. Should this be something that a node operator should have control over (whether or not to set this value)?

@zzzzzzzzzy9
Copy link
Contributor Author

Yes, we need to explicitly set thp disabled to true in shim, and then restore thp disabled to runc before exec after fork runc. Because I don't think shim actually needs thp, it can reduce a lot of memory consumption when it is turned off.

@jsturtevant
Copy link
Contributor

Oh, read I'm generally ok with allowing more precise configuration of the runtime as adding some switch or configuration that would enable turning this on/off via if it was needed.

@mxpv could you clarify, otherwise looks good.

@mxpv mxpv enabled auto-merge February 15, 2024 20:08
@mxpv
Copy link
Member

mxpv commented Feb 15, 2024

@zzzzzzzzzy9 could you pls rebase your PR to pick up latest CI changes?

@zzzzzzzzzy9
Copy link
Contributor Author

Done. @mxpv

@mxpv mxpv added this pull request to the merge queue Feb 19, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Feb 19, 2024
@zzzzzzzzzy9
Copy link
Contributor Author

@mxpv May need to merge again?

@fuweid fuweid added this pull request to the merge queue Feb 19, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Feb 19, 2024
@fuweid
Copy link
Member

fuweid commented Feb 19, 2024

@zzzzzzzzzy9 would you please remove that merge commit by rebase? It's conflict right now. If you don't mind, I can help to handle this.

@mxpv mxpv added this pull request to the merge queue Feb 19, 2024
@mxpv mxpv removed this pull request from the merge queue due to a manual request Feb 19, 2024
@github-actions github-actions bot added the T-CI Changes in project's CI label Feb 20, 2024
@codecov-commenter
Copy link

codecov-commenter commented Feb 20, 2024

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (1b2a74a) 37.98% compared to head (25341db) 37.89%.

Files Patch % Lines
crates/runc-shim/src/service.rs 0.00% 12 Missing ⚠️
crates/runc/src/lib.rs 38.46% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #195      +/-   ##
==========================================
- Coverage   37.98%   37.89%   -0.10%     
==========================================
  Files          55       55              
  Lines        5060     5083      +23     
==========================================
+ Hits         1922     1926       +4     
- Misses       3138     3157      +19     
Flag Coverage Δ
unittests 37.89% <20.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

If /sys/kernel/mm/transparent_hugepage/enabled=always, the shim process
will use huge pages, which will consume a lot of memory.

Just like this:
ps -efo pid,rss,comm | grep shim
    PID   RSS COMMAND
   2614  7464 containerd-shim

I don't think shim needs to use huge pages, and if we turn off the huge
pages option, we can save a lot of memory resources.

After we set THP_DISABLE=true:
ps -efo pid,comm,rss
    PID COMMAND           RSS
1629841 containerd-shim  5648

containerd
    |
    |--shim1   --start
        |
        |--shim2    (this shim will on host)
            |
            |--runc create (when containerd send create request by ttrpc)
                |
                |--runc init (this is the pid 1 in container)

    we should set thp_disabled=1 in shim1 --start, because if we set this
    in shim 2, the huge page has been setted while func main() running,
    we set thp_disabled cannot change the setted huge pages.
    So We need to set thp_disabled=1 in shim1 so that shim2 inherits the
    settings of the parent process shim1, and shim2 has closed the
    hugepage when it starts.

    For runc processes, we need to set thp_disabled='before' in shim2 after
    fork() and before execve(). So we use cmd.pre_exec to do this.
@fuweid fuweid added this pull request to the merge queue Feb 20, 2024
Merged via the queue into containerd:main with commit 3a7b9ce Feb 20, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-runc runc helper C-runc-shim Runc shim T-CI Changes in project's CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants