feat: set THP_DISABLE=true in shim, and restore it before starting runc #195

zzzzzzzzzy9 · 2023-09-11T11:39:09Z

If /sys/kernel/mm/transparent_hugepage/enabled=always, the shim process will use huge pages, which will consume a lot of memory.

cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Just like this:

ps -efo pid,rss,comm | grep shim
    PID   RSS COMMAND
   2614  7464 containerd-shim

cat /proc/2614/smaps | grep -i hugepages
AnonHugePages:      2048 kB
...

I don't think shim needs to use huge pages, and if we turn off the huge pages option, we can save a lot of memory resources.

After we set THP_DISABLE=true:

ps -efo pid,rss,comm 
    PID   RSS COMMAND
   2470  5444 containerd-shim

cat /proc/2470/smaps | grep -i hugepages
AnonHugePages:         0 kB
...

containerd
    |
    |--shim1   --start
        |
        |--shim2    (this shim will on host)
            |
            |--runc create (when containerd send create request by ttrpc)
                |
                |--runc init (this is the pid 1 in container)

we should set thp_disabled=1 in shim1 --start, because if we set this
in shim 2, the huge page has been setted while func main() running,
we set thp_disabled cannot change the setted huge pages.
So We need to set thp_disabled=1 in shim1 so that shim2 inherits the
settings of the parent process shim1, and shim2 has closed the
hugepage when it starts.

For runc processes, we need to set thp_disabled='before' in shim2 after
fork() and before execve(). So we use cmd.pre_exec to do this.

Burning1020 · 2023-09-11T12:46:03Z

I think this problem is related to specific application scenarios.

If node resources are insufficient, there is no need to enable transparent huge page for shim process and also other management processes. Only service processes need to be enabled separately. In this case, you do not need to set transparent_hugepage/enabled=always.
If the node resources are abundant, does it matter to care about the little usage of the shim process?

zzzzzzzzzy9 · 2023-09-12T02:11:24Z

Because some processes in the environment do not show the use of huge pages, but need to use huge pages to improve performance.
For the shim process, real-time performance is not particularly needed, and the shim process can save at least 2M memory by turning off the huge page option, which can control the shim memory footprint to about 5M. According to my observations, if the shim process does not turn off the huge page option, it is marked with a huge page by 2M, and in more cases it can reach 10M+. If turned off, memory is saved by a minimum of 30% (2M / 7M), which means that more containers can be started.

Burning1020 · 2023-09-14T07:22:38Z

I feel it's more like a custom-made thing. /cc @mxpv @fuweid

crates/runc-shim/src/service.rs

mxpv · 2023-09-14T19:41:08Z

crates/runc/src/lib.rs

@@ -366,6 +366,7 @@ pub trait Spawner: Debug {
 /// and some other utilities.
 #[cfg(feature = "async")]
 impl Runc {
+    #[cfg(not(target_os = "linux"))]


We don't have runc on non Linux environments. But also can apply same suggestion as above to avoid func duplication.

crates/runc-shim/src/service.rs

mxpv · 2023-09-14T19:44:27Z

I feel it's more like a custom-made thing. /cc @mxpv @fuweid

I'm generally ok with allowing more precise configuration of the runtime.
@fuweid might have more thoughts on this though.

zzzzzzzzzy9 · 2023-09-16T03:02:54Z

Thanks for your suggestions. I have changed them.

zzzzzzzzzy9 · 2023-09-16T03:37:05Z

crates/runc/src/lib.rs

@@ -368,6 +368,22 @@ pub trait Spawner: Debug {
 impl Runc {
    async fn launch(&self, cmd: Command, combined_output: bool) -> Result<Response> {


async fn launch(&self, cmd: Command, combined_output: bool) -> Result<Response> {

There are some problems here, the cmd variable needs to be mutable, but if on a non-Linux environment, cmd cannot be mutable. If you don't use a separate function name, the code looks ugly because you need to create a new mut cmd and use a #[cfg] block. Is there a better way?

mxpv · 2023-09-17T01:12:25Z

crates/runc-shim/src/service.rs

+        let mut vars: Vec<(&str, &str)> = Vec::new();
+        #[cfg(target_os = "linux")]
+        let mut thp_disabled = String::new();
+        #[cfg(not(target_os = "linux"))]


let disabled = if cfg!(target_os = "linux") { // Query whether THP is disabled. if let Ok(x) = prctl::get_thp_disable() { let _ = prctl::set_thp_disable(true); true } else { false } } else { false }; let vars = vec![("THP_DISABLED", &disabled.to_string())]

?

since it sounds like #[cfg(target_os = "linux")] is needed, it still might be clearer to do something like:

let thp_disabled = false; #[cfg(target_os = "linux")] let thp_disabled = match prctl::get_thp_disable() { Ok(x) => { let _ = prctl::set_thp_disable(true); true } Err(_) => false, }; let vars: Vec<(&str, &str)> = vec![("THP_DISABLED", &disabled.to_string())];

Here we need the return value of the prctl::get_thp_disable function and assign this return value to the variable thp_disable. That is, the x in Ok(x) is needed, besides, x may be true or false. We should not just judge whether it is Ok() or Err(). This value means the state before setting the set_thp_disable, and will be used to set_thp_disable before starting runc later.

Here we need the return value of the prctl::get_thp_disable function and assign this return value to the variable thp_disable. That is, the x in Ok(x) is needed, besides, x may be true or false. We should not just judge whether it is Ok() or Err().

get_thp_disable returns a Result<bool, i32> so we could still return x instead of converting to string here. using the bool makes it clearer in the code than x.tostring and string::new() which is semantically unclear what string:new() is in this case.

There are actually 3 states here, true, false and error. When the get_thp_disable function returns error, it means that the thp parameter cannot be obtained, if the set_thp_disable is executed at this time, it will cause runc to be unable to recover the value of thp_disable, therefore, thp_disabled parameter needs 3 states, true, false, error, due to the variable life cycle, so string is used here to return.

mxpv · 2023-09-17T01:16:59Z

crates/runc/src/lib.rs

@@ -368,6 +368,22 @@ pub trait Spawner: Debug {
 impl Runc {
    async fn launch(&self, cmd: Command, combined_output: bool) -> Result<Response> {
        debug!("Execute command {:?}", cmd);
+        #[cfg(target_os = "linux")]
+        let mut cmd = cmd;


Since runc is Linux only, we can rewrite this to something like:

async fn launch(&self, mut cmd: Command, combined_output: bool) -> Result<Response> { debug!("Execute command {:?}", cmd); if let Ok(thp) = std::env::var("THP_DISABLED") { if let Ok(thp_disabled) = thp.parse::<bool>() { unsafe { cmd.pre_exec(move || { #[cfg(target_os = "linux")] if let Err(e) = prctl::set_thp_disable(thp_disabled) { log::debug!("set_thp_disable err: {}", e); }; Ok(()) }); } } }

zzzzzzzzzy9 · 2023-09-18T02:32:47Z

Thanks for your suggestions.
We must use conditional compilation #[cfg], otherwise it will compile with an error. This is because there are no functions in the prctl::set_thp_disabled and prctl::get_thp_disabled in other environments in the prctl package.
Done.

jsturtevant · 2023-09-18T17:02:45Z

crates/runc-shim/src/service.rs

+                let _ = prctl::set_thp_disable(true);
+                x.to_string()


is it possible that you get Ok(false) and the set it to true and this will return false? The does weren't clear (https://docs.rs/prctl/latest/prctl/fn.get_thp_disable.html)

Our goal is to set thp disabled = true on the shim side and then restore thp disabled before starting runc.
So we only need to focus on the return value of the function get_thp_disabled, which is Result<bool, i32>.
The return value of the function set_thp_disabled is Result<(), i32>, we don't care if the setting is successful, because even if the setting failed, we should not exit the shim process, therefore, there is no need to pay attention to the set_thp_disabled function's return value.

The return value of the function set_thp_disabled is Result<(), i32>, we don't care if the setting is successful, because even if the setting failed, we should not exit the shim process, therefore, there is no need to pay attention to the set_thp_disabled function's return value.

could you add a comment in the code that indicates this? I worry about someone doing maitaince long term and wondering why return value and failure case is ignored.

Thanks for your suggestion, done.

jsturtevant · 2023-10-04T16:41:32Z

I'm generally ok with allowing more precise configuration of the runtime.
@fuweid might have more thoughts on this though.

@zzyyzte thanks for your patience. I guess I might be missing something but I don't see how the current configuration configurable? it seems to always try to let _ = prctl::set_thp_disable(true); unless it can't get thp_disable. Should this be something that a node operator should have control over (whether or not to set this value)?

zzzzzzzzzy9 · 2023-10-09T01:35:06Z

Yes, we need to explicitly set thp disabled to true in shim, and then restore thp disabled to runc before exec after fork runc. Because I don't think shim actually needs thp, it can reduce a lot of memory consumption when it is turned off.

jsturtevant · 2023-10-09T16:22:21Z

Oh, read I'm generally ok with allowing more precise configuration of the runtime as adding some switch or configuration that would enable turning this on/off via if it was needed.

@mxpv could you clarify, otherwise looks good.

mxpv · 2024-02-15T20:08:33Z

@zzzzzzzzzy9 could you pls rebase your PR to pick up latest CI changes?

zzzzzzzzzy9 · 2024-02-19T08:51:33Z

Done. @mxpv

zzzzzzzzzy9 · 2024-02-19T09:04:12Z

@mxpv May need to merge again?

fuweid · 2024-02-19T12:21:40Z

@zzzzzzzzzy9 would you please remove that merge commit by rebase? It's conflict right now. If you don't mind, I can help to handle this.

codecov-commenter · 2024-02-20T02:22:08Z

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (1b2a74a) 37.98% compared to head (25341db) 37.89%.

Files	Patch %	Lines
crates/runc-shim/src/service.rs	0.00%	12 Missing ⚠️
crates/runc/src/lib.rs	38.46%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #195      +/-   ##
==========================================
- Coverage   37.98%   37.89%   -0.10%     
==========================================
  Files          55       55              
  Lines        5060     5083      +23     
==========================================
+ Hits         1922     1926       +4     
- Misses       3138     3157      +19

Flag	Coverage Δ
unittests	`37.89% <20.00%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

If /sys/kernel/mm/transparent_hugepage/enabled=always, the shim process will use huge pages, which will consume a lot of memory. Just like this: ps -efo pid,rss,comm | grep shim PID RSS COMMAND 2614 7464 containerd-shim I don't think shim needs to use huge pages, and if we turn off the huge pages option, we can save a lot of memory resources. After we set THP_DISABLE=true: ps -efo pid,comm,rss PID COMMAND RSS 1629841 containerd-shim 5648 containerd | |--shim1 --start | |--shim2 (this shim will on host) | |--runc create (when containerd send create request by ttrpc) | |--runc init (this is the pid 1 in container) we should set thp_disabled=1 in shim1 --start, because if we set this in shim 2, the huge page has been setted while func main() running, we set thp_disabled cannot change the setted huge pages. So We need to set thp_disabled=1 in shim1 so that shim2 inherits the settings of the parent process shim1, and shim2 has closed the hugepage when it starts. For runc processes, we need to set thp_disabled='before' in shim2 after fork() and before execve(). So we use cmd.pre_exec to do this.

github-actions bot added C-runc-shim Runc shim C-runc runc helper labels Sep 11, 2023

zzzzzzzzzy9 force-pushed the main branch 4 times, most recently from 30351c5 to 0e4dae5 Compare September 12, 2023 03:33

mxpv reviewed Sep 14, 2023

View reviewed changes

fuweid self-requested a review September 15, 2023 07:06

zzzzzzzzzy9 force-pushed the main branch 2 times, most recently from 5ac6db6 to 80ee51f Compare September 16, 2023 02:59

zzzzzzzzzy9 force-pushed the main branch 3 times, most recently from 654bf0f to 7c6d2dd Compare September 16, 2023 03:31

zzzzzzzzzy9 commented Sep 16, 2023

View reviewed changes

zzzzzzzzzy9 force-pushed the main branch from 7c6d2dd to 51946cf Compare September 16, 2023 03:39

mxpv reviewed Sep 17, 2023

View reviewed changes

zzzzzzzzzy9 force-pushed the main branch from 51946cf to 21ea421 Compare September 18, 2023 02:32

zzzzzzzzzy9 force-pushed the main branch 4 times, most recently from 3e833d4 to e42fb05 Compare September 18, 2023 06:56

jsturtevant reviewed Sep 18, 2023

View reviewed changes

zzzzzzzzzy9 force-pushed the main branch 2 times, most recently from 78f505e to 50b1963 Compare September 28, 2023 07:27

zzzzzzzzzy9 force-pushed the main branch from 50b1963 to 588398c Compare September 28, 2023 07:36

zzzzzzzzzy9 force-pushed the dev branch from 1e214b5 to 16cbe55 Compare January 22, 2024 11:05

mxpv approved these changes Feb 15, 2024

View reviewed changes

mxpv enabled auto-merge February 15, 2024 20:08

mxpv added this pull request to the merge queue Feb 19, 2024

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Feb 19, 2024

fuweid approved these changes Feb 19, 2024

View reviewed changes

fuweid added this pull request to the merge queue Feb 19, 2024

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Feb 19, 2024

mxpv added this pull request to the merge queue Feb 19, 2024

mxpv removed this pull request from the merge queue due to a manual request Feb 19, 2024

zzzzzzzzzy9 force-pushed the dev branch from 0f15e6e to 2269aab Compare February 20, 2024 02:17

github-actions bot added the T-CI Changes in project's CI label Feb 20, 2024

zzzzzzzzzy9 force-pushed the dev branch from 2269aab to 25341db Compare February 20, 2024 02:34

fuweid added this pull request to the merge queue Feb 20, 2024

Merged via the queue into containerd:main with commit 3a7b9ce Feb 20, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: set THP_DISABLE=true in shim, and restore it before starting runc #195

feat: set THP_DISABLE=true in shim, and restore it before starting runc #195

zzzzzzzzzy9 commented Sep 11, 2023

Burning1020 commented Sep 11, 2023

zzzzzzzzzy9 commented Sep 12, 2023

Burning1020 commented Sep 14, 2023 •

edited

Loading

mxpv Sep 14, 2023

mxpv commented Sep 14, 2023

zzzzzzzzzy9 commented Sep 16, 2023

zzzzzzzzzy9 Sep 16, 2023

mxpv Sep 17, 2023 •

edited

Loading

jsturtevant Sep 18, 2023

zzzzzzzzzy9 Sep 19, 2023 •

edited

Loading

jsturtevant Sep 21, 2023

zzzzzzzzzy9 Sep 28, 2023 •

edited

Loading

mxpv Sep 17, 2023

zzzzzzzzzy9 commented Sep 18, 2023 •

edited

Loading

jsturtevant Sep 18, 2023

zzzzzzzzzy9 Sep 19, 2023

jsturtevant Sep 21, 2023

zzzzzzzzzy9 Sep 28, 2023

jsturtevant commented Oct 4, 2023

zzzzzzzzzy9 commented Oct 9, 2023

jsturtevant commented Oct 9, 2023

mxpv commented Feb 15, 2024

zzzzzzzzzy9 commented Feb 19, 2024

zzzzzzzzzy9 commented Feb 19, 2024

fuweid commented Feb 19, 2024

codecov-commenter commented Feb 20, 2024 •

edited

Loading

		@@ -368,6 +368,22 @@ pub trait Spawner: Debug {
		impl Runc {
		async fn launch(&self, cmd: Command, combined_output: bool) -> Result<Response> {

feat: set THP_DISABLE=true in shim, and restore it before starting runc #195

feat: set THP_DISABLE=true in shim, and restore it before starting runc #195

Conversation

zzzzzzzzzy9 commented Sep 11, 2023

Burning1020 commented Sep 11, 2023

zzzzzzzzzy9 commented Sep 12, 2023

Burning1020 commented Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

mxpv commented Sep 14, 2023

zzzzzzzzzy9 commented Sep 16, 2023

Choose a reason for hiding this comment

mxpv Sep 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzzzzzzzzy9 Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzzzzzzzzy9 Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzzzzzzzzy9 commented Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsturtevant commented Oct 4, 2023

zzzzzzzzzy9 commented Oct 9, 2023

jsturtevant commented Oct 9, 2023

mxpv commented Feb 15, 2024

zzzzzzzzzy9 commented Feb 19, 2024

zzzzzzzzzy9 commented Feb 19, 2024

fuweid commented Feb 19, 2024

codecov-commenter commented Feb 20, 2024 • edited Loading

Codecov Report

Burning1020 commented Sep 14, 2023 •

edited

Loading

mxpv Sep 17, 2023 •

edited

Loading

zzzzzzzzzy9 Sep 19, 2023 •

edited

Loading

zzzzzzzzzy9 Sep 28, 2023 •

edited

Loading

zzzzzzzzzy9 commented Sep 18, 2023 •

edited

Loading

codecov-commenter commented Feb 20, 2024 •

edited

Loading