Update cgroup resolver to try all possible base paths #1981

ddelnano · 2024-08-13T19:42:29Z

Summary: Update cgroup resolver to try all possible base paths

For certain environments, like OVH cloud's managed kubernetes service, the auto discovery mechanism fails due to a confusion between cgroup v1 and v2. This is fixed by attempting all cgroup base paths rather than trying the first to match and failing if it doesn't work. This is the second iteration of this change (#1978 was the earlier attempt).

Relevant Issues: N/A

Type of change: /kind bug

Test Plan: Community user deployed this change and verified their OVH MKS cluster is working now

…group2_mode default to prevent test only behavior from normal operation Signed-off-by: Dom Del Nano <[email protected]>

Signed-off-by: Dom Del Nano <[email protected]>

oazizi000 · 2024-08-15T21:31:56Z

src/shared/metadata/cgroup_path_resolver.cc

@@ -27,12 +27,13 @@
 #include "src/common/fs/fs_wrapper.h"
 #include "src/shared/metadata/cgroup_path_resolver.h"

-DEFINE_bool(force_cgroup2_mode, true, "Flag to force assume cgroup2 fs for testing purposes");
+DEFINE_bool(force_cgroup2_mode, false, "Flag to force assume cgroup2 fs for testing purposes");


I would add add test_only somewhere in the name of this flag.

Additionally, now that we have a more robust cgroup resolver, can we get rid of this flag, and just rely on different testdata to make sure we cover {cgroup1 only, cgroup2 only, cgroup1+cgroup2 w/ cgroup1 failing, cgroup1+cgroup2 w/ cgroup1 succeeding}.

Here are the complications with covering those cases:

CGroupPathResolver (newer method) is template based -- meaning it doesn't read the filesystem at all. This prevents us from testing against a directory of "mocked" cgroup data and simulating the failure seen in the end user's cluster.

The end user's issue happened when the cgroup v1 base path existed, but the PEM was running within a v2 cgroup. This is challenging to simulate with the template based CGroupPathResolver approach because we don't have test coverage for the function containing the bug.

LegacyCGroupPathResolver operates with real directories under test. This complicates testing the cgroup v2 logic because we check the filesystem type, which is always false for a testdata directory.

In order to achieve the ideal test coverage, the CGroupPathResolver needs test coverage for AutoDiscoverCGroupTemplate and for the legacy path resolver needs to be able to pass the cgroup v2 filesystem check. The latter is what I was trying to accomplish with an overlayfs (unfortunately that doesn't work).

I'm going to spend more time thinking about how to address those problems.

After thinking about this more, I don't think there is an easy way to test this without rearchitecting the code quite a bit. My most recent research was around testing the AutoDiscoverCGroupTemplate function. I was hoping I could recreate the bug with fake cgroup v1 files and a bind mount to /sys/fs/cgroup (or some subdirectory) in order to have that function fail. Bazel unfortunately does not like a cgroup filesystem bind mount, so this isn't viable.

I'm currently thinking that the test investment right now might not be worth holding up this fix. What do you think about putting a TODO here explaining the test coverage we wish we'd have?

oazizi000 · 2024-08-15T21:37:32Z

src/shared/metadata/cgroup_path_resolver_test.cc

+//
+// sysfsv2 and sysfsv3 model cgroupv1 and cgroupv2 respectively.
+TEST(LeagcyCGroupPathResolverTest, Cgroup2With1) {
+ PX_SET_FOR_SCOPE(FLAGS_force_cgroup2_mode, true);


Why do need this in this case?

The reason for this is that the LegacyCGroupPathResolver doesn't operate on a real cgroupv2 filesystem when under test. The newer CGroupPathResolver uses /sys/fs/cgroup directly (rather than a testdata filesystem mock) meaning this cgroupv2 base path can be exercised without the flag. The newer CGroupPathResolver has its template method, which avoids filesystem reads and is why using a /sys/fs/cgroup path works.

I've removed this test case after more deeply thinking about what test cases we would ideally cover. I've added a TODO explaining those cases in its place.

We have to keep the FLAGS_test_only_force_cgroup2_mode flag because that's the only way we can test the LegacyCgroupPathResolver since it actually reads the file filesystem (rather than the template based approach with the CGroupPathResolver).

Yeah, a different PR for that refactor is best anyways.

Signed-off-by: Dom Del Nano <[email protected]>

…cenarios that should be tested Signed-off-by: Dom Del Nano <[email protected]>

Signed-off-by: Dom Del Nano <[email protected]>

oazizi000 · 2024-08-19T22:34:05Z

src/shared/metadata/cgroup_path_resolver.cc

+ if (base_paths.empty()) {
+ return error::NotFound("Could not find CGroup base path");
+ }
+ return base_paths;


minor: Could consider returning the empty list and letting the caller deal with this, now that this returns a list of candidates.

Summary: Update cgroup resolver to try all possible base paths For certain environments, like OVH cloud's managed kubernetes service, the auto discovery mechanism fails due to a confusion between cgroup v1 and v2. This is fixed by attempting all cgroup base paths rather than trying the first to match and failing if it doesn't work. This is the second iteration of this change (pixie-io#1978 was the earlier attempt). Relevant Issues: N/A Type of change: /kind bug Test Plan: Community user deployed this change and verified their OVH MKS cluster is working now --------- Signed-off-by: Dom Del Nano <[email protected]> GitOrigin-RevId: ff83032

ddelnano added 2 commits August 13, 2024 17:23

Update cgroup resolver to try all possible base paths. Update force_c…

76a7da6

…group2_mode default to prevent test only behavior from normal operation Signed-off-by: Dom Del Nano <[email protected]>

Add test that simulates merged cgroup v1 and v2 directories

6593cb2

Signed-off-by: Dom Del Nano <[email protected]>

ddelnano requested a review from a team as a code owner August 13, 2024 19:42

ddelnano requested a review from oazizi000 August 13, 2024 19:43

oazizi000 reviewed Aug 15, 2024

View reviewed changes

Rename --force_cgroup_mode to --test_only_force_cgroup_mode

15a2a33

Signed-off-by: Dom Del Nano <[email protected]>

ddelnano mentioned this pull request Aug 15, 2024

Attempt to use cgroup v2 prior to cgroup v1 in path resolver #1978

Closed

ddelnano added 3 commits August 16, 2024 14:54

Remove test that provides minimal coverage. Add TODO explaining the s…

948cc8f

…cenarios that should be tested Signed-off-by: Dom Del Nano <[email protected]>

Fix misspelling in test name leagcy -> legacy

fc4f5b2

Signed-off-by: Dom Del Nano <[email protected]>

Fix linting

ea2a767

Signed-off-by: Dom Del Nano <[email protected]>

oazizi000 approved these changes Aug 19, 2024

View reviewed changes

ddelnano merged commit ff83032 into pixie-io:main Aug 19, 2024
32 checks passed

ddelnano deleted the ddelnano/try-additional-cgroup-base-paths branch August 19, 2024 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update cgroup resolver to try all possible base paths #1981

Update cgroup resolver to try all possible base paths #1981

ddelnano commented Aug 13, 2024 •

edited

Loading

oazizi000 Aug 15, 2024

oazizi000 Aug 15, 2024

ddelnano Aug 15, 2024

ddelnano Aug 16, 2024

oazizi000 Aug 15, 2024

ddelnano Aug 15, 2024

ddelnano Aug 16, 2024

oazizi000 Aug 19, 2024

oazizi000 Aug 19, 2024

Update cgroup resolver to try all possible base paths #1981

Update cgroup resolver to try all possible base paths #1981

Conversation

ddelnano commented Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ddelnano commented Aug 13, 2024 •

edited

Loading