Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE: Pr/mtardy/user stacktrace hangfix #2286

Closed
wants to merge 10 commits into from

Conversation

mtardy
Copy link
Member

@mtardy mtardy commented Apr 2, 2024

No description provided.

kkourt and others added 10 commits April 2, 2024 17:12
policystatemetrics needs a reference to the sensor manager so that it
can collect metrics. Currently, this reference is passed using
observer.GetSensorManager() at initialization time.

In observer tests, we currently do not restart the metrics (see [1])
which means that if we create a new observer, then the metrics will
still reference the old sensor manager.

Fix this by having policystatemetrics to call
observer.GetSensorManager() to get the latest version of the sensor
manager.

[1] https://github.com/cilium/tetragon/blob/22eb995b19207ac0ced2dd83950ec8e8aedd122d/pkg/observer/observertesthelper/observer_test_helper.go#L272-L276

Signed-off-by: Kornilios Kourtis <[email protected]>
We should also do the same in the other operations, but we leave that as
a followup.

Signed-off-by: Kornilios Kourtis <[email protected]>
This patch adds a timeout for ListTracingPolicies. It can be the case
that the sensor manager is stuck or misbehaving. This patch (combined
with the previous one) ensures that metrics will continue after a
timeout.

Tested manually using:

```diff
diff --git a/pkg/metrics/policystatemetrics/policystatemetrics_test.go b/pkg/metrics/policystatemetrics/policystatemetrics_test.go
index 227306b65..fd581392b 100644
--- a/pkg/metrics/policystatemetrics/policystatemetrics_test.go
+++ b/pkg/metrics/policystatemetrics/policystatemetrics_test.go
@@ -9,6 +9,7 @@ import (
 	"io"
 	"strings"
 	"testing"
+	"time"

 	"github.com/cilium/tetragon/pkg/observer"
 	tus "github.com/cilium/tetragon/pkg/testutils/sensors"
@@ -57,3 +58,22 @@ tetragon_tracingpolicy_loaded{state="load_error"} %d
 	err = testutil.CollectAndCompare(collector, expectedMetrics(1, 0, 0, 0))
 	assert.NoError(t, err)
 }
+
+func TestTimeout(t *testing.T) {
+	reg := prometheus.NewRegistry()
+
+	manager := tus.GetTestSensorManager(context.TODO(), t).Manager
+	observer.SetSensorManager(manager)
+	t.Cleanup(observer.ResetSensorManager)
+
+	collector := newPolicyStateCollector()
+	reg.Register(collector)
+
+	go func() {
+		err := manager.SleepForTesting(context.TODO(), t, 1*time.Second)
+		assert.NoError(t, err)
+	}()
+
+	err := testutil.CollectAndCompare(collector, strings.NewReader(""))
+	assert.NoError(t, err)
+}
diff --git a/pkg/sensors/manager.go b/pkg/sensors/manager.go
index eaf908340..291a58c8f 100644
--- a/pkg/sensors/manager.go
+++ b/pkg/sensors/manager.go
@@ -8,6 +8,8 @@ import (
 	"errors"
 	"fmt"
 	"strings"
+	"testing"
+	"time"

 	"github.com/cilium/tetragon/api/v1/tetragon"
 	"github.com/cilium/tetragon/pkg/k8s/apis/cilium.io/v1alpha1"
@@ -96,6 +98,13 @@ func startSensorManager(
 				logger.GetLogger().Debugf("stopping sensor controller...")
 				done = true
 				err = nil
+
+			// NB(kkourt): for testing
+			case *sensorManagerSleep:
+				time.Sleep(op.d)
+				err = nil
+
 			default:
 				err = fmt.Errorf("unknown sensorOp: %v", op)
 			}
@@ -421,6 +430,13 @@ type sensorCtlStop struct {
 	retChan chan error
 }

+// sensorManagerSleep just sleeps. Intended only for testing.
+type sensorManagerSleep struct {
+	ctx     context.Context
+	retChan chan error
+	d       time.Duration
+}
+
 type LoadArg struct{}
 type UnloadArg = LoadArg

@@ -436,5 +452,18 @@ func (s *sensorEnable) sensorOpDone(e error)         { s.retChan <- e }
 func (s *sensorDisable) sensorOpDone(e error)        { s.retChan <- e }
 func (s *sensorList) sensorOpDone(e error)           { s.retChan <- e }
 func (s *sensorCtlStop) sensorOpDone(e error)        { s.retChan <- e }
+func (s *sensorManagerSleep) sensorOpDone(e error)   { s.retChan <- e }

 type sensorCtlHandle = chan<- sensorOp
+
+func (h *Manager) SleepForTesting(ctx context.Context, t *testing.T, d time.Duration) error {
+	retc := make(chan error)
+	op := &sensorManagerSleep{
+		ctx:     ctx,
+		retChan: retc,
+		d:       d,
+	}
+
+	h.sensorCtl <- op
+	return <-retc
+}
```

Signed-off-by: Kornilios Kourtis <[email protected]>
Signed-off-by: Andrey Fedotov <[email protected]>
Signed-off-by: Mahe Tardy <[email protected]>
@mtardy mtardy added the release-note/misc This PR makes changes that have no direct user impact. label Apr 2, 2024
Copy link

netlify bot commented Apr 2, 2024

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 90bce77
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/660c3966372ad10008968583
😎 Deploy Preview https://deploy-preview-2286--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mtardy mtardy closed this Apr 2, 2024
@mtardy mtardy deleted the pr/mtardy/user-stacktrace-hangfix branch April 2, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants