Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus iptables rules metric only counts programmed rules. #9374

Merged

Conversation

aaaaaaaalex
Copy link
Contributor

@aaaaaaaalex aaaaaaaalex commented Oct 22, 2024

Description

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

The `felix_iptables_rules` Prometheus metric now only counts rules within referenced Iptables chains, no longer counts candidate rules.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@aaaaaaaalex aaaaaaaalex requested a review from a team as a code owner October 22, 2024 10:57
@marvin-tigera marvin-tigera added this to the Calico v3.30.0 milestone Oct 22, 2024
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Oct 22, 2024
@aaaaaaaalex aaaaaaaalex changed the title only amend iptables-rules prom gauge when a chain is inc/dec-ref'd Prometheus iptables rules metric only counts programmed rules. Nov 7, 2024
Copy link
Member

@fasaxc fasaxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ref counting looks correct 🎉 some comments on the test.

Thinking about it, you could have tested this at UT level pretty effectively if there's a way to get the prom gauge. Then you'd be in control of the exact number of rules and wouldn't need approximates. Not bad to have a bit more coverage of the real gauges though so I'm OK with FV too.

@@ -200,6 +206,87 @@ var _ = infrastructure.DatastoreDescribe("pre-dnat with initialized Felix, 2 wor
})

Context("with pre-DNAT policy to open pinhole to 8055", func() {
// Exec iptables-save, or iptables-save -t [table], if table is not "".
dumpIPTables := func(table string) []string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have utility methods for getting chains on the Felix object.

For example felix.IPTablesChains(table) ...

// The number of cali rules in the dumped tables should increase by the same delta as the metrics
Eventually(dumpMangleTable).Should(HaveLen(len(baselineMangleTableIptablesSave) + mangleRulesMetricDelta))
Eventually(dumpFilterTable).Should(HaveLen(len(baselineFilterTableIptablesSave) + filterRulesMetricDelta))
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend doing some more steps after this:

  • Change the type of the policy from pre-DNAT to normal. That's the most fiddly change to handle since it goes from referenced in one table to referenced in the other.
  • Delete the policy, should go back to the count with the hep.
  • Delete the hep, should get back to the base count.

// Returns nil when the value of f() hasn't changed for the duration settledPeriod.
// Returns error if f() never settles up to the timeout.
// Immediately returns error if timeout is lower than settledPeriod.
waitToSettle := func(f func() int, settledPeriod, timeout time.Duration) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on slack, would prefer not to rely on settling, if there's something explicit that we can wait for. It's not a bad approach, it's just a bit slow (and if it flakes, you have to bump the settled timeout so it gets slower...)

If we do go down this path, I suggest making it a general utility method. Kubernetes has a wait package with similar utility funcs in it.

return errors.New("Timed out waiting for IPTables to settle.")
default:
last = cur
time.Sleep(1 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically reach for 100ms for a sleep loop. We don't want to tight loop but probing 1x per second isn't normally a big deal.

settledTimer = nil
} else {
select {
case <-settledTimer.C:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the way you're using the settledTimer, you're not actually using it as a timer at all! You only select on it in a non-blocking way. I think this code would be clearer if you just recorded the time since the last change and did

last := f()
lastChange := time.Now()
for {
  cur := f()
  if cur != last {
    lastChange := time.Now()
  }
  last = cur
  if time.Since(lastChange) > settledPeriod {
    // we're done
  }
  ...
}

}
}
select {
case <-timeoutTimer.C:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a sleep and a timeout channel is slightly mixing paradigms. While you''re asleep you're not checking the timeout channel, which could be a bug (or not matter at all). In this case, it means that, if you just miss a timeout, it'll be >1s before you check the timeout again and you'll do an extra poll of f(); not clear if that was intended (it probably doesn't matter for this use case).

Best to jump one way or the other to make your intent clear. Either:

  • Use Sleep() and record the start time and do if time.Since(startTime) > timeoutPeriod { ... timeout ...}
  • Use channels for the sleep. case <-time.After(delay): ... so that the sleep can be interrupted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you also wanted the settled timer to interrupt the sleep (not clear that you do, since you probably want to run f() one more time before declaring it settled), then... A powerful technique is to mask channels in your main select block. To mask a channel, set it to nil; that prevents its case from firing. So you could do:

var settledC chan <-time.Time

for {
  select {
  case <- time.After(time.Second):
    cur := f()
    if cur == last {
      if settledC == nil { 
        settledC = time.After(...)
      }
    } else {
      settledC  = nil
    }
  case <- settledC:
    // done!
  case <-timeoutC:
    // failed
  }
}

@aaaaaaaalex
Copy link
Contributor Author

@fasaxc done away with the "settled" business now, instead checking for a particular IPTables chain to determine when the dataplane got programmed at start-of-day, and then using "Eventually" calls to wait until additional changes occur.

appliedGNP, err = client.GlobalNetworkPolicies().Create(utils.Ctx, policy, utils.NoOptions)
Expect(err).NotTo(HaveOccurred(), "Couldn't create pre-DNAT GNP")

Eventually(mangleRulesMetric.Int, "5s").ShouldNot(BeEquivalentTo(curMangleRulesMetric), "Mangle rules metric never changed following change of GNP")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Eventually(mangleRulesMetric.Int, "5s").ShouldNot(BeEquivalentTo(curMangleRulesMetric), "Mangle rules metric never changed following change of GNP")
Eventually(mangleRulesMetric.Int, "5s").Should(BeNumerically(">", curMangleRulesMetric), "Mangle rules metric never changed following change of GNP")

Could use >

_, err = client.HostEndpoints().Create(utils.Ctx, hep, utils.NoOptions)
Expect(err).NotTo(HaveOccurred(), "Failed to create HEP")

appliedGNP, err = client.GlobalNetworkPolicies().Create(utils.Ctx, policy, utils.NoOptions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should create the GNP first. Felix won't program the GNP unless there's a HEP but it'll program the HEP without a GNP. So, if you do HEP then GNP, you might see an intermediate step, whereas it'll be atomic the other way.

@aaaaaaaalex aaaaaaaalex merged commit 2ea7981 into projectcalico:master Nov 22, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-pr-required Change is not yet documented release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants