Fix multi-alarm overlapping bug #466

alevy · 2024-09-05T23:38:24Z

This PR fixes a subtle bug in the userspace virtual alarm library. If alarms are created and expire pretty close to each other, the within_range check can incorrectly return the relative relationship between the two, place the earlier alarm after the later alarm, which is incorrect, and can end up totally starving the earlier alarm if the later alarm is repeating.

This bug was introduced when trying to correctly account for potentially overflowing alarms, and the "obvious" fix (just compare absolute expiration values) doesn't work.

This PR adds two tests, one that demonstrates the starved alarm issue and one that exercises the overflow issue, and fixes the bug in such a way that both pass. In particular, the fix is simply to explicitly check if alarms overflow when insert them to the virtual alarm queue and use that for the comparison between alarms.

Test 1: Multi-Alarm Simple

This is a simple test for multiple alarms that only sets exactly two repeating alarms that should overlap frequently (one is twice the other's frequency). It sets two repeating alarms, at 500ms and 1s respectively, each prints the alarm index (1 or 2), current tick and alarm expiration to the console. When successful, both alarms should fire and alarm 1 should fire twice as often as alarm 2.

Today (on the NRF52840 at least, though I don't expect other chips to be different), this test fails. It should output (the first column in each line is the timer index):

1 6316032 6314240
2 10512384 10510592
1 10512384 10511104
1 14722560 14720768
2 18903552 18901760
1 18919680 18917632
1 23116544 23114752
2 27294720 27292928
...

but, due to a bug in the alarm library, actually outputs:

1 7254784 7252992
2 11451136 11449344
2 19842304 19840512
2 28233472 28231680
2 36624640 36622848
2 45015808 45014016
2 53406976 53405184
...

(note timer 1 never appears after the first instance)

Test 2: Multi-Alarm Overflow

This is a for multiple alarms where one overflows the clock (UINT32_MAX ticks), resulting in a low absolute value expiration, while the other is "normal" (1 second). When successful, both alarms should fire, the second after about 1 second, and the first after the clock overflows (around 7 minutes on 32kHz clocks).

When the logic for inserting clocks doesn't account correctly for overflows, the second alarm fires after the first one, both after the clock overflows.

tyler-potyondy

This is an interesting bug and perhaps explains some of the edge case bugs we've seen with the OpenThread library.

I'm curious to see what the root cause ends up being.

examples/tests/multi_alarm_simple_test/main.c

brghena · 2024-09-06T15:18:29Z

This is the kind of test we should be running frequently on Treadmill. Highly important to get right, easy enough to have some expected results (although not exact results every time, so we'll have to pattern match in some way), and historically easy to break accidentally.

lschuermann · 2024-09-08T14:43:24Z

Good catch, and the test seems fine to me.

However, this does raise a more general issue to me, and one that relates to our releases, automated tests, and the timer subsystem in particular: how to assert whether the test ran correctly, or not.

In particular, for timer tests we often just produce some raw, unprocessed output. When it comes to running these tests as part of a release (esp. with our release cadence), we then need to re-learn what it means for these tests to pass or fail. Similarly, with automated tests, esp. with non-deterministic output, it can be hard to encode a check function that determines the test result.

We may want to take this opportunity to figure out what we can add to the user space tests and/or a test framework to make this easier.

lschuermann · 2024-09-12T09:46:06Z

FWIW, @charles37 is working on an automated test for this, to be integrated into Treadmill.

Fixes #466 When inserting an alarm into the virtual alarm queue, comparing two alarms needs to account for alarms that may wrap beyond the clock overflow. Simply comparing computed expirations against the new alarm's reference isn't sufficient, as the current alarm may overflow, resulting in an expiration that is earlies by absolute value, but should still expire later.

alevy · 2024-09-14T04:34:23Z

A significant update that adds another test (for clock overflow) and (I believe) fixes the underlying issue. See new PR description.

An intentionally simple test for multiple alarms that only sets exactly two repeating alarms that should overlap frequently (one is twice the other's frequency).

Co-authored-by: Pat Pannuto <[email protected]>

Fixes #466 When inserting an alarm into the virtual alarm queue, comparing two alarms needs to account for alarms that may wrap beyond the clock overflow. Simply comparing computed expirations against the new alarm's reference isn't sufficient, as the current alarm may overflow, resulting in an expiration that is earlies by absolute value, but should still expire later.

alevy · 2024-09-18T18:49:03Z

@brghena @ppannuto a nudge that this is actually quite important to fix and worth reviewing/merging ASAP.

@tyler-potyondy potentially relevant to the issues you've been having with OpenThread

brghena · 2024-09-19T04:11:11Z

I will look over this tomorrow. Thanks for the nudge.

tyler-potyondy

The logic for the timer placement seems to check out and is an improvement over the somewhat opaque within range function. I have one small comment for improving the readability of the logic, but overall the changes look good to me.

tyler-potyondy · 2024-09-19T17:58:34Z

libtock/services/alarm.c

+    // This alarm happens after the new alarm if:
+    // - neither expirations overflow and this expiration value is larger than the new expiration
+    // - both overflow and this expiration value is larger than the new expiration
+    // - or, this alarm overflows but the new one doesn't
+    //
+    // If the new alarm overflows and this alarm doesn't, this alarm
+    // happens _before_ the new alarm.
+    if (!(!cur_overflows && new_overflows) && ((cur_overflows && !new_overflows) || cur_expiration > new_expiration)) {


The double negations and nested operations make this somewhat tricky to parse. Perhaps altering the conditional's logic to match the 4 statements enumerated in the comment more closely would improve readability. Something to the effect of:

( case1 || case2 || case3 || case4)

I strongly agree. I really can't parse line 110.

Based on my reading of the comment, the logic would be if ((cur_overflows && !new_overflows) || (cur_expiration > new_expiration)). I don't understand why the first part of the condition is there.

I couldn't quite help myself from being a tiny bit clever still, but I believe the new version of both the comment and boolean logic are clear now.

brghena

This is a good catch and the fix looks good to me. I have a bunch of small things to clean up though.

libtock/services/alarm.c

brghena · 2024-09-19T21:02:19Z

libtock/services/alarm.c

+    // expires.
+    bool cur_overflows = (*cur)->reference > UINT32_MAX - (*cur)->dt;
+
+    // This alarm happens after the new alarm if:


What is "this alarm" and what is "the new alarm"? I think it's cur and alarm respectively? But I might have that backwards. In my opinion, this entire comment block should use the actual names for clarity.

brghena · 2024-09-19T21:10:31Z

libtock/services/alarm.c

+    // This alarm happens after the new alarm if:
+    // - neither expirations overflow and this expiration value is larger than the new expiration
+    // - both overflow and this expiration value is larger than the new expiration
+    // - or, this alarm overflows but the new one doesn't
+    //
+    // If the new alarm overflows and this alarm doesn't, this alarm
+    // happens _before_ the new alarm.
+    if (!(!cur_overflows && new_overflows) && ((cur_overflows && !new_overflows) || cur_expiration > new_expiration)) {


I strongly agree. I really can't parse line 110.

Based on my reading of the comment, the logic would be if ((cur_overflows && !new_overflows) || (cur_expiration > new_expiration)). I don't understand why the first part of the condition is there.

brghena · 2024-09-19T21:11:49Z

examples/tests/multi_alarm_simple_overflow_test/README.md

+# Test Multiple Alarms (With Overflow)
+
+This tests the virtual alarms available to userspace. It sets two
+alarms, first one that overflows the alarm, such that it's expiration


Suggested change

alarms, first one that overflows the alarm, such that it's expiration

alarms, first one that overflows the alarm, such that its expiration

brghena · 2024-09-19T21:12:17Z

examples/tests/multi_alarm_simple_overflow_test/README.md

+
+This tests the virtual alarms available to userspace. It sets two
+alarms, first one that overflows the alarm, such that it's expiration
+is small in absolute value (but should shouldn't fire until after the


Should or shouldn't? Pick one please

ppannuto

Review of everything except timer logic so far sorry

examples/tests/multi_alarm_simple_overflow_test/README.md

examples/tests/multi_alarm_simple_test/README.md

libtock/services/alarm.h

Co-authored-by: Pat Pannuto <[email protected]> Co-authored-by: Branden Ghena <[email protected]>

alevy · 2024-09-20T18:24:50Z

@ppannuto @tyler-potyondy @brghena all comments addressed

alevy force-pushed the multi_alarm branch from af4c927 to 9c814ec Compare September 5, 2024 23:39

tyler-potyondy previously approved these changes Sep 6, 2024

View reviewed changes

ppannuto reviewed Sep 6, 2024

View reviewed changes

examples/tests/multi_alarm_simple_test/main.c Outdated Show resolved Hide resolved

alevy dismissed tyler-potyondy’s stale review via 6bb6e36 September 6, 2024 07:34

brghena previously approved these changes Sep 6, 2024

View reviewed changes

alevy dismissed brghena’s stale review via b919c80 September 14, 2024 04:23

alevy changed the title ~~test: simple multi alarm test~~ Fix multi-alarm overlapping bug Sep 14, 2024

alevy and others added 3 commits September 18, 2024 11:46

test: simple multi alarm test

c0c0072

An intentionally simple test for multiple alarms that only sets exactly two repeating alarms that should overlap frequently (one is twice the other's frequency).

Update examples/tests/multi_alarm_simple_test/main.c

f30488c

Co-authored-by: Pat Pannuto <[email protected]>

alevy force-pushed the multi_alarm branch from b919c80 to 3237e14 Compare September 18, 2024 18:46

tyler-potyondy requested changes Sep 19, 2024

View reviewed changes

brghena requested changes Sep 19, 2024

View reviewed changes

ppannuto requested changes Sep 20, 2024

View reviewed changes

alevy and others added 5 commits September 20, 2024 11:02

Apply suggestions from code review

7ecf030

Co-authored-by: Pat Pannuto <[email protected]> Co-authored-by: Branden Ghena <[email protected]>

Typo

3b11c22

Add example output to multi_alarm_simple_test README

3a9c33a

alarm.c: clarify overflow logic

ffc7f8b

Add example output to multi_alarm_simple_overflow README

099ae64

brghena previously approved these changes Sep 20, 2024

View reviewed changes

tyler-potyondy previously approved these changes Sep 20, 2024

View reviewed changes

alarm: parentheses are cheap, precedence is hard

702b4c1

ppannuto dismissed stale reviews from tyler-potyondy and brghena via 702b4c1 September 23, 2024 22:31

ppannuto approved these changes Sep 23, 2024

View reviewed changes

ppannuto enabled auto-merge September 23, 2024 22:31

ppannuto added this pull request to the merge queue Sep 23, 2024

Merged via the queue into master with commit 928ba0b Sep 23, 2024
4 checks passed

ppannuto deleted the multi_alarm branch September 23, 2024 22:47

alevy mentioned this pull request Oct 15, 2024

alarm: rewrite alarm virtualization with better comments and simpler logic, add tests #468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-alarm overlapping bug #466

Fix multi-alarm overlapping bug #466

alevy commented Sep 5, 2024 •

edited

Loading

tyler-potyondy left a comment

brghena commented Sep 6, 2024

lschuermann commented Sep 8, 2024

lschuermann commented Sep 12, 2024

alevy commented Sep 14, 2024

alevy commented Sep 18, 2024

brghena commented Sep 19, 2024

tyler-potyondy left a comment

tyler-potyondy Sep 19, 2024

brghena Sep 19, 2024

alevy Sep 20, 2024

brghena left a comment

brghena Sep 19, 2024

brghena Sep 19, 2024

brghena Sep 19, 2024

brghena Sep 19, 2024

ppannuto left a comment

alevy commented Sep 20, 2024

	alarms, first one that overflows the alarm, such that it's expiration
	alarms, first one that overflows the alarm, such that its expiration

Fix multi-alarm overlapping bug #466

Fix multi-alarm overlapping bug #466

Conversation

alevy commented Sep 5, 2024 • edited Loading

Test 1: Multi-Alarm Simple

Test 2: Multi-Alarm Overflow

tyler-potyondy left a comment

Choose a reason for hiding this comment

brghena commented Sep 6, 2024

lschuermann commented Sep 8, 2024

lschuermann commented Sep 12, 2024

alevy commented Sep 14, 2024

alevy commented Sep 18, 2024

brghena commented Sep 19, 2024

tyler-potyondy left a comment

Choose a reason for hiding this comment

tyler-potyondy Sep 19, 2024

Choose a reason for hiding this comment

brghena Sep 19, 2024

Choose a reason for hiding this comment

alevy Sep 20, 2024

Choose a reason for hiding this comment

brghena left a comment

Choose a reason for hiding this comment

brghena Sep 19, 2024

Choose a reason for hiding this comment

brghena Sep 19, 2024

Choose a reason for hiding this comment

brghena Sep 19, 2024

Choose a reason for hiding this comment

brghena Sep 19, 2024

Choose a reason for hiding this comment

ppannuto left a comment

Choose a reason for hiding this comment

alevy commented Sep 20, 2024

alevy commented Sep 5, 2024 •

edited

Loading