Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-alarm overlapping bug #466

Merged
merged 9 commits into from
Sep 23, 2024
Merged

Fix multi-alarm overlapping bug #466

merged 9 commits into from
Sep 23, 2024

Conversation

alevy
Copy link
Member

@alevy alevy commented Sep 5, 2024

This PR fixes a subtle bug in the userspace virtual alarm library. If alarms are created and expire pretty close to each other, the within_range check can incorrectly return the relative relationship between the two, place the earlier alarm after the later alarm, which is incorrect, and can end up totally starving the earlier alarm if the later alarm is repeating.

This bug was introduced when trying to correctly account for potentially overflowing alarms, and the "obvious" fix (just compare absolute expiration values) doesn't work.

This PR adds two tests, one that demonstrates the starved alarm issue and one that exercises the overflow issue, and fixes the bug in such a way that both pass. In particular, the fix is simply to explicitly check if alarms overflow when insert them to the virtual alarm queue and use that for the comparison between alarms.

Test 1: Multi-Alarm Simple

This is a simple test for multiple alarms that only sets exactly two repeating alarms that should overlap frequently (one is twice the other's frequency). It sets two repeating alarms, at 500ms and 1s respectively, each prints the alarm index (1 or 2), current tick and alarm expiration to the console. When successful, both alarms should fire and alarm 1 should fire twice as often as alarm 2.

Today (on the NRF52840 at least, though I don't expect other chips to be different), this test fails. It should output (the first column in each line is the timer index):

1 6316032 6314240
2 10512384 10510592
1 10512384 10511104
1 14722560 14720768
2 18903552 18901760
1 18919680 18917632
1 23116544 23114752
2 27294720 27292928
...

but, due to a bug in the alarm library, actually outputs:

1 7254784 7252992
2 11451136 11449344
2 19842304 19840512
2 28233472 28231680
2 36624640 36622848
2 45015808 45014016
2 53406976 53405184
...

(note timer 1 never appears after the first instance)

Test 2: Multi-Alarm Overflow

This is a for multiple alarms where one overflows the clock (UINT32_MAX ticks), resulting in a low absolute value expiration, while the other is "normal" (1 second). When successful, both alarms should fire, the second after about 1 second, and the first after the clock overflows (around 7 minutes on 32kHz clocks).

When the logic for inserting clocks doesn't account correctly for overflows, the second alarm fires after the first one, both after the clock overflows.

tyler-potyondy
tyler-potyondy previously approved these changes Sep 6, 2024
Copy link
Contributor

@tyler-potyondy tyler-potyondy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting bug and perhaps explains some of the edge case bugs we've seen with the OpenThread library.

I'm curious to see what the root cause ends up being.

brghena
brghena previously approved these changes Sep 6, 2024
@brghena
Copy link
Contributor

brghena commented Sep 6, 2024

This is the kind of test we should be running frequently on Treadmill. Highly important to get right, easy enough to have some expected results (although not exact results every time, so we'll have to pattern match in some way), and historically easy to break accidentally.

@lschuermann
Copy link
Member

Good catch, and the test seems fine to me.

However, this does raise a more general issue to me, and one that relates to our releases, automated tests, and the timer subsystem in particular: how to assert whether the test ran correctly, or not.

In particular, for timer tests we often just produce some raw, unprocessed output. When it comes to running these tests as part of a release (esp. with our release cadence), we then need to re-learn what it means for these tests to pass or fail. Similarly, with automated tests, esp. with non-deterministic output, it can be hard to encode a check function that determines the test result.

We may want to take this opportunity to figure out what we can add to the user space tests and/or a test framework to make this easier.

@lschuermann
Copy link
Member

FWIW, @charles37 is working on an automated test for this, to be integrated into Treadmill.

alevy added a commit that referenced this pull request Sep 14, 2024
Fixes #466

When inserting an alarm into the virtual alarm queue, comparing two
alarms needs to account for alarms that may wrap beyond the clock
overflow. Simply comparing computed expirations against the new
alarm's reference isn't sufficient, as the current alarm may overflow,
resulting in an expiration that is earlies by absolute value, but
should still expire later.
@alevy alevy changed the title test: simple multi alarm test Fix multi-alarm overlapping bug Sep 14, 2024
@alevy
Copy link
Member Author

alevy commented Sep 14, 2024

A significant update that adds another test (for clock overflow) and (I believe) fixes the underlying issue. See new PR description.

alevy and others added 3 commits September 18, 2024 11:46
An intentionally simple test for multiple alarms that only sets exactly
two repeating alarms that should overlap frequently (one is twice the
other's frequency).
Fixes #466

When inserting an alarm into the virtual alarm queue, comparing two
alarms needs to account for alarms that may wrap beyond the clock
overflow. Simply comparing computed expirations against the new
alarm's reference isn't sufficient, as the current alarm may overflow,
resulting in an expiration that is earlies by absolute value, but
should still expire later.
@alevy
Copy link
Member Author

alevy commented Sep 18, 2024

@brghena @ppannuto a nudge that this is actually quite important to fix and worth reviewing/merging ASAP.

@tyler-potyondy potentially relevant to the issues you've been having with OpenThread

@brghena
Copy link
Contributor

brghena commented Sep 19, 2024

I will look over this tomorrow. Thanks for the nudge.

Copy link
Contributor

@tyler-potyondy tyler-potyondy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for the timer placement seems to check out and is an improvement over the somewhat opaque within range function. I have one small comment for improving the readability of the logic, but overall the changes look good to me.

Comment on lines 103 to 110
// This alarm happens after the new alarm if:
// - neither expirations overflow and this expiration value is larger than the new expiration
// - both overflow and this expiration value is larger than the new expiration
// - or, this alarm overflows but the new one doesn't
//
// If the new alarm overflows and this alarm doesn't, this alarm
// happens _before_ the new alarm.
if (!(!cur_overflows && new_overflows) && ((cur_overflows && !new_overflows) || cur_expiration > new_expiration)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double negations and nested operations make this somewhat tricky to parse. Perhaps altering the conditional's logic to match the 4 statements enumerated in the comment more closely would improve readability. Something to the effect of:

( case1 || case2 || case3 || case4)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly agree. I really can't parse line 110.

Based on my reading of the comment, the logic would be if ((cur_overflows && !new_overflows) || (cur_expiration > new_expiration)). I don't understand why the first part of the condition is there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't quite help myself from being a tiny bit clever still, but I believe the new version of both the comment and boolean logic are clear now.

Copy link
Contributor

@brghena brghena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch and the fix looks good to me. I have a bunch of small things to clean up though.

libtock/services/alarm.c Outdated Show resolved Hide resolved
libtock/services/alarm.c Outdated Show resolved Hide resolved
// expires.
bool cur_overflows = (*cur)->reference > UINT32_MAX - (*cur)->dt;

// This alarm happens after the new alarm if:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "this alarm" and what is "the new alarm"? I think it's cur and alarm respectively? But I might have that backwards. In my opinion, this entire comment block should use the actual names for clarity.

Comment on lines 103 to 110
// This alarm happens after the new alarm if:
// - neither expirations overflow and this expiration value is larger than the new expiration
// - both overflow and this expiration value is larger than the new expiration
// - or, this alarm overflows but the new one doesn't
//
// If the new alarm overflows and this alarm doesn't, this alarm
// happens _before_ the new alarm.
if (!(!cur_overflows && new_overflows) && ((cur_overflows && !new_overflows) || cur_expiration > new_expiration)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly agree. I really can't parse line 110.

Based on my reading of the comment, the logic would be if ((cur_overflows && !new_overflows) || (cur_expiration > new_expiration)). I don't understand why the first part of the condition is there.

# Test Multiple Alarms (With Overflow)

This tests the virtual alarms available to userspace. It sets two
alarms, first one that overflows the alarm, such that it's expiration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alarms, first one that overflows the alarm, such that it's expiration
alarms, first one that overflows the alarm, such that its expiration


This tests the virtual alarms available to userspace. It sets two
alarms, first one that overflows the alarm, such that it's expiration
is small in absolute value (but should shouldn't fire until after the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should or shouldn't? Pick one please

Copy link
Member

@ppannuto ppannuto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of everything except timer logic so far sorry

examples/tests/multi_alarm_simple_overflow_test/README.md Outdated Show resolved Hide resolved
examples/tests/multi_alarm_simple_overflow_test/README.md Outdated Show resolved Hide resolved
libtock/services/alarm.h Outdated Show resolved Hide resolved
@alevy
Copy link
Member Author

alevy commented Sep 20, 2024

@ppannuto @tyler-potyondy @brghena all comments addressed

brghena
brghena previously approved these changes Sep 20, 2024
tyler-potyondy
tyler-potyondy previously approved these changes Sep 20, 2024
@ppannuto ppannuto added this pull request to the merge queue Sep 23, 2024
Merged via the queue into master with commit 928ba0b Sep 23, 2024
4 checks passed
@ppannuto ppannuto deleted the multi_alarm branch September 23, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants