Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block until mutex can be locked #318

Merged
merged 2 commits into from
Oct 30, 2023
Merged

Block until mutex can be locked #318

merged 2 commits into from
Oct 30, 2023

Conversation

bugadani
Copy link
Contributor

@bugadani bugadani commented Oct 29, 2023

In esp-idf, the corresponding function is implemented as:

static int32_t IRAM_ATTR mutex_lock_wrapper(void *mutex)
{
    return (int32_t)xSemaphoreTakeRecursive(mutex, portMAX_DELAY);
}

This function suspends the task until the mutex can be locked. This PR attempts to replicate this behaviour.

It is common to ignore the return value of such mutex locks, as they are assumed to always succeed. In these cases the old code may have entered regions from multiple places simultaneously.

The second commit moves some code out of a critical section to prevent infinitely looping when the mutex can't be locked immediately.

@bugadani
Copy link
Contributor Author

bugadani commented Oct 29, 2023

This is 100% my fault (or maybe 50-50 since @MabezDev didn't notice my mistake :) ) as the change I introduced in #276 is absolutely, completely broken. It's also very probably the cause of all my issues in #315 - and the intermittent failure experienced during testing, maybe?

BUT! I am not sure if the original code was working as intended either, and that my previous transformation was wrong in the sense that I did something logically different. I still see no code path in the old code where lock_mutex would have looped. Unless there's some subtlety I'm skipping over, the branch (false, false) should be unreachable. Why am I wrong?

@bugadani bugadani changed the title Block until mutex is locked Block until mutex can be locked Oct 29, 2023
@bugadani
Copy link
Contributor Author

bugadani commented Oct 30, 2023

Haha, I managed to run into this blocking forever 😭 But it made me realize that the software interrupt that would switch the tasks is masked for some reason. Code is now looping in lock_mutex, yield_task fires the interrupt request that never gets handled, which obviously doesn't let the code make any progress.

@bjoernQ
Copy link
Contributor

bjoernQ commented Oct 30, 2023

Seems you are right and while the loop definitely shows the intent to loop there until the mutex is available in reality the old code never got there

The thing about yield is a bit weird since it used to work - I think there were some changed to interrupt-on / off or something which might cause that now. We have more places where we call yield 🤔

I will compare the behavior to some older revision to see if the Software-Interrupt gets handled there correctly. I assume you are using S3, right?

@bugadani
Copy link
Contributor Author

bugadani commented Oct 30, 2023

I'm using the S3, yes. I don't think the yield mechanism is incorrect, nothing would work if that was the case. I've traced task switching yesterday and it was cycling between the three correctly.

What I'm thinking is that maybe we're not unmasking some interrupts correctly in one of the functions we provide to the driver. I remember messing with those, I just don't remember if my changes made it in or not (i.e. if this is something I've caused). I'm planning to spend my time on this today, though so you don't have to if you have other stuff.

It's very annoying to debug because, as with most concurrency issues, it's only occurring rarely.

@bjoernQ
Copy link
Contributor

bjoernQ commented Oct 30, 2023

Yeah, probably doesn't make sense to look into this in parallel. Won't be fun definitely

@bugadani
Copy link
Contributor Author

This is the call trace at the place of the infinite loop:
image

I hope we don't have more of these but I don't know. If a freeze happens again, we can connect a debugger, read the stack trace and update accordingly.

@bugadani bugadani force-pushed the mutex branch 2 times, most recently from 03ebc5d to 6a76d6f Compare October 30, 2023 09:46
esp-wifi/src/wifi/mod.rs Outdated Show resolved Hide resolved
@bugadani bugadani force-pushed the mutex branch 2 times, most recently from 145bad5 to b749a5f Compare October 30, 2023 13:07
Copy link
Member

@MabezDev MabezDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield_task being blocked by a CS is quite the foot gun we'll have to watch out for that :D. I agree that we should avoid CS with our "freertos" primitives where possible to avoid these situations in the future.

LGTM

@MabezDev MabezDev merged commit d6bc265 into esp-rs:main Oct 30, 2023
7 checks passed
@bugadani bugadani deleted the mutex branch October 30, 2023 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants