Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losing connection while downloading large file may result in unrecoverable state #315

Closed
bugadani opened this issue Oct 27, 2023 · 3 comments

Comments

@bugadani
Copy link
Contributor

The "while downloading large file" may not be necessary but it's how I'm able to reproduce this issue the most reliably. It looks like the TX queue fills up, and never gets cleared afterwards.

@bjoernQ
Copy link
Contributor

bjoernQ commented Oct 27, 2023

Might be a new thing?

How large is the file? During my tests to improve the throughput I think I used files >2M. Also in the very beginning I had an MQTT temperature logger running for more than one week without error recovery. (But with the non-async API... There was no async back then)

@bugadani
Copy link
Contributor Author

bugadani commented Oct 27, 2023

I'm downloading and discarding a firmware update, around 1MB, as part of a throughput test. The key is losing connection for more than a couple seconds. The indication that this happens is that the download speed drops to 0 and stays 0. 1-2 seconds of this is surviveable, but in this state as soon as the TX queue fills up with frames, the driver can't seem to recover.

I'll test this more, I've done 2 days of hair pulling due to this. TX/RX buffer configuration matters a lot in terms of what happens if the issue hits (if the wifi stack exhausts its heap the app can completely die), but other than that I didn't make that much progress.


Update on 28th:

This seems to be a lot more messy than just losing connection. I can't seem to be able to reproduce this issue today, but I'm building on a different computer introducing more variables. This time I'm seeing the TX callback receiving false status. Either this is because I'm building with 1.73.0.0, Intel vs AMD, general bad luck or there's something totally different going on and I drew the wrong conclusions. At this point anything goes.


I've managed a repro. This test run was made with bugadani@a447c5a which is slightly modified, but the problem is not specific to my fork. Logs are here though they are only indicative, not very informative. You can see [DEBUG] - tx inflight (consume): 10 in line 5919, this is the last time the network stack consumes a TX token. After this point, no TX tokens are returned. The last esp_wifi_tx_done_cb call happened in line 5321.

The test I'm running timed out in line 5692, earlier than the last TX token consume. The test timeout drops the HTTP client, the HTTP connection, and the TCP socket. I am able to restart the test, around line 5827, but it just eats up the remaining TX tokens (which smoltcp would I think do anyway), and then nothing happens except for timers arming, disarming and sometimes firing.


What I've been able to determine so far:

  • it's not the event queue filling up silently
  • task switching doesn't stop, though I didn't verify that there aren't stack overflows or similar errors
  • the interrupt handlers are only set once and the INTERRUPT_HANDLER_1 pointers most likely aren't corrupted (though I should check this)
  • just walking out of range isn't sufficient to reproduce this error

What I want to check:

  • are the interrupt handlers firing after the error has occurred? is the right handler function called?
  • is it possible that the interrupt handlers don't have time to handle some interrupt, so it gets lost when the same event happens again?
  • does power management wake up the modem correctly? does the error reproduce with ps-min-modem, as it started happening after enabling max?
  • am I bumping up against sta's max beacon interval timeout?

@bugadani
Copy link
Contributor Author

My app has been running for 3.5 hours in a loop without the issue, so I assume #318 has fixed it.

@github-project-automation github-project-automation bot moved this from Todo to Done in esp-rs Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants