Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMC2209 with SKR 2.0 using software serial now appears broken #136

Open
fitch22 opened this issue Aug 29, 2023 · 18 comments
Open

TMC2209 with SKR 2.0 using software serial now appears broken #136

fitch22 opened this issue Aug 29, 2023 · 18 comments

Comments

@fitch22
Copy link
Contributor

fitch22 commented Aug 29, 2023

I have finally returned to my CNC project and built latest code for BTT SKR 2.0 with TMC2209. It appears the software serial code no longer works, or I need to configure something else that I am not aware of. The last build I made was 1/24/2022 and that works.

I took a quick look at the new tmc_uart.c file and compared it to the code from btt_skr_2.0.c where it used to be. I found a couple things that looked to be issues but I have been unsuccessful fixing them. I am trying to avoid pulling out my scope to look at the comms:-).

The status in iosender does not show a TMC in NEWOPT the way it did last year. But it does show Trinamic as a PLUGIN. My old firmware shows TMC=7. I cant do a M122 or M122I since the comm doesnt work.

I am happy to help resolve this, but need a starting point. There are a lot of changes to the code that I was not a part of.

Thanks!

@terjeio
Copy link
Contributor

terjeio commented Aug 30, 2023

But it does show Trinamic as a PLUGIN. My old firmware shows TMC=7.

TMC=x is reported after succesfully initializing the driver(s).

I just hooked up a TMC2209 driver to a F446 Nucleo board and it initializes ok. This with the SKR 2 map and UART X routed to an available pin.

There are a lot of changes to the code that I was not a part of.

Changes were for bringing the code in line with other drivers and available for more STM32F4xx boards.
I have no good explanation for why it does not work for the SKR 2 board anymore. One change is that baud rate has been significantly increased from 19200 to 100000, it may be worthwhile to reduce that?

If you decide to pull out the scope then the rcv() function has two commented out digital_out() calls that I used to check the the read timing. These calls needs an aux output defined in the map file, or you may write to an output with a STM library call. It should go high in the middle of a received pulse.

@fitch22
Copy link
Contributor Author

fitch22 commented Aug 30, 2023

@terjeio
I did a couple things.

  1. I looked at the scope timing of last year's firmware and a fresh build. I even started with a brand new git clone of everything just to make sure I had the latest. The timing looked the same, although neither is as good as it should be.
  2. I compared the current software serial code in tmc_uart.c to what what had been in btt_skr_2.0.c. All the changes looked very good to me. I noticed the use of START_DELAY was removed at the cost of an extra 10us bit time, but the timing is no longer dependent on a particular cpu and clock freq, so I like that.

Then I started looking for what is different. There is a #define in my_machine.h for ESTOP_ENABLE in the latest code but was not present in the code I used last year. The default is commented out. When I leave it commented out, I cannot communicate with the 2209s. But if I uncomment the line, so ESTOP_ENABLE is 0, then I can communicate with the 2209s and all is as it was before. I do not understand why this would change anything. When I install new firmware, with it commented out I had to do a $14=71 to get rid of the ESTOP error. When I uncomment the line, I only need $14=7 to get everything going.

Perhaps you know why this would affect the comms to the 2209s. If so can you explain to me what just happened?

Regarding the uart timing: I noticed that the response baud rate coming back from the 2209 is actually faster than the baud rate sent to the 2209. At the very beginning of a read, the mcu is sampling the bit at approximately the center of the bit period, but after many bits, the sampling period slowly walks to later positions in time and eventually gets within ns of the bit transition from the 2209. This is not new behavior, it has existed since day 1, I just never noticed it before. I think the 2209 may not be measuring the bit width very accurately and with a baud rate of 100000. When it tries to reply with the same baud rate as we used to make the request, this is as close as it gets. Problem is we may be just on the edge of things working. I will look into the 2209 data sheet to see if there might be a baud rate that it can duplicate more precisely. Maybe a more traditional rate like 115200 would actually be better. If I make any improvements in this area I will let you know.

Thanks,
Bill

@fitch22
Copy link
Contributor Author

fitch22 commented Aug 30, 2023

@terjeio
Well I guess I spoke too soon. New build on one board initializes all the drivers. Same build on different board can only talk to X. Old firmware works fine with all 3 axis. There must be something else going on other than the serial stuff. I need a bit of time but will report back.

Bill

@fitch22
Copy link
Contributor Author

fitch22 commented Aug 31, 2023

@terjeio
I have good news and I have bad news. The good news is I know exactly where the problem occurs. The bad news is that it is due to the checkin on 2/9/2022 5:26:04 AM (might be my local time which is PDT), and the comment is "Refactored soft UART code...". However, this particular version scrogs my USB, and using version on 2/9/2022 5:41:33 AM clearly exhibits the problem.

What I have found is that things sort of work, but only if there is a single 2209 in the system. When I boot from this code, and I change the stepper driver selection to only include X, then I can indeed get status from that driver. Resetting the board then shows status that includes TMC=1. But at that point it does not like my Y or Z drivers.

In contrast, the version on 2/5/2022 12:41:09 AM works just fine.

Since this is related to 1 vs multiple 2209s, it may have been missed unless tested with multiple drivers. My quick comparison of the code from before and after did not find anything, but I was not looking to see how the X, Y, and Z calls worked, I only looked at the algorithm for the software serial which I thought was OK.

The only changes in this checkin were related to the UART code. You may see an issue quicker than I can since I have to figure out how the multi-channel stuff works, but I will see if I can find it. Appreciate your help if you can.

Thanks,
Bill

@terjeio
Copy link
Contributor

terjeio commented Sep 1, 2023

Some initial findings:

There is a bug in trinamic.c that will cause a hardfault if the X driver is not enabled. Change this line to
if(report.sg_status_enable && stepper[report.sg_status_motor]) {
to fix that. It could be this that breaks the USB communication.

After setting some breakpoints and reiniting the drivers multiple times I've found that the crashes seems to related to transmit, not receive. This is with a single physical driver connected to a single UART pin that is mapped to all logical drivers.
In the init function I compare received commands count with the count before and after sending seven commands, this fails more more less randomly. I have yet to see any receive CRC errors, but there may be other frame errors present that I have not trapped yet. Reducing the baud rate by setting SWS_BAUDRATE to 50000 seems to help.

@terjeio
Copy link
Contributor

terjeio commented Sep 1, 2023

Here is the end of a read at 38400 baud, sampling looks near perfect to me:

uart

I wonder if an interrupt delaying the next timer interrupt during transmission of the sync bits can affect driver baud rate detection? It is a pretty small window to hit though, and the timer interrupt has the highest priority. I am using UART to connect to the controller, USB might have longer latencies and thus be problematic?

@fitch22
Copy link
Contributor Author

fitch22 commented Sep 1, 2023

I took the latest code from a fresh clone. I deleted tmc_uart.c and replaced btt_skr_2.0.c and btt_skr_2.0_map.h with those from 2/5/2022. There was a warning and an error due to changes in enumerate_pins, but those are easily fixed. Basically, I wanted to see if going back to the old code works with everything else up to date. And it works just fine.

I think I really need to dig out the debugger and just step through things, but being lazy I have started swapping old code for new code to see if I can quickly determine what the culprit is. I expect it will turn out to be something very subtle. I will let you know once I find it.

Bill

@fitch22
Copy link
Contributor Author

fitch22 commented Sep 1, 2023

If we do delay one of the bits from another interrupt, as long as the delay is less than about 1/2 the baud rate we should be OK. My theory is that on a read we are trying to sample in the middle of the bit. We could be delayed all the way until the end of the bit and still get the same sample value. On a write, I am thinking the UART inside the 2209 is more sophisticated and could tolerate a comparable amount of jitter.

My observation of the sample point walking toward the end of the bit cycle worries me. This is caused by 2209 behavior as the SKR uses the same timer for transmit as for receive. The 2209 has to measure the baud rate of the SKR during the request, and attempt to output its data using the same baud rate, but I think it is only able to get close, not perfect. With other interrupts causing delays, I can see the SKR getting a bad sample. I think once I understand why the new uart code does not work I will see if there is something that can be done to make this a bit more robust as well.

@terjeio
Copy link
Contributor

terjeio commented Sep 2, 2023

Try this with baud rate set to 100000:

static void write_n (uint8_t data[], uint32_t length)
{
    uint_fast8_t i = 0;

    setTX();
    while(tx_buf.busy);                             // should not be anything pending but...

    tx_buf.data = (data[i++] << 1) | STOP_BIT;      // form first word with START and STOP bits
    tx_buf.busy = true;
    length--;

    TMC_UART_TIMER->CNT = period_div_2 * 3;                        // initialize counter and
    TMC_UART_TIMER->CR1 &= ~TIM_CR1_UDIS;           // enable interrupt

    send();

    while(tx_buf.busy);                             // wait for 1st byte to finish

    do {
        tx_buf.data = (data[i++] << 1) | STOP_BIT;  // form next word
        tx_buf.busy = true;
        while(tx_buf.busy);                         // wait for byte to finish
    } while(--length);

    TMC_UART_TIMER->CR1 |= TIM_CR1_UDIS;            // disable interrupt
}

It sends a longer initial start bit - and seems to fix the issue. Bedtime is long overdue so I quit for today....

@terjeio
Copy link
Contributor

terjeio commented Sep 2, 2023

Disregard the previous comment, too tired when I wrote it so forgot a change elsewhere.

The solution seems to be to resync the timer on each byte (wait for the start bit), the following version seems to be robust at 200 Kbaud, I have not tested any faster. FYI I use 230400 baud for hardware UART, I have not seen any problems with that so the drivers can handle it.

tmc_uart.zip

Relying on no delay between bytes from the driver was a tad too optimistic?

@fitch22
Copy link
Contributor Author

fitch22 commented Sep 3, 2023

Terje,
I see something that might explain why in some cases only my X driver works. I set a breakpoint at the entry to both tmc_uart_write() and tmc_uart_read(). The very first breakpoint is hit with a read. When I have all three axis enabled, the first read is for Z. Using old code that works, the first read request is 0x05 0x02 0x01 0xc1, and the 2209 responds with data. Using fresh code that does not work, the first read request is 0x05 0x00 0x01 0x9a, and since the slave address = 0, the Z driver does not respond.

Did you eliminate the slave addressing at some point?

@fitch22
Copy link
Contributor Author

fitch22 commented Sep 3, 2023

Terje,
I just pulled the jumpers out from under Y and Z and now all axis work. It looks like you did remove the slave addressing (when did you do that?). I have a vague recollection that I suffered the opposite problem a while back:-).

Now the mystery is solved for me.

Now that I have my debug environment all set up, I will take a look at your latest code you sent me.

By the way, regarding re-syncing for each byte: The 2209 data transfer is not really asynchronous. When it sends data back, it sends it as a block, not as a bunch of individual bytes, so it should not be necessary to re-sync per byte. However, I don't see any down side to it other than a bit more work per byte to locate the START bit. If that means being able to run faster, that is probably a good thing, too.

@terjeio
Copy link
Contributor

terjeio commented Sep 3, 2023

Did you eliminate the slave addressing at some point?

When using separate UARTs for each driver slave addressing does not make sense, and some boards does not even allow it. So it is disabled in the shared code. I missed that when I switched to the shared code for the SKR2, sorry about that.

By the way, regarding re-syncing for each byte: The 2209 data transfer is not really asynchronous.

I am lazy to look it up myself - is that stated in the datasheet? Anyway, drifting of the sample point that I see is not always present and only randomly causes read failures - sometimes very seldom. Either the initial sync bits sent from the controller is sligthly out of timing (due to another interrupt interfering?) or the response is. Anyway, resyncing helps so whatever triggers the drifting does not matter. From the datasheet: "The bit time is calculated by measuring the time from the beginning of start bit (1 to 0 transition) to the end of the sync frame (1 to 0 transition from bit 2 to bit 3)."

@fitch22
Copy link
Contributor Author

fitch22 commented Sep 3, 2023

Terje,
The 2209 data sheet does not specifically state that the data is returned as a block on a read, it was just my observation of its behavior. The data sheet uses the term UART, so it is probably safer to assume it is asynchronous.

Now regarding your code resampling for START every byte: I like it! The only change I would make is you have called stop_listening() in a few places where SetTx() had been called. In all cases, either a very long timeout has occurred or the mandatory 12 bits times have already elapsed, so waiting another 5 bit times I think is unnecessary. And since START_DELAY is no longer used, I would just pull it out of the code to eliminate any confusion.

If a bit error occurs, there should be a CRC error. Does the driver detect this and try again?

@terjeio
Copy link
Contributor

terjeio commented Sep 3, 2023

The only change I would make is you have called stop_listening() in a few places where SetTx() had been called.

I have reverted this already, I made the change when experimenting and having trouble with the timer IRQ enabled state.
I'll remove the START_DELAY definition in the next commit.

If a bit error occurs, there should be a CRC error. Does the driver detect this and try again?

It is detected and the command silently fails. Retries can be added but should be done with raising an alarm if times out?
Proper error handling is sometimes hard...

@fitch22
Copy link
Contributor Author

fitch22 commented Sep 4, 2023

Terje,
What do I have to do to use digital_out()? I put some declarations in my map file like:
#define AUXOUTPUT0_PORT GPIOE
#define AUXOUTPUT0_PIN 10
But could not get anything to work. I must be missing something. Any instructions anywhere?

thanks

@terjeio
Copy link
Contributor

terjeio commented Sep 4, 2023

You have to add #define HAS_IOPORTS to the map file. Use $pins to list the assigned pins, aux pins are shown with a P<number> if not claimed, is the "port" to use with the hal.port.digital_out() call.

Any instructions anywhere?

Not yet, a programmers reference manual (PRM) should be written - but I have no idea about when I am able to do that. I have started to add markup to the source though, this in order to generate some of the documentation. See here for both the HAL API and some core helper functions.

@fitch22
Copy link
Contributor Author

fitch22 commented Sep 6, 2023

Terje,
OK, got the hal.port.digital_out() working, and that's nice.

I never noticed before, but I apparently did not have 100% communication with my 2209s. I could reset the board and watch the traffic on my scope and maybe 1 out of 10 or 20 times, things would croak on a driver. I could get the Y driver to fail pretty consistently in the same place, and it was always a read where it just did not respond. I figured it must be getting a CRC error on the read request. This would be a good argument to have error checking on the reads and writes. So that gave me an idea.

I added some code to TMC2209_WriteRegister() to check the ifcnt value after each write, and retry the write until the count incremented. I added some code the TMC2209_ReadRegister() to retry if it timed out or if it got a CRC error. With these changes in place, I am unable to see any failures during initialization, but things take a bit more time.

Then I switched to the new driver that re-syncs on every received byte, and again could not see any problems. I removed my error checking and still cannot see any problems, but I don't really have any formal test.

What are your thoughts regarding some error checking when using software serial? I see that during driver init that the code checks to make sure the writes were good, but after that the return status of _WriteRegister an _ReadRegister is ignored. I think retrying when an error occurs could be beneficial, especially if the return codes are ignored. Is there a way to raise an alarm if some maximum number of retries was exceeded? Not sure what else makes sense to do.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants