Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle CAN disconnect (detach device from bus) on J1939 #443

Open
in-text opened this issue Jul 17, 2023 · 19 comments
Open

Handle CAN disconnect (detach device from bus) on J1939 #443

in-text opened this issue Jul 17, 2023 · 19 comments

Comments

@in-text
Copy link

in-text commented Jul 17, 2023

Hello!

How can I determine situation, when device has been detached from CAN bus (in userspace) physically?

@olerem
Copy link
Contributor

olerem commented Jul 17, 2023

If no device receiver on the CAN bus ACKed a CAN frame, the linux CAN framework will not send local echo for particular packet (SKB). The linux J1939 stack is using this "echo" to control order of J1939 packets on the bus and at same time to communicate packet errors and progress to the user space over the socket error queue.
Here is one of example on how to subscribe to the error queue:
https://github.com/olerem/can-utils/blob/isobusfs-2023.05.15/isobusfs/isobusfs_srv.c#L278

And here is one example on how to process an error queue message:
https://github.com/olerem/can-utils/blob/isobusfs-2023.05.15/isobusfs/isobusfs_srv.c#L222

Similar examples can be found here:
https://github.com/linux-can/can-utils/blob/master/j1939cat.c

@in-text
Copy link
Author

in-text commented Jul 17, 2023

If no device receiver on the CAN bus ACKed a CAN frame, the linux CAN framework will not send local echo for particular packet (SKB). The linux J1939 stack is using this "echo" to control order of J1939 packets on the bus and at same time to communicate packet errors and progress to the user space over the socket error queue. Here is one of example on how to subscribe to the error queue: https://github.com/olerem/can-utils/blob/isobusfs-2023.05.15/isobusfs/isobusfs_srv.c#L278

And here is one example on how to process an error queue message: https://github.com/olerem/can-utils/blob/isobusfs-2023.05.15/isobusfs/isobusfs_srv.c#L222

Similar examples can be found here: https://github.com/linux-can/can-utils/blob/master/j1939cat.c

Thanks! As I understand, by recvmsg (and error queue) I can determine that CAN-Bus was detached. How I may determine CAN-Bus was reconnected?

By my experience with J1939 driver, if delay between bus detach and attach more than ~10 secs, after bus attach I retrieve error on sendto with errno=ENOTRECOVERABLE.

If bus was detached (at middle of normal work of program), i retrieve error on sendto and errno=EHOSTUNREACH.

@olerem
Copy link
Contributor

olerem commented Jul 17, 2023

Hm, looks like a bug in the driver. We have two places where socket is set to an error state, but no place to recover it:
https://elixir.bootlin.com/linux/latest/source/net/can/j1939/socket.c#L1096
https://elixir.bootlin.com/linux/latest/source/net/can/j1939/socket.c#L1276

@marckleinebudde, @hartkopp, @yope, any ideas to better handle it?

@marckleinebudde
Copy link
Member

@olerem
Copy link
Contributor

olerem commented Jul 17, 2023

As counter part for the j1939_sk_netdev_event_netdown - yes.

For j1939_sk_send_loop_abort it is a bit more difficult. It will not set the socket in to error state if error queue is activated. But there is still problem how to clear an socket error if no error queue is activated? Using ioctl?

@olerem
Copy link
Contributor

olerem commented Jul 23, 2023

@in-text are you able to detect re-attached bus by using error queue?

@in-text
Copy link
Author

in-text commented Jul 25, 2023

@in-text are you able to detect re-attached bus by using error queue?

Okay if I will use j1939cat app for bus attach/detach event? Or I need write some specific code for this test?

@olerem
Copy link
Contributor

olerem commented Jul 25, 2023

j1939cat is designed to abort on errors. You will need to write own code. Please note, there is no attach/detach "event". Just send errors.

@xile273
Copy link

xile273 commented May 24, 2024

I also have a similar problem with the J1939 stack, when using PDU2 format and segmentation.

I'm building a Linux application which simulates J1939 ECUs of a real vehicle and one of those ECUs sends cyclic messages using the PDU2 format. Those messages are longer than 8 byte so they need to be segmented. If there's another device on the bus this works without problem.

However if there's no other device, and the cyclic messages continue to be sent, then after a short time the socket seems to enter some kind of erroneous state. Should then another device be switched on, the socket will be unable to send other
messages and when trying to close the socket the close syscall seems to hang.

Heres how it looks when I send data with another device active, everything works as expected:

Dev     PGN Src Dst p      time      dir data
CAN 1 0EC00  27  FF 6     591.469138 R   20220005 FFCAFE00 
CAN 1 0EB00  27  FF 6     591.519228 R   01103F46 000222AB 
CAN 1 0EB00  27  FF 6     591.569398 R   02000203 10F7EC03 
CAN 1 0EB00  27  FF 6     591.619498 R   036F0012 7E4BF7F2 
...
...

Here the received CAN data when I switch on my receiving device after the simulation has been running
without another device being active on the bus:

Dev     PGN Src Dst p      time      dir data
CAN 1 0EC00  27  FF 6      25.665528 R   20220005 FFCAFE00 
CAN 1 0EC00  27  FF 6      25.665868 R   20220005 FFCAFE00 
CAN 1 0EC00  27  FF 6      25.666128 R   20220005 FFCAFE00 
CAN 1 0EC00  27  FF 6      25.666468 R   20220005 FFCAFE00 

After switching on the other device I would expect to see the same results as in the first case but the socket becomes
unable to send any data.

Here the corresponding dmesg output:

[ 8430.439346] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x00000000b26e46bd: rx timeout, send abort
[ 8430.939545] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x00000000b26e46bd: abort rx timeout. Force session deactivation
[ 8431.202947] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x00000000ba54bde7: rx timeout, send abort
[ 8431.703136] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x00000000ba54bde7: abort rx timeout. Force session deactivation
[ 8431.963466] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x00000000db6c9c27: rx timeout, send abort
[ 8432.463679] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x00000000db6c9c27: abort rx timeout. Force session deactivation
[ 8432.733038] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x00000000c80627e5: rx timeout, send abort
[ 8433.233166] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x00000000c80627e5: abort rx timeout. Force session deactivation
[ 8433.493515] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x0000000096728eb7: rx timeout, send abort
[ 8433.993681] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x0000000096728eb7: abort rx timeout. Force session deactivation
[ 8434.267963] kvaser_usb 1-1.1:1.0 can0: j1939_tp_rxtimer: 0x00000000fb00bcc6: rx timeout, send abort
[ 8434.733613] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000fb00bcc6: connection exists (27 ff). last cmd: 20
[ 8434.733636] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000949e7acb: connection exists (27 ff). last cmd: 20
[ 8434.733772] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000fb00bcc6: connection exists (27 ff). last cmd: 20
[ 8434.733786] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000949e7acb: connection exists (27 ff). last cmd: 20
[ 8434.734061] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000fb00bcc6: connection exists (27 ff). last cmd: 20
[ 8434.734076] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000949e7acb: connection exists (27 ff). last cmd: 20
[ 8434.734530] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000fb00bcc6: connection exists (27 ff). last cmd: 20
[ 8434.734544] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000949e7acb: connection exists (27 ff). last cmd: 20
[ 8434.734763] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000fb00bcc6: connection exists (27 ff). last cmd: 20
[ 8434.734777] kvaser_usb 1-1.1:1.0 can0: j1939_xtp_rx_rts_session_active: 0x00000000949e7acb: connection exists (27 ff). last cmd: 20

I don't really know how to work around this problem as I don't see any way how to make sure that there is another device on
the bus, apart from sending those messages...

I've observed this behavior on kernel 6.8.0 as well as 5.10.72.

@olerem
Copy link
Contributor

olerem commented May 24, 2024

Hi @xile273 , did you tried to use error queue? It will send you all packet notifications and prevent from blocking the socket.

@xile273
Copy link

xile273 commented May 24, 2024

Hi and thank you for your help!
Yes I did so as well.
I get INFO_TX_ABORT errors when sending without another device.
But when I switch on my receiver the behavior with regards to sending data is the same.
What is interesting though is, as soon as I switch on the receiver I stop getting messages from the error queue.

Edit:
Sorry I got confused. When using the error queue I can close the socket, but it is not possible to send
messages after I've switched on another device on the bus. And as stated above as soon as another device
becomes active I stop receiving messages from the error queue.

@olerem
Copy link
Contributor

olerem commented May 24, 2024

Hm.. I tried to use isobusfs client and server, with following modification but was not able to reproduce issue with stalled socket:


@@ -288,8 +288,6 @@ static int isobusfs_cli_handle_events(struct isobusfs_priv *priv, unsigned int n
                                };
 
                                ret = isobusfs_recv_err(priv->sock_ccm, &emsg);
-                               if (ret && ret != -EINTR)
-                                       return ret;
                        }
                } else if (ev->data.fd == STDIN_FILENO) {
                        if (!priv->interactive) {
diff --git a/isobusfs/isobusfs_srv.c b/isobusfs/isobusfs_srv.c
index d89cc79..22cc6d6 100644
--- a/isobusfs/isobusfs_srv.c
+++ b/isobusfs/isobusfs_srv.c
@@ -221,8 +221,6 @@ static int isobusfs_srv_handle_events(struct isobusfs_srv_priv *priv, unsigned i
                                };
 
                                ret = isobusfs_recv_err(priv->sock_fss, &emsg);
-                               if (ret && ret != -EINTR)
-                                       return ret;
                        }
                }
 

I did following test:
on one side:
./isobusfs-srv -a 0x80 -v vol1:/tmp/ -i can0
on other side:
./isobusfs-cli -a 0x90 -r 0x80 -i can1

The kernel is provided by ubuntu:
6.5.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 26 11:23:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

After server and client started, i detached both interfaces from each other. The server side started printing messages:
[2024-05-24 17:46:09.206] [WARNING]: previous message was not acked
[2024-05-24 17:46:11.209] [WARNING]: previous message was not acked

After both interfaces was reattached, the server continued to work as expected.

Can you please test if you can reproduce same issue with cangen? Will it be able to continue sending packets after after you detach and reattach the remote MCU?

@xile273
Copy link

xile273 commented May 24, 2024

Thank you for your quick responses!

I'm out of office and don't have access to my equipment but I will try as soon as possible.
I'll also send my reproducer then.

I'm sorry I also probably wasn't that clear about the error. I've only seen it happen when using segmented
broadcast messages, so only the J1939 BAM transport seems to be affected.
The steps that I take to reproduces are the following:

  1. Create a J1939 socket with SO_BROADCAST enabled
  2. With no other interface attached send multiple messages (something like 20 suffices) to a PDU2 format PGN e.g. 0x00FECA. The message has to > 8 bytes so it doesn't fit into one frame.
  3. Attach the other interface
  4. Try to send another message on the same socket

Like that I can 100% reliably reproduce the error.

I'll test your code and send you my reproducer as soon as possible though.

@olerem olerem mentioned this issue May 25, 2024
@olerem
Copy link
Contributor

olerem commented May 25, 2024

@xile273 , there is one more application for testing, if you like:
#524

@xile273
Copy link

xile273 commented May 25, 2024

Hi, I've done some more testing and the modified isobusfs behaves the same on my system.
I cannot reproduce the issue with cangen.
I haven't tested the J1939 TimeDate services yet though.

But I think I now know what causes the behavior. It only happens when sending those segmented broadcast messages but with an interval which is too short to actually send the whole data before the next send is triggered, and with no other attached can device on the bus.

This means I'm probably also to blame for misusing the j1939 stack...

See this little test program:
https://github.com/xile273/j1939_bam_error/blob/8069b5a0c651661ce8d77248373c7e7a6ed56273/j1939_bam_no_errorqueue.c

It sends a 34 byte message in an interval of 50 milliseconds and on my systems I get this output:

Make sure there is no other can device on the bus
Press any button to continue

Sending a couple of messages
Sending sn 0
Sending sn 1
Sending sn 2
Sending sn 3
Sending sn 4
Sending sn 5
Got error: No route to host. continuing...
Sending sn 6
....
Sending sn 21
Got error: No route to host. continuing...
Sending sn 22
....
Sending sn 35
Sending sn 36
Got error: No route to host. continuing...
Sending sn 37
....
Sending sn 48
Sending sn 49
Switch on another device on the bus
Press any button

Sending a couple of messages
No route to host
Now closing socket....

After the call to close, it hangs and no further messages are observed.
On the receiving side with candump I get this output after re-attaching the interface:

$ candump can0 -td
 (000.000000)  can0  18ECFF27   [8]  20 22 00 05 FF CA FE 00
 (000.000167)  can0  18ECFF27   [8]  20 22 00 05 FF CA FE 00
 (000.000257)  can0  18ECFF27   [8]  20 22 00 05 FF CA FE 00
 (000.000433)  can0  18ECFF27   [8]  20 22 00 05 FF CA FE 00
 (000.000196)  can0  18ECFF27   [8]  20 22 00 05 FF CA FE 00
 (000.000412)  can0  18ECFF27   [8]  20 22 00 05 FF CA FE 00

When running this program with a receiver attached from the start, it works as expected and the socket closes after
sending all queued frames.

But realistically speaking this program can probably be considered as erroneous code on the userspace side...

I've also written a version which prints out received errorqueue messages but it behaves very similar.
https://github.com/xile273/j1939_bam_error/blob/8069b5a0c651661ce8d77248373c7e7a6ed56273/j1939_bam_errorqueue.c

I've done my tests with Kernel 6.5.0 and 6.8.0 with a Kvaser USBcan and a Kvaser Leaf Light v2 CAN interface.
bitrate was always set to 500k.

@olerem
Copy link
Contributor

olerem commented May 26, 2024

Ok, thx. Now i can reproduce it. By the way, it is possible to reproduce it with following commands too:

dd if=/dev/zero of=test bs=1 count=10
j1939cat -i test  can0:0x80 :0xFF,0x12300 -B -R 100000

The only problem is to find some time for debugging :( ...

@olerem
Copy link
Contributor

olerem commented May 26, 2024

@xile273 , can you please test following patch:
olerem/linux-2.6@a32c569
If it works for you, can you please give me your Reported-by and Tested-by

@xile273
Copy link

xile273 commented May 27, 2024

Thank you, I hope you stil had some time to enjoy your weekend...

dd if=/dev/zero of=test bs=1 count=10
j1939cat -i test  can0:0x80 :0xFF,0x12300 -B -R 100000

I did not think of that, that would have been easier...

@xile273 , can you please test following patch: olerem/linux-2.6@a32c569 If it works for you, can you please give me your Reported-by and Tested-by

Your patch works for me and fixes the issue, thank you!
I've updated my github profile with mail and name, tell me if you need anything else.

@olerem
Copy link
Contributor

olerem commented May 28, 2024

@xile273 , thank you!
Can you please tell more about what you are doing with the kernel j1939 stack. It is interesting for me how it is used and if it solves the problems... :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants