Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple nic (communication) card causes disconnections for gateway type devices #267

Open
gdziuba opened this issue May 30, 2024 · 12 comments
Assignees
Labels
customer request requested by customer needs-triage Needs looking at to decide what to do sales request requested by a sales lead

Comments

@gdziuba
Copy link
Contributor

gdziuba commented May 30, 2024

Current Behavior

When a device with multiple nic cards runs the device agent. It seems as if it causes network issues on the device and the device then starts to try to switch between communications.

Device issues can be elevated when one of the communication cards is disabled.

We have been working with robustel and have access to a device where we can reduplicate. Please reach to go @gdziuba for details on how to connect to the device.

Expected Behavior

Device agent to maintain connection to the platform.

Steps To Reproduce

In this case we leverage an EG5120 from robustel. Running device agent with a sim card and wifi will cause the device to start disconnecting and reconnecting.

Environment

  • FlowFuse version:
  • Node.js version:
  • npm version:
  • Platform/OS:
  • Browser:

Linked Customers

  • Customer name and/or link to HubSpot contact

This customer seems to be having similar issues with a different product:

@gdziuba gdziuba added the needs-triage Needs looking at to decide what to do label May 30, 2024
@gdziuba
Copy link
Contributor Author

gdziuba commented Jun 14, 2024

@joepavitt, @knolleary. This is the issue that I raised concern for in the #dept-engineering channel.

@knolleary
Copy link
Member

@gdziuba It would be useful to have more context on this - we don't have much to go on here. It feels like this is going to be a very device-specific thing and we haven't had any other reports of this type of behaviour.

The EG5120 was mentioned in a support ticket earlier this week regarding IPv6 connectivity. I wonder if that is related at all here.

@joepavitt joepavitt added customer request requested by customer sales request requested by a sales lead labels Jun 14, 2024
@gdziuba
Copy link
Contributor Author

gdziuba commented Jun 14, 2024

@knolleary That is a fair statement.

The customer was incorrect above and will be updated: https://app-eu1.hubspot.com/contacts/26586079/record/0-2/7790198479

Customer notes:

Compulab IoT gateway
Debian
IOT-GATE-IMX8PLUS - Industrial IoT Gateway
Cellular Communication
Bandwidth issues
Agent Version is 2.3.2 (Updated to 2.4.1 and still had issues)
Had to restart the service
Not Docker
TCP and UDP issues on device agents

Reduplicate the issue:

  1. Clean device
  2. Install device agent
  3. Start
  4. Pulls Node-RED down
  5. Deploy a flow
  6. Try to reopen editor and it doesn't connect
  7. Restart device agent service
  8. Typically, opening editor is fine until another deploy.

Resolution for EG5120

  1. Connect to Ethernet
  2. Disable other communication cards (wifi, cellular)

@hardillb
Copy link
Contributor

We will need to know how the network is configured on both interfaces and what routes are configured with what weights

Also how else does the network change when the interface changes, e.g. does the ethernet network need a proxy, where as the cellular network does not?

@gdziuba
Copy link
Contributor Author

gdziuba commented Jun 14, 2024

I have a device we can troubleshoot the issue with provide by a partner.

@gdziuba
Copy link
Contributor Author

gdziuba commented Jun 14, 2024

The devices we can test are OEM devices, where it is my understanding it is a clean installation with only 1 active in-use network card.

@hardillb hardillb self-assigned this Jul 11, 2024
@hardillb
Copy link
Contributor

@gdziuba Can you let me know where we are with a device that we can investigate this on please?

This is the sort of thing that it will be useful to have access to the device while it changes network, so if it has 2 network interfaces then maybe having a serial console attached would be useful.

Give me a shout when you have time for a chat about this.

@gdziuba
Copy link
Contributor Author

gdziuba commented Jul 11, 2024

@hardillb sent an email to you. It is in the thread with Russell.

@gdziuba
Copy link
Contributor Author

gdziuba commented Aug 13, 2024

We were able to replicate the issue with a customer. Here is a video: https://flowforgeworkspace.slack.com/archives/C032Q63FGG1/p1723493785229809

Update on this. There is only 1 nic card on these device. He was switching from a local physical connection to a Sim card solution then back. Though, when he wanted to connect back to physical, it could never recover. Service had to be restarted. Do we have a way that we would have a self recovery mechanism? AKA, network configuration change, try to reconnect?

4:48
The last error it had was this:
/etc/systemd/system/flowfuse-device-agent.service:3: Assignment outside of section. Ignoring.
Though I think it is unrelated.
I suspect that the other issue I am seeing from former people testing out FlowFuse is that when they were doing network changes the device never could reconnect for FlowFuse.

@hardillb
Copy link
Contributor

hardillb commented Aug 13, 2024

@gdziuba We need to know how the network fail over is triggered, and exactly how the network is configured.

e.g.

  1. When the cellular connection is active, is the ethernet interface still up?
  2. How are they moving back to the ethernet connection? Does it drop the cellular interface?
  3. how does the cellular interface present it's self? is it PPP or as a USB ethernet device?

Also for the Assignment outside of section. ignoring error we need a copy of their flowfuse-device.service file.

@hardillb
Copy link
Contributor

Can we get the following command run as root:

  1. ip l
  2. ip a
  3. ip r

All three want to be run 3 times:

  1. When all is working correctly with ethernet connected
  2. When the cellular link is up and ethernet is down
  3. When things are broken after the ethernet is brought back up

I suspect the best way to do this will be via a serial console attached to the device so that a ssh session doesn't need to be maintained to the device for each step.

@hardillb
Copy link
Contributor

I may be able to reproduce something similar (not confirmed until I see the network config).

Upgrading to the latest mqtt.js libraray allows connections to drop cleanly (after keepalive timeout) when current interface is brought down and new connection is established.

The new mqtt.js library does require at least NodeJS v16 so will only be available as part of #263

@hardillb hardillb mentioned this issue Aug 13, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer request requested by customer needs-triage Needs looking at to decide what to do sales request requested by a sales lead
Projects
Status: Scheduled
Status: In Progress
Development

No branches or pull requests

4 participants