Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Integration causing constant flash writes on queried device #389

Open
pullasuti opened this issue Dec 4, 2024 · 20 comments
Open

[Bug] Integration causing constant flash writes on queried device #389

pullasuti opened this issue Dec 4, 2024 · 20 comments
Labels
bug Something isn't working

Comments

@pullasuti
Copy link

I don't know if this counts as a bug or an unfortunate feature... however. When enabled, the Mikrotik integration will cause flash writes on my CRS switch on every query. If I set the update interval to 30 seconds, the router will write every 30 seconds, if I set it to 5 seconds, it will write to flash every 5 seconds.

I tried disabling every sensor, one by one, but the writes still keep happening until I disable the entire integration.

Steps to reproduce the behavior:

  1. Enable Mikrotik integration
  2. Monitor router resources
  3. See sector writes counter increase on integration update

Since the router SNMP data can be queried with minimal flash writes, I assume this would be possible with the integration?

Nearly all the writes in the screenshot are from the integration, which has been installed for a few weeks. Last reboot and the consecutive writes are after the update to RouterOS 7.16.2 (version upgrade had no effect, but you can see how the writes keep accumulating).

kuva

Some version info:

  • Home Assistant version: HA 2024.11.3
  • Mikrotik Router integration version: 2.1.4 (latest)
  • Mikrotik Hardware: CRS328-24P-4S+
  • RouterOS version: 7.16.2

Noticed the issue a little while back, and initially thought it was the SNMP queries on the router. However after disabling SNMP completely, disabling all logging, checking every possible service running regularly, nothing seemed to affect the writes... until I remembered I had the HA integration running.

@pullasuti pullasuti added the bug Something isn't working label Dec 4, 2024
Copy link
Owner

tomaae commented Dec 4, 2024

there is no way integration can do any writes on mikrotik device. could be you logging something extensive.

@pullasuti
Copy link
Author

pullasuti commented Dec 4, 2024

there is no way integration can do any writes on mikrotik device. could be you logging something extensive.

I understand, and this is how I would assume the device should work in most cases. Nevertheless, there have been previous similar issues (apparently confirmed by Mikrotik). The following post on Reddit first got me thinking the reason for the writes is SNMP (link to conversation). This, however, turned out not to be the case after disabling the SNMP service completely and still seeing the issue.

I'm not suggesting the integration itself is causing the writes, but something related to the queries over the api. Currently I've had a total of about 60 writes in 6 hours since I disabled the addon/integration, which seems reasonable. As soon is I enable it, I start getting 2 writes on every query made like clockwork (depending on the update interval currently set in the integration - at 30 seconds this would be 240 per hour or 40000+ per week, non-stop).

As I mentioned, I've disabled all logging, graphing etc. in RouterOS, so those are not causing the writes.

I don't know. Just thought I'd bring the issue forward (easy to check if it's just me... set the update interval to 1 second and open resources in Winbox 😄 ). But I suppose if there are issues, they are more in the direction of librouteros and Mikrotik.

Copy link
Owner

tomaae commented Dec 5, 2024

these are just API queries, so it shoudl not happen.
How do you check number of writes? I will see how it looks on my device.

@pullasuti
Copy link
Author

these are just API queries, so it shoudl not happen. How do you check number of writes? I will see how it looks on my device.

You can check the writes with Mikrotik's Winbox or through the devices Webfig pages. They can be found under System/Resources (or cli /system/resource/print)

@Extrapilot1
Copy link

Guys- it seems the API behaves on the MT side as an SNMP query would. There is a known problem with MT's implementation, where for whatever reason they are caching 'high complexity' calcs to flash. People have found that by running the query more frequently (sub 30 sec-ish) gets around this, where apparently a new request sort of resets the logic and the calcs are redone.

Anyway, its known, and as of a year ago, MT had put it in the 'fix' queue. No ETA reported. There are people with very high write values (mine shows 9 million sectors on a HAP AC2) who show no bad sectors, so the flash may outlive the useful life of the router/switch depending on the flash config. There is a thread about this SNMP thing on Reddit under r/mikrotik. This thread is specifically about SNMP, but another I read implied that the API is simply acting as an SNMP emulator on reads.

There are other things that contribute- DHCP lease time, logging level, etc, but yea, when you see the sector write count increment by 2 every 30 sec (where the Integration on mine is set to read every 30 sec), that is the bulk of the writes for my system.

@Extrapilot1
Copy link

Hi. I wanted to add some information here. I captured via logging the actual read commands being sent via the client API to the Mikrotik. If those commands are run locally (Winbox), none of those from my setup cause a sector write. So, it isnt the physical process that causes this in RouterOS. I also checked the core MT logs, and while I see an initial login from the Integration/API, I dont see subsequent entries (i.e. the Integration isnt dropping and reconnecting each query cycle). So, that isnt contributing.

There may be some internal logging where just the process of sending a request- any request- via the API yields a sector write (2 in my case). It may also be that there are some precursor calls in the Integration for each read cycle that are not being logged in HA, but are triggering the sector writes. I have asked MT support for help in finding a way to ID what is going on- if that is some special script, or a logging config that would catch it etc.

There are reports on the MT forum that even proper SNMP OID reads are triggering sector writes. So, this may be, as suggested above, just some RouterOS bug where a new read 'session' on an open/established connection yields a sector write thing.

Does the native integration also cause this? I havent checked.

@Lieta2
Copy link

Lieta2 commented Dec 20, 2024

It is due to
/system/package/update> check-for-updates
call.

@Extrapilot1
Copy link

Confirmed- Lieta2's find on the check-for-updates, if commented out, results in no sector writes on my HAP AC2 for my config. Great catch!

@pascalsavigny
Copy link

Thanks for the tip, but could you please elaborate ?
what should be commented on ? where ?

@Extrapilot1
Copy link

One route is to edit the coordinator.py file in the install folder for your integration. There is a section commented as get_firmware_update. You can just replace the logic in the function to a 'return', and so the system will just bounce back out of that with no data read from the MT on firmware. It may be that there is a different way to read the firmware version, or if there is a pending update, that would not trigger a flash sector write. I posted this question to MT support- and it seems they are actually looking at it... Maybe it is a bug in RouterOS, or maybe there is an change to the query that would get around it...

@pascalsavigny
Copy link

Thanks, I found the function and if I understand correctly, it should be modified that way ;

# ---------------------------
#   get_firmware_update
# ---------------------------
def get_firmware_update(self) -> None:
    """do nothing because of constant flash writes on queried device issue"""
   return
# ---------------------------

All the code is removed.
Am I correct ?

@Extrapilot1
Copy link

Hi-
This is how I did it- just left the function in place so I didnt have to worry about different places from which it may be called, and then simply returned. Best I can tell, everything is functioning in terms of data parsing and sensors etc, but it is possible errors are being thrown I havent found in the logs. Since I dont see anything significant, vs 9 million flash writes on my router this year, Im good with it. Maybe the code author will break this out as a separate process, where it could be enabled or disabled via the config, or perhaps allow a check on that to be limited to just one call per day (say, at midnight), where 2 sector writes on the Mikrotik per day is nothing. Even on the HAP AC2 (what Im running) with a whole 16MB flash has tolerated 9 million sector writes to date... Better hardware has much more flash, so probably better expected life via wear leveling.

This has been a real help to me (being able to query every 30 seconds or so), since part of the benefit in my application is in identifying unusual traffic from IOT vlans. Best I can tell, Router OS integrates TX/RX rate as a total TX/RX divided by the number of seconds since the last read, so it is averaging. With that, on a 30min sample time, you would never see a burst of 500kb of data that should not be there if that is a vlan that is moving 10MB of data on average per 30 min window etc...

@pascalsavigny
Copy link

Thanks (again), I just had a bad time seeing my Chateau 18ax having 2 000 000+ writes for "nothing"...
Now the integration in back on business.

now "Sector Writes Since Reboot" is very stable.

The code has been altered this way :

# ---------------------------
#   get_firmware_update
# ---------------------------
def get_firmware_update(self) -> None:
    
    """do nothing because of constant flash writes on queried device issue"""
    return

    """Check for firmware update on Mikrotik"""
    """        
    if (
        "write" not in self.ds["access"]
        or "policy" not in self.ds["access"]
        or "reboot" not in self.ds["access"]
    ):
        return

    self.execute(
        "/system/package/update", "check-for-updates", None, None, {"duration": 10}
    )
    self.ds["fw-update"] = parse_api(
        data=self.ds["fw-update"],
        source=self.api.query("/system/package/update"),
        vals=[
            {"name": "status"},
            {"name": "channel", "default": "unknown"},
            {"name": "installed-version", "default": "unknown"},
            {"name": "latest-version", "default": "unknown"},
        ],
    )

    if "status" in self.ds["fw-update"]:
        self.ds["fw-update"]["available"] = (
            self.ds["fw-update"]["status"] == "New version is available"
        )

    else:
        self.ds["fw-update"]["available"] = False

    if self.ds["fw-update"]["installed-version"] != "unknown":
        try:
            full_version = self.ds["fw-update"].get("installed-version")
            split_end = min(len(full_version),4)
            version = re.sub("[^0-9\.]", "", full_version[0:split_end])
            self.major_fw_version = int(version.split(".")[0])
            self.minor_fw_version = int(version.split(".")[1])
            _LOGGER.debug(
                "Mikrotik %s FW version major=%s minor=%s (%s)",
                self.host,
                self.major_fw_version,
                self.minor_fw_version,
                full_version
            )
        except Exception:
            _LOGGER.error(
                "Mikrotik %s unable to determine major FW version (%s).",
                self.host,
                full_version,
            )
    """

Copy link

github-actions bot commented Jan 7, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale label Jan 7, 2025
@Lieta2
Copy link

Lieta2 commented Jan 7, 2025

Not stale.

@tomaae
Copy link
Owner

tomaae commented Jan 8, 2025

are you sure its get_firmware_update? that should be executed only once every 4 hours or after device reboot.

@Extrapilot1
Copy link

This is the one item I commented out, and the flash writes effectively stopped. There are always going to be some writes, but Im seeing something like 100 per now vs 10,000/day before I made this one change to the code. And, I think it would make sense to break out that firmware check as an optional process setup in the config, so that the read request is not sent vs the response being ignored if the user doesnt want notifications on firmware bumps.

@github-actions github-actions bot removed the stale label Jan 8, 2025
Copy link
Owner

tomaae commented Jan 8, 2025

it could be possible there is some bug that triggers that check more often, or every query.
Also it would be worth checking change when checked for updates manually with integration disables, so it cannot interfere.
Check for new firmware should stay as its pretty much standard in HA now, but it should check only like once per day probably.

@Extrapilot1
Copy link

It is certainly correlated 1:1 with the update frequency. It is easy to validate just by watching the Resources window in the MT Winbox, where perhaps you set update rate at 30sec, and then 60 sec. At least for my config, I see 2 writes accumulated for each read made by the plugin. It is possible its not strictly the firmware query, but the size of the query vs a buffer thing or who knows, but the net is that removing that query stops the writes correlated in time with the API queries.

@Extrapilot1
Copy link

I received a note just now from MT Support that they have implemented a bug fix on their end for this flash write issue on the API call. However, they wont say when this will be rolled out. I assume it would be in the next dev release, but they are close to final for this cycle and may not want to add new fixes until the next minor rev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants