-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add storage counters related to errors #1091
Conversation
No major YANG version changes in commit 3983f2b |
pyang tree output
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate precisely what something like this error counter would map to for an underlying OS (e.g. Linux) implementation?
This also is attempting to categorize per mount point vs. device (block)
Addressed comments. This is now ready for review. |
This was reviewed in Apr 9, 2024 OC Operators meeting without objection. Addressed latest comments and now placing this on last-call for merge on May 21, 2024 |
Is there a reference that can confirm that statement? My understanding is that "discard i/o" is just another type of i/o operation, which is often used with SSD drives (see also fstrim man); and not an error. |
Here's the reference I found: https://www.kernel.org/doc/Documentation/block/stat.txt |
I checked that link, yes, but there's no indication that discard i/o is related to errors in any way. If anything, it confirms my understanding, since they describe the discard operations in the same way as read/write e.g.:
|
Another ref: blkdiscard |
For the driving use-case (trying to understand when storage media is having issues), maybe its best to narrow this in via SMART data vs. what is exposed in sysfs @dplore - maybe best to align w/ what precisely is being monitored in your compute environment for such case? |
I agree with the suggestion to use SMART. |
Ironically the current description is aligned with at least one use case for what is monitored in one of our network environments. I do agree that SMART is a better data set to base this on and will refactor for that. |
Updated this PR to use a select few SMART counters. I appreciate any feedback on this approach. Note, I used https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes as a reference for picking counters which are related to storage failures. There are 10 attributes noted for predicting/measuring failures. I modeled 4 so far. If we like this approach, I could add all 10 or so subset based on feedback. |
Overall I think your latest patch is a better approach but it should be noted that this is up to the drive manufacturer as far as implementation which is going to vary within platforms of the same vendor and across vendors. For example, just grabbing one variant of SSD we ship today, 2 of the 4 attributes listed here are supported. This structure should be noted that these leaf nodes should be supported if the underlying hardware supports, otherwise optional/excluded. |
This was reviewed in the OC Community meeting on Sep 12, 2024 without objection. Setting last-call to Sep 19 to allow a little more time for public review. @s19nal do you have any comments? Can you approve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - all comments addressed
Change Scope
/components/component/storage/state/counters/
as a container to represent storage device countersOperational use case
While rare, in a large population of devices storage errors have led to a device becoming unhealthy, unable to accept software updates or unable to boot due to non-volatile media (flash, ssd media) errors. This is a counter to be able to measure the accumulation of storage areas as an statistic for storage component health.
Note,
/components/component/healthz/state/status
is also a useful data point, but as a boolean only value, it is very coarse. Storage counters can be used to predict a storage device will fail in the future.Tree View
Platform Implementations
MEDIA
errors