Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] failed disk while its not failed #690

Open
Hr46ph opened this issue Sep 14, 2024 · 2 comments
Open

[BUG] failed disk while its not failed #690

Hr46ph opened this issue Sep 14, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Hr46ph
Copy link

Hr46ph commented Sep 14, 2024

Describe the bug
All 3 NVMe disks show as failed and I have no idea why. For one, I might have a clue but not for the 2 others.

The only place it shows failed is on the dashboard and when I click a disk, the label at 'status'.

Expected behavior
Healthy disks because there is nothing wrong with them.

Screenshots
image

image

Log Files
I have 2 of these:

# smartctl -a /dev/nvme0 
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.10.9-arch1-2] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Lexar SSD NM620 2TB
Serial Number:                     
Firmware Version:                   9846
PCI Vendor/Subsystem ID:            0x1e4b
IEEE OUI Identifier:                0xcaf25b
Total NVM Capacity:                 2,048,408,248,320 [2.04 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            caf25b 0430001008
Local Time is:                      Sat Sep 14 13:33:51 2024 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     90 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.50W       -        -    0  0  0  0        0       0
 1 +     5.80W       -        -    1  1  1  1        0       0
 2 +     3.60W       -        -    2  2  2  2        0       0
 3 -   0.7460W       -        -    3  3  3  3     5000   10000
 4 -   0.7260W       -        -    4  4  4  4     8000   45000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    38,469,207 [19.6 TB]
Data Units Written:                 24,861,201 [12.7 TB]
Host Read Commands:                 274,542,336
Host Write Commands:                495,212,791
Controller Busy Time:               543
Power Cycles:                       85
Power On Hours:                     6,610
Unsafe Shutdowns:                   60
Media and Data Integrity Errors:    0
Error Information Log Entries:      3
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               42 Celsius
Thermal Temp. 1 Transition Count:   61
Thermal Temp. 1 Total Time:         74

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

And this is number 3:

# smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.49-1-lts] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB256HAHQ-00000
Serial Number:                     
Firmware Version:                   EXD7201Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 256,060,514,304 [256 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Utilization:            79,673,511,936 [79.6 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8a81b268d0
Local Time is:                      Sat Sep 14 13:38:17 2024 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    36%
Data Units Read:                    33,151,051 [16.9 TB]
Data Units Written:                 92,050,787 [47.1 TB]
Host Read Commands:                 1,311,000,162
Host Write Commands:                2,155,486,718
Controller Busy Time:               6,461
Power Cycles:                       378
Power On Hours:                     5,271
Unsafe Shutdowns:                   159
Media and Data Integrity Errors:    0
Error Information Log Entries:      20,405
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               29 Celsius
Temperature Sensor 2:               30 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0      20405     0  0x0010  0x4004      -            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged
@Hr46ph Hr46ph added the bug Something isn't working label Sep 14, 2024
@Hr46ph
Copy link
Author

Hr46ph commented Sep 14, 2024

Not sure if any of the other logging is relevant, I figured I'd wait the resonse before supplying more info. Let me know what you need.

Thanks!

@chris114782
Copy link

Getting exactly the same issue:

dashboard

detail

# sudo smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-45-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT1000T700SSD5
Serial Number:                      **redacted**
Firmware Version:                   PACR5101
PCI Vendor/Subsystem ID:            0xc0a9
IEEE OUI Identifier:                0x00a075
Controller ID:                      0
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 **redacted**
Local Time is:                      Fri Oct  4 20:09:17 2024 BST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x3e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg Log0_FISE_MI
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     87 Celsius
Critical Comp. Temp. Threshold:     89 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    11.50W       -        -    0  0  0  0      800    1000
 1 +     8.00W       -        -    0  0  0  0      800    1000
 2 +     6.00W       -        -    0  0  0  0      800    1000
 3 -   0.1440W       -        -    0  0  0  0     3000    3000
 4 -   0.1440W       -        -    0  0  0  0     3000    3000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    4%
Data Units Read:                    225,608,936 [115 TB]
Data Units Written:                 51,358,910 [26.2 TB]
Host Read Commands:                 6,634,833,183
Host Write Commands:                613,483,504
Controller Busy Time:               5,734
Power Cycles:                       32
Power On Hours:                     8,830
Unsafe Shutdowns:                   4
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 16 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

Running master#57dc547

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants