Scrub Your Disks Periodically!

2024-04-06 11:19 (updated 2024-04-18 19:39)

Detlev Zundel

Tags:

The smartmontools subsystem of my Debian GNU/Linux system started sending me mails that one of my hard disks was requiring attention. The mail warned that 2 Offline uncorrectable sectors were detected on the drive. So that's when I decided it is time for a maintenance interrupt to look after the "mildly sick" disk.

My usual choice of tools is the built-in self test of the hard disk itself. smartctl can be used to start such a test:

dzu@krikkit:/$ sudo smartctl -t long /dev/sda

While the test is progressing, status can be queried by smartctl also:

dzu@krikkit:/$ sudo smartctl -a /dev/sda

In the "SMART Self-test log" area, you can see the progress of the running test. The test exits upon detection of an error and you will see the location of the first error on the disk. In order to cure the sickness, we would now need to rewrite those blocks to allow the disk to remap the data of this block to a previously healthy part of the medium. Mapping the (physical) "LBA Address" of the failing area to files in a filesystem is basically "black magic" and very error prone.

Unfortunately I did not save the output of the very run that reported the errors, but as the output includes log entries, we can see later the time stamps of the failed run. Somehow I triggered the self-test twice after receiving the notification, so we see two failing runs.

Real Errors

But the report at that time showed that the self-test could not be completed because of errors only 10% "away" from the beginning of the disk. Healing such errors requires rewriting all of the failing blocks to allow the firmware controller to remap those blocks. As we only get a single failing position out of a self-test run, we somehow have to find the other blocks ourselves by doing read attemps on all the data.

Getting assistance from our file system would just to be so extra cool. And indeed btrfs was built with this in mind and allows us to scrub a filesystem in its entirety while leveraging its redundant design. This scrubbing "brushes" over all the data in a filesystem by attempting to read all of it. When errors are encountered, the filesystem can use its own redundancy and fix correctable errors and react to them by rewriting the data. This process is required to successfully maintain proper health of magnetic spinning disks or NAND based flash devices. Both actually have a "firmware controller" removing the operating system from full control of where logical blocks are stored on the underlying medium.

Scrubbing a Btrfs disk

We can start the scrubbing process with the btrfs scrub subcommand:

dzu@krikkit:/$ sudo btrfs scrub start /home
dzu@krikkit:/$

Progress can be queried while the process is running or at the end. Here is the final status of the scrubbing process initiated after seeing the errors in the SMART data of the device.

dzu@krikkit:/$ sudo btrfs scrub status /home
[sudo] Passwort für dzu: 
UUID:             1280206b-67ec-461c-b7c9-c09498035e75
Scrub started:    Wed Mar 27 13:51:14 2024
Status:           finished
Duration:         2:11:45
Total to scrub:   718.94GiB
Rate:             93.10MiB/s (limit 100.00MiB/s)
Error summary:    read=864
  Corrected:      696
  Uncorrectable:  168
  Unverified:     0
Sie haben neue Post in /home/dzu/Maildir/.
dzu@krikkit:/$

Redo The Self-Test

So now that btrfs has finished scrubbing its data, let's restart the self test of the drive:

dzu@krikkit:/$ sudo smartctl -t long /dev/sda

The Healing Was Done!

After the completion of the self-test, let's assess the result by inspecting the SMART data of the drive:

dzu@krikkit:/$ sudo smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba P300 (CMR)
Device Model:     TOSHIBA HDWD110
Serial Number:    58R8N0VNS
LU WWN Device Id: 5 000039 fd6e01d28
Firmware Version: MS2OA8J0
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Mar 27 22:25:00 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection: 		( 7264) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 121) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   141   141   054    Pre-fail  Offline      -       72
  3 Spin_Up_Time            0x0007   125   125   024    Pre-fail  Always       -       184 (Average 183)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       2168
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       11
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   115   115   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       26627
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2162
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       2170
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       2170
194 Temperature_Celsius     0x0002   133   133   000    Old_age   Always       -       45 (Min/Max 20/55)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       12
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 57 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 57 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 f8 97 66 04  Error: WP at LBA = 0x046697f8 = 73832440

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 10 70 60 67 d6 40 08      01:29:24.514  WRITE FPDMA QUEUED
  60 08 88 68 98 66 40 08      01:29:22.545  READ FPDMA QUEUED
  60 08 00 c0 98 66 40 08      01:29:22.545  READ FPDMA QUEUED
  60 08 80 60 98 66 40 08      01:29:22.545  READ FPDMA QUEUED
  60 08 68 00 96 86 40 08      01:29:22.521  READ FPDMA QUEUED

Error 56 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 38 96 66 04  Error: UNC at LBA = 0x04669638 = 73831992

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 e0 50 98 66 40 08      01:29:18.453  READ FPDMA QUEUED
  60 08 10 28 99 66 40 08      01:29:18.448  READ FPDMA QUEUED
  60 08 f8 20 99 66 40 08      01:29:18.445  READ FPDMA QUEUED
  60 08 a0 38 96 66 40 08      01:29:18.422  READ FPDMA QUEUED
  61 08 08 80 08 00 40 08      01:29:15.140  WRITE FPDMA QUEUED

Error 55 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 f0 97 66 04  Error: WP at LBA = 0x046697f0 = 73832432

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 f0 a8 38 76 40 08      01:29:07.806  WRITE FPDMA QUEUED
  61 00 e8 58 44 71 40 08      01:29:07.806  WRITE FPDMA QUEUED
  61 00 d8 08 de d1 40 08      01:29:07.806  WRITE FPDMA QUEUED
  60 08 d0 b0 99 66 40 08      01:29:07.806  READ FPDMA QUEUED
  60 08 c8 90 98 66 40 08      01:29:07.806  READ FPDMA QUEUED

Error 54 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 e8 97 66 04  Error: WP at LBA = 0x046697e8 = 73832424

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 60 c0 a6 6e 40 08      01:29:06.801  WRITE FPDMA QUEUED
  61 00 58 60 9d 74 40 08      01:29:06.801  WRITE FPDMA QUEUED
  61 00 50 08 49 74 40 08      01:29:06.800  WRITE FPDMA QUEUED
  61 00 48 f0 fd 73 40 08      01:29:06.799  WRITE FPDMA QUEUED
  61 00 28 10 8c 73 40 08      01:29:06.798  WRITE FPDMA QUEUED

Error 53 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 20 96 66 04  Error: UNC at LBA = 0x04669620 = 73831968

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 d8 48 98 66 40 08      01:28:59.804  READ FPDMA QUEUED
  60 08 c8 40 98 66 40 08      01:28:59.796  READ FPDMA QUEUED
  60 08 d0 a8 99 66 40 08      01:28:59.796  READ FPDMA QUEUED
  60 08 70 20 96 66 40 08      01:28:59.787  READ FPDMA QUEUED
  60 08 68 38 98 66 40 08      01:28:59.787  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     26624         -
# 2  Extended offline    Completed: read failure       90%     26618         1436960
# 3  Extended offline    Completed: read failure       90%     26618         1436960
# 4  Extended offline    Completed without error       00%     25692         -
# 5  Extended offline    Completed without error       00%     21566         -
2 of 2 failed self-tests are outdated by newer successful extended offline self-test # 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

dzu@krikkit:/$

In the log data structure we can clearly see the two abandoned self-tests that I invoked manually after receiving e-mails from smartd:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     26624         -
# 2  Extended offline    Completed: read failure       90%     26618         1436960
# 3  Extended offline    Completed: read failure       90%     26618         1436960
# 4  Extended offline    Completed without error       00%     25692         -
# 5  Extended offline    Completed without error       00%     21566         -
2 of 2 failed self-tests are outdated by newer successful extended offline self-test # 1

But we can also clearly see that the device returned to a "Completed without error" state. My previous maintenance attempts are also clearly visible. It is also obvious that the errors started appearing 926 hours (~39 days) after the last self-test. So it seems my maintenance strategy was spot on. The device was functioning perfectly for 2.5 years and then I started to trigger manual self-tests of the device. This was working ok twice but after 3 years (of uptime), the errors started to appear. Luckily I was alerted to the problem quickly and reacted with scheduling the self-test. Subsequent scrubbing of the btrfs filesystem fully healed the file system as the disk was given the chance to write new data.

In the output we can clearly see real errors in the "SMART Error Log" section (abbreviated):

SMART Error Log Version: 1
ATA Error Count: 57 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 57 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 f8 97 66 04  Error: WP at LBA = 0x046697f8 = 73832440

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 10 70 60 67 d6 40 08      01:29:24.514  WRITE FPDMA QUEUED
  60 08 88 68 98 66 40 08      01:29:22.545  READ FPDMA QUEUED
  60 08 00 c0 98 66 40 08      01:29:22.545  READ FPDMA QUEUED
  60 08 80 60 98 66 40 08      01:29:22.545  READ FPDMA QUEUED
  60 08 68 00 96 86 40 08      01:29:22.521  READ FPDMA QUEUED

Error 56 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 38 96 66 04  Error: UNC at LBA = 0x04669638 = 73831992

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 e0 50 98 66 40 08      01:29:18.453  READ FPDMA QUEUED
  60 08 10 28 99 66 40 08      01:29:18.448  READ FPDMA QUEUED
  60 08 f8 20 99 66 40 08      01:29:18.445  READ FPDMA QUEUED
  60 08 a0 38 96 66 40 08      01:29:18.422  READ FPDMA QUEUED
  61 08 08 80 08 00 40 08      01:29:15.140  WRITE FPDMA QUEUED

But (re-)writing this data, the firmware used its wear leveling algorithm to move potentially failing blocks to elsewhere on the physical medium and thus was able to "reset" those errors. The redundancy of the filesystem ensured its integrity even though uncorrectable errors were found on the disk. Rewriting the failing data allowed the hard disk to do its "magic remapping" to less worn areas of the drive.

In the SMART attributes section, we can also see that the "Current_Pending_Sector" and "Offline_Uncorrectable" metrics have been reset to 0:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   141   141   054    Pre-fail  Offline      -       72
  3 Spin_Up_Time            0x0007   125   125   024    Pre-fail  Always       -       184 (Average 183)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       2168
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       11
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   115   115   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       26627
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2162
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       2170
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       2170
194 Temperature_Celsius     0x0002   133   133   000    Old_age   Always       -       45 (Min/Max 20/55)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       12
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

This means the disk has undergone a magic rejuvenation and should now be good to go for another 3 to 5 years.

Lessons Learned!

Other important storage devices are my btrfs (encrypted) backup disks that I use for my (manual) monthly backup routine. I will spend another blog post on describing this absolute basic backup system all done with standard Linux tools, but for now it should be mentioned that I learned to apply those scrubbing lessons to those disks also. Here is the progress 80 minutes after starting the process:

dzu@krikkit:/$ sudo btrfs scrub status /media/dzu/seagate-bckup/
UUID:             2fbf9fef-dc5d-474c-a6f6-d1da2c101e77
Scrub started:    Wed Mar 27 22:06:01 2024
Status:           running
Duration:         1:20:05
Time left:        1:21:31
ETA:              Thu Mar 28 00:47:41 2024
Total to scrub:   372.98GiB
Bytes scrubbed:   184.82GiB  (49.55%)
Rate:             39.39MiB/s
Error summary:    no errors found
dzu@krikkit:/$

It seems the device is in very good health, but querying SMART turns out to be "difficult". Running smartctl without options gives us immediate errors:

dzu@krikkit:~$ sudo smartctl -a /dev/sdg
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: scsi error unsupported field in scsi command

If this is a USB connected device, look at the various --device=TYPE variants
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
dzu@krikkit:~$

Let's see what options we can use for the device:

dzu@krikkit:~$ sudo smartctl -a /dev/sdg --device=TYPE
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/sdg: Unknown device type 'TYPE'
=======> VALID ARGUMENTS ARE: ata, scsi[+TYPE], nvme[,NSID], sat[,auto][,N][+TYPE], usbasm1352r,N, usbcypress[,X], usbjmicron[,p][,x][,N], usbprolific, usbsunplus, sntasmedia, sntjmicron[,NSID], sntrealtek, jmb39x[-q],N[,sLBA][,force][+TYPE], jms56x,N[,sLBA][,force][+TYPE], areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, aacraid,H,L,ID, sssraid,E,S, cciss,N, auto, test <=======

Use smartctl -h to get a usage summary

dzu@krikkit:~$

Hm. Can't smartctl tell us what to do here?

dzu@krikkit:~$ sudo smartctl -a /dev/sdg --device=test
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/sdg [SAT]: Device of type 'sat' [ATA] detected
/dev/sdg [SAT]: Device of type 'sat' [ATA] opened
dzu@krikkit:~$

Aha! So let's retry with that option:

dzu@krikkit:~$ sudo smartctl -a /dev/sdg --device=sat,auto
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Seagate
Product:              Expansion
Revision:             0708
Compliance:           SPC-4
User Capacity:        1.000.204.885.504 bytes [1,00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Logical Unit id:      0x3e41385035595353
Serial number:        NA8P5YSS
Device type:          disk
Local Time is:        Wed Mar 27 23:32:25 2024 CET
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

No Self-tests have been logged

dzu@krikkit:~$

Hrmpf. Better, but still no real result. For for this backup disk I have to accept the fact that I am "SMART blind", but as long as the file system is happy, I am optimistic. But I will reschedule such scrubs every 3 months or so from now on to keep the disk happy.

Summary

Keeping our disks happy requires a non-negligible effort from our side! Be sure to be aware of the health conditions of the "strategic" media of your digital life. And remember, untested backups are worth nothing. Test(!) importing (parts of) a backup to see if the process works. You would not be the first to find out only after testing that the backups were created just fine, but impossible to use.

2024-04-18 Postscript

Next Error

The next error mail from the smartd daemon:

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 720 Currently unreadable (pending) sectors

Scrub

First run:

dzu@krikkit:~$ sudo btrfs scrub status /home
UUID:             1280206b-67ec-461c-b7c9-c09498035e75
Scrub started:    Wed Apr 17 23:27:02 2024
Status:           finished
Duration:         2:21:59
Total to scrub:   727.54GiB
Rate:             87.45MiB/s (limit 100.00MiB/s)
Error summary:    read=1104
  Corrected:      548
  Uncorrectable:  556
  Unverified:     0
dzu@krikkit:~$

Second run, the day after:

dzu@krikkit:~$ sudo btrfs scrub status /home
UUID:             1280206b-67ec-461c-b7c9-c09498035e75
Scrub started:    Thu Apr 18 18:34:43 2024
Status:           finished
Duration:         2:17:44
Total to scrub:   727.56GiB
Rate:             90.15MiB/s (limit 100.00MiB/s)
Error summary:    read=1024
  Corrected:      492
  Uncorrectable:  532
  Unverified:     0
dzu@krikkit:~$

SMART

After running a -t long test:

dzu@krikkit:~$ sudo smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba P300 (CMR)
Device Model:     TOSHIBA HDWD110
Serial Number:    58R8N0VNS
LU WWN Device Id: 5 000039 fd6e01d28
Firmware Version: MS2OA8J0
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 18 23:37:39 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  38)	The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline 
data collection: 		( 7264) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 121) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   141   141   054    Pre-fail  Offline      -       73
  3 Spin_Up_Time            0x0007   125   125   024    Pre-fail  Always       -       184 (Average 183)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       2181
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       21
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   115   115   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       27016
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2175
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       2183
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       2183
194 Temperature_Celsius     0x0002   133   133   000    Old_age   Always       -       45 (Min/Max 20/55)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       26
197 Current_Pending_Sector  0x0022   070   070   000    Old_age   Always       -       720
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 268 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 268 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 78 ae 34 00  Error: UNC at LBA = 0x0034ae78 = 3452536

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 c0 f8 af 34 40 08      01:15:43.019  READ FPDMA QUEUED
  60 80 b0 80 bc 34 40 08      01:15:42.549  READ FPDMA QUEUED
  60 00 a8 00 fd 34 40 08      01:15:42.546  READ FPDMA QUEUED
  60 80 a0 00 fc 34 40 08      01:15:42.545  READ FPDMA QUEUED
  60 80 98 80 fa 34 40 08      01:15:42.545  READ FPDMA QUEUED

Error 267 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 70 ae 34 00  Error: UNC at LBA = 0x0034ae70 = 3452528

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 e8 00 fd 34 40 08      01:15:41.687  READ FPDMA QUEUED
  60 80 e0 00 fc 34 40 08      01:15:41.687  READ FPDMA QUEUED
  60 80 d8 80 fa 34 40 08      01:15:41.687  READ FPDMA QUEUED
  60 00 d0 80 f8 34 40 08      01:15:41.687  READ FPDMA QUEUED
  60 80 c8 00 f5 34 40 08      01:15:41.687  READ FPDMA QUEUED

Error 266 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 70 ae 34 00  Error: WP at LBA = 0x0034ae70 = 3452528

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 c0 80 08 00 40 08      01:15:31.811  WRITE FPDMA QUEUED
  60 00 b8 80 ac 34 40 08      01:15:31.811  READ FPDMA QUEUED
  ea 00 00 00 00 00 a0 08      01:15:31.719  FLUSH CACHE EXT
  61 20 70 00 42 2d 40 08      01:15:31.719  WRITE FPDMA QUEUED
  61 20 68 80 41 2d 40 08      01:15:31.719  WRITE FPDMA QUEUED

Error 265 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 60 76 34 00  Error: WP at LBA = 0x00347660 = 3438176

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 60 80 08 00 40 08      01:15:26.688  WRITE FPDMA QUEUED
  61 00 58 70 c6 59 40 08      01:15:26.688  WRITE FPDMA QUEUED
  61 00 50 30 76 59 40 08      01:15:26.688  WRITE FPDMA QUEUED
  61 00 48 10 1b c8 40 08      01:15:26.688  WRITE FPDMA QUEUED
  61 00 40 c8 c0 c5 40 08      01:15:26.688  WRITE FPDMA QUEUED

Error 264 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 18 8c 34 00  Error: UNC at LBA = 0x00348c18 = 3443736

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 a8 c0 8b 34 40 08      01:15:22.677  READ FPDMA QUEUED
  60 08 a0 18 8c 34 40 08      01:15:22.677  READ FPDMA QUEUED
  61 00 98 60 3a ae 40 08      01:15:22.677  WRITE FPDMA QUEUED
  60 08 90 08 8d 34 40 08      01:15:22.677  READ FPDMA QUEUED
  60 08 88 98 8c 34 40 08      01:15:22.677  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      60%     27014         -
# 2  Extended offline    Completed without error       00%     26624         -
# 3  Extended offline    Completed: read failure       90%     26618         1436960
# 4  Extended offline    Completed: read failure       90%     26618         1436960
# 5  Extended offline    Completed without error       00%     25692         -
# 6  Extended offline    Completed without error       00%     21566         -
2 of 2 failed self-tests are outdated by newer successful extended offline self-test # 2

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

dzu@krikkit:~$

For unknown reasons, the test run was not able to finish:

Self-test execution status:      (  38)	The self-test routine was interrupted
                                        by the host with a hard or soft reset.

The substantial difference of course is this:

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       26
197 Current_Pending_Sector  0x0022   070   070   000    Old_age   Always       -       720
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

So we now have 720 "Current_Pending_Sectors". Doing some more scrubs and another long SMART self test eventually brought these numbers to zero again.