Scrub Your Disks Periodically!
The smartmontools subsystem of my Debian GNU/Linux system started
sending me mails that one of my hard disks was requiring attention.
The mail warned that 2 Offline uncorrectable sectors
were detected
on the drive. So that's when I decided it is time for a maintenance
interrupt to look after the "mildly sick" disk.
My usual choice of tools is the built-in self test of the hard disk
itself. smartctl
can be used to start such a test:
dzu@krikkit:/$ sudo smartctl -t long /dev/sda
While the test is progressing, status can be queried by smartctl
also:
dzu@krikkit:/$ sudo smartctl -a /dev/sda
In the "SMART Self-test log" area, you can see the progress of the running test. The test exits upon detection of an error and you will see the location of the first error on the disk. In order to cure the sickness, we would now need to rewrite those blocks to allow the disk to remap the data of this block to a previously healthy part of the medium. Mapping the (physical) "LBA Address" of the failing area to files in a filesystem is basically "black magic" and very error prone.
Unfortunately I did not save the output of the very run that reported the errors, but as the output includes log entries, we can see later the time stamps of the failed run. Somehow I triggered the self-test twice after receiving the notification, so we see two failing runs.
Real Errors
But the report at that time showed that the self-test could not be completed because of errors only 10% "away" from the beginning of the disk. Healing such errors requires rewriting all of the failing blocks to allow the firmware controller to remap those blocks. As we only get a single failing position out of a self-test run, we somehow have to find the other blocks ourselves by doing read attemps on all the data.
Getting assistance from our file system would just to be so extra
cool. And indeed btrfs
was built with this in mind and allows us to
scrub a filesystem in its entirety while leveraging its redundant
design. This scrubbing "brushes" over all the data in a filesystem by
attempting to read all of it. When errors are encountered, the
filesystem can use its own redundancy and fix correctable errors and
react to them by rewriting the data. This process is required to
successfully maintain proper health of magnetic spinning disks or NAND
based flash devices. Both actually have a "firmware controller"
removing the operating system from full control of where logical
blocks are stored on the underlying medium.
Scrubbing a Btrfs disk
We can start the scrubbing process with the btrfs scrub
subcommand:
dzu@krikkit:/$ sudo btrfs scrub start /home
dzu@krikkit:/$
Progress can be queried while the process is running or at the end. Here is the final status of the scrubbing process initiated after seeing the errors in the SMART data of the device.
dzu@krikkit:/$ sudo btrfs scrub status /home
[sudo] Passwort für dzu:
UUID: 1280206b-67ec-461c-b7c9-c09498035e75
Scrub started: Wed Mar 27 13:51:14 2024
Status: finished
Duration: 2:11:45
Total to scrub: 718.94GiB
Rate: 93.10MiB/s (limit 100.00MiB/s)
Error summary: read=864
Corrected: 696
Uncorrectable: 168
Unverified: 0
Sie haben neue Post in /home/dzu/Maildir/.
dzu@krikkit:/$
Redo The Self-Test
So now that btrfs
has finished scrubbing its data, let's restart the
self test of the drive:
dzu@krikkit:/$ sudo smartctl -t long /dev/sda
The Healing Was Done!
After the completion of the self-test, let's assess the result by inspecting the SMART data of the drive:
dzu@krikkit:/$ sudo smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Toshiba P300 (CMR)
Device Model: TOSHIBA HDWD110
Serial Number: 58R8N0VNS
LU WWN Device Id: 5 000039 fd6e01d28
Firmware Version: MS2OA8J0
User Capacity: 1.000.204.886.016 bytes [1,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5528
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Mar 27 22:25:00 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 7264) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 121) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 141 141 054 Pre-fail Offline - 72
3 Spin_Up_Time 0x0007 125 125 024 Pre-fail Always - 184 (Average 183)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 2168
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 11
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 115 115 020 Pre-fail Offline - 34
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 26627
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2162
192 Power-Off_Retract_Count 0x0032 099 099 000 Old_age Always - 2170
193 Load_Cycle_Count 0x0012 099 099 000 Old_age Always - 2170
194 Temperature_Celsius 0x0002 133 133 000 Old_age Always - 45 (Min/Max 20/55)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 12
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 57 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 57 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 f8 97 66 04 Error: WP at LBA = 0x046697f8 = 73832440
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 10 70 60 67 d6 40 08 01:29:24.514 WRITE FPDMA QUEUED
60 08 88 68 98 66 40 08 01:29:22.545 READ FPDMA QUEUED
60 08 00 c0 98 66 40 08 01:29:22.545 READ FPDMA QUEUED
60 08 80 60 98 66 40 08 01:29:22.545 READ FPDMA QUEUED
60 08 68 00 96 86 40 08 01:29:22.521 READ FPDMA QUEUED
Error 56 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 38 96 66 04 Error: UNC at LBA = 0x04669638 = 73831992
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 e0 50 98 66 40 08 01:29:18.453 READ FPDMA QUEUED
60 08 10 28 99 66 40 08 01:29:18.448 READ FPDMA QUEUED
60 08 f8 20 99 66 40 08 01:29:18.445 READ FPDMA QUEUED
60 08 a0 38 96 66 40 08 01:29:18.422 READ FPDMA QUEUED
61 08 08 80 08 00 40 08 01:29:15.140 WRITE FPDMA QUEUED
Error 55 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 f0 97 66 04 Error: WP at LBA = 0x046697f0 = 73832432
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 f0 a8 38 76 40 08 01:29:07.806 WRITE FPDMA QUEUED
61 00 e8 58 44 71 40 08 01:29:07.806 WRITE FPDMA QUEUED
61 00 d8 08 de d1 40 08 01:29:07.806 WRITE FPDMA QUEUED
60 08 d0 b0 99 66 40 08 01:29:07.806 READ FPDMA QUEUED
60 08 c8 90 98 66 40 08 01:29:07.806 READ FPDMA QUEUED
Error 54 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 e8 97 66 04 Error: WP at LBA = 0x046697e8 = 73832424
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 60 c0 a6 6e 40 08 01:29:06.801 WRITE FPDMA QUEUED
61 00 58 60 9d 74 40 08 01:29:06.801 WRITE FPDMA QUEUED
61 00 50 08 49 74 40 08 01:29:06.800 WRITE FPDMA QUEUED
61 00 48 f0 fd 73 40 08 01:29:06.799 WRITE FPDMA QUEUED
61 00 28 10 8c 73 40 08 01:29:06.798 WRITE FPDMA QUEUED
Error 53 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 20 96 66 04 Error: UNC at LBA = 0x04669620 = 73831968
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 d8 48 98 66 40 08 01:28:59.804 READ FPDMA QUEUED
60 08 c8 40 98 66 40 08 01:28:59.796 READ FPDMA QUEUED
60 08 d0 a8 99 66 40 08 01:28:59.796 READ FPDMA QUEUED
60 08 70 20 96 66 40 08 01:28:59.787 READ FPDMA QUEUED
60 08 68 38 98 66 40 08 01:28:59.787 READ FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 26624 -
# 2 Extended offline Completed: read failure 90% 26618 1436960
# 3 Extended offline Completed: read failure 90% 26618 1436960
# 4 Extended offline Completed without error 00% 25692 -
# 5 Extended offline Completed without error 00% 21566 -
2 of 2 failed self-tests are outdated by newer successful extended offline self-test # 1
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for more
dzu@krikkit:/$
In the log data structure we can clearly see the two abandoned
self-tests that I invoked manually after receiving e-mails from
smartd
:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 26624 -
# 2 Extended offline Completed: read failure 90% 26618 1436960
# 3 Extended offline Completed: read failure 90% 26618 1436960
# 4 Extended offline Completed without error 00% 25692 -
# 5 Extended offline Completed without error 00% 21566 -
2 of 2 failed self-tests are outdated by newer successful extended offline self-test # 1
But we can also clearly see that the device returned to a "Completed without error" state. My previous maintenance attempts are also clearly visible. It is also obvious that the errors started appearing 926 hours (~39 days) after the last self-test. So it seems my maintenance strategy was spot on. The device was functioning perfectly for 2.5 years and then I started to trigger manual self-tests of the device. This was working ok twice but after 3 years (of uptime), the errors started to appear. Luckily I was alerted to the problem quickly and reacted with scheduling the self-test. Subsequent scrubbing of the btrfs filesystem fully healed the file system as the disk was given the chance to write new data.
In the output we can clearly see real errors in the "SMART Error Log" section (abbreviated):
SMART Error Log Version: 1
ATA Error Count: 57 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 57 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 f8 97 66 04 Error: WP at LBA = 0x046697f8 = 73832440
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 10 70 60 67 d6 40 08 01:29:24.514 WRITE FPDMA QUEUED
60 08 88 68 98 66 40 08 01:29:22.545 READ FPDMA QUEUED
60 08 00 c0 98 66 40 08 01:29:22.545 READ FPDMA QUEUED
60 08 80 60 98 66 40 08 01:29:22.545 READ FPDMA QUEUED
60 08 68 00 96 86 40 08 01:29:22.521 READ FPDMA QUEUED
Error 56 occurred at disk power-on lifetime: 26619 hours (1109 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 38 96 66 04 Error: UNC at LBA = 0x04669638 = 73831992
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 e0 50 98 66 40 08 01:29:18.453 READ FPDMA QUEUED
60 08 10 28 99 66 40 08 01:29:18.448 READ FPDMA QUEUED
60 08 f8 20 99 66 40 08 01:29:18.445 READ FPDMA QUEUED
60 08 a0 38 96 66 40 08 01:29:18.422 READ FPDMA QUEUED
61 08 08 80 08 00 40 08 01:29:15.140 WRITE FPDMA QUEUED
But (re-)writing this data, the firmware used its wear leveling algorithm to move potentially failing blocks to elsewhere on the physical medium and thus was able to "reset" those errors. The redundancy of the filesystem ensured its integrity even though uncorrectable errors were found on the disk. Rewriting the failing data allowed the hard disk to do its "magic remapping" to less worn areas of the drive.
In the SMART attributes section, we can also see that the "Current_Pending_Sector" and "Offline_Uncorrectable" metrics have been reset to 0:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 141 141 054 Pre-fail Offline - 72
3 Spin_Up_Time 0x0007 125 125 024 Pre-fail Always - 184 (Average 183)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 2168
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 11
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 115 115 020 Pre-fail Offline - 34
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 26627
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2162
192 Power-Off_Retract_Count 0x0032 099 099 000 Old_age Always - 2170
193 Load_Cycle_Count 0x0012 099 099 000 Old_age Always - 2170
194 Temperature_Celsius 0x0002 133 133 000 Old_age Always - 45 (Min/Max 20/55)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 12
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
This means the disk has undergone a magic rejuvenation and should now be good to go for another 3 to 5 years.
Lessons Learned!
Other important storage devices are my btrfs (encrypted) backup disks that I use for my (manual) monthly backup routine. I will spend another blog post on describing this absolute basic backup system all done with standard Linux tools, but for now it should be mentioned that I learned to apply those scrubbing lessons to those disks also. Here is the progress 80 minutes after starting the process:
dzu@krikkit:/$ sudo btrfs scrub status /media/dzu/seagate-bckup/
UUID: 2fbf9fef-dc5d-474c-a6f6-d1da2c101e77
Scrub started: Wed Mar 27 22:06:01 2024
Status: running
Duration: 1:20:05
Time left: 1:21:31
ETA: Thu Mar 28 00:47:41 2024
Total to scrub: 372.98GiB
Bytes scrubbed: 184.82GiB (49.55%)
Rate: 39.39MiB/s
Error summary: no errors found
dzu@krikkit:/$
It seems the device is in very good health, but querying SMART turns
out to be "difficult". Running smartctl
without options gives us
immediate errors:
dzu@krikkit:~$ sudo smartctl -a /dev/sdg
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
Read Device Identity failed: scsi error unsupported field in scsi command
If this is a USB connected device, look at the various --device=TYPE variants
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
dzu@krikkit:~$
Let's see what options we can use for the device:
dzu@krikkit:~$ sudo smartctl -a /dev/sdg --device=TYPE
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
/dev/sdg: Unknown device type 'TYPE'
=======> VALID ARGUMENTS ARE: ata, scsi[+TYPE], nvme[,NSID], sat[,auto][,N][+TYPE], usbasm1352r,N, usbcypress[,X], usbjmicron[,p][,x][,N], usbprolific, usbsunplus, sntasmedia, sntjmicron[,NSID], sntrealtek, jmb39x[-q],N[,sLBA][,force][+TYPE], jms56x,N[,sLBA][,force][+TYPE], areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, aacraid,H,L,ID, sssraid,E,S, cciss,N, auto, test <=======
Use smartctl -h to get a usage summary
dzu@krikkit:~$
Hm. Can't smartctl
tell us what to do here?
dzu@krikkit:~$ sudo smartctl -a /dev/sdg --device=test
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
/dev/sdg [SAT]: Device of type 'sat' [ATA] detected
/dev/sdg [SAT]: Device of type 'sat' [ATA] opened
dzu@krikkit:~$
Aha! So let's retry with that option:
dzu@krikkit:~$ sudo smartctl -a /dev/sdg --device=sat,auto
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: Seagate
Product: Expansion
Revision: 0708
Compliance: SPC-4
User Capacity: 1.000.204.885.504 bytes [1,00 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is fully provisioned
Logical Unit id: 0x3e41385035595353
Serial number: NA8P5YSS
Device type: disk
Local Time is: Wed Mar 27 23:32:25 2024 CET
SMART support is: Unavailable - device lacks SMART capability.
=== START OF READ SMART DATA SECTION ===
Current Drive Temperature: 0 C
Drive Trip Temperature: 0 C
Error Counter logging not supported
No Self-tests have been logged
dzu@krikkit:~$
Hrmpf. Better, but still no real result. For for this backup disk I have to accept the fact that I am "SMART blind", but as long as the file system is happy, I am optimistic. But I will reschedule such scrubs every 3 months or so from now on to keep the disk happy.
Summary
Keeping our disks happy requires a non-negligible effort from our side! Be sure to be aware of the health conditions of the "strategic" media of your digital life. And remember, untested backups are worth nothing. Test(!) importing (parts of) a backup to see if the process works. You would not be the first to find out only after testing that the backups were created just fine, but impossible to use.
2024-04-18 Postscript
Next Error
The next error mail from the smartd daemon:
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], 720 Currently unreadable (pending) sectors
Scrub
First run:
dzu@krikkit:~$ sudo btrfs scrub status /home
UUID: 1280206b-67ec-461c-b7c9-c09498035e75
Scrub started: Wed Apr 17 23:27:02 2024
Status: finished
Duration: 2:21:59
Total to scrub: 727.54GiB
Rate: 87.45MiB/s (limit 100.00MiB/s)
Error summary: read=1104
Corrected: 548
Uncorrectable: 556
Unverified: 0
dzu@krikkit:~$
Second run, the day after:
dzu@krikkit:~$ sudo btrfs scrub status /home
UUID: 1280206b-67ec-461c-b7c9-c09498035e75
Scrub started: Thu Apr 18 18:34:43 2024
Status: finished
Duration: 2:17:44
Total to scrub: 727.56GiB
Rate: 90.15MiB/s (limit 100.00MiB/s)
Error summary: read=1024
Corrected: 492
Uncorrectable: 532
Unverified: 0
dzu@krikkit:~$
SMART
After running a -t long
test:
dzu@krikkit:~$ sudo smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Toshiba P300 (CMR)
Device Model: TOSHIBA HDWD110
Serial Number: 58R8N0VNS
LU WWN Device Id: 5 000039 fd6e01d28
Firmware Version: MS2OA8J0
User Capacity: 1.000.204.886.016 bytes [1,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5528
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Apr 18 23:37:39 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 38) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: ( 7264) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 121) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 141 141 054 Pre-fail Offline - 73
3 Spin_Up_Time 0x0007 125 125 024 Pre-fail Always - 184 (Average 183)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 2181
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 21
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 115 115 020 Pre-fail Offline - 34
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 27016
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2175
192 Power-Off_Retract_Count 0x0032 099 099 000 Old_age Always - 2183
193 Load_Cycle_Count 0x0012 099 099 000 Old_age Always - 2183
194 Temperature_Celsius 0x0002 133 133 000 Old_age Always - 45 (Min/Max 20/55)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 26
197 Current_Pending_Sector 0x0022 070 070 000 Old_age Always - 720
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 268 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 268 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 78 ae 34 00 Error: UNC at LBA = 0x0034ae78 = 3452536
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 c0 f8 af 34 40 08 01:15:43.019 READ FPDMA QUEUED
60 80 b0 80 bc 34 40 08 01:15:42.549 READ FPDMA QUEUED
60 00 a8 00 fd 34 40 08 01:15:42.546 READ FPDMA QUEUED
60 80 a0 00 fc 34 40 08 01:15:42.545 READ FPDMA QUEUED
60 80 98 80 fa 34 40 08 01:15:42.545 READ FPDMA QUEUED
Error 267 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 70 ae 34 00 Error: UNC at LBA = 0x0034ae70 = 3452528
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 e8 00 fd 34 40 08 01:15:41.687 READ FPDMA QUEUED
60 80 e0 00 fc 34 40 08 01:15:41.687 READ FPDMA QUEUED
60 80 d8 80 fa 34 40 08 01:15:41.687 READ FPDMA QUEUED
60 00 d0 80 f8 34 40 08 01:15:41.687 READ FPDMA QUEUED
60 80 c8 00 f5 34 40 08 01:15:41.687 READ FPDMA QUEUED
Error 266 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 10 70 ae 34 00 Error: WP at LBA = 0x0034ae70 = 3452528
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 c0 80 08 00 40 08 01:15:31.811 WRITE FPDMA QUEUED
60 00 b8 80 ac 34 40 08 01:15:31.811 READ FPDMA QUEUED
ea 00 00 00 00 00 a0 08 01:15:31.719 FLUSH CACHE EXT
61 20 70 00 42 2d 40 08 01:15:31.719 WRITE FPDMA QUEUED
61 20 68 80 41 2d 40 08 01:15:31.719 WRITE FPDMA QUEUED
Error 265 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 60 76 34 00 Error: WP at LBA = 0x00347660 = 3438176
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 60 80 08 00 40 08 01:15:26.688 WRITE FPDMA QUEUED
61 00 58 70 c6 59 40 08 01:15:26.688 WRITE FPDMA QUEUED
61 00 50 30 76 59 40 08 01:15:26.688 WRITE FPDMA QUEUED
61 00 48 10 1b c8 40 08 01:15:26.688 WRITE FPDMA QUEUED
61 00 40 c8 c0 c5 40 08 01:15:26.688 WRITE FPDMA QUEUED
Error 264 occurred at disk power-on lifetime: 27013 hours (1125 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 18 8c 34 00 Error: UNC at LBA = 0x00348c18 = 3443736
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 a8 c0 8b 34 40 08 01:15:22.677 READ FPDMA QUEUED
60 08 a0 18 8c 34 40 08 01:15:22.677 READ FPDMA QUEUED
61 00 98 60 3a ae 40 08 01:15:22.677 WRITE FPDMA QUEUED
60 08 90 08 8d 34 40 08 01:15:22.677 READ FPDMA QUEUED
60 08 88 98 8c 34 40 08 01:15:22.677 READ FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 60% 27014 -
# 2 Extended offline Completed without error 00% 26624 -
# 3 Extended offline Completed: read failure 90% 26618 1436960
# 4 Extended offline Completed: read failure 90% 26618 1436960
# 5 Extended offline Completed without error 00% 25692 -
# 6 Extended offline Completed without error 00% 21566 -
2 of 2 failed self-tests are outdated by newer successful extended offline self-test # 2
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for more
dzu@krikkit:~$
For unknown reasons, the test run was not able to finish:
Self-test execution status: ( 38) The self-test routine was interrupted
by the host with a hard or soft reset.
The substantial difference of course is this:
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 26
197 Current_Pending_Sector 0x0022 070 070 000 Old_age Always - 720
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
So we now have 720 "Current_Pending_Sectors". Doing some more scrubs and another long SMART self test eventually brought these numbers to zero again.
Comments
Comments powered by Disqus