[colug-432] Interpret smartctl

Bill Schwanitz bilsch at bilsch.org
Tue Jan 29 13:37:51 EST 2013


On Jan 28, 2013, at 7:37 PM, Thomas W. cranston <thomas.w.cranston at gmail.com> wrote:

>  looked at dmesg last night, finding no error messages.
> 
> I next ran smartctl, but need help deciphering the readings.
> 
> For example:
> 
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED
> 
> RAW_VALUE
> 
> Can somebody explain the significance of the values under these headers?
> 
> I See nothing under WHEN_FAILED. That means nothing wrong with the HDD? 
> No bad sectors?

These are the lines you should probably look at ( from your email just copy/pasted ):

  1 Raw_Read_Error_Rate     0x000f   100   095   046    Pre-fail  Always -       151324
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always -       573972
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always -       20244550189059
198 Offline_Uncorrectable   0x0032   100   100   000    Old_age   Always -       126979380084739

The currently pending and offline/uncorrectable value seems rather high to me - not sure I trust it.

I put *** wrapping the important part below.

Error 113 occurred at disk power-on lifetime: 9989 hours (416 days + 5
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 bf 91 3e e0  Error: ***UNC 8 sectors at LBA = 0x003e91bf = 4100543***

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 40 87 91 3e e0 00      01:27:16.122  READ DMA EXT
  25 00 20 47 8f 3e e0 00      01:27:16.114  READ DMA EXT
  25 00 20 7f 8b 3e e0 00      01:27:16.107  READ DMA EXT
  25 00 20 e7 89 3e e0 00      01:27:16.105  READ DMA EXT
  25 00 08 7f 89 3e e0 00      01:27:16.104  READ DMA EXT

So, what that means is that the drive returned a failure while trying to read lba address 0x003e91bf ( a physical sector of the drive )

These are your early warnings that the drive is going south. What I would recommend you do is this:

sudo smartctl -a long /dev/sda

The response should be something like this:

$ sudo smartctl -t long /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.7.1] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 48 minutes for test to complete.
Test will complete after Tue Jan 29 14:18:05 2013

Use smartctl -X to abort test.

So, my drive will take 48 minutes to run - yours likely will be longer ( this is off a 60gb ssd drive… ) Note, you need to leave the system running during this period so don't suspend or power it off.

When the time has passed, run smartctl -a /dev/sda and see what it shows:

( Note, different drive and all clear/ok )
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     27542         -
# 2  Extended offline    Completed without error       00%     27373         -

If you see anything in the column for LBA_of_first_error, that indicates your drive has a failed sector. 

If you are doing this against a fresh drive it is possible you simply got a bad drive - it happens. Also note that if you want to RMA the drive or somethign its highly likely all of the above will likely need to be repeated utilizing the vendors tool of choice ( and is typically windows or dos based in my experience )

I also highly recommend you run smartd which is a daemonized monitor. man smartd.conf to look over the options. Set the smartd daemon to run at boot. You should have smartd configured to run full tests of your drives at least monthly - more aggressivly if you want.

( note, smartd will send an email when it finds errors which is why I like/recommend it )

hope this helps!

Bill

here are the relevent lines I have in my fileserver box at home 
/dev/sda -d ata -s L/../../3/18
/dev/sdb -d ata -s L/../../3/18
/dev/sdc -d ata -s L/../../3/18
/dev/sdd -d ata -s L/../../3/18
/dev/sde -d ata -s L/../../3/18

read the man page to see what that means ;)


More information about the colug-432 mailing list