[colug-432] Solid State Drives (was: SMART HDD Status)

Thu Feb 23 18:01:56 EST 2012

On Thu, Feb 23, 2012 at 09:51:08AM -0500, Joshua Kramer wrote:
> Last night I
> procured a new enclosure for my HD, one that gives no errors reading or
> writing NTFS volumes (and supports SMART status).  Indeed, I noticed that
> the values for Hardware ECC Recovered, Seek Error Rate, Read Error Rate for
> my Seagate Momentus were way off track - in the millions.  Scary, right?

A related query:  Does anyone have experience with the new solid-state
disk drives?  In particular, I wonder if their failure mode(s) are far
different from conventional drives.  (Well, duh, I suppose some would say
that that's obvious!)  But, with conventional drives, we typically either
see catastrophic failure, or sector errors for which a retry may or
may not succeed.  It's pretty straightforward, if not easy, to resolve:
If you can get the data off the failing drive, do so -- you can generally
trust it.  If not, load a new drive from a backup and continue.

I have a SheevaPlug ARM-based computer where the root partition exists on
an SD-card -- the same NAND memory found in solid state drives.  I was
seeing some occasional file system corruption on this system on boots --
stuff that required I manually fsck the root file system occasionally --
although the little server appeared to run fine.  Finally, I decided to
tear into it and find out what was going on.  It turns out that the SD-card
was going bad, failing in such a way that it would occasionally introduce
a bit error -- usually it was just one bit or a small cluster of bits wrong
in a given sector when this occurred.  I replaced the card and everything
seemed fine.

But then I began to wonder:  Since the unit seemed to run fine even with
the failing card, and no bad blocks were being reported, what was happening 
with my backups.  It turns out I had perhaps half a dozen, so I un-cpio-ed
them (yes, I cut my teeth on cpio and it still seems more natural to me
than tar) and proceeded to cmp the files in one set of backups against the
others.  To my horror, I started finding discrepancies in what should be
static files, like those in /var/cache, and they became more and more
prevalent with each succeeding backup.  (For some reason, these errors
seemed to cluster in /var/cache.  I found no errors at all in /bin, /usr/bin,
/sbin, /usr/sbin, or /etc files.)  But, these bit errors were propagating
through the backups without any evidence of being logged or reported.  So,
now I'm questioning whether I should just take a day and completely reload
the system.  On the one hand, it would be a good chance to pull it up to
the latest Debian load, and it's probably time.  On the other hand, the
little box continues to do everything I ask of it, and quite competently,
with nary a hiccup.

Bottom line:  Are the solid state drives better than SD-cards at detecting
and reporting such errors?  Do they employ SMART functionality, too?  If
so, to what extent?  Anyone have any experience in this area?

Rob