Unthinkable (or at least Unwanted) happened!

A system drive failure (and the S.M.A.R.T. system gave no warning!) and it wasn’t wasn’t a simple drive.  It was a RAID 0  [that is a zero, but looks like small ‘o’] set.  Serious lesson learned – no system drives on RAID 0 again!

How it happened:  getting ready to go on trip to Prague/Vienna and wanted to load some recorded TiVo shows from TiVo-to-go onto a couple of SD cards.  I wanted to test how some transferred to the PC (Vista premium) and started playing (which opened up Media Player).  Things started to stutter and freeze, and the mouse cursor no longer worked.  Then the dreaded Blue Screen of Death.  It mentioned something about the video driver and the typical ‘try to restart after removing offending hardware’.  Not possible on my system as I do not have VGA built-in to the motherboard.  It was an ATI 4550 video card.  Rebooting did not work as it said missing ‘operating system’.  The only time this happened before was when I was using one of the many live Linux disks I have and the bios switched the SATA channel boot order.  I went into the bios and that was not it.  Apparently the stress of playing that video toasted a portion of one of my twin 500G Seagate drives.  The AMD RAID hardware driver/bios is so good that it noticed the missing chunk and would not load the striped set.  The second drive always showed up as ‘offline’.  I wish it was more forgiving!

I went into panic mode because my backup drive is only 600G and the data I had on the 1 Terabyte RAID pair was up to about 750G.  Since I wasn’t able to fit the data on the drive for months, I said to myself “I’ll get another bigger backup drive and then backup”.  I actually said that 3 weeks ago when my wife got a nasty redirect virus that rewrote much of her files (and she had Avira anti-virus installed).  Since I thought I was going to get another drive, I had since used that portable drive to put some work-related data on.  Whoops!  No backup.  Even though I am fanatical about backup at my regular job as an IT manager, I let the home front slip.  I even have an offsite backup of our office data remotely backing up to a terrabyte drive on my Ubuntu Server at home using rsync.

I tell this tale so maybe you can avoid such stress and panic.  Also, if it happens that my experience can help you get  all (or some) of your data back.

The first thing I did was use the AMD RAIDxpert application to see if I could see what was up and coax the failed drive back online.  I found that if I unplugged the power to the second drive, I could get it to respin and the tab that said ‘activate’ would light up.  I clicked on that tab and the drive showed ‘functional’ for a few seconds.  This wasn’t working.  So then I rebooted, went into the bios and configured SATA to emulate IDE.  Now before I go on, I should say that I still had a regular PATA drive with my old XP and data still hooked up.  I had to re-install XP Pro (good thing the CD was still in my software drawer) and it installed back over that PATA drive.  Then I could do any recovery operations from that XP system.  I also have many Live Linux discs to choose from too.  So now the troubleshooting and tediousness could begin.

Many steps:  The XP system could ‘see’ the first RAID drive (now as IDE), but could not see second.  I fired up Seagate’s SeaTools and found it was not there either.  Then I pulled power plug on the 2nd drive again and it respun up and it popped on the SeaTools window.  Hey, this is encouraging!  I performed a Short Drive test and it failed.  I performed a Long Drive Test and it didn’t fail until more than 50% through – then failed.  It recommended SeaTools for DOS, and I downloaded and created the bootup CD from the ISO.  After the reboot, I had to repull power plug on drive again to have it be recognized.  Then I performed a long test with the DOS version, which is supposed to try and repair any bad spots by replacing them with a hash (#?) or something, without moving the data bits.  Anyway, that failed at about 62% through – never finished even after many hours.  So at least I knew where the problem was on the drive!  OK, I had to think of something else.  I started searching online.  I was beginning to consider a data recovery company with a certified clean room.  I saw some of the prices and almost fell out of my chair – $1500 up to $9000 in some cases.  Yikes!  Since I just bought a new car this year, I didn’t want to shell out many more thousands of dollars.  I then found QueTekConsulting in Texas.  They had a software app called ‘Recoup’ which supposedly extracts images of broken drives and tries and tries and skips over bad data spots.  This was worth the money for me to try.  I also downloaded a trial of ‘Raid Recovery for Windows’ from another recovery site.  When I was able to coax the drive to show up (which was getting harder each time) with the power plug move, the Raid Recovery app saw the two drives and I chose ‘RAID 0’ and the next screen showed some filenames and folders with an $ in front which were completely unreadable and I couldn’t search for any file that I knew was there.  It said the total number of files was 17!  OK, so this app didn’t work.

Once I got the Recoup app registered, I started the image copy process.  I first tried on a 500G USB drive and it said there was not enough room (and it was completely empty and formatted in NTFS).  So I went out and got a 750G USB drive and a replacement 500G internal SATA drive.  I know I should get the exact same model of the failed drive, but it was no longer available.  I ended up with a WD with the same size cache, and I was going to cross my fingers.  My plan:

  • make an image of the failed drive
  • replace failed drive with new blank drive
  • move/copy image over to the new drive
  • switch AMD SATA back to RAID in Bios
  • reboot and hope to see original drives that were RAID

My original RAID set consisted of 300G for Windows and programs, and 600G for data and extra stuff.  This may have been what saved my ass.

I ran Recoup to copy over to new 750G USB drive.  This took 14 hours.  The good thing about this app is if you stop or lose power, it pick up where it left off.  It keep a detailed log as it goes along in a spare folder.  That is why I could not make an image to a drive of same size as failed drive.  I lost no power so it continued straight for 14 hours.  I was thinking of getting Acronis True Image because my Norton Ghost did not see the SATA drive (even when set as IDE).  I downloaded and tried HDClone and WinImage, and they both could not read the ‘.dsk’ image created by Recoup.  I even tried renaming the dsk to img.  Then I thought about ‘dd’ in Linux.  I use it all the time to create images for ARM boards (Beagleboard, PC104, etc).  I searched online to see if others had used it for that.  A few had, but the forums or blogs mentioned trying it for RAID 0 recovery, but specifically I could not find a reference where it successfully restored a drive.  They ended with “…I’ll give that a try” and then they did not return to fill us in.

I fired up Puppy Linux Live and it recognized the AMD SATA drives (Centos did not).  I made sure the first RAID drive was unplugged, then checked on the location of second drive.  It showed up as /dev/sda.  The USB drive with image mounted automatically, so I knew that information (sca1).  So, from a terminal window, I used this command:

   [root user#]  dd if=/dev/sca1/raid0-2.dsk of=/dev/sda bs=1k conv=sync,noerror

Crossing my fingers:  When copying was finished (about 1 hour), shut down the Live Linux.  Booted up and changed Bios back to RAID and SATA for all channels.  When Xp cam up again, the first thing I did was load RAIDXpert and I see it saw the two physical drives.  Then this is when I knew I was relieved – the first logical drive was online!  Before, even with the RAID set recognized with coaxing (with bad drive), both logical drives were always offline.  The good news for me was the system partition was this first logical drive.  It was all there!  I don’t know if I’ll ever get the 600G partition back, but that had misc data and video mostly.  All my music is on a 750G drive attached to the living Ubuntu server, so that was never in danger.  Although I think I had the music backed up on this 600G partition (my music probably tops out at 400-450G in storage space).

So, what am I doing now?  Making a fresh backup (non-compressed) of the system partition I temporarily lost on that 500G USB drive I was going to replace.  I am using RichCopy and using about 7 threads to multitask the backup (I have a 4 core processor for it to use).  Then, when I am sure I can get no more out of that image file – I will wipe the 750G clean and use that as my Vista image backup and never replace it with anything else.  I am going to reconfigure my drives (no RAID for syetm partition!) and bump up to Windows 7.  But I will do that after coming back from abroad.

Conclusion:  I think the 300G partition as the system allowed the RAID rebuild to keep the stripes intact.  I shudder to think if I had one big Terrabyte “C:” drive.  I bet I would have lost it all.  It is interesting to note that using a different brand drive in the RAID recovery process did not ruin my chances of recovery either.