Unthinkable (or at least Unwanted) happened!

A system drive failure (and the S.M.A.R.T. system gave no warning!) and it wasn’t wasn’t a simple drive.  It was a RAID 0  [that is a zero, but looks like small ‘o’] set.  Serious lesson learned – no system drives on RAID 0 again!

How it happened:  getting ready to go on trip to Prague/Vienna and wanted to load some recorded TiVo shows from TiVo-to-go onto a couple of SD cards.  I wanted to test how some transferred to the PC (Vista premium) and started playing (which opened up Media Player).  Things started to stutter and freeze, and the mouse cursor no longer worked.  Then the dreaded Blue Screen of Death.  It mentioned something about the video driver and the typical ‘try to restart after removing offending hardware’.  Not possible on my system as I do not have VGA built-in to the motherboard.  It was an ATI 4550 video card.  Rebooting did not work as it said missing ‘operating system’.  The only time this happened before was when I was using one of the many live Linux disks I have and the bios switched the SATA channel boot order.  I went into the bios and that was not it.  Apparently the stress of playing that video toasted a portion of one of my twin 500G Seagate drives.  The AMD RAID hardware driver/bios is so good that it noticed the missing chunk and would not load the striped set.  The second drive always showed up as ‘offline’.  I wish it was more forgiving!

I went into panic mode because my backup drive is only 600G and the data I had on the 1 Terabyte RAID pair was up to about 750G.  Since I wasn’t able to fit the data on the drive for months, I said to myself “I’ll get another bigger backup drive and then backup”.  I actually said that 3 weeks ago when my wife got a nasty redirect virus that rewrote much of her files (and she had Avira anti-virus installed).  Since I thought I was going to get another drive, I had since used that portable drive to put some work-related data on.  Whoops!  No backup.  Even though I am fanatical about backup at my regular job as an IT manager, I let the home front slip.  I even have an offsite backup of our office data remotely backing up to a terrabyte drive on my Ubuntu Server at home using rsync.

I tell this tale so maybe you can avoid such stress and panic.  Also, if it happens that my experience can help you get  all (or some) of your data back.

The first thing I did was use the AMD RAIDxpert application to see if I could see what was up and coax the failed drive back online.  I found that if I unplugged the power to the second drive, I could get it to respin and the tab that said ‘activate’ would light up.  I clicked on that tab and the drive showed ‘functional’ for a few seconds.  This wasn’t working.  So then I rebooted, went into the bios and configured SATA to emulate IDE.  Now before I go on, I should say that I still had a regular PATA drive with my old XP and data still hooked up.  I had to re-install XP Pro (good thing the CD was still in my software drawer) and it installed back over that PATA drive.  Then I could do any recovery operations from that XP system.  I also have many Live Linux discs to choose from too.  So now the troubleshooting and tediousness could begin.

Many steps:  The XP system could ‘see’ the first RAID drive (now as IDE), but could not see second.  I fired up Seagate’s SeaTools and found it was not there either.  Then I pulled power plug on the 2nd drive again and it respun up and it popped on the SeaTools window.  Hey, this is encouraging!  I performed a Short Drive test and it failed.  I performed a Long Drive Test and it didn’t fail until more than 50% through – then failed.  It recommended SeaTools for DOS, and I downloaded and created the bootup CD from the ISO.  After the reboot, I had to repull power plug on drive again to have it be recognized.  Then I performed a long test with the DOS version, which is supposed to try and repair any bad spots by replacing them with a hash (#?) or something, without moving the data bits.  Anyway, that failed at about 62% through – never finished even after many hours.  So at least I knew where the problem was on the drive!  OK, I had to think of something else.  I started searching online.  I was beginning to consider a data recovery company with a certified clean room.  I saw some of the prices and almost fell out of my chair – $1500 up to $9000 in some cases.  Yikes!  Since I just bought a new car this year, I didn’t want to shell out many more thousands of dollars.  I then found QueTekConsulting in Texas.  They had a software app called ‘Recoup’ which supposedly extracts images of broken drives and tries and tries and skips over bad data spots.  This was worth the money for me to try.  I also downloaded a trial of ‘Raid Recovery for Windows’ from another recovery site.  When I was able to coax the drive to show up (which was getting harder each time) with the power plug move, the Raid Recovery app saw the two drives and I chose ‘RAID 0’ and the next screen showed some filenames and folders with an $ in front which were completely unreadable and I couldn’t search for any file that I knew was there.  It said the total number of files was 17!  OK, so this app didn’t work.

Once I got the Recoup app registered, I started the image copy process.  I first tried on a 500G USB drive and it said there was not enough room (and it was completely empty and formatted in NTFS).  So I went out and got a 750G USB drive and a replacement 500G internal SATA drive.  I know I should get the exact same model of the failed drive, but it was no longer available.  I ended up with a WD with the same size cache, and I was going to cross my fingers.  My plan:

  • make an image of the failed drive
  • replace failed drive with new blank drive
  • move/copy image over to the new drive
  • switch AMD SATA back to RAID in Bios
  • reboot and hope to see original drives that were RAID

My original RAID set consisted of 300G for Windows and programs, and 600G for data and extra stuff.  This may have been what saved my ass.

I ran Recoup to copy over to new 750G USB drive.  This took 14 hours.  The good thing about this app is if you stop or lose power, it pick up where it left off.  It keep a detailed log as it goes along in a spare folder.  That is why I could not make an image to a drive of same size as failed drive.  I lost no power so it continued straight for 14 hours.  I was thinking of getting Acronis True Image because my Norton Ghost did not see the SATA drive (even when set as IDE).  I downloaded and tried HDClone and WinImage, and they both could not read the ‘.dsk’ image created by Recoup.  I even tried renaming the dsk to img.  Then I thought about ‘dd’ in Linux.  I use it all the time to create images for ARM boards (Beagleboard, PC104, etc).  I searched online to see if others had used it for that.  A few had, but the forums or blogs mentioned trying it for RAID 0 recovery, but specifically I could not find a reference where it successfully restored a drive.  They ended with “…I’ll give that a try” and then they did not return to fill us in.

I fired up Puppy Linux Live and it recognized the AMD SATA drives (Centos did not).  I made sure the first RAID drive was unplugged, then checked on the location of second drive.  It showed up as /dev/sda.  The USB drive with image mounted automatically, so I knew that information (sca1).  So, from a terminal window, I used this command:

   [root user#]  dd if=/dev/sca1/raid0-2.dsk of=/dev/sda bs=1k conv=sync,noerror

Crossing my fingers:  When copying was finished (about 1 hour), shut down the Live Linux.  Booted up and changed Bios back to RAID and SATA for all channels.  When Xp cam up again, the first thing I did was load RAIDXpert and I see it saw the two physical drives.  Then this is when I knew I was relieved – the first logical drive was online!  Before, even with the RAID set recognized with coaxing (with bad drive), both logical drives were always offline.  The good news for me was the system partition was this first logical drive.  It was all there!  I don’t know if I’ll ever get the 600G partition back, but that had misc data and video mostly.  All my music is on a 750G drive attached to the living Ubuntu server, so that was never in danger.  Although I think I had the music backed up on this 600G partition (my music probably tops out at 400-450G in storage space).

So, what am I doing now?  Making a fresh backup (non-compressed) of the system partition I temporarily lost on that 500G USB drive I was going to replace.  I am using RichCopy and using about 7 threads to multitask the backup (I have a 4 core processor for it to use).  Then, when I am sure I can get no more out of that image file – I will wipe the 750G clean and use that as my Vista image backup and never replace it with anything else.  I am going to reconfigure my drives (no RAID for syetm partition!) and bump up to Windows 7.  But I will do that after coming back from abroad.

Conclusion:  I think the 300G partition as the system allowed the RAID rebuild to keep the stripes intact.  I shudder to think if I had one big Terrabyte “C:” drive.  I bet I would have lost it all.  It is interesting to note that using a different brand drive in the RAID recovery process did not ruin my chances of recovery either.

Posted under Management Tags: , , , ,

WebDAV on Ubuntu (10.04)

Had an interesting go-around with WebDAV on a non-production test server with Apache2 and it seemed errors were almost random.  There were a number of factors involved.  First, I am testing for preparation of using multiple authors to update a website.  WebDAV was the logical choice.  Installing WedDAV was simple enough:

  • load or enable modules (a2enmod)  dav, dav_fs
  • load up authentication module(s) [I started with auth_digest]
  • insert directives like turning DAV on

Well, it seemed to work with default settings.  Since I was enabling this for authors, it was not used for a separate directory or folder, but the root folder.  The initial test using ‘cadaver’ went well.  It all went downhill once I started testing with Dreamweaver.  Apparently, no matter what I did, Dreamweaver would not use ‘auth digest’ mode.  Supposedly if you add PROPFIND to the <Limitexcept> directive, Dreamweaver should find and use Auth Digest.  Nope, didn’t work.  Since Apache thought the client was using ‘Auth Basic’, I decided to go ahead and use ‘Basic’.  So that meant I had to create a new password file with ‘htpasswd’ which I did.

htpasswd -c /var/run/apache2/.passwordfile

Well, after restarting Apache, I tested with cadaver and I had new errors.  Could not find ‘user’  One suggestion I found on the web said that you should name your user with the server name before it (much like the domainuser credentials in the Windows world) like this  servernameusername [the second slash is there because it ignores the first – whatever].  This did not work either.  Other mentioned it was because some Ubuntu and Debian builds did not link the ‘authz_basic’ module properly.  I had this listed in my ‘mods-enabled’ folder, so it was loaded.

One novel suggestion (and I used this method) was by placing all your DAV directives in the ‘dav_fs.conf’ file in the ‘mods-enabled’ folder.  Therebye automatically removing directives when DAV mods are unloaded.  So ‘cadaver’ still didn’t work after all this.  I knew if I could get ‘cadaver’ to see the user this time, I would be good to go, as Dreamweaver works with Auth Basic.

After all that messing around, it turned out to be pretty simple.  Originally, I had checked to make sure the Digest password file had proper permissions (group had to be www-data), so that was why that worked with ‘cadaver’.  When creating the new password file with htpasswd, it defaulted to group of root.  I changed that file to the www-data group and everything worked.  This is how my directives in dav_fs looks:

DAVLockDB /var/run/apache2/lock/davlock
DAVMinTimeout 600

<Location /webdav/>
      DAV On
      AuthType basic
      AuthName username
      AuthUserFile /var/run/apache2/.davpasswordfile
      <Limitexcept GET HEAD OPTIONS PROPFIND>
             Require valid-user
      </Limitexcept>
</Location>

My ‘alias’ directive is still located in virtual default file.  Good news is it works now!

Posted under Applications

Ubuntu 11 (Natty) on Beagleboard

I have no idea what messed up the microSD that had Ubuntu 10.10.  I loaded up a new SD with the latest 11.04 image (zcat and dd) and that went pretty smoothly.  Meanwhile, I decided to put Angstrom back on that corrupted DS card (worse case it wouldn’t work) withmy development PC (used Linux Mint presently).  Worked fine, but went blank after firing up gdm.  I got the image from the Narcissus site and custom made it with gnome and other extras.  I probably should have stuck with the console version.  Fired up good when I redid it with just the console bare bones version.  Then I made the mistake of adding the gdm package.  I wasn’t sure at the time if that was the issue.  It was.  I can’t remember if the SED card that came with it ever got to the gnome desktop.  I’m betting it has something to do with fact it is using HDMI, but not that concerned as I have Ubuntu working like a champ.

I was hoping with version 11 that the clock speed issue with the XM board would be solved.  Apparently the new kernel also chokes with the 1GHz beagleboard.  I had to go into the boot.scr in the /boot folder to change the boot parameters from 1000 to 800 for the CPU.  After doing that the speed came back (it was appalling before changing as it literally took 5 or 6 seconds for the prompt to come up after opening the terminal.  Before I made the boot script change, the USB cam was hardly working – it was extra grainy and barely moved fast enough to have a frame rate.  Focus was effected too.  After the fix, the cam worked as it should.

One thing that must be done if you are trying to interface with a serial port with Ubuntu 11 – change the parameters for the ftdi device you use.   For some reason, the probe of the the USB serial device (in this case an Arduino Nano) came out wrong.  I could get an arduino sketch uploaded once after choosing /dev/ttyusb0, but then it stopped working.  It repeatedly said there was no such device.  Well there wasn’t – nothing resembling that in /dev folder.  What had to be done is a modprobe to the device with proper hex numbers.  If you do an ‘lusb’, you will get all your devices listed on the USB bus.  You need to pay attention to the vender ID and product ID.  Most likely the vendor ID is correct.  The product ID for me was off.  I had a 6001 for my product ID and I thing dmesg reported a different number.  Regardless, it is a good idea to run modprobe again with correct numbers.  Mine was:

sudo modprobe ftdi_sio vendor=0x0403 product=0x6001

After that, I could communicate with arduino and there was a ttyUSB0 entry in the /dev folder.  Now I just have to work on my flowchart and coding before I starting piecing things together.

Posted under Operating Systems,Robotics

Beagleboard Ubuntu corrupted

I don’t know how it happened, but the Ubuntu 10.10 image and partitions I had on the Beagleboard got corrupted somehow.  This used to work flawlessly.  The little micro-SD card has been sitting in the slot of the board for about 9 months without being used.  I wonder if static from the cables still attached to Beagleboard caused corruption.  I booted and got all kinds of I/O errors.

I am now putting an Ubuntu 11.04 netbook image on anothe SD card and will try that. Perhaps during non-use, I should pull the flash cards out!

Posted under Uncategorized