Images Missing | Day 2
Here's an update on our hardware failure yesterday. The short answer is that we're still in the middle of recovering the images, but it's taking longer than expected. New dot images are displaying properly, but the older dots are still missing their thumbnails.
One lesson for us is that a) you can't have too many backup strategies, and b) you should test your backup processes to make sure they work.
The machine that failed in our data center was the master for our thumbnail images. This includes user profile images as well. We used to have a partial backup of these images, but we lost that machine in a separate failure earlier this year.
While we have extensive backup systems in place for user and dot databases, we rationalized that a) this machine has dual drives (a Raid1 disk array) and b) the images are recoverable from the Internet (since this is "just a thumbnail cache"). But we didn't correctly reason that we also have the one and only copy of our user profile pictures stored on this same machine. It's a fairly large dataset with about 1 million images (30+ GB).
We did not have a current backup for this data. Since the machine died, we've been wrestling with migrating the disks to another machine. Part of our trouble is due to inexperience in dealing with the vagaries of how our hardware Raid controller works. It also involved some high-speed runs to Fry's electronics for some needed parts.
We're currently combining several recovery strategies: restoring data from an image backup we made in January, re-thumbnailing images directly from their original Internet locations and using a data recovery tool to recover files from our Raid1 drives. We also finally got a hold of AMCC (manufacturers of our 3Ware Raid controller) today and better understand how to migrate the drives (BTW - I rate their product support as excellent - their support engineer was very knowledgeable and helpful).
Again, I'll update you tomorrow on our progress. Thanks again for your patience and we apologize for any inconvenience this partial site outage is causing you.