My Linux-based VarLinux.org experienced a disk crash recently. Prior to this, I had equipped my server with two IBM model DTLA 46GB ATA100 hard drives with SMART (Self-Monitoring Analysis and Reporting Technology); I had planned to use one for the live server and the other as a bootable backup of the system, which I had intended to update nightly using scheduled Linux shell scripts. Unfortunately, my brain is equipped with a procrastination feature called DUMB (Disasters Usually Motivate Backups), so I never wrote those scripts.
I know many people who practically worship these very same IBM drives, but I've experienced three failures so far with this model. By the time the second drive acted up, I had found out I could use Linux to create a bootable floppy with an IBM soft-"ware utility called DFT (Drive Fit-ness Test), which restored the ailing drive well enough that it passed even the most strenuous tests. Everything had run smoothly for months when suddenly the main drive started reporting so many bad sectors on one partition that it refused to let me mount that partition. Luckily, the partition contained only the PHP code for the Web site, not the actual articles or user data.
I had been under the impression that most IDE drives automatically remap data as bad sectors develop. Unfortunately, SMART is only smart about monitoring and reporting problems. Even the DFT is too stupid to repair the drive without wiping out all your data.
I could use the utility to wipe and fix the drive, but I wanted to recover some data first. I couldn't mount the partition, so the prospect of recovering anything looked dim. I looked around for a Reiserfs utility that might help me recover something from that partition without having to mount it. Unfortunately, as good as Reiserfs may be, the maintenance utilities are terribly incomplete and poorly documented. Fortunately, Reiserfs is open source, which is what saved me. I scanned through the source code for the debugreiserfs program and was able to find some undocumented features in it that allowed me to recover some of the data from the faulty partition without having to mount it.
Once I had enough of a server to bring it back online, I created a mirror of it on the second drive. I swapped the master/slave jumpers on the drives to boot the good drive first, and then I planned to use DFT to wipe out the data on the bad drive to fix it. But before I took the leap, I tried mounting that bad partition again - just for kicks. Not only did it mount properly, but all the data was there in perfect condition. Needless to say, I restored the previously lost data. And you can bet that I wrote those backup scripts immediately.
But it isn't the whole story.
Only a few Linux files are drive-specific, so you don't have to change much to boot from a mirrored drive. Once I had confirmed the mirror worked, I re-edited those files to let me swap the drives. I powered down, switched the master/slave jumpers on my two drives, and booted up the server. The new VarLinux.org was online and running fine.
Or so I had thought. As luck would have it, I published a controversial story on VarLinux.org shortly afterward. The story was picked up by the popular Slashdot site (http://slashdot.org), and page "views started rolling in. What "is commonly known as the "Slashdot" effect has brought down many a Web server, but I thought my wimpy DSL line would be enough of a bottleneck to prevent VarLinux.org from being overloaded by page hits. But the server froze.
I rebooted, logged in, and ran the "top" command to monitor server activity to see what was happening. As the Web server requests mounted up, they filled all available RAM. The server seized up without writing anything to the swap file.
I had created a swap partition on the new drive, but I never initialised it. I hadn't noticed before because the server didn't need a swap file until I got "Slashdotted".
Once I initialised the swap file, the system handled the load, but just barely. It was usually too busy swapping to respond to any commands that I issued. I had fallen victim to the now infamous virtual memory performance characteristics of the latest 2.4 Linux kernels. The Linux kernel developers are currently working diligently on this issue, but I bypassed it for only $100. All I needed was another 256MB of RAM to move the bottleneck from disk swapping to the DSL line.
Moral: initialise your swap files; keep an eye on the new kernels; and buy enough RAM - it's cheap, and you can never have too much, even if you run Linux.