Thursday, November 16, 2006

Datacenter Fiascos

I worked for a very large company from 2001 to 2003 as a mainframe operator and midrange administrator. The company datacenter consisted of multiple mainframes, roughly a thousand servers, a ridiculous amount of disk space, multiple tape silos, and 24/7 monitoring. Our job (98% of the time, anyways) was to drool on our keyboard in twelve hour shifts while we waited for some key system to blow up.

On January 25th, 2003 another system administrator from our Windows group contacted the data center regarding a worm that was spreading like wildfire through the Internet. This worm was named sapphire, although it came to be known as “slammer” because it completely saturated local networks while attempting to find other hosts to infiltrate, causing networks to become completely unresponsive. The worm propagated via a buffer overflow in Microsoft's “SQL Server” database software. This exploit had been well documented; a security briefing issued by Microsoft outlined the overflow in July of 2002 and shortly thereafter a patch was issued.

Checking our servers in the data center, I determined that our systems had not been patched (almost six months after the fact, I might add) and were completely vulnerable. However, our system had not been compromised because of a substantial firewall between the datacenter and general network traffic. I strongly advised my coworker to patch this vulnerability as soon as possible.

Fast forward six months: the entire datacenter network crashes spectacularly after slammer somehow found its way inside the internal network. All IP network traffic inside the datacenter was brought to a screeching halt as the worm completely saturated network resources. We could not login to Windows servers to shut them down; the only course of action at that point was to go through the entire datacenter and unplug every computer running SQL Server. Several “mission critical” systems were offline for hours, which cost a dollar figure I dare not mention. The source had been a lone developer who logged into the internal network via a VPN connection. It was truly an “OMG WTF!?” sort of moment, to use the parlance of our times. We all had to wipe the spittle off of our keyboards.

It is tempting to point a finger at Microsoft, but realistically that's not valid. The issue had been well documented a year prior so there is little to blame beyond our own inaction. Had our Windows administrators not been in a state of narcolepsy, this would have been a non-issue. Furthermore, the internal network had insufficient security; just because the traffic was internal did not mean it should be allowed to travel without scrutiny. Had internal company routers been setup to block extraneous network traffic on several key UDP ports, the entire issue would have been mitigated.

It is equally tempting to fault the people who wrote the worm. However, the truth is they exploited little more than complete incompentence.