Learn from my mistakes. Just a little planning can go a long way to averting a security upgrade disaster.
Experience is the best teacher. Or that's what I am trying to convince myself after living through an upgrade from hell. It was a crucial piece of security hardware--our firewall--which was so critical that if it went down in flames, I was going down with it. Mission accomplished? Not quite. What should have been an ordinary maintenance window became a lesson of what not to do.
At first, everything appeared fine; traffic was going into the firewall, and counters showed that firewall rules were being used. But it wasn't long before our Blackberries began buzzing with news that something was wrong--services that rely on the backup site were not working. After three frustrating hours, we finally worked out the problem, but not before management knew that our failover site had been out of commission. As it turned out, an unknown bug in the vendor's firewall code was the culprit, but that didn't diminish management's embarrassment over the outage.
If we had done some better planning before undertaking the upgrade, we could have navigated safely around the problem until we diagnosed it, without an outage. Here are our lessons:
Lesson 1: Never assume. Before the upgrade, I should have verified my assumptions about the firewall system and validated that we had an accurate configuration baseline. For instance, the firewalls are in clusters, but prior to starting the upgrade, we did not verify that both members of the cluster were actually synchronizing. If I had required that we regularly test the failover capabilities from one side of the cluster to the other, we would have noticed this flaw before it became a crucial issue during the maintenance window.
Lesson 2: Have a testing regimen. Because we did not accurately baseline what was in the field, we did not see that this upgrade was a bigger jump than what we analyzed in the lab. Nor had we worked out what constituted a proper test, so we did not have test cases that matched the field conditions. Instead of just checking to see if the upgrade took, we needed to test specific services from one end of the system's boundaries to the other. Defining what services to examine and what constituted the success or failure of a test helped us discover what was afflicting the upgrade: a bug in the anti-DoS options in Juniper Networks' NetScreen ScreenOS 5.0. During our do-over, we knew the expected outcomes for our test regimen, so even when we hit some sporadic firewall failures, we were able to isolate the nature of the problem more quickly.
Lesson 3: Have a solid maintenance pro-cess, including a back-out plan. Our past success in upgrading our firewall system masked our lack of a well-designed workflow. We had assumed success and not adequately planned for failure. When failure occurred, we were caught off guard. We hadn't worked out how long to continue troubleshooting before deciding that we had to back out. For the retry, I broke out a copy of Microsoft's drawing program Visio and created a logical flow of the processes, including pre-determined fallbacks for each critical juncture. Even though we hit more snags, we were able to safely diagnose the problem after spending several hours on the phone with Juniper's support staff. We couldn't have done that unless we had a plan for running another firewall on the old code during the diagnosis.
So do as I say not as I do. Planning will help you from suffering similar consequences.