Sometimes we have to make releases that we know are going to fail. This usually means that we’re looking at a change so substantial that there’s no effective way to test and ensure that everything is set up right the first time. In the ideal world, all of the software distribution and configuration management would be automated, but many systems I work with are not part of the ideal world — they contain thousands of lines of configuration, hundreds of interconnections, and there’s never enough time to complete testing. I’ve mentioned all of this before.
To make matters worse, these sorts of releases aren’t always just for business releases. Sometimes there are major process improvements that we want to implement from a support side; these sorts of things are a hard sell to the business. Telling them that it will increase reliability doesn’t mean much if somebody is always staying up and making the system work, such that the “apparently reliability” is quite high.
But as I was saying, sometimes this large architectural changes are important to implement. For example, transitioning from a system where job distribution is controlled by 180 scheduled tasks on an array of machines, rather than using a centralized distribution system and a single point of initiation, control, and termination (I know … centralized and distribution don’t seem to make sense in the same sentence, but let’s just take this for granted) is a big gain for being able to manage an environment. Implementing a change like this is not trivial. To start, it means shutting off 180 scheduled tasks on individual machines, with no automated means of doing so. If so much as one of them is not shut off, the whole thing falls apart. Then there’s the problem of all of the reports that need to be changed to accommodate changes to the distribution scheduling, infrastructure, and so forth. We did a migration like this recently, and had maybe an 85% success rate. The natural “correct” response is to roll everything back, fine-tune the changes, and re-deploy on another weekend.
To make things more interesting, the guy who is responsible for most of the development on the system to which things are being migrated is in Taiwan for a month, with no phone. I know, in a proper development team, this wouldn’t matter, because knowledge would be shared and so forth … but the development team with which I work tends to be pretty atomic. While I recognize the power of the individual, I don’t think it’s a healthy approach for a large business environment.
The problem is that the changes took about 10 hours straight the first time around; it will take another 6 hours to back out. On the other hand, it will take 2 hours to fix for good, and now we get all of the performance and usability gains, which is a big bonus. While the right thing to do isn’t to just fix it, it seems to be in the best interest of things, but drawing that line is a little tricky. Further, because of the nature of the release, not being one for business improvement, it means that it has to be sponsored by the support side, and support has to take it in the ass in terms of taking responsibility for anything that breaks as a result. But, at the end of the day, sometimes you have to take it in the ass in the interest of system betterment.