So you’re working on Project Pluto. Call it that because that was a really neat project. Call it that because it’s a really scary project. If you have to pick one, read the second link. Point being, this dichotomy is an important characteristic of all important projects. Neat and scary!
Project Pluto has a problem. It’s too slow. Everything looks great, on schedule, the world is your oyster. Except Project Pluto doesn’t perform well. Ted Merkle (Project Pluto’s director) comes to you and says “Pluto is too slow, it takes too long to run its calculations. The engineers have fast workstations that mask this fast, but with our normal computers, it’s too slow. You need to do something.”
So, you put on your management cap and start scrambling engineers. You tell them “Pluto is too slow on real hardware, you’ve been lulled into a false sense of security because your workstations are fast.” You tell them this because it makes sense. The reason you have fast workstations is because software development on normal computers is horribly slow, especially with all of the virus scanners, disk encryptors, automatic backup agents, policy audits, software updaters, and so forth that forms the modern corporate computer ecosystem.
The engineers follow your orders. They instrument everything. They use profilers. They figure out where the big slowdown is coming from. They start proving that the performance characteristics of Project Pluto do indeed indicate that much of the calculation performance is bounded by hardware performance. Everything adds up. It doesn’t just make sense, the data actually supports it, and it’s clear what needs to be done.
The problem is there isn’t any low-hanging fruit. There’s a thousand little inefficiencies.
You say “make a thousand little improvements!” The engineers make a thousand little improvements. They put in long hours. The weeks drag on. They prod and poke and squeeze every ounce of performance out of Pluto that they can, but the improvements are marginal. Single-digit stuff. Weeks go by. Pluto is ten times as good as it was weeks ago, but … the calculations are still too slow on Mr. Merkle’s computer.
You build a test environment just like Mr. Merkle’s. You don’t run anything special on it, it’s just the same hardware to keep variables to a minimum. Performance on it isn’t great. You set some performance goals, and start testing hypotheses. You start tuning Pluto to meet those performance goals. Hypothesis are proven and disproven, and more improvements are made. You’re getting close to your goals. You’re happy. You’re succeeding. You weren’t crazy all along, nuclear cruise missiles really are the answer. It’s all going to be okay.
You present your findings to director Merkle, and discover that they’ve already tested the newest version of Pluto. It’s slow. You suck. It’s even slower than before.
You’ve controlled the environment and proven that the process works; the software is fast. It must be something hostile that didn’t manifest in the lab environment. You try to figure out the difference between the two environments. Nothing makes sense. You started the day with the end in sight. You end the day starting to formulate contingency plans and control fallout.
Naturally, there are other awesome and scary projects, like Project Orion. The engineering director on Project Orion is talking shop with Mr. Merkle and the discussion turns to Project Pluto. “That’s weird, what happens if you try to run the thing without any plutonium in it?” That’s stupid, of course. The problem is that the computers are slow to calculate results, duh. Think about it. Would Project Pluto really work without PLUTOnium?
Luckily, at this point anybody will try anything, and it’s a good thing: Project Orion’s director did your job. The calculations are lightning fast in this silly nonsensical scenario. The specific test was silly because it described an atypical use case of the project. Yet, it made something very clear very quickly: The problem has nothing to do with the performance of Mr. Merkle’s computer. The entirety of the delay is caused by an upstream system that hooks into the project. As soon as that system is removed, the problem goes away. The fix for that may be hard, but it’s not something that had any chance of being fixed by making Project Pluto itself faster or better.
You’re an idiot and you suck. You forgot to think like an engineer. You thought you could just manage and direct.
Here’s what happened.
Somebody outside of engineering came to you with two things: A problem and a reason for that problem. You did the “right thing” and directed your team to put all of their energy into looking into this reason and fixing it. Things didn’t get better, so you doubled down and invested in proving that your efforts were treating the reason.
Then a real fucking engineer came along and casually did what you should have.
The problem was real and needed to be fixed, the reason was a distraction. You fell for it.