Aaron N. Tubbs bio photo

Aaron N. Tubbs

Dragon chaser.

Twitter Facebook Google+ LinkedIn Github

About seven years ago, I went on a cruise. One day, we received notice that our balcony railing would be refinished that day while we were ashore. The railing was in pretty good shape. I was irritated. From my vantage point, it could have waited until the next time the boat was in for routine maintenance or our stateroom was not booked.

Cruise ships don’t work this way. An empty room or ship that’s not sailing means lost revenue. Instead, a cruise ship is always undergoing routine maintenance. An attempt is made to conceal much of this work, but in some cases it’s better to have a WET PAINT placard.

I should not have been irritated. This sort of guest-visible maintenance is a positive sign. It’s a signal that the operator cares not just about acquiring a nice boat. They also care about keeping it a nice boat for years to come. If the operator is taking the time to make sure the railing is always glossy and smooth, I am confident the kitchens are spotless and the waste management system is functioning.

I am, of course, appealing to the same intuition as the Broken Windows Theory. I have a tremendous amount of anecdotal resonance with this theory.

Painting the Halls

When we moved into our apartment building, we were fascinated to find out that they employed a full-time porter. The porter’s job, among other things, was to walk the halls each day. While doing so, they would patch scuffs and touch up paint. If they discovered dirty carpet, they would ensure it was shampooed. They would check the trash rooms and chutes and make sure everything was spotless.

It was, we reflected, as if we were living in a hotel.

A few months after the porter left, there were some management changes, and the porter disappeared. The walls are no longer regularly patched and painted. The daily cleaning includes a vacuuming, but stains and spills just remain on the carpet. The trash rooms (just trash chutes, there should be nothing in them) are full of junk, empty boxes, and spills. The halls are dim as the light bulbs wear out without replacement.

Numerous other changes have happened in parallel. Dog excrement is likely to be found in the parking garage. Dumpsters are often overflowing. Communications aren’t reliably sent out. The concierge often leaves early on short notice or overstays their meal breaks by an hour or more. Dozens of similar first world problems have developed that were not present when we moved in.

In summary: Get off my lawn, it was better when we were young, and all that jazz.

Now, I’m not attempting to suggest that the dingy dimly-lit halls is a causal event as much as a correlated one. The broken-window intuition is appealing, however. Since dog owners are no longer being scolded or fined for failing to pick up their excrement from the dog walk, what’s the incremental cost of allowing the dog to shit in the parking garage without cleaning it up, either?

Back to the paint. Eventually, the hallway walls got so dingy that something had to be done. A commercial painter was brought in to completely repaint the halls. They did a sloppy job, but things look better than they have for years. There’s a positive effect to this, but the carpet is still dirty and the lights are still dim and the parking garage is still littered with dog shit. Things got so bad something had to be done, it was obvious. However, unless a culture of routine preventative maintenance and cleaning is adopted, the same solution will be necessary in the future: Ignore it until it’s so bad that it cannot be ignored.

The threshhold at which people notice the problems is much lower than the threshhold at which something must be done.

My position here is somewhat allegorical: This complete repainting of the walls is not unlike hitting the EMERGENCY STOP button in software development. Hold the release train, freeze the code, no new enhancements! Engineers may only fix bugs for the next month. Once things are under control, it’s back to business as usual and all is well!

Except, all is never well. Inevitably not everybody will treat the freeze or stoppage the same way. And, if cultural changes are not made in the wake of the stoppage, the only solution down the road will be another stoppage. The taxes always have to be paid.

The Conway Connection

The Pragmatic Programmer brings up Broken Windows theory:

We’ve seen clean, functional systems deteriorate pretty quickly once windows start breaking. There are other factors that can contribute to software rot, and we’ll touch on some of them elsewhere, but neglect accelerates the rot faster than any other factor.

You may be thinking that no one has the time to go around cleaning up all the broken glass of a project. If you continue to think like that, then you better plan on getting a dumpster, or moving to another neighborhood. Don’t let entropy win.

The authors go on to make an interesting comparison to the boiling frog metaphor:

Note that the frog’s problem is different from the broken windows issue discussed in Section 2. In the Broken Window Theory, people lose the will to fight entropy because they perceive that no one else cares. The frog just doesn’t notice the change.

My favorite adage to extend and overuse is Conway’s Law. My central thesis in this post (and indeed in other ones) is to invoke the logical fallacy of affirming the consequent and extrapolate beyond what is reasonable. This is of course bad practice, but I have two excuses:

  • Conway’s law isn’t even a law! It’s just an adage. Abundant anecdotal experience suggests it’s evident.
  • It’s my blog, so I can pluralize anecdotes into data and conflate correlation and causality all day long.

Soap Dispenser Allegory

Now that that’s out of the way, my converse extrapolation is that everything I’ve talked about thus far is related. Here’s a mental exercise.

Let’s think about a soap dispenser in a bathroom. Maybe there’s two or three for redundancy. If one of those soap dispensers is empty, it’s not a big deal. Just slide to the adjancent sink for a quirk squirt, and all is well. This might mean there’s some sink contention or some awkward banging on the handle to no effect, but it works out. The next time the janitor comes in, they check all the dispensers and realize one is empty, and the problem’s resolved.

Or, the soap dispenser remains empty all day, increasing the contention on an adjacent dispenser. Now it too becomes empty before the janitor shows up and resolves the situation. The contention increases and people start to notice that the dispenser remains empty all day long.

Now, maybe what should happen is that the frustrated people should call the building and ask them to replenish the soap. Instead, what happens is they get pissed off and angry and leave the bathroom with dirty wet hands (more dangerous than dirty dry hands) and a bad mood.

If this happens regularly, week after week, the frustration turns to resignation: It’s just taken as a given that hand-washing isn’t going to be reliable. Some people don’t even attempt it, and those that do don’t always pull it off.

In this allegory, a statistical increase in the unhealthiness and unhappiness in the office is induced as germs spread and people get sick more often. Even if the soap dispensers stop suffering depletion at some point in the future, the problem doesn’t go away: The habits and resignation set in already. It takes a while for the organization to revert to its old cleaner ways even if the original problem goes away.

There are some odd feedback loops that accelerate the problems here. If the soap depletion means people stop using soap, it runs out less often and the janitor learns that they do not need to check it as often. Thus, when there is an outage, it goes even longer before being replenished.

The problem can be minimized from the start by ensuring the dispensers are replenished as soon as they are empty. The better solution is to top off the dispensers every day. The even better solution is to also test the dispenser when this is done (or notice that one dispenser’s level hasn’t dropped since the last refill). After all, a broken dispenser and an empty dispenser are the same thing from a user standpoint.

Extrapolation Time

The soap dispenser has the potential to create entropy (in The Pragmatic Programmer vernacular) in the organization. Hell bent on Conway and converse-Conway, entropy in the organization is correlated to the entropy in the system. The pile of dead bugs in the corner of the cube farm may be mirrored by a pile of dead (long unaddressed) bugs in the system.

But let’s try to find some more supporting anecdotes. Professional cooks clean as they go:

Although the extra effort of cleaning as you go may first seem like a drag, over time it becomes habit. And isn’t any habit that creates a more peaceful and tidy environment a good one?

By just keeping the workstation clean, the kitchen is always clean, the food prep environment is always sanitary, and the meal is safe, delicious, and attractive. It may not actually be that simple, but restaurants that consistently excel at service and food have constantly immaculate kitchens. Great things can come out of dirty kitchens (I make a mean bag of steamed mixed vegetables, after all), but I haven’t had many experiences where horrible things come out of clean kitchens.

The Phoenix Project is full of allegory that plays into this premise:

Looking around, I find a blank Post-it note on my desk and write in large letters, “DO NOT INSERT LAPTOP UNTIL POWERED ON!!!” and put it on the docking station to avert my next act of time-wasting stupidity.

It would clearly be more productive if the laptop just worked whether docked with the power on or off. The workaround at first seems like a clever hack. Being a hack, however, it papers over the underlying issue and leaves it unresolved.

In the book, the factory is clean and orderly; the organization and processes here function well. By comparison, the organization and systems related to IT are in varying states of neglact, chaos, and disarray. The soap dispenser appears about 20% of the way through the book:

Exasperated, I shout, “Goddamnit, Wes. That was completely preventable! Get one of your junior guys to look at the logs every day for drive failures. Maybe even have him visually inspect the drives and count all the goddamned blinking lights. It’s called preventative maintenance for a reason!”

For those playing along at home, it continues predictably:

Wes says defensively, “Hey, it’s actually a little more complicated than that. We put in the order for replacement drives, but they’ve been stuck in Procurement for weeks. We had to get one of our vendors to give it to us on credit. This wasn’t our fault.”

Later in the text, the topic of preventative maintenance resurfaces:

“Precisely,” Erik says. “Properly elevating preventive work is at the heart of programs like Total Productive Maintenance, which has been embraced by the Lean Community. TPM insists that we do whatever it takes to assure machine availability by elevating maintenance. As one of my senseis would say, ‘Improving daily work is even more important than doing daily work.’ The Third Way is all about ensuring that we’re continually putting tension into the system, so that we’re continually reinforcing habits and improving something.

That might sound a little hand-wavy, so the authors continue.

“Mike Rother says that it almost doesn’t matter what you improve, as long as you’re improving something. Why? Because if you are not improving, entropy guarantees that you are actually getting worse, which ensures that there is no path to zero errors, zero work-related accidents, and zero loss.”

And that ties everything back nicely with the notion of entropy aspoused by Hunt and Thomas. Of course, it might also be complete BS. Who knows.