How things fail
Things break on the internet. The question is: is there anyone left to fix them? And how much of the resulting friction causes people to leave you for a competitor?
Friction, like entropy, is ever-increasing
Code is forever. It should keep working. But it requires maintenance. Why?
Because it does not exist in a vacuum.
Operating systems keep changing. Code is connected to databases that continue to update, using connections based on evolving technology. Traffic ebbs and flows. Because the environment keeps evolving, the code needs maintenance: somebody who knows how it works and how to keep it working.
Even though my blog is a fairly simple piece of technology, it failed recently. I complained to my Web guy, whose staff took a few hours to determine that two plugins had become incompatible as they updated. This is as common and ubiquitous as grime: stuff just stops working, and you need an expert to figure out why.
What does failure look like?
An app fails when switching from WiFi to mobile connectivity.
A page takes 25 seconds to load instead of 1.5 seconds.
A feature that works in the app fails in a browser — not all browsers, but just one, say, Safari on a Macintosh.
If the user works at it, there’s usually a workaround. (Have you tried turning it off and turning it on again?)
But if you force that customer to work around the problem for long, they’ll give up. Convenience is no longer a luxury. It’s what everyone expects. If they don’t get it, they’ll find it elsewhere.
Monopolies are exempt. That’s why I stayed on hold with Comcast for half an hour yesterday after three failed calls — calls that fell afoul of some system that wasn’t maintained as it should have been. It’s why I didn’t bolt Twitter when bookmarks stopped working. But even in a monopoly-driven world, there are competitors coming up with friction-free alternatives. Add enough friction, fail to maintain systems, and people will switch.
Maintenance is boring, but ignoring it is dangerous
Failures of maintenance keep happening, because maintenance doesn’t generate revenue (although the lack of it can cause revenue to leak away).
The Boston transit system known as the MBTA is about to open a long-awaited new branch to the suburb of Medford. And it is awaiting new train cars. But the lack of proper maintenance made it slow, caused a big part of the system to require a monthlong shutdown, and caused a train car full of passengers to catch on fire on a bridge over a river. People are excited to use the new branch, but afraid to use the whole system. That’s not good.
My “new” house needed a new roof and soon, new windows. It’s drafty as heck and heating is expensive. It still works, but unless I invest, it’s not going work the way I want it to.
This is what’s going to happen at Twitter. All those people who left knew how stuff works and how to fix it. The work they did didn’t generate revenue. But when they’re not there, things will take longer to load, and other things will stop functioning properly. There will be workarounds. But eventually, bunches of people will give up.
That’s how things die. Not from an explosive and instantaneous failure. From lack of maintenance.
Thanks for this explanation. I was wondering why, with automatic scaling, a site like Twitter can’t keep working, as if on autopilot.
Automatic scaling is for things in the cloud. Twitter has its own data centers. Their main use of AWS is for CloudFront (e.g., data distribution), not for Twitter’s servers.
True for so many businesses, and in so many parts of our lives, and not just tech as you point out.
Having worked for four different airlines — each of which filed bankruptcy at least once (not my fault!!) — and only one of which still exists, I have first hand experience in what happens when leadership fails to invest in the business. Given the headline topic of the importance of maintenance, I feel compelled to add that none of these airlines ever skimped on our maintenance or safety training. The carriers failed for various individual reasons, but the common causes were failures in new aircraft, in technology, in their products, in their people and in their marketing (always one of the first two departments targeted for budget cuts when revenue isn’t meeting expectations).
They may not have failed to maintain the planes, but they failed to maintain the business.
The clock is ticking…
So much for the mass outage. The biggest mistake any of us can make is to think we are irreplaceable.