Planning for things to go wrong is important. Even with all the ounces of prevention, there still needs to be some cure to catch what slips through the cracks. When anything breaks there are two parts: Knowing what broke and a solution to get things working again. I’ll take three examples: Bicycle, application, server.
I got flat tire, it took me longer than I’d like to admit to notice it. I probably rode on it for half a mile before I started thinking that the road was bumpier than normal. Once I found out, it wasn’t fun to walk the rest of the way back, but I knew what I had to do so I could get back to riding as soon as possible without causing more damage. I pulled the small rocks and splinters out of the tire and replaced the tube and thought I’d be ok for a bit. It wasn’t long before I got another flat. I wasn’t surprised, angry, or sad; I’ve accepted that getting a flat is a normal part of riding like I do. I figured it was probably related to tire wear, but I wasn’t sure. I decided to patch the puncture to see if it got worse. When it went flat gradually again and I couldn’t find a leak, I decided to wait on the tire to ship before replacing the tube again in vain. I was confident the new tire would fix it, so I rode a spare bike until it came in and I had time to replace it After the first incident I had to keep testing and learning to find a solution that would stick. I could have prevented it by changing my tire after the first puncture flat, but in most cases that would have me replacing it far more often than was necessary to prevent flats from bad tires at a very high financial cost to myself. I probably could have lessened the damage and extended the lifetime of the tire by not riding on it flat for so long, but once that happened I needed a reactive solution. I normally don’t even bother with patch kits for flats, but I knew that it would be a good test for the problem I was having while I was looking at other solutions. I used the tools that I had to expand the solution space I was searching. Eventually I fixed the root cause and went on my merry way, but the entire process had pain points that I’ll eventually encounter again. I’m not so risk-averse that I’d change my behavior from this so “It never happens again”, I treated this as an acceptable failure. My bike was my primary mode of transportation, but I also had 2 backups: walking home and a slower, heavier bike that’s been more reliable. While I could have handled each step better, I would consider this a normal part of being a thrifty bicycle commuter.
When one of my custom applications broke in a new way, I debugged the problem I saw, then added logging to other possible problems that I thought might come up which were related to the first problem. Then when other problems surfaced, some of that logging was useful and other parts were noise, so I dropped the noisy logs and kept improving the application in other ways. Once I knew what broke the most I added more layers to report errors and refactored to treat the root causes. The plan wasn’t to completely stop it from breaking every time, but to be able to quickly identify and fix errors when they came up. Eventually all of those improvements added up and it stopped breaking as often, but I never would have made the choices and designs I did up front without the knowledge of what ended up breaking. Being reactive instead of predictive made the application evolve differently. It would have been difficult to work out all of the real failure conditions compared to all of the other possible failures.
This is a two-parter, since it evolved since I first starting writing it.
My CI/CD VM died, wouldn’t boot into anything. I noticed that the disk filled up before it started having problems, and then when I rebooted, it never made it back. I wasn’t experienced in troubleshooting OS failures, so I read a few dozen guides on Windows boot recoveries. Eventually I exhausted myself, mixing and matching all the advice I could find without success.
I didn’t have any plan for a VM not booting like this. I had all the data backed up, but couldn’t respond to this as quickly as I would have liked. Instead of fighting to recover and trace the root cause I quickly accepted defeat and built a new VM and then attached the backed up data. There weren’t any preventative measures to report whatever went wrong. I wasn’t thinking about the stability of a hastily constructed non-production VM that I inherited from a previous dev. I had enough documentation and experience from troubleshooting build issues that I knew the layout of the machine well enough that I honestly thought about starting from scratch to just clean the slate. But that was before I found an application that encrypted some files via the Windows login, and since I couldn’t get it to boot that encrypted data was as good as lost. When setting it up the next time I was sure to backup the data with the key so that it wouldn’t happen again.
This VM failure was nearly catastrophic because I needed the knowledge of the whole system to functionally fix just one error. This is the sort of error that I mitigate via more preventative steps, like not using a persistent VM and attempting to restore backups before they are needed. I’m no ops expert, but it gave me great respect for the amount of time a failure like this can take to recover.
A test server (which was really a repurposed desktop) that had slowly accumulated some useful services had a primary disk failure. There was a UPS failure which caused the box to reboot and it complained that it couldn’t find the bootloader. I hadn’t seen that failure before, so I just reinstalled GRUB and everything appeared to be fine. Then the next day I came into error message of the filesystem being in read-only mode and I knew I had to backup what I could before it became inaccessible. I was able to backup everything except the database of the most used test service, which was unfortunate.
The preventative measures worked in some sense, but I hadn’t used the first warning to evaluate what would happen if the box did fail completely. It was the impetus for promoting some heavily used test services to a better supported environment. The recovery was less painful than the VM because the services were easily restored, and instead of sticking to exactly the setup I had before I upgraded everything about the test machine and made a project of evaluating what tests were ready for more formal treatment.
Evolving features of ecosystems
All of these failures and recoveries are based around growth. A single failure doesn’t need to change how everything is done, and frequent failures can often be safer than the rare failures if only because more time must be invested into the system during the recovery. A broken system can be an important part of growth. Attempting to avoid unnecessary breakages is good, but trying to extend it too far is as silly as attempting to future-proof something. While the breakage has an immediate negative affect, the in long run it can be good to know what didn’t work and how to fix it. When exploring a new space, ‘fail fast’ is as helpful for business plans as it is for complex technical systems. Knowing that something won’t work up front is far better than it just limping along and then dieing without any hope of recovery. Multiple frequent simple failure modes are better than a single rare complex failure, but it’s obviously impossible to build a system that doesn’t ever experience the latter. The most interesting post-mortems are very tightly scoped and have a mix of process and technical knowledge. The team explains what went wrong, how it wasn’t prevented, how it impacted customers, how it was resolved in the short term, and how to prevent future outages. When the failure is a freak accident and exacerbated by the complexity of the preventative measures it shows that attempting to prevent failures can be more failure prone than the standard . I’ve always wanted a broader picture than the affected systems. A single post-mortem could cause technical or process changes in affected systems based on attitudes towards failures, even it was wasn’t to reduce the frequency of failures.