When building a system, you can either take the view that things are going to work, or that they are going to fail.
You need to be defensive when building systems, but you want to avoid overdoing it. When you get overly defensive, you get product out the door later and you raise your total cost of ownership by creating an overly complex, difficult-to-maintain system.
In this month’s column I want to tell you the story about how a team I was part of leaned too far in the defensiveness direction, and what we did to bring it back.
A Messaging System
The team had an architecture that looked something like this:
Messages would come in one end, and transformed messages for specific recipients would come out the other. Simple enough.
Not surprisingly the legacy system we were replacing had “issues” (mostly around things blowing up). So when it came to replacing it, there were strong opinions concerning error handling and logging. We started off expecting things to fail.
Well, it didn’t take long before we had so much error handling and logging that it became hard to see (and troubleshoot) what the system was actually doing. We were so afraid of failing that it felt like we forgot what we were building in the first place.
Why were we being so timid? Were we justified in our level of skepticism? Was all this support code and logic really necessary? It felt wrong and it felt strange. Yet we persisted.
Then one day, someone asked:
What if messages came in one end, and the correctly transformed message came out the other? What if we stopped expecting failure? Don’t accept any failures. Just make the bloody thing work. What would that mean to our approach to the system?
What a beautiful, simple, powerful question. Asking ourselves “What if it just worked?” got us looking at the system in a new light.
Instead of it expecting it to fail at every step, we now expected it to succeed.
We realized we could get rid of a lot of errors up front by simply not letting them into the system in the first place.
* What if we tightened up validation on messages coming in?
* What if we rejected all messages that didn’t conform and notified production operations instead of trying to handle it ourselves?
If we coded and tested the system right in the first place, we could fix a lot of the core problems that riddled the legacy system instead of expecting and handling them when they happened.
I know, it seems obvious now. It is obvious. But somehow it wasn’t so obvious at the time. Yet this simple, obvious shift in perspective profoundly changed how we looked at the system.
Instead of expecting failure, we started to expect success. This simplified and lightened our code base, while leaving just enough checking there to let us know if things went wrong.
No, not quite. Our systems are going to fail, and we need to handle those failures. All I am saying is that if we only focus on failure, we can really overengineer and overcomplicate things. That’s expensive, because then we need to carry the baggage of this overengineering around with us forever, making the system harder to maintain and to change.
It’s a balancing act. We do need to expect failure. We just don’t want to overengineer for it to the point where we lose sight of what we’re really doing. Which should be delivering valuable, working software to our customers.
Not wanting to leave you with just a story, I want to offer you two surefire ways I’ve seen teams harden their systems to ensure that they aren’t overly confident before rolling into production.
Work with Real Data
When you throw real, live production data at your system, good things start to happen.
1. You discover where the holes in your data model are.
2. You discover which of your assumptions are valid, and which are wrong.
3. You see what edge data cases you missed, and discover that French characters don’t always encode the way you’d expect them to.
There are few better things that you can do to see how your system is going to handle going live than to throw some real data at it. The other thing you can do is to get something into production.
Get Something into Production
You don’t have to flip the switch and really “go live” (though you obviously will at some point). What I am talking about is getting your system into production before it needs to be there.
This does a couple of things for you:
1. You can throw some of that data at it and see how it behaves.
2. You can work all the kinks out of your automated build and deploy scripts.
3. You can work out all the network and infrastructure issues.
But best of all, you get into the habit of releasing. The more you push to production, the less scary it is. Do it enough and it will soon be like breathing.
Context Is Everything
Take what I am saying here with a grain a salt. You still have to judge for yourself what level of error handling and defensiveness is right for your application.
But if you keeping things simple, and build things so that you expect them to work, you can keep yourself from overengineering a system that doesn’t need it, and while making your application easier and simpler to maintain.