Pessimistic Engineering

Published On 2012/09/06 | By Kurt Bales | Rant

I don’t know about you, but I am regularly told “You worry too much” or “You don’t need to worry about that”. Sometimes its “What are the chances of that ever happening?”. These are things that Ive heard from many people over the years and the best I can come up with is:

“That’s what you pay me for! Im here to think of the worst case scenario and then mitigate against that.”.

This is usually followed by confused looks from those around me who do not seem to grasp what I am getting at here.

The way I see it, it is my job to constantly be thinking about worst case scenarios.

  • “What happens if we lose this device/site/cable?”
  • “What happens if we all backups are lost?”
  • “What if…”

These sorts of questions are exactly the reason why I fell in love with Networking as a discipline within IT. The very fact that I have the ability to build redundant systems that take serious effort to bring down draws me deeper in (assuming Im given appropriate budget 😉 ),

Why do I build many of my networks like a Service Provider network? Because I have found that these basic design principles are usually the most robust. Configurations utilising OSPF to carry core routing information and BGP to provide end user routes stands up to some serious beating – and it is extensible too!

Why do I cry when I hear vendors pronouncing “With our new Wonder Fabric Technology you can now turn of Spanning Tree”? I cry because I feel that this is sending the wrong message. I have a whole other post coming on that topic, but please people don’t just turn off spanning tree. Are all your edge ports protected? Can you ensure that nobody will ever mis-cable? (And don’t even get me started on VMWare’s view about filtering BPDUs!)

Why do I prefer two stand alone systems providing redundant network services over a single HA unit? Devices redundant power, RE’s and line cards, but with a shared management plane are still susceptible to risk of incorrect configuration causing a service interruption. Switch Stacks, Virtual-Chassis, VSS and what ever other similar technology all suffer from this problem. I would rather a technology such as Multi-Chassis Link Aggregation, Virtual Port Channels, or even utilising VRRP/HSRP or anycasted Services to provide the desired network redundancy. Sometimes this is “harder”, but again – This is what you pay me for 🙂

Mop and bucket

While I know that I am pre-disposed to the negative and pessimistic tendencies and views, but am I the only one who feels that “Worst Case Scenario Thinking” is one of the prime reasons people pay us? My wife could easily plug in a couple of cables and “make intarwebz happen”, as is proven by the millions of home users CPE, but true network design and redundancy comes from thinking about the worst that can happen and how to mitigate against these risks.

Im curious as to the thoughts of those of you out there.

Like this Article? Share it!

About The Author

Kurt is a Network Engineer based out of Sydney Australia. He a the Senior Network Engineer at ICT Networks where he spends his time building Service Provider networks, and enterprise WAN environments. He holds his CCNP and JNCIE-ENT, and currently studying for his JNCIE-SP and CCIE R&S. Find out more about Kurt at http://network-janitor.net/ and you can follow him on Twitter.
  • DC STP free 🙂 LACP everywhere.

    Access, spanning tree on.

    I agree need to cater for worst case scenarios, but I am not a fan of designing for incopitence where only professionals should have access.

  • You know why we design for worst case scenarios? It is because we care. We are passionate. We want the best. Networks are a foundation fundamental that needs to be right. Get it wrong? It cracks and all the layers placed upon it (VMware, Application, pewpew) crumble. As does your job.

  • Yes, simple physical redundancy beats complex systems. Complex systems have complex failures.

  • Ethan Banks

    It's always about the worst case scenarios. Why? Because they happen. And because they happen, and because the IT infrastructure halts when the network is unavailable, businesses need to think about how best to keep the network alive.

    I believe the network is unique in the IT space because of its all-pervasive nature. There not an app a business needs that can run effectively without the network. Period. That makes the network (arguably) the single most important element in a corporation's IT infrastructure. Lose a server or service? Other servers and services continue. Lose a storage shelf? Probably a bigger impact than a server, but still not the end of the world. Lose the network? Business stops.

    So, can planning for the worst-case scenario go too far? I'm sure it can. There's only so much money available in a budget, and only so much complexity that a design can tolerate before it becomes inherently more failure-prone. But someone telling me, "Don't worry about that," is just proving that they don't understand the risk implied by network downtime.

  • It’s exactly that kind of pessimistic downer thinking that ended up gaining me the nickname “the little gray storm cloud”…. and led me into a very happy career as a security professional. You’re doing it right!

  • Paranoia is your friend for sure. You're pretty much guaranteed that the folks who are willing to cut the corners are not carrying the pager. When you've done a lot of network support, you' build a huge knowledge base of failure-scenarios. Come design-time, you end-up translating all these risks into implicit design requirements.

    I'm a big fan of calling out these risks explicitly in the design document as design requirements, then designing solutions which addresses those explicit risks. This gives the stakeholders a little more context as to why you're designing such a bullet-proof network.

    A fun approach is to ask the person "saying it will be fine" if you can include their name in your format design document as the owner of the risk. ; -)

  • I agree completely. It is something that Management needs to understand. And it is also something we should be spending the most time on. When I design, or engineer something, I am going to take a whole lot of time thinking and researching and questioning. The implementation is the relatively quick part. It’s easy to type in a few commands. Sure someone can get a network up and running in two days, but at what cost?

    We need time to think through all the worst case scenarios, and the what-ifs.

    I am curious, you mentioned you design networks “like” SP networks. Do you use SP design techniques with Enterprise and DC networks? If so, what does that look like?

  • "with a shared management plane are still susceptible to risk of incorrect configuration causing a service interruption. Switch Stacks, Virtual-Chassis, VSS and what ever other similar technology all suffer from this problem". << THIS

    I think it's too easy for some engineers to overlook this aspect and focus on the benefit of a "single point of management". If you're implementing virtual chassis/stacking/etc, you've got to understand the risks just as much as the benefits.

  • I've been spending a lot of time recently fixing a LOT of things in our core that have either no redundancy or very slow redundancy. I get asked a lot why I even bother as it's been running like so for so long without problems. But problems happen, randomly. Not only do I not want the failure of a circuit or even a core router means catastrophic problems, but it makes me a better engineer in the long run. Think of all the things that could go wrong, what would happen to all the traffic on top, and how do I ensure that without human intervention everything routes around the problems. How do I then speed that up? I've significantly changed a number of things which I thought were poor implementations.

    What have I got out of it? first of all I've learned a lot. I've also slept a lot more as I've been called out of hours twice the whole year (and I'm on call every day). Les customer complaints and more sleep for me is win-win