On the Premature Death of Spanning Tree and the Indiscriminate Killing of Canaries

Published On 2012/12/07 | By Kurt Bales | Rant

I have a bee in my bonnet. After my last post full of love and bromance, this one is full of hate and vitriol – and I don’t apologise! We have all seen many presentations on each vendors latest and greatest “fabric” technologies over the past 18 months. It doesn’t matter which vendor, whether the presenter is sales or tech, or even enterprise or service provider focused – at some point almost every one declares that their solution is “the end of spanning tree”. It gets worse when they actively advise that you do not run spanning tree in your environment.

And I don’t buy it ūüôĀ

The Premature Death of Spanning Tree

Spanning Tree: noun – A pox on the house of networking

Everybody loves to hate on Spanning Tree. Haters gonna hate. While we’ve all been bitten by something horrible happening related to spanning tree, I have seen many more things go wrong because people *didnt* configure Spanning Tree properly.

Vendors knew how painful it was and went to great lengths to ensure that we didn’t need to do anything so that it would “just work”. Which is great‚Ķ right up until the point that you run into a limitation on the STP PixieDust Mode. Often this comes in VLAN dense environments when you max out the total number of spanning tree instances that your devices will handle. Oh thats easy – lets run out Multiple Spanning Tree!

I can hear the gasps from many people now. If people hated Spanning Tree, then they have a full lynch mob ready with pitch forks and stakes at the mere mention of poor MSTP. And they hate it for a reason. MSTP makes you think about how spanning tree works again and all the PixieDust goes away. And networks become hard again.

I will let you in on a little secret:

I actually really like MSTP and I implement it every chance I can get.

Yes its a little harder and requires a little more forethought, but I would rather do that at the design time then have to overhaul my design later to meet some new need well after my network has reached “critical mass”. I have spent many hours rebuilding spanning tree designs because I needed more than 128 instances. Sadly in more than one case I needed to work out the best way to deal with a group of Catalyst 3750 in PVST+ with 1000 VLANs configured (and 900 VLANs with STP disabled).

And things get messy, and things get hard. So lets find a new solution.

The Magical Healing Powers of Woven Unicorn Hair Fabrics

Somewhere over the rainbow, far beyond the Dark Forest of Broccoli Despair, many magical elves have worked hard to deliver us a the perfect solution to the problems I listed above. Vendors have taken this creation and moulded into their own “Fabric Solutions”. Some created skinny jeans, others an uncomfortable sweater vest. Sadly most of the time they have just presented us with a sensible pair of slacks that the sales people try to sell as a three piece suit.

A sensible pair of slacks (Unicorn Hair or otherwise), is perfectly apt when used as intended, but if you drape them over your shoulders and call them a shirt then your wrong (or a hipster. In which case your cardigan is probably over the top of your shirt-slacks).

And so it is with data centre fabrics. I agree that most of these solutions will allow us to disable spanning tree on our core/fabric facing interfaces. We will get many of the benefits of multi-path layer 2 and some times efficiencies gained by avoiding the flooding of L2 addressing information. Turning off spanning tree into the fabric core makes sense. Im happy with that.

So what about all those edge interfaces?

Do we live in a world where end users never plug two ports together?

A client PC never bridges interfaces?

How about “Oh my VoIP phone has two network ports let me just‚Ķ. BOOM!”

Maybe you have no requirement to integrate with other networking infrastructure, but end stations can still do bad things and thats usually when you don’t want them to.

The Indiscriminate Killing of Canaries

So how do we go about detecting these loops? Well over the past couple of decades we’ve presented ourselves with a whole cage full of canaries that can alert us to loops or other similar problems in the network. These are our early warning signals that “Something bad just happened…” and better yet “‚Ķ so let me just fix that for you!”. And sadly, many of these have been built around the functionality that Spanning Tree provides.

Let’s take the BPDU Guard feature as an example. BPDU Guard is set on an access port or another port that you do not expect to see Spanning Tree Packets (BPDUs). If a BPDU is detected, the switch will usually log a message and send the port into a blocking mode. In the scenarios listed previously the offending port is now taken out of action and the loop is removed. If we have disabled Spanning Tree on all ports then the BPDU will never be sent or received and our little bridging loop will happily continue. Well at least until your switch is a bubbling blob on the bottom of your rack.

Another feature available on most switches is the BPDU Filter. With BPDU Filter enabled on a port the switch will pass all traffic on the port but silently drop the BPDU messages. Now I agree that their are certain times when this feature is useful, such as when interconnecting with a 3rd Party that you “know” can never form a loop with you and you do not want to either learn a STP root from them or go into block due to election issues.

Sadly, our good friends at VMWare love to advocate that we implement BPDU Filter on the ports facing our VMWare Hosts. Unfortunately I have been bitten by loops coming from inside a VMWare environment due to a Microsoft Guest Bridging two vNICs in separate VLANs. A BPDU from the came in from the Physical NIC on VLAN A out to the vNIC in that VLAN and back out through vNIC and VLAN B. Thankfully when this happened, my canary (BPDU Guard) signalled that there was a problem then promptly died in its cage and disabled the port to the VMWare Host. Yes this would have some undesirable effects on all the other guests on that host, but we were alerted to the problem and needed to fix it. In the scenario with BPDU Filter these alerts would have been filtered out and the loop would continue unnoticed.

So what other methods do we have to detect possible bridging loops that do not involve Spanning Tree to be operational? I have the following list as a start to some ideas, and I am looking for others that you might know of too:

  • Broadcast Storms
    • Possible Mitigation: Storm Control
  • Multiple Mac Addresses on a ¬†Port
    • Possible Mitigation:¬†Max MAC Address restrictions
  • MAC Address Flapping
    • Possible Mitigation: MAC Flap Dampening
  • High CPU Usage (in some cases)
How do we best monitor these details and present them in a useful way to our NOC and Service Desk people to know when something bad is happening without the tools we originally created? How do we mitigate these issues so that we can maintain some of the “self healing” we have had with out previous tools?

Mop and Bucket

Yes, I’ve written his post at 2am, but its been something that I have been thinking about for the past 8 or 9 months.

I can see that Spanning Tree doesn’t have an indefinite future, but calling it dead today is premature. If you are looking at fabric technologies or worse still you dont have a new fangled fabric but hate spanning tree so bad that you have just turned it off, then ask yourself how you will detect loops in the edge networks and how you will mitigate them.

Take your canaries with you and let them do their job and don’t strangle them at the top of the mine shaft.

If you do you might just find that the Emperors new cloths are just a sensible pair of slacks!

Like this Article? Share it!

About The Author

Kurt is a Network Engineer based out of Sydney Australia. He a the Senior Network Engineer at ICT Networks where he spends his time building Service Provider networks, and enterprise WAN environments. He holds his CCNP and JNCIE-ENT, and currently studying for his JNCIE-SP and CCIE R&S. Find out more about Kurt at http://network-janitor.net/ and you can follow him on Twitter.
  • Love what you've written here. Even in my shiny new MLAG-based core where every links forwards and "you could never have a loop", I am still running a full STP configuration. I define roots and enforce with root guard. I use BPDU guard. Etc. I don't happen to have a reason for MSTP, because I just don't have many VLANs in my environment, but would if I needed it. Storm-control? Check. MAC address control? On certain ports, especially user-facing where they've got multiple live jacks at hand, yes. Loop prevention still matters. Salespeople prey on FUD & ignorance regarding STP, because some people think "STP causes STP loops, so if we can shut it off, that's good, right?!?" Sigh.

    • I agree with your sentiments. And the way I see it, the "TRILLs" of the industry are not used to replace STP, but rather supplement it. What I mean is, the problem we have to solve is not avoiding loops, but rather how can we utilize our expensive links to the fullest extent.

      If I am a CxO I want to use the Network Kit I just spent gobs of money on as much as I can. If you're telling me we have a port just sitting there doing nothing as a "backup", well that screams waste to me. I believe this mentality started to come about after VMWare took off. Think about it, you now have servers that are being utlitize "100%" of the time. I don't have an application sitting there only working the CPU to 30% anymore. So naturally the question comes about, "why can't we utilize our network gear more?"

      Atleast from a business perspective it's about using all our links, from a nerds perspective, it's about making the network more scalable, and to help meet the business case.

  • Mick

    Hi Kurt,

    You know your stuff – that's clear.

    Peers doing exactly the same work clearly understand where you're coming from.

    With a tiny bit of effort you could embellish your gripe into an article that can be consumed by more than expert peers.

    Soon you'll be publishing your own books !


  • I always thought the selling points for "no more spanning tree" were more related to data centre interconnects and other uplinks in the data centre. The idea is that you don't end up with expensive links going in to blocking mode.

    Have people been advocating the end of spanning tree on edge networks? I don't think there has been much directed at this market.

    That's what I always thought.. no one actually hates spanning tree on your typical LAN except that your redundant data centre links end up in blocking mode, wasting a lot of money. It's a life savour on a user facing network!

    PS: Writing this drunk from Bali. I see your 2am post and raise :p

  • Hi Kurt,

    Well done. I totally agree. I don't see spanning tree going away any time soon. The more interesting point you raise is about it making networking 'hard'. That comment is without a doubt totally true. I know lots and lots of brilliant network engineers that shiver at configuring spanning tree. Better yet are the ones that just assume it's there and that it works. It's too easy this day and age to be concerned with only layer 3 networking. I for one, embrace STP and don't see it going away (at least at my org) for a very very long time.

    Reading your post and the comments also raises another interesting point. Do people still build networks with blocking links in the core and distribution layers? I guess Im biased since most of the network I work on run layer 3 to the edge so between that and port-channels we don't let any sit idle.

    Interesting post, great to see you posting again!

    Take care – Jon

  • You're way too smart in subjects I know nothing about.

  • ijdod

    I couldn't agree more. I've seen a couple of STP-less implementations, and I'm not sold. Often the fear of STP in those environment can be traced back to lack of knowledge of what STP does, or in some cases very dubious implementations by vendors. So that vendor sold the company on this neat STP-less design… which requires a boatload of proprietary technology to prevent things going pear-shaped if things don't work as planned (like switches dropping to factory default configs without admin interference or hardware failure. Same vendor… ). All of which haven't actually prevented things going pear-shaped on occasion.