Cisco’s WAN / MPLS Redundancy Silver Bullet

How many of you have ever heard of OER? How many times have you had redundant paths where an immediate failover would have saved you 180 seconds of BGP dead timer? When we’re talking about a transit link in the Internet, even with respect to multiple Internet connections for a company, it’s typically acceptable to suffer through the rare 3 minute dead timer on a dead BGP peer link, but when you are running VOIP across your redundant MPLS WAN you can bet the failures will only occur when the CEO is talking to the head sales VP, and trust me, a three minute outage window is not going to make anyone happy.

We’ve been running along for years with a redundant MPLS WAN and have relied exclusively on BGP to handle failover. In a sense we are lucky to have redundant carriers and the ability to recover relatively quickly from a failure. Even still, we suffer for short but extremely painful periods of time when things go wrong. Worse, we all know this is a best case scenario and we really can’t do a lot except manually intervene in the case where performance across a given cloud changes or degrades. Sure you can monitor these things using the Cisco SLA responder tests for calculating MOS scores and ICPIF scores, but to do something about those things changing BEFORE the CEO’s call gets disrupted or untenable is the glory we all seek which makes us invisible.

We would also like to believe the investments we make in infrastructure and telco are well placed. It diminishes the argument that is made for redundant WAN carriers when one “side” typically sits idle and unused. The core of the argument when you purchase that second WAN link is redundancy and the elimination of a single point of failure. Any frugal network engineer or business owner will eventually circle back around and ask themselves or you to tell them how many times that redundancy carried the business through a failure situation. A cost benefit analysis might reveal a lop-sided investment despite the utopia redundant connections bring. Wouldn’t it be better if you could tell them that both infrastructures are in use at all times, providing real time benefits on a daily basis?

Enter OER. OER is Optimized Edge Routing and is being renamed to PfR or Performance Routing. Cisco’s site engages this topic using those names interchangeably. I am still learning the extent of what OER can do, but essentially it has the ability to alter your local routing table. It makes decisions about what to do based on several performance metrics like MOS, RTT, delay and Packet Loss, based on how you configure those metrics for each application you describe to it. It can also passively learn about network prefixes floating through it. You can identify applications based on NBAR, prefix lists or access lists. It’s like a fluid and intelligent policy based routing engine.

We have been investigating OER for a while now because like many companies we are facing converged data, video and voice across IP which is raising the demand on WAN networks where bandwidth is still relatively expensive. In essence what I have been looking for is a way to leverage our entire investment in technology. I didn’t find OER, another engineer did, and I am glad he did. There are a lot of wild ideas you can come to like running OSPF over gre tunnels on top of your BGP infrastructure, and very few of those ideas are really workable in a production environment.

We were given the opportunity to explore OER with Cisco in their Raleigh, N.C. lab last month and we jumped on it. If you’ve ever had the chance to tour the Cisco campus on either the East or West coast you know how hard it is to come away without having a lot of faith in the Cisco brand. They have really spent a great deal of time and money developing technology and turning out support for that technology in a workable way. Their proof of concept lab was a great experience, and allowed us to mock up our entire WAN and test OER under several different scenarios. It’s not just having the massive amount of equipment available to test with, it’s having the expert people to work with that makes the trip worthwhile. If we’d deployed OER ourselves we might not have found that bug in the latest version of IOS that would have caused problems, but our engineer talked to one of the primary OER developers when HE saw the problem, and was able to make a solid recommendation to avoid trouble before we got to it. I realize they like to tie CPOC trips to major spends, and we’re looking at Cisco VOIP, but the experience was fantastic. I encourage the use of the CPOC. I’ll bet you won’t find a better engineer than Keith Brister, either.

After our two day CPOC lab discovery with Cisco and a scheduled maintenance window to bring our current primary router up to recommended OER enabled code (12.4.15T7) it came time to turn on OER. I decided to make this change going into a weekend for a couple of reasons. Normally I don’t make changes on a Friday for obvious reasons, but I didn’t want to make a problem for myself in the middle of the week. Folks want the WAN to work when they are at work after all, and I felt an opportunity with a weekend of long running backups coming up. it was a unique opportunity to see lots of data impact the data circuits. I have not been disappointed in the least.

We have 5 remote sites, 4 with a T1 from each provider and the 5th with a 9 meg connection from each. Our Atlanta corporate site has a 12 meg circuit to each provider connected to a single Cisco 7204 VXR router running an NPE 400. Our plan is to deploy a second VXR router to handle one of the hand-offs which will obviously step up our redundancy a good bit. Our remote sites just got upgraded with 2811′s so they were perfect for the OER rollout as well.

I planned the deployment and came to the following several criteria:

  1. I don’t want it to learn prefixes. I’ll define what I want it to know about.
  2. I want to define VOIP as an application but the rest will be blanket definitions of data networks.
  3. I want to test data networks by echo probes and make decisions solely on latency and health.
  4. I want to test MOS scores for VOIP and make decisions based on loss and delay.
  5. I do not want to do anything with load sharing. i believe we will get that inherent to the OER methods.
  6. I want to start out with OER in route observation mode.

Based on these few criteria, I followed these steps to bring OER on-line.

  1. For each site I created prefix lists named for each site. I put all the prefix lists on all the routers just to keep things consistent.
  2. I turned on the IP SLA responder in each location with the “ip sla responder” command.
  3. I built a key chain for each site. This key is used by OER to authenticate the conversation between the Master Router (MR) and the Border Router (BR).
  4. I configured oer master with logging and described the border router for each location. In our case, the MR and BR are the same router. In any case I used the loopback interface for OER.
  5. I configured oer border in each location, pointing to the master configuration and turning on logging there, too. This is the shortest part as the border just does what the master says. The master is where all the guts of the configuration really goes.

Once all that was done, I checked OER in each location using the “show oer master border detail” command. This command will check the external and internal links and tell you if OER itself is functional:

Border           Status   UP/DOWN             AuthFail  Version
10.105.105.8     ACTIVE   UP       2d23h          0  2.1
 Fa2/0           EXTERNAL UP             
 Po10.401        INTERNAL UP             
 Po10.30         INTERNAL UP             
 Po10            INTERNAL UP             
 Se3/0           EXTERNAL UP             

 External            Capacity      Max BW   BW Used    Load Status          Exit Id
 Interface            (kbps)       (kbps)    (kbps)    (%)                         
 ---------           --------      ------   ------- ------- ------           ------
 Fa2/0           Tx     12000        9000       121       1 UP                    2
                 Rx                 12000      5996      49
 Se3/0           Tx     12000        9000       178       1 UP                    1
                 Rx                 12000      1975      16

This output shows OER is up, sees internal and external links and is aware of utilization of each link. The most interesting part of this is the first two lines. It’s hard to see here, but the first line is column headers for the second. It’s saying border router 10.105.105.8 is ACTIVE and UP for 2 days, 23 hours with 0 authentication failures and it’s running OER version 2.1. This is all good. Now we get to go ahead and start talking about applications and network prefixes, as well as what we’re going to do about each performance metric.

This is also where it looks like you can become artistic in your approach to how to manage your OER deployment. My primary concern with OER early on was that I didn’t want to deploy anything that would do things in unexpected ways and become hard to manage over all. The fact of the matter is that while you can get fairly elaborate in what tests you perform or how many you run for each oer-map criteria, the interface to OER is simple and easy to manage. Here is what I did in brief:

  1. In Atlanta, I added an oer-map for NY VOIP and LA VOIP, referencing a match to their respective prefix lists. I did this across two separate but identical oer-maps. The reason is, the decision I make about routing to LA might be necessarily different than that to NY. Here’s what my oer-map looked like
    oer-map 10 10
     match traffic-class prefix-list NYVOIP
     set delay threshold 750
     set mode monitor fast
     set resolve mos priority 2 variance 10
     set resolve delay priority 3 variance 10
     set resolve loss priority 4 variance 10
     set loss relative 500
     set jitter threshold 15
     set mos threshold 3.76 percent 30
     set active-probe jitter 10.105.105.5 target-port 1025 codec g729a
     set probe frequency 2
    
  2. I then went ahead and added a data network oer-map for each remote site. There’s of course no reason to describe Atlanta to Atlanta, so I left it out. Here is a representative sample:
    oer-map 10 40
     match traffic-class prefix-list NY
     set delay threshold 750
     set mode monitor fast
     set active-probe echo 10.105.105.5
     set probe frequency 2
    
  3. Once I did this, while still in mode route observe in oer master, I added the policy group to OER. I named mine stupidly (10) but you could have called yours anything you like. Once you tie the policy group to OER, the tests begin. You can see the results by issuing an “show ip sla statistics” command.
    Round Trip Time (RTT) for       Index 788
            Latest RTT: 24 milliseconds
    Latest operation start time: 20:01:08.520 EST Sun Nov 2 2008
    Latest operation return code: OK
    Number of successes: 88
    Number of failures: 0
    Operation time to live: Forever
    
    Round Trip Time (RTT) for       Index 793
            Latest RTT: 32 milliseconds
    Latest operation start time: 20:01:08.560 EST Sun Nov 2 2008
    Latest operation return code: OK
    RTT Values:
            Number Of RTT: 50               RTT Min/Avg/Max: 27/32/38 milliseconds
    Latency one-way time:
            Number of Latency one-way Samples: 50
            Source to Destination Latency one way Min/Avg/Max: 18/21/26 milliseconds
            Destination to Source Latency one way Min/Avg/Max: 8/11/14 milliseconds
    Jitter Time:
            Number of SD Jitter Samples: 49
            Number of DS Jitter Samples: 49
            Source to Destination Jitter Min/Avg/Max: 0/2/8 milliseconds
            Destination to Source Jitter Min/Avg/Max: 0/3/5 milliseconds
    Packet Loss Values:
            Loss Source to Destination: 0           Loss Destination to Source: 0
            Out Of Sequence: 0      Tail Drop: 0
            Packet Late Arrival: 0  Packet Skipped: 0
    Voice Score Values:
            Calculated Planning Impairment Factor (ICPIF): 11
    MOS score: 4.06
    Number of successes: 88
    Number of failures: 0
    Operation time to live: Forever
    
  4. These are results for two of the tests. What you will wind up with is results for each exit interface for each oer-map. In my case, I have two exit links. So for each oer-map that tests jitter to a remote site, I get two jitter test results. In case you didn’t know this, you can graph the jitter test results with your favorite (cacti) SNMP monitoring system. It’s nice to have these trends recorded when you start trying to troubleshoot issues relating to jitter in the future. You can also turn these tests on without OER if all you want to do is monitor. BTW, if you want to pinpoint which results go to what remote sites, it won’t be obvious unless you do something like run them over different ports. In my case, I didn’t think of this until I was finished, but I guess I just have to figure out how to identify what exit link and site each test is really for. Oh well.
  5. Take a look at “show oer master traffic-class” as well which will tell you exactly what OER is doing based on each prefix you are working with.
    wan-7204#show oer master traffic-class 
    OER Prefix Statistics:
     Pas - Passive, Act - Active, S - Short term, L - Long term, Dly - Delay (ms),
     P - Percentage below threshold, Jit - Jitter (ms), 
     MOS - Mean Opinion Score
     Los - Packet Loss (packets-per-million), Un - Unreachable (flows-per-million),
     E - Egress, I - Ingress, Bw - Bandwidth (kbps), N - Not applicable
     U - unknown, * - uncontrolled, + - control more specific, @ - active probe all
     # - Prefix monitor mode is Special, & - Blackholed Prefix
     % - Force Next-Hop, ^ - Prefix is denied
    
    DstPrefix           Appl_ID Dscp Prot     SrcPort     DstPort SrcPrefix         
               Flags             State     Time            CurrBR  CurrI/F Protocol
             PasSDly  PasLDly   PasSUn   PasLUn  PasSLos  PasLLos      EBw      IBw
             ActSDly  ActLDly   ActSUn   ActLUn  ActSJit  ActPMOS
    --------------------------------------------------------------------------------
    10.50.1.0/24              N defa    N           N           N N                 
                              OOPOLICY     @105      10.105.105.8    Se3/0      BGP
                   N        N        N        N        N        N        N        N
                2678     2525   600000   491525        N        N
    <..snip..>
  6. In this example, I have a prefix (10.50.1.0/24, my LA data network) out of policy (OOPOLICY). Further down the output I also see my VOIP network for LA was moved recently to the other exit interface. Something must be going on with one of my WAN connections in LA. The best part is that things get managed by OER based on your criteria in the oer-map sections of your master configuration(s). In a subsequent execution of the “show oer master traffic-class” command, I see the data network was moved over as well. It is likely the VOIP network moved quicker due to the fact that I am looking at delay, loss and jitter and not just delay like I am with the data network blocks.
  7. Once you have your OER set up though, you should give it a long time to cycle through all of it’s tests in order to settle out and be ready to be put in control. OER is deliberate in it’s attempts to intelligently test and manipulate things. I didn’t run into a single issue where routes got screwed up or anything. But patience is certainly helpful. Also, until you put it in to route control mode, you won’t see any OER routes and all your STATUS messages in the traffic-class output will have *’s beside them, indicating OER is not controlling those prefixes.
  8. The most exciting part is putting OER in control. To do that, go into the oer master configuration mode and type “mode route control”. I am sure you have an ITIL compliant change control request already scheduled, right? Anyway, no one will notice except that you will become a little more invisible because fewer problems are going to be noticed by anyone, especially the CEO on that VOIP call.

My own results

It’s not my intention to sing the praises of one equipment vendor over another, but there are many reasons why Cisco is the market leader. Cisco has built some of the most robust network gear on the planet. I mean, I know there is faster gear out there at the mid/large sized business market, but sometimes the real need is rock solid reliability. This doesn’t mean other vendors are inferior but it does mean there is a lot of intellectual horsepower built into those green boxes out there.

My own personal results with OER are extraordinary in my opinion. Remember I told you there were long running backups on the weekend? I noticed on Saturday that 4 of my T1 connected sites were backing up over one WAN cloud while NY with 9mbit connectivity was backing up over the other cloud. OER basically gave NY a dedicated pipe for backups to run over and consolidated some of the smaller sites across the other cloud. I couldn’t have asked for a better result in that case. I have noticed OER moving other networks around because of things like LOSS and delay as well, but I haven’t really dug into what was going on while those events occurred. I did notice it moved my two west coast offices over to one MPLS provider but left everything else alone tonight. Was there a problem with the other provider? Not sure. It is highly recommended to have a good netflow analysis tool handy though. Scrutinizer by Plixer International is a great tool. The map of our WAN is a quick and easy reference to find where bandwidth is being pushed to and from by OER.

Other Notes

OER is included in IOS. We didn’t pay an extra license for it to be in our router code, but it seems like a great thing to have. Here’s another thing to like about it. Since it tests links performance and moves things based on that, if you have a link go down or a BGP peer otherwise die, OER will see that and move things within a few seconds based on your testing frequency. Otherwise you would have to wait 180 seconds for the BGP dead timer to expire and pull those routes from the table.

Oh, and I told it to choose the best exit based on test results, not just any good link. That way, if both links are failing, it will choose the least bad link.

I’m glad you read this post. Isn’t OER cool?

No Comments »

Trackback URI | Comments RSS

Leave a Reply

You must be logged in to post a comment.