Archive for the 'Networking' Category

Cisco’s WAN / MPLS Redundancy Silver Bullet

How many of you have ever heard of OER? How many times have you had redundant paths where an immediate failover would have saved you 180 seconds of BGP dead timer? When we’re talking about a transit link in the Internet, even with respect to multiple Internet connections for a company, it’s typically acceptable to suffer through the rare 3 minute dead timer on a dead BGP peer link, but when you are running VOIP across your redundant MPLS WAN you can bet the failures will only occur when the CEO is talking to the head sales VP, and trust me, a three minute outage window is not going to make anyone happy.

We’ve been running along for years with a redundant MPLS WAN and have relied exclusively on BGP to handle failover. In a sense we are lucky to have redundant carriers and the ability to recover relatively quickly from a failure. Even still, we suffer for short but extremely painful periods of time when things go wrong. Worse, we all know this is a best case scenario and we really can’t do a lot except manually intervene in the case where performance across a given cloud changes or degrades. Sure you can monitor these things using the Cisco SLA responder tests for calculating MOS scores and ICPIF scores, but to do something about those things changing BEFORE the CEO’s call gets disrupted or untenable is the glory we all seek which makes us invisible.

We would also like to believe the investments we make in infrastructure and telco are well placed. It diminishes the argument that is made for redundant WAN carriers when one “side” typically sits idle and unused. The core of the argument when you purchase that second WAN link is redundancy and the elimination of a single point of failure. Any frugal network engineer or business owner will eventually circle back around and ask themselves or you to tell them how many times that redundancy carried the business through a failure situation. A cost benefit analysis might reveal a lop-sided investment despite the utopia redundant connections bring. Wouldn’t it be better if you could tell them that both infrastructures are in use at all times, providing real time benefits on a daily basis?

Enter OER. OER is Optimized Edge Routing and is being renamed to PfR or Performance Routing. Cisco’s site engages this topic using those names interchangeably. I am still learning the extent of what OER can do, but essentially it has the ability to alter your local routing table. It makes decisions about what to do based on several performance metrics like MOS, RTT, delay and Packet Loss, based on how you configure those metrics for each application you describe to it. It can also passively learn about network prefixes floating through it. You can identify applications based on NBAR, prefix lists or access lists. It’s like a fluid and intelligent policy based routing engine.

We have been investigating OER for a while now because like many companies we are facing converged data, video and voice across IP which is raising the demand on WAN networks where bandwidth is still relatively expensive. In essence what I have been looking for is a way to leverage our entire investment in technology. I didn’t find OER, another engineer did, and I am glad he did. There are a lot of wild ideas you can come to like running OSPF over gre tunnels on top of your BGP infrastructure, and very few of those ideas are really workable in a production environment.

We were given the opportunity to explore OER with Cisco in their Raleigh, N.C. lab last month and we jumped on it. If you’ve ever had the chance to tour the Cisco campus on either the East or West coast you know how hard it is to come away without having a lot of faith in the Cisco brand. They have really spent a great deal of time and money developing technology and turning out support for that technology in a workable way. Their proof of concept lab was a great experience, and allowed us to mock up our entire WAN and test OER under several different scenarios. It’s not just having the massive amount of equipment available to test with, it’s having the expert people to work with that makes the trip worthwhile. If we’d deployed OER ourselves we might not have found that bug in the latest version of IOS that would have caused problems, but our engineer talked to one of the primary OER developers when HE saw the problem, and was able to make a solid recommendation to avoid trouble before we got to it. I realize they like to tie CPOC trips to major spends, and we’re looking at Cisco VOIP, but the experience was fantastic. I encourage the use of the CPOC. I’ll bet you won’t find a better engineer than Keith Brister, either.

After our two day CPOC lab discovery with Cisco and a scheduled maintenance window to bring our current primary router up to recommended OER enabled code (12.4.15T7) it came time to turn on OER. I decided to make this change going into a weekend for a couple of reasons. Normally I don’t make changes on a Friday for obvious reasons, but I didn’t want to make a problem for myself in the middle of the week. Folks want the WAN to work when they are at work after all, and I felt an opportunity with a weekend of long running backups coming up. it was a unique opportunity to see lots of data impact the data circuits. I have not been disappointed in the least.

We have 5 remote sites, 4 with a T1 from each provider and the 5th with a 9 meg connection from each. Our Atlanta corporate site has a 12 meg circuit to each provider connected to a single Cisco 7204 VXR router running an NPE 400. Our plan is to deploy a second VXR router to handle one of the hand-offs which will obviously step up our redundancy a good bit. Our remote sites just got upgraded with 2811’s so they were perfect for the OER rollout as well.

I planned the deployment and came to the following several criteria:

  1. I don’t want it to learn prefixes. I’ll define what I want it to know about.
  2. I want to define VOIP as an application but the rest will be blanket definitions of data networks.
  3. I want to test data networks by echo probes and make decisions solely on latency and health.
  4. I want to test MOS scores for VOIP and make decisions based on loss and delay.
  5. I do not want to do anything with load sharing. i believe we will get that inherent to the OER methods.
  6. I want to start out with OER in route observation mode.

Based on these few criteria, I followed these steps to bring OER on-line.

  1. For each site I created prefix lists named for each site. I put all the prefix lists on all the routers just to keep things consistent.
  2. I turned on the IP SLA responder in each location with the “ip sla responder” command.
  3. I built a key chain for each site. This key is used by OER to authenticate the conversation between the Master Router (MR) and the Border Router (BR).
  4. I configured oer master with logging and described the border router for each location. In our case, the MR and BR are the same router. In any case I used the loopback interface for OER.
  5. I configured oer border in each location, pointing to the master configuration and turning on logging there, too. This is the shortest part as the border just does what the master says. The master is where all the guts of the configuration really goes.

Once all that was done, I checked OER in each location using the “show oer master border detail” command. This command will check the external and internal links and tell you if OER itself is functional:

Border           Status   UP/DOWN             AuthFail  Version
10.105.105.8     ACTIVE   UP       2d23h          0  2.1
 Fa2/0           EXTERNAL UP
 Po10.401        INTERNAL UP
 Po10.30         INTERNAL UP
 Po10            INTERNAL UP
 Se3/0           EXTERNAL UP             

 External            Capacity      Max BW   BW Used    Load Status          Exit Id
 Interface            (kbps)       (kbps)    (kbps)    (%)
 ---------           --------      ------   ------- ------- ------           ------
 Fa2/0           Tx     12000        9000       121       1 UP                    2
                 Rx                 12000      5996      49
 Se3/0           Tx     12000        9000       178       1 UP                    1
                 Rx                 12000      1975      16

This output shows OER is up, sees internal and external links and is aware of utilization of each link. The most interesting part of this is the first two lines. It’s hard to see here, but the first line is column headers for the second. It’s saying border router 10.105.105.8 is ACTIVE and UP for 2 days, 23 hours with 0 authentication failures and it’s running OER version 2.1. This is all good. Now we get to go ahead and start talking about applications and network prefixes, as well as what we’re going to do about each performance metric.

This is also where it looks like you can become artistic in your approach to how to manage your OER deployment. My primary concern with OER early on was that I didn’t want to deploy anything that would do things in unexpected ways and become hard to manage over all. The fact of the matter is that while you can get fairly elaborate in what tests you perform or how many you run for each oer-map criteria, the interface to OER is simple and easy to manage. Here is what I did in brief:

  1. In Atlanta, I added an oer-map for NY VOIP and LA VOIP, referencing a match to their respective prefix lists. I did this across two separate but identical oer-maps. The reason is, the decision I make about routing to LA might be necessarily different than that to NY. Here’s what my oer-map looked like
    oer-map 10 10
     match traffic-class prefix-list NYVOIP
     set delay threshold 750
     set mode monitor fast
     set resolve mos priority 2 variance 10
     set resolve delay priority 3 variance 10
     set resolve loss priority 4 variance 10
     set loss relative 500
     set jitter threshold 15
     set mos threshold 3.76 percent 30
     set active-probe jitter 10.105.105.5 target-port 1025 codec g729a
     set probe frequency 2
    
  2. I then went ahead and added a data network oer-map for each remote site. There’s of course no reason to describe Atlanta to Atlanta, so I left it out. Here is a representative sample:
    oer-map 10 40
     match traffic-class prefix-list NY
     set delay threshold 750
     set mode monitor fast
     set active-probe echo 10.105.105.5
     set probe frequency 2
    
  3. Once I did this, while still in mode route observe in oer master, I added the policy group to OER. I named mine stupidly (10) but you could have called yours anything you like. Once you tie the policy group to OER, the tests begin. You can see the results by issuing an “show ip sla statistics” command.
    Round Trip Time (RTT) for       Index 788
            Latest RTT: 24 milliseconds
    Latest operation start time: 20:01:08.520 EST Sun Nov 2 2008
    Latest operation return code: OK
    Number of successes: 88
    Number of failures: 0
    Operation time to live: Forever
    
    Round Trip Time (RTT) for       Index 793
            Latest RTT: 32 milliseconds
    Latest operation start time: 20:01:08.560 EST Sun Nov 2 2008
    Latest operation return code: OK
    RTT Values:
            Number Of RTT: 50               RTT Min/Avg/Max: 27/32/38 milliseconds
    Latency one-way time:
            Number of Latency one-way Samples: 50
            Source to Destination Latency one way Min/Avg/Max: 18/21/26 milliseconds
            Destination to Source Latency one way Min/Avg/Max: 8/11/14 milliseconds
    Jitter Time:
            Number of SD Jitter Samples: 49
            Number of DS Jitter Samples: 49
            Source to Destination Jitter Min/Avg/Max: 0/2/8 milliseconds
            Destination to Source Jitter Min/Avg/Max: 0/3/5 milliseconds
    Packet Loss Values:
            Loss Source to Destination: 0           Loss Destination to Source: 0
            Out Of Sequence: 0      Tail Drop: 0
            Packet Late Arrival: 0  Packet Skipped: 0
    Voice Score Values:
            Calculated Planning Impairment Factor (ICPIF): 11
    MOS score: 4.06
    Number of successes: 88
    Number of failures: 0
    Operation time to live: Forever
    
  4. These are results for two of the tests. What you will wind up with is results for each exit interface for each oer-map. In my case, I have two exit links. So for each oer-map that tests jitter to a remote site, I get two jitter test results. In case you didn’t know this, you can graph the jitter test results with your favorite (cacti) SNMP monitoring system. It’s nice to have these trends recorded when you start trying to troubleshoot issues relating to jitter in the future. You can also turn these tests on without OER if all you want to do is monitor. BTW, if you want to pinpoint which results go to what remote sites, it won’t be obvious unless you do something like run them over different ports. In my case, I didn’t think of this until I was finished, but I guess I just have to figure out how to identify what exit link and site each test is really for. Oh well.
  5. Take a look at “show oer master traffic-class” as well which will tell you exactly what OER is doing based on each prefix you are working with.
    wan-7204#show oer master traffic-class
    OER Prefix Statistics:
     Pas - Passive, Act - Active, S - Short term, L - Long term, Dly - Delay (ms),
     P - Percentage below threshold, Jit - Jitter (ms),
     MOS - Mean Opinion Score
     Los - Packet Loss (packets-per-million), Un - Unreachable (flows-per-million),
     E - Egress, I - Ingress, Bw - Bandwidth (kbps), N - Not applicable
     U - unknown, * - uncontrolled, + - control more specific, @ - active probe all
     # - Prefix monitor mode is Special, & - Blackholed Prefix
     % - Force Next-Hop, ^ - Prefix is denied
    
    DstPrefix           Appl_ID Dscp Prot     SrcPort     DstPort SrcPrefix
               Flags             State     Time            CurrBR  CurrI/F Protocol
             PasSDly  PasLDly   PasSUn   PasLUn  PasSLos  PasLLos      EBw      IBw
             ActSDly  ActLDly   ActSUn   ActLUn  ActSJit  ActPMOS
    --------------------------------------------------------------------------------
    10.50.1.0/24              N defa    N           N           N N
                              OOPOLICY     @105      10.105.105.8    Se3/0      BGP
                   N        N        N        N        N        N        N        N
                2678     2525   600000   491525        N        N
    <..snip..>
  6. In this example, I have a prefix (10.50.1.0/24, my LA data network) out of policy (OOPOLICY). Further down the output I also see my VOIP network for LA was moved recently to the other exit interface. Something must be going on with one of my WAN connections in LA. The best part is that things get managed by OER based on your criteria in the oer-map sections of your master configuration(s). In a subsequent execution of the “show oer master traffic-class” command, I see the data network was moved over as well. It is likely the VOIP network moved quicker due to the fact that I am looking at delay, loss and jitter and not just delay like I am with the data network blocks.
  7. Once you have your OER set up though, you should give it a long time to cycle through all of it’s tests in order to settle out and be ready to be put in control. OER is deliberate in it’s attempts to intelligently test and manipulate things. I didn’t run into a single issue where routes got screwed up or anything. But patience is certainly helpful. Also, until you put it in to route control mode, you won’t see any OER routes and all your STATUS messages in the traffic-class output will have *’s beside them, indicating OER is not controlling those prefixes.
  8. The most exciting part is putting OER in control. To do that, go into the oer master configuration mode and type “mode route control”. I am sure you have an ITIL compliant change control request already scheduled, right? Anyway, no one will notice except that you will become a little more invisible because fewer problems are going to be noticed by anyone, especially the CEO on that VOIP call.

My own results

It’s not my intention to sing the praises of one equipment vendor over another, but there are many reasons why Cisco is the market leader. Cisco has built some of the most robust network gear on the planet. I mean, I know there is faster gear out there at the mid/large sized business market, but sometimes the real need is rock solid reliability. This doesn’t mean other vendors are inferior but it does mean there is a lot of intellectual horsepower built into those green boxes out there.

My own personal results with OER are extraordinary in my opinion. Remember I told you there were long running backups on the weekend? I noticed on Saturday that 4 of my T1 connected sites were backing up over one WAN cloud while NY with 9mbit connectivity was backing up over the other cloud. OER basically gave NY a dedicated pipe for backups to run over and consolidated some of the smaller sites across the other cloud. I couldn’t have asked for a better result in that case. I have noticed OER moving other networks around because of things like LOSS and delay as well, but I haven’t really dug into what was going on while those events occurred. I did notice it moved my two west coast offices over to one MPLS provider but left everything else alone tonight. Was there a problem with the other provider? Not sure. It is highly recommended to have a good netflow analysis tool handy though. Scrutinizer by Plixer International is a great tool. The map of our WAN is a quick and easy reference to find where bandwidth is being pushed to and from by OER.

Other Notes

OER is included in IOS. We didn’t pay an extra license for it to be in our router code, but it seems like a great thing to have. Here’s another thing to like about it. Since it tests links performance and moves things based on that, if you have a link go down or a BGP peer otherwise die, OER will see that and move things within a few seconds based on your testing frequency. Otherwise you would have to wait 180 seconds for the BGP dead timer to expire and pull those routes from the table.

Oh, and I told it to choose the best exit based on test results, not just any good link. That way, if both links are failing, it will choose the least bad link.

I’m glad you read this post. Isn’t OER cool?

No Comments »

Open Source TCP Load Balancer

I find myself with less and less time to geek out lately. Tonight has been a rare instance where I was able to carve out a couple of minutes to go browsing and I found something really cool. All the credit for my two finds tonight have to go to the folks who develop and document the Crossroads load balancer project. I downloaded their latest stable release and built it on my local Linux box. As I was reading through some of the documentation, there was a really cool configuration example that caught my eye.

Crossroads has the ability to “stick” requests to whatever servers are there to listen. For that reason, a clever person decided to run it locally and set up all the various proxy servers they encounter on a daily basis. This way, when they move their laptop from home to the office, they didn’t have to configure their web browser proxy differently. They just pointed their web proxy at the local crossroads instance and let it handle “load balancing” them to whatever proxy happens to be available. I love it! Go to the crossroads site and get things compiled and installed. Then go to their documentation and search for “lazy”. You’ll see the passage I am referring to.

This is a double purpose post though. If you went off on the crossroads shiny quarter above, you may have already met Charles. Charles is an HTTP/S proxy system written in Java that allows you to see everything involved in the process of loading a web site. It’s very easy to set up and use and is chock full of goodness. Check it out, too

No Comments »

Wireless Networking Notes

This post is purely note taking for me. If it helps you, I am sorry. :)

Here’s a calculator to help determine how to design a given wireless link. This will calculate a theoretical maximum SOM (System Operating Margin).

Here is a page with much discussion and many words. Good words. Stuff to use in designing a wireless link.

And a word of sadness. I retired my original antenna pictured on this site. It was made from stove pipe material and worked like a champ in the sunshine. The problem with it was that it funneled water into the cable no matter what I tried. i sealed it up pretty well using microwave safe tupperware which just condensed water and made it rain inside the antenna every morning. It was a great learning experience, but it has been retired in favor of the 14dbi Hawking Tech patch antenna. This is a nice kit that includes a pole mount and a nice lightning arrestor. This antenna is incredibly stable. I know that sounds strange, but the signal doesn’t dance around nearly as much, the SNR is dead even on my link right now where it would move around a lot before. I don’t know why.

And those old square grid antennas?  I am using one of them again on my side.  The Yagi I had would get wet inside it’s shell.  who knew water would be such a terrible opponent in this game?  The problem I had with these antennas before was that the beam measures 8 degrees width from the origin at it’s widest point.  A beam 8 degrees wide .6 miles away requires a lot of precision in aiming which is quite difficult.  These antennas to work quite well though.  Mine came from wifi-link.com via ebay.

No Comments »

Apple AirPort Express

Airport Express

If I may, allow me to tout the Apple AirPort Express. This device is probably one of my favorite guilty pleasures. Guilty? Yeah, a little. It’s a $99 extremely frivolous purchase, or at least that’s the way it started out.

It has a headphone jack for connecting speakers or hooking it up to your home theatre system, but it also is lit for optical digital out in the same little hole, just like the MacBook Pro. Why does it have that? Because you can play your iTunes music library over the network.

It has a USB port on it because it can be a printer server which is shared best via Bonjour and it does this very well.

It has an Ethernet port as well as wireless capability. It can stand alone, be a full fledged access point, extend a wireless network or just about anything else you want to do… put it in your DMZ and make an unencrypted guest wireless network at your house.

I can’t think of a single reason not to buy one of these little guys. The question is not whether or not to buy one, but whether or not to start collecting them. A monthly Airport Express budget? Maybe. I jest…

No Comments »

802.11b Waveguide Antenna

AntennaI’m working on a wireless link between our house and my sisters house across the pasture down the hill from us. I measured the distance at about 0.6 miles and decided to start with some cheap experimentation. This is the first antenna I have ever built and it does a pretty good job of getting the signal where I need it. It’s too ghetto for me so I know Ellyn would hate it if she happened upon it out in the yard. I have ordered a set of grid parabolic antennas which should make the connection exponentially more reliable. These pictures are funny though. The antenna is made from 4″ dryer duct, a 4″ – 6″ adapter and a panel mount N-type female connector. Literally everything else is scrounged. I did go ahead and buy the weatherproof box to house the access point which runs on power over Ethernet. The only thing I had to run to the entire rig was a single cat 5 Ethernet cable. Check out this site, it’s where I wound up when I was researching this thing. Turns out this has been done before.

1 Comment »

IP Multicast Addresses and Resources

If you are not familiar with multicast, these are good resources for stumbling your way through it. I have a couple of PIM environments, so these are here for my reference.

multicast addresses

PIM and DVMRP build multicast trees, and are protocols for delivering multicast datagrams. The main difference between these two protocols is that PIM doesn’t rely on any unicast protocols to operate. Otherwise they are pretty much the same thing. They both build and maintain separate trees for every multicast source/group pair.

Jetcore Foundry gear will allow for hardware forwarding of multicast datagrams… the command is:

ip multicast-perf

1 million foot overview

A much more thorough overview is here: Multicast HOWTO

No Comments »