Why do we need routing redundancy?

Redundant routes have two purposes:

Table of Contents Show

Load Balancing
Switches and routers
Network protocols
Subnet connections
WAN and SD-WAN

■ To minimize the effect of link failures

■ To minimize the effect of an internetworking device failure Redundant routes might also be used for load balancing when all routes are up.

Load Balancing

By default, the Cisco IOS balances between a maximum of four equal-cost paths for IP. Using the maximum-paths maximum-path router configuration command, you can request that up to 16 equally good routes be kept in the routing table (set maximum-path to 1 to disable load balancing).

When a packet is process-switched, load balancing over equal-cost paths occurs on a per-packet basis. When packets are fast-switched, load balancing over equal-cost paths is on a per-destination basis.

To support load balancing, keep the bandwidth consistent within a layer of the hierarchical model so that all paths have the same metric. Cisco's EIGRP includes the variance feature to load-balance traffic across multiple routes that have different metrics.

Possible ways to make the connection redundant include the following:

■ Parallel physical links between switches and routers

■ Backup LAN and WAN links (for example, DDR backup for a leased line) The following are possible ways to make the network redundant:

■ A full mesh to provide complete redundancy and good performance

■ A partial mesh, which is less expensive and more scalable

The common approach when designing route redundancy is to implement partial redundancy by using a partial mesh instead of a full mesh and backup links to the alternative device. This protects only the most vital parts of the network, such as the links between the layers and concentration devices.

A full-mesh design forms any-to-any connectivity and is ideal for connecting a reasonably small number of devices. However, as the network topology grows, the number of links required to maintain a full mesh increases exponentially. (The number of links in a full mesh is n(n-1 )/2, where n is the number of routers.) As the number of router peers increases, the bandwidth and CPU resources devoted to processing routing updates and service requests also increase.

A partial-mesh network is similar to the full-mesh network with some of its connections removed. A partial-mesh backbone might be appropriate for a campus network in which traffic predominantly goes into one centralized Server Farm module.

Figure 3-19 illustrates an example of route redundancy in a campus. In this example, the access layer switches are fully meshed with the distribution layer switches. If a link or distribution switch fails, an access layer switch can still communicate with the distribution layer. The multilayer switches select the primary and backup paths between the access and distribution layers based on the link's metric as computed by the routing protocol algorithm in use. The best path is placed in the forwarding table, and, in the case of equal-cost paths, load sharing takes place.

Figure 3-19 Campus Infrastructure Redundancy Example

Building Access Multilayer Switching

Building Distribution Multilayer Switching

Building Access Multilayer Switching

Building Distribution Multilayer Switching

NOTE Chapter 7, "Selecting Routing Protocols for the Network," discusses routing protocols in detail.

Continue reading here: Designing Link Redundancy

Was this article helpful?

When the network goes down, everything stops. For some enterprises, a few minutes of downtime isn't critical. For others, including those whose business depends on a customer-facing website, a few minutes of downtime means lost revenue and, possibly, lost customers. It's critical for these enterprises to design a network that stays up despite a component failure.

Enterprises in which even a brief downtime has a major effect must add redundant equipment and contract for redundant services. But adding network redundancy increases cost and complexity. Each enterprise must consider the tradeoff of downtime costs against the cost of adding devices and services.

Below are seven factors network teams should evaluate when building their network redundancy designs.

Switches and routers

Switches and routers are quite reliable, but they do sometimes fail. Some organizations find it's sufficient to keep an extra switch or router on the shelf so they can quickly swap out a failing unit. Organizations with more critical needs must have redundant equipment up and running in the network.

Network protocols

Network standards organizations have developed network protocols that provide rapid switchover to backup devices when a failure occurs. Adding redundancy at Layer 2 requires teams to connect more than a single switch to each subnet segment.

These additional switches create multiple paths through the network, resulting in network flooding with multiple copies of each packet. The Spanning Tree algorithm provides a way to determine a single path through the network. Unfortunately, Spanning Tree can take almost a minute to determine a new path. While this time frame might be acceptable for some networks, others require a more rapid recovery.

Newer protocols, including Multisystem Link Aggregation (MLAG), Transparent Interconnection of Lots of Links (TRILL) and Shortest Path Bridging (SPB), have been developed to support faster recovery. Network teams that build network redundancy designs and require faster recovery must determine which option works best for their network.

Don't forget these seven factors when adding redundancy in your network design.

Subnet connections

The next step in adding redundancy is to connect subnets. Again, it's necessary to provide multiple paths between the subnets. Routers connect the subnets within a network and to external destinations. Each subnet must be connected to multiple routers to provide redundancy. Protocols such as Open Shortest Path First (OSPF) and Enhanced Interior Gateway Routing Protocol (EIGRP) define how routers inform each other of the current optimum path to each destination.

Routers determine that a neighboring router is down when no updates arrive from that router for several seconds. However, both OSPF and EIGRP take more time to recover than some networks can accept. Hot Standby Router Protocol (HSRP) and Virtual Router Redundancy Protocol (VRRP) are available to reduce the amount of time needed to recover from a router failure.

A network connection can be disconnected for many reasons -- e.g., someone pulls the wrong wire, knocks a connection loose when adding a new connection or brushes against a cable while moving behind equipment. IEEE 802.3ad link aggregation defines how to use two cables for a single connection. Traffic can be shared across the two cables as long as both are connected, but it continues to flow when one is disconnected.

Backup

A disk failure that results in loss of data can cripple an enterprise. A daily backup is enough for some organizations, while others can't accept the loss of one day's data and the time required to recover.

RAID protects against the failure of a single disk. Multiple levels of protection are available. Each item of data can be written on two different disks. Higher protection levels define ways to add additional disks and parity information, which enables teams to recover correct data in case two disks that should contain identical data differ.

Continuous cloud backup has advantages over the highest RAID levels because sending each data update to the cloud means no data will be lost. The individual RAID disks are installed in a single cabinet, which means damage to that cabinet can wipe out all the individual disks. Meanwhile, it's still necessary to periodically back up the array, but data accumulated after the most recent backup will still be lost.

Processors

Processors can fail just like other components, so it's important to consider them in network redundancy designs. In addition to the possibility of failure, processors must be regularly updated with the latest system software release. It's necessary for organizations to have sufficient extra processing resources to guarantee continuous network operation.

Moving all processing and storage to a public cloud can simplify the task of designing in redundancy. Clouds have many processors and storage units, applications can quickly move to another processor in the event of failure, and redundant storage can be configured. If some event shuts down an entire facility, processing can move to a distant location.

Power

Obviously, nothing works without power, which can fail because of a storm, a pole knocked down by a car or any number of other reasons. Battery backup can take over quickly in the event of a failure, but this option can require a large number of backup units for large facilities.

Switching to a generator takes more time, but it can pick up the load if the blackout lasts beyond battery capacity. In some cases, it's also possible to connect to two different supplier circuits to survive wire damage along one of the supplier's routes.

WAN and SD-WAN

WAN connections have always been important, but the growth of cloud computing and the importance of remote users have made WAN reliability increasingly critical. One option for enterprises is to procure connections to two different network service providers. While this adds expense, it protects against failure along the link to the provider and failure within the provider's network.

Software-defined WAN (SD-WAN) provides an additional way to add network redundancy. MPLS circuits are quite reliable and guarantee a specified level of quality of service (QoS), but they can fail. An SD-WAN controller can switch traffic to the internet in the event of failure. The public internet doesn't provide the same level of reliability or QoS guarantees, but it provides a way to get data to its destination. Another advantage of SD-WAN is it can move less critical traffic to the internet during times of maximum load, rather than driving teams to contract for the maximum level of MPLS bandwidth required during the year.

Adding redundancy increases expense and complexity. Designers shouldn't design more network redundancy than is necessary, but they also can't design in less than required, as even a short disruption can mean enterprise success or failure.