How long does bgp take to converge




















Data-plane convergence is different than control-plane convergence within a node. This document defines methodology to test o data-plane convergence on a single BGP device that supports the BGP functionality with a scope as outlined above; and o using test topology of three or four nodes that are sufficient to recreate the convergence events used in the various tests of this document.

To maintain reliable connectivity within intra-domains or across inter-domains, fast recovery from failures remains most critical. To ensure minimal traffic losses, many service providers are requiring BGP implementations to converge the entire Internet routing table within sub-seconds at FIB level.

Informational [Page 4] RFC BGP Convergence Methodology April Furthermore, to compare these numbers amongst various devices, service providers are also looking at ways to standardize the convergence measurement methods. This document offers test methods for simple topologies. These simple tests will provide a quick high- level check of BGP data-plane convergence across multiple implementations from different vendors. Methodologies that test control-plane convergence are out of scope for this document.

Benchmarking Testing In order to ensure that the results obtained in tests are repeatable, careful setup of initial conditions and exact steps are required. This document proposes these initial conditions, test steps, and result checking. For the sake of clarity and continuity, this document adopts the general template for benchmarking terminology set out in Section 2 of [RFC].

Definitions are organized in alphabetical order and grouped into sections for ease of reference. These test setups have three or four nodes with the following configuration: 1. Basic test setup 2. Setup for eBGP multihop test Scenario 4. For eBGP, the route is installed on the DUT with the remote interface address as the next hop, with the exception of the multihop test case as specified in the test.

The peers are established before the tests begin. Additional peers can be added based on the testing requirements. The number of peers enabled during the testing should be well documented in the report matrix.

Number of Routes per Peer "Number of Routes per Peer" is defined as the number of routes advertised or learned by the DUT per session or through a neighbor relationship with an emulator or Helper Node.

Each test run must identify the route stream in terms of route packing, route mixture, and number of routes. This route stream must be well documented in the reporting stream. RFC defines these terms. If multiple peers are used, it is important to precisely document the timing sequence between the peer sending routes as defined in RFC Additional runs may be done with the policy that was set up before the tests began.

Exact policy settings MUST be documented as part of the test. There are configured parameters and timers that may impact the measured BGP convergence times. The benchmark metrics MAY be measured at any fixed values for these configured parameters. All optional BGP settings MUST be kept consistent across iterations of any specific tests Examples of the configured parameters that may impact measured BGP convergence time include, but are not limited to: 1.

Interface failure detection timer 2. BGP keepalive timer 3. BGP holdtime 4. BGP update delay timer 5. ConnectRetry timer 6. TCP segment size 7. Route flap damping parameters Maximum TCP window size MTU The basic-test settings for the parameters should be: 1.

Interface failure detection timer 0 ms 2. BGP keepalive timer 1 min 3. BGP holdtime 3 min 4. BGP update delay timer 0 s 5. ConnectRetry timer 1 s 6. TCP segment size bytes 7. Route flap damping parameters off TCP Authentication Option off 4. Interface Types The type of media dictates which test cases may be executed; each interface type has a unique mechanism for detecting link failures, and the speed at which that mechanism operates will influence the measurement results.

All interfaces MUST be of the same media and throughput for all iterations of each test case. Measurement Accuracy Since observed packet loss is used to measure the route convergence time, the time between two successive packets offered to each individual route is the highest possible accuracy of any packet-loss- based measurement.

When packet jitter is much less than the convergence time, it is a negligible source of error, and hence, it will be treated as within tolerance. An exterior measurement on the input media such as Ethernet is defined by this specification. Measurement Statistics The benchmark measurements may vary for each trial due to the statistical nature of timer expirations, CPU scheduling, etc.

It is recommended to repeat the test multiple times. Evaluation of the test data must be done with an understanding of generally accepted testing practices regarding repeatability, variance, and statistical significance of a small number of trials. For any repeated tests that are averaged to remove variance, all parameters MUST remain the same. The processing of the authentication hash, particularly in devices with a large number of BGP peers and a large amount of update traffic, can have an impact on the control plane of Papneja, et al.

If authentication is enabled, it MUST be documented correctly in the reporting format. Convergence Events Convergence events or triggers are defined as abnormal occurrences in the network, which initiate route flapping in the network and hence forces the reconvergence of a steady state network. In a real network, a series of convergence events may cause convergence latency operators desire to test. These convergence events must be defined in terms of the sequences defined in RFC This basic document begins all tests with a router initial setup.

Additional documents will define BGP data- plane convergence based on peer initialization. The convergence events may or may not be tied to the actual failure. For cases where the redundancy cannot be disabled, the results are no longer comparable and the level of impact on the measurements is out of scope of this document. Test Cases All tests defined under this section assume the following: a. BGP peers are in Established state.

The current best path is through PE2. Now imagine that PE2 goes down due to a power failure or hardware failure. It has to rely on timers which are normally 60 seconds for keepalives and seconds for the hold time.

This means that it would take up to three minutes for BGP to detect that the peer is gone. This takes up to 60 seconds depending on the when the BGP Scanner process last ran. The problem is that the BGP Scanner process is not event driven. BGP NHT will react to different events such as the next-hop becoming reachable or unreachable, if the metric to the next-hop changes and so on. The change of metric for the next-hop is considered a non critical event and the change of next-hop is considered a critical event, although only IOS-XR makes this separation, IOS reacts the same way to both events.

If a next-hop goes away for a route, then the route is no longer valid. This delay is meant to give the IGP a chance to flood information, aggregate events and converge. BGP normally selects a single best path. Once the IGP reacts, the primary path is no longer valid due to the unreachable next-hop. BGP immediately reacts to this and installs the backup path.

This means that convergence is very fast, as fast as your IGP combined with how fast you can react, such as loss of signal LoS or bidirectional forwarding detection BFD. There are other optimizations to NHT which are out of scope for this blog such as only tracking and installing next-hops of a certain length. This brings us to another important part of NHT. This means that we have to fall back to BGP converging by itself, which can be very slow. One alternative to handle this could be doing multi hop BFD.

If your IGP is properly tuned for fast convergence, it may be safe to tune down the NHT a bit although the default values are often the ones that are recommended by Cisco. EBGP sessions are normally done with directly adjacent routers.

A keepalive is sent every 60 seconds and the hold time is seconds. The same holds true if the next-hop goes away for eBGP multi hop peering. This is often referred to as bgp fast-external-fallover. This behavior is not desirable for iBGP because peering sessions are generally done to a loopback and there are often several paths as calculated by the IGP to reach that loopback. Therefore, this feature is not enabled for iBGP by default.

It is desirable to have loss of signal LoS or some form of link down event on both sides of the link if the link or router fails. The second best option is to use BFD for detecting if a peer goes away. Convergence is much slower if we have to wait for BGP to send updates and withdraws. For this reason it is important to have diversity of paths, but we will discuss this point later. To achieve fast convergence, we need a tuned IGP.

This post will not go into detail how to tune your IGP but the key point is to react quickly, the best way is event driven detection such as LoS and the second best option is BFD. It is important though to throttle the flooding and SPF runs if there is a lot of churn. A feature such as IP dampening could also be used to further reduce churn. BGP will normally modify the next-hop to itself on eBGP peerings unless third party next-hop modification is used, which is most common in IX scenarios.

The idea is to make the BGP process register the next-hop values with the RIB "watcher" process and require a "call-back" every time information about the prefix corresponding to the next-hop changes. The first event is more important and reported faster than metric change. Overall, IGP delays report of an event for the duration of bgp nextop trigger delay XX interval which is 5 seconds by default.

This allows for more consecutive events to be processed and received from IGP and effectively implements event aggregation. This delay is helpful in various "fate sharing" scenarios where a facility failure affects multiple links in the network, and BGP needs to ensure that all IGP nodes have reported this failure and IGP has fully converged. Normally, you should set the NHT delay to be slightly above the time it takes the IGP to fully converge upon a change in the network.

In a fast-tuned IGP network, you can set this delay to as low as 0 seconds, so that every IGP event is reported immediately, though this requires careful underlying IGP tuning to avoid oscillations. See [6] for more information on tuning the IGP protocol settings, but in short, you need to tune the SPF delay value in IGP to be conservative enough to capture all changes that could be caused by a failure in the network.

This will affect every prefix that has the next-hop changed as a result of IGP event, and could take significant amount of time, based on number of prefixes associated with this nexthop.

For example, if an AS has two connections to the Internet and receives full BGP tables over both connections, then a single exit failure will force full-table walk for over k prefixes. The last, less visible contributor to faster convergence is Hierarchical FIB. Look at the figure below - it shows how FIB could be organized as either "flat" or "hierarchical". In the "flat" case, BGP prefixes have their forwarding information directly associated - e.

In such case, any change to a BGP next-hop may require updating a lot of prefixes sharing the same next-hop, which is a time consuming process. If the next-hop value remains the same, and only the output interface changes, the FIB update process still needs walking over all BGP prefixes and reprogramming the forwarding information. The use of hierarchical FIB is automatic and does not require any special commands.

All major networking equipment vendors support this feature. Summarization hides detailed information and may conceal changes occurring in the network. In such case, the process will not be notified of the IGP event and will have to detect failure and re-converge using BGP-only mechanics.

Look at the figure below - because of summarization, R1 will not be notified or R2's failure and the process at R1 will have to wait till the session times out. Aside from avoiding summarization for the prefixes used for iBGP peering, an alternate solution could be using multi-hop BFD [15]. Additionally, there is some work in progress to allow the separation of routing and reachability information natively in IGP protocols.

This fast convergence process effectively covers core link and node failures, as well as edge link and node failures, provided that all these could be detected by IGP. You may want to look at [1] for detailed convergence breakdowns.

Pay special attention that edge link failure requires special handling. If your edge speaker is changing the next-hop value to self for the routes received from another autonomous system, than IGP will only be able to detect failures for paths going to the BGP speaker's own IP address.

The best approach in this case is to leave the eBGP next-hop IP address unmodified and advertise the edge link into IGP using the passive interface feature or redistribution. This will allow the IGP to respond to link down condition by quickly propagating the new LSA and synchronously trigger re-convergence on all BGP speakers in the system by informing them of the failed next-hop.

Previously, we discussed how having multiple equal-cost BGP paths could be used for redundancy and fast failover at the forwarding engine level, without involving any BGP best-path selection. What if the paths are unequal - is it possible to use them for backup? In fact, since BGP treats the local AS as a single hop, all BGP speakers select the same path consistently, and changing from one path to another synchronously among all speakers should not create any permanent routing loops.

Thus, even in scenarios where equal-cost BGP multi-path is not possible, the secondary paths may still be used for fast failover, provided that a signaling mechanism to detect the primary path failure exists. This switchover does not require any BGP table walks and best-path re-election, but simply is a matter of changing the forwarding information - provided that hierarchical FIB is in use.

BGP PIC could be used any time there are multiple paths to the destination prefix, such on R1 in the example below, where target prefix is reachable via multiple paths:. We have already stated the problem with multiple paths - only one best path is advertised by BGP speakers and the speaker will only accept one path for a given prefix from a given peer.

If a BGP speaker receives multiple paths for the same prefix within the same session it simply uses the newest advertisement. A special 4-byte path-identifier is added to NLRIs to differentiate multiple paths for the same prefix sent across a peering session.

Notice that BGP still considers all paths as comparable from the viewpoint of best-path selection process - all paths are stored in the BGP RIB and only one is selected as the best-path. The additional NLRI identifier is only used when prefixes are sent across a peering session to prevent implicit withdrawals by the receiving peer. These identifiers are generated locally and independently for every peering session that supports such capability.

Alternatively, if backup paths are required but "Add Path" feature is not implemented, one of your options could be using full-mesh of BGP speakers, such as on the figure below. This may require IGP metric manipulation to ensure different exit points are selected by the RRs or using other techniques, such as different RD values for multi-homed site attachment points. First, look at the topology diagram: R9 is advertising a prefix, and R5, R6 receive this prefix via the RRs.

However, we tune the scenario, disabling the connections between R1 and R4 and R2 and R3, so R3 has better cost to exit via R1 and R4 has better cost via R2. This will make the RRs elect different best-paths and propagate them to their clients. The following is the key piece of configuration for enabling the fast backup path failover to be applied to every router in AS BGP nexthop trigger delay is set to 0 seconds, thus fully relying on IGP to aggregate underlying events. In any production environment, you should NOT use these values and pick up your own, matching your IGP scale and convergence rate.

The command bgp additional-paths install when executed in non BGP-multipath environment allows for installing backup paths in additional to the best one elected by BGP. This, of course, requires that the additional paths have been advertised by the BGP Route Reflectors. At the moment of writing, Cisco IOS does not support the "Add Paths" capability, so you need to make sure BGP RRs elect different best-paths in order for the edge routers to be able to use additional paths.



0コメント

  • 1000 / 1000