Re: BGP Path Selection weirdness regarding next hops from Brian Dennis on 2012-12-01 (Ccielab archives 12/2012)

From: Brian Dennis <bdennis_at_ine.com>
Date: Sat, 1 Dec 2012 01:13:42 -0600

John,
Let me see if I can sum this up:

Two iBGP peers (let's call them Peer 1 and Peer 2) are advertising the
same prefix via iBGP to Peer 3. From Peer 1 the NH (Next-Hop) metric is
10 to reach the prefix and from Peer 2 the NH metric is 20. The BGP
decision process is selecting Peer 1 over Peer 2 due to the lower NH
metric. You also have a default route learned via iBGP from another peer
and let's say it has a NH metric of 5.

Now Peer 2 does down and within about 35 to 45 seconds the IGP converges
and the NH is removed from Peer 3's RIB. Okay fine, so what, you might
think as you're not using Peer 2 to reach that prefix anyways. The BGP
peering session is still up due to the default BGP hold timers being
60/180 seconds in the IOS. So time wise we are about 60 seconds after the
failure of Peer 2.

After the default delay timer of 5 seconds the NHT (Next-Hop-Tracking) in
BGP on Peer 3 kicks-in (or BGP scan process in older IOS versions) to look
and see if the NH to Peer 2 is still reachable via another route in the
RIB. The only route available to reach the NH advertised by Peer 2 is the
default route. The default route's NH metric is less than the NH metric
to Peer 1 which is the current best path. This means that BGP updates the
NH metric to Peer 2's NH with the IGP metric to reach the default route
(5). Now the prefix advertised by Peer 2 becomes the best path since it
now has a lower NH metric. At this point we are about 66 seconds after
the failure of Peer 2 but we still have 114 more seconds until BGP detects
that Peer 2 is actually down so Peer 2's prefix is still useable by BGP.
The real problem, as we know, is that Peer 2 is actually down but Peer 3
will not detect it until the BGP hold time finally expires (180 seconds).
Once Peer 2 is declared down, Peer 3 will switch back to Peer 1's prefix.
The problem comes in that traffic at best is sub-optimally routed or worst
"black holed" when this occurs as Peer 3 isn't using the "best" path (Peer
1).

This is actually a common problem with a simple solution. Just don't
allow BGP to use any /31 or longer (or whatever length you want) and/or
not use another BGP route to reach the next hop using BGP Selective NHT.

router bgp 10
bgp nexthop route-map RM_NH_FILTER
!

ip prefix-list PL_NH_FILTER seq 5 permit 0.0.0.0/0 le 31
!
route-map RM_NH_FILTER deny 10
match ip address prefix-list PL_NH_FILTER
!
route-map RM_NH_FILTER deny 20
match source-protocol bgp 10
!
route-map RM_NH_FILTER permit 30

With this configuration Peer 3 will show the NH to Peer 2's prefix as
"inaccessible" as it can not use any route with a mask longer than a /32
and/or a route that's installed in the RIB via BGP. This can also be used
to not allow BGP to use a discard route for the NH which is another common
problem.

I'll just do a quick blog post on this tomorrow. I labbed it all up
already to verify what I talked about above but the post will have to wait
until tomorrow Wife is telling me it's 11pm and time to get off the
computer as it's Friday night ;-) There are a few minor caveats to this
that I'll mention this weekend in the blog post. Also I pasted my tests
below.

-- 
Brian Dennis, CCIEx5 #2210 (R&S/ISP-Dial/Security/SP/Voice)
bdennis_at_ine.com
INE, Inc.
http://www.INE.com
******************************************************
  R6 is our Peer 3 and R3 and R5 are Peer 1 and 2
******************************************************
Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
BGP routing table entry for 50.0.0.0/8, version 3
Paths: (2 available, best #2, table Default-IP-Routing-Table)
  Advertised to update-groups:
        2
  200, (Received from a RR-client)
    10.5.5.5 (metric 20) from 10.5.5.5 (10.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal
  200, (Received from a RR-client)
    10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
BGP routing table entry for 0.0.0.0/0, version 2
Paths: (1 available, best #1, table Default-IP-Routing-Table)
  Advertised to update-groups:
        2
  Local, (Received from a RR-client)
    10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
Rack1R6#show ip route 10.3.3.3
Routing entry for 10.3.3.3/32
  Known via "ospf 1", distance 110, metric 10, type intra area
  Last update from 173.1.36.3 on GigabitEthernet0/0.36, 00:17:05 ago
  Routing Descriptor Blocks:
  * 173.1.36.3, from 10.3.3.3, 00:17:05 ago, via GigabitEthernet0/0.36
      Route metric is 10, traffic share count is 1
Rack1R6#show ip route 10.5.5.5
Routing entry for 10.5.5.5/32
  Known via "ospf 1", distance 110, metric 20, type intra area
  Last update from 173.1.0.205 on GigabitEthernet0/0.456, 00:00:50 ago
  Routing Descriptor Blocks:
  * 173.1.0.205, from 10.5.5.5, 00:00:50 ago, via GigabitEthernet0/0.456
      Route metric is 20, traffic share count is 1
Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
	BGP routing table entry for 0.0.0.0/0, version 2
	Paths: (1 available, best #1, table Default-IP-Routing-Table)
	  Advertised to update-groups:
	        2
	  Local, (Received from a RR-client)
	    10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
	      Origin incomplete, metric 0, localpref 100, valid, internal, best
Rack1R6#
******************************************************
Not let's shutdown the interface R5 uses to reach R6.
******************************************************
Rack1R5#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
Rack1R5(config)#int fa0/0.456
Rack1R5(config-subif)#shut
Rack1R5(config-subif)#^Z
Rack1R5#
%OSPF-5-ADJCHG: Process 1, Nbr 10.6.6.6 on FastEthernet0/0.456 from FULL
to DOWN, Neighbor Down: Interface down or detached
%OSPF-5-ADJCHG: Process 1, Nbr 10.10.10.10 on FastEthernet0/0.456 from
FULL to DOWN, Neighbor Down: Interface down or detached
%SYS-5-CONFIG_I: Configured from console by console
Rack1R5#
******************************************************
          Now wait for OSPF to converge
******************************************************
Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
BGP routing table entry for 50.0.0.0/8, version 4
Paths: (2 available, best #1, table Default-IP-Routing-Table)
Flag: 0x900
  Advertised to update-groups:
        2
  200, (Received from a RR-client)
    10.5.5.5 (metric 2) from 10.5.5.5 (10.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
  200, (Received from a RR-client)
    10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
      Origin incomplete, metric 0, localpref 100, valid, internal
Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
BGP routing table entry for 0.0.0.0/0, version 2
Paths: (1 available, best #1, table Default-IP-Routing-Table)
  Advertised to update-groups:
        2
  Local, (Received from a RR-client)
    10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
Rack1R6#
******************************************************
           Now roughly 180 seconds later
******************************************************
Rack1R6#
%BGP-3-NOTIFICATION: received from neighbor 10.5.5.5 4/0 (hold time
expired) 0 bytes 
Rack1R6#
%BGP-5-ADJCHANGE: neighbor 10.5.5.5 Down BGP Notification received
Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
BGP routing table entry for 50.0.0.0/8, version 5
Paths: (1 available, best #1, table Default-IP-Routing-Table)
Flag: 0x900
  Advertised to update-groups:
        2
  200, (Received from a RR-client)
    10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
Rack1R6#
******************************************************
 But with the Selective NHT config this is the result
******************************************************
Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
BGP routing table entry for 50.0.0.0/8, version 5
Paths: (2 available, best #2, table Default-IP-Routing-Table)
  Advertised to update-groups:
        2
  200, (Received from a RR-client)
    10.5.5.5 (inaccessible) from 10.5.5.5 (10.5.5.5)
      Origin incomplete, metric 0, localpref 100, valid, internal
  200, (Received from a RR-client)
    10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
      Origin incomplete, metric 0, localpref 100, valid, internal, best
Rack1R6#
On 11/30/12 6:21 PM, "John Neiberger" <jneiberger_at_gmail.com> wrote:7
>I posted this question to the Cisco NSP list and I've also talked to a
>couple of guys from Cisco Advanced Services and I'm still stumped about
>something. I'll try my best to phrase it in a way that makes sense.
>
>Router A is learning about a prefix from two route reflector clients. In
>both cases, the next hop for the prefix is the loopback address of the
>advertising routers. Their loopback addresses are being advertised into
>OSPF.
>
>So, from the perspective of Router A, it's BGP table for this prefix has
>two paths:
>
>1: 4.4.4.4  (loopback address of Router B, learned via OSPF) * winner due
>to lower IGP metric
>2. 5.5.5.5 (loopback address of Router C, learned via OSPF)
>
>Now for the weirdness to begin. A network event occurs that causes the
>loopback address of Router C to go away. This shouldn't affect Router A
>because it is already selecting the shortest path to the network via
>Router
>B (4.4.4.4).
>
>However, Router A is also learning a default via BGP. That means that even
>though 5.5.5.5 (loopback of Router C) disappeared and is unreachable, the
>router is doing a recursive lookup and keeps the path in the BGP table;
>5.5.5.5 is still reachable, it thinks, by using the default route.
>
>The weird thing is that this causes Router A to start using the wrong
>path!
>It seems to be preferring a path with a next hop learned via BGP to a path
>with a next hop learned via OSPF. Why would it do this? I see no
>documentation that would explain why a BGP-learned next hop is preferred
>over an IGP-learned next hop.
>
>Is the router still comparing IGP metrics even though the "wrong" path now
>has no IGP metric?
>
>It's not changing due to router ID, cluster length, or neighbor IP
>address.
>I checked. So, why is it switching?
>
>As soon as the BGP session from Router A to Router C times out, the
>extraneous path gets removed from the BGP table and the router goes back
>to
>using the correct path it should have been using all along.
>
>So, is a BGP-learned next hop preferred over an IGP-learned next hop? If
>so, why? If not, any idea why my router switches paths? I've turned on BGP
>debugging and IP routing debugging and haven't found a suitable
>explanation
>for the switch.
>
>John
>
>
>Blogs and organic groups at http://www.ccie.net
>
>_______________________________________________________________________
>Subscription information may be found at:
>http://www.groupstudy.com/list/CCIELab.html
Blogs and organic groups at http://www.ccie.net

Received on Sat Dec 01 2012 - 01:13:42 ART

This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 09:36:52 ART