Cisco OTV (Part II)June 28, 2011
This is a follow on post from OTV (Part I).
Edge Devices do take part in STP by sending and receiving BPDUs on their internal interface as would any other layer2 switch.
But an OTV Edge Device will not originate or forward BPDUs on the overlay network. OTV thus limits the STP domain to the boundaries of each site. This means a STP problem in the control plane of a given site would not produce any effect on the remote data centers. This is one of the biggest benefits of OTV in comparison to other DCI technologies. This is made possible because MAC reachability information is advertised and learned via the control plane protocol instead of learned using typical MAC flooding behavior.
With the STP separation between sites, the ability for different sites to use different STP technologies is made possible with OTV. I.e., one site can run MSTP while another runs RSTP. In the real world this is a nifty enhancement.
OTV allows multiple Edge Devices to co-exist in the same site for load-sharing purposes. (With NX-OS 5.1 that is limited to 2 OTV Edge Devices per site.)
With multiple OTV Edge Devices per site and no STP across the overlay to shut down redundant links, the possibility of an end-to-end site loops are created. The absence of STP between sites holds valuable benefits, but a loop prevention mechanism is still required, so an alternative method was used. The boys who wrote OTV, decided on electing a master device responsible for traffic forwarding (similar to some non-STP protocols).
With OTV this master elected device is called an AED (Authoritative Edge Device).
An AED is an Edge Device that is responsible for forwarding the extended VLAN frames in and out of a site, from and to the overlay network. It is a very important to understand this before carrying on. Only the AED will forward traffic out of the site onto the overlay. With optimal traffic replication in a transport network, a site’s broadcast and multicast traffic will reach every Edge Device in the remote site. Only the AED in the remote site will forward traffic from the overlay into the remote site. The AED thus ensures that traffic crossing the site-overlay boundary does not get duplicated or create loops when a site is multi-homed.
Coming back to the original cause, an AED enables the load-sharing of traffic if multiple Edge Devices are present in a site. An AED is elected dynamically or statically and currently on a per-VLAN basis. There is talk of a per-flow option, but that is still future talk. The load-sharing is thus achieved by one AED being authoritative for some extended VLANs and another AED for the other extended VLANs.
The AED is elected using a deterministic algorithm, by assigning ordinal numbers (indicates numerical order in a certain position) to each Edge Device in the local site based on the OTV System-ID (by default, derived from the system MAC address). The result for two AEDs in a site will be the split of odd and even VLAN between the two devices. More specifically, the Edge Device with a lower System-ID will become authoritative for all the even extended VLANs, whereas the Edge Device with higher System-ID will be authoritative the odd extended VLANs. Since all traffic may only leave via the AED for a given extended VLAN, it is strongly recommended to setup a vPC peer-link between the Edge Devices for optimal redirection. The vPC peer-link will leveraged in the initial release to steer the traffic to the AED device.
From a redundancy perspective, if one AED fails, the remaining AED will realize the failure and assume authority for all extended VLANs.
The Site VLAN must be defined even if only one Edge Device is present in a site?
The Site VLAN is a VLAN used for communication between local OTV Edge Devices within a site (not via the Overlay). It is used to facilitate the role election of the AED. The Site VLAN should NOT be extended across the overlay. It does not need to match between sites, but it is recommended to use the same VLAN. The VLAN used as the Site VLAN must be active else overlay will stay down and not pass traffic.
Finally the AED is also responsible for advertising the MAC reachability information for the extended VLANs it is authoritative for.
Handling of Traffic Types
An AED will decide to forward a Layer-2 unicast, multicast, or broadcast packet over the overlay interface when the overlay control plane has placed the overlay interface in the forwarding table. OTV attempts optimize inter-site traffic, while retaining resiliency, stability, and scalability. Let look at the forwarding behavior of different traffic types.
Known unicast layer2 frames destined via the overlay will be sent directly to the join interface of the AED in the remote site, that advertised the reachability for the destination MAC addresses.
An unknown unicast frame, is a frame received with a destination MAC address that has not been learned yet and as a result is flooded out all other valid STP interface (or Internal interfaces in the case of OTV). This is normal Ethernet behavior. But with OTV unknown unicast layer2 frames are not flooded between OTV sites. This is generally not an issue since hosts become known the moment they emit one packet. The time for the MAC reachability information to be propagated between all sites is nominal. The assumption with OTV is that no host on the network is completely silent. Again there is talk of a feature available in future NX-OS releases to enable selective flooding. On NX-OS 5.1 all unknown unicast frames will be blocked from crossing the logical overlay.
A broadcast layer2 frame originated at an OTV site will be delivered to all remote sites. Broadcast frames are forwarded in the same manner as the OTV hellos, leveraging the same ASM multicast group in the transport network. In future NX-OS releases, there will broadcast based policy control mechanisms such as broadcast suppression, broadcast white-listing, etc.
Multicast layer2 frames will have the layer2 source group addresses mapped to a SSM (Source Specific Multicast) groups in the transport network. The SSM groups used are configured as the OTV data-group. These dynamic mappings are communicated via the OTV control plane between sites. Multicast traffic via the overlay will be destined to the SSM group multicast addresses joined.
ARP optimization is another traffic enhancement OTV offers to reduce the amount of broadcast traffic between sites. IP ARP is a layer2 broadcast frame used to determine the MAC address of the host with a particular IP address. ARP requests are sent across the OTV overlay to all remote sites, with the hope that it reaches the host with that particular IP. The intended host will respond to the originating host’s ARP request using an ARP reply, which will pass via the original OTV Edge Device that forwarded the ARP request. That OTV Edge Device will record the ARP reply. OTV Edge Devices are capable of snooping ARP reply traffic and caching the contained mapping information in a local data table called ARP ND (Neighbor-Discovery). Any subsequent ARP broadcast requests that has a match in the ARP ND will be served from there and will not be sent across the overlay.
One caveat to be aware of is the relation between the MAC aging-timer and the ARP cache timer. The ARP cache timer should always be lower than the MAC aging-timer, else traffic might be black-holed. Using the default NX-OS values and provided the default gateway resides on a Nexus 7000, this should never be an issue with the default set values.
The defaults on Nexus 7000 platforms for these timers are:
- OTV ARP aging-timer: 480 seconds / 8 minutes
- MAC aging-timer: 1800 seconds / 30 minutes
One of the great features of Ethernet is the ability to adapt dynamically to MAC addresses moving around. This ability must also be achieved when a MAC moves from one OTV site to another. VMotion is one common example of MAC mobility. VMotion occurs when the virtual machine moves from one site to another. OTV uses a metric value to support seamless MAC mobility.
If an AED has a MAC address stored in the MAC forwarding table which points to the overlay interface, it means that an Edge Device in another site has explicitly advertised the MAC address as being local to its site. When that MAC appears in a new site, which were previously advertised by another site, the AED on the new site will advertise the MAC address (newly learned on the internal interface) with a metric value of 0. When the Edge Device in the site the MAC has moved from hears the advertisement, it will withdraw the MAC address that it had previously advertised. Once the MAC address is withdrawn, the Edge Device where the MAC has moved to will change the metric value to 1. All remote sites sending to this MAC address will start using the new Edge Device as soon as they hear its MAC advertisement with metric 0.
Care should be taken when using static MAC addresses.
Most of this post has focused on activity at layer2. Lets take a step up to the layer3 breakouts and focus more specifically on the FHRPs (First Hop Redundancy Protocols). HSRP (Host Standby Routing Protocol), VRRP (Virtual Router Redundancy Protocol) and GLBP (Gateway Load Balancing Protocol) are very often implemented to provide a common IP address to be used as a default gateway and provide redundancy and load-balancing to the clients in that subnet. These are established technologies that work well within the local segment of a network. Since OTV is extending VLANs over multiple sites, the size of the ‘local’ segment is extended and not so ‘local’ anymore. The FHRP protocols are also extended between sites which opens the likely possibility that the FHRP ‘active’ device responsible for forwarding traffic beyond layer2 boundaries might not be locally located any more. Inter-VLAN traffic could traverse a remote site where the FHRP gateway resides even though both the source and destination hosts reside in the local site. A increase in latency between sites could additionally offset the stability of the FHRP protocols.
This problem is addressed by what is called FHRP isolation, which limits/isolates all FHRP frames to each local site. With the absence of site-to-site FHRP communication, each site would choose a local FHRP ‘active’ member as the default gateway. This means that the outbound traffic will be able to follow the optimal and shortest path, always leveraging the local default gateway.
FHRP Isolation (current day) is done manually and will be done (future days) using fewer commands.
I have covered some of the limitations of the still youthful OTV. But before we recap, there is one more to cover.
OTV SVI coexistence
With the current implementations of NX-OS, the OTV encapsulation (i.e., the Edge Device) and the SVI exit point for a given VLAN must be separate devices. This separation can be achieved by using separate physical devices, or alternatively the OTV Edge Devices can be deployed as separate VDC (Virtual Device Contexts).
- OTV Appliance on a Stick : A common set of uplinks from the Routing VDC are used for both the routing and DCI extension.
- Inline OTV Appliance: A dedicated link from the OTV VDC is used for the DCI extension.
NX-OS OTV limitations hopefully removed soon in future releases:
- Unicast support for the transport network.
- Support for an IPv6 transport network.
- The join interface must be a physical interface, a Loopback interface would be preferred.
- Per-Flow based AED load-balancing.
- Selective Flooding Mechanism.
- Broadcast Suppression.
- FHRP Isolation done using native commands.
- More than two Edge Devices per site.
- More than six Edge Devices in all sites.
- More than three supported OTV sites.
- More that 128 VLANs per overlay.
OTV is a brand new kid on the block and although it does present some exciting and needed enhancements to the DCI realm, it still needs some work before fully complemented with all the promised features.
OTV is easy to configure and implement, OTV in my opinion appears to be a solid technology and is here to stay. The initial releases of OTV so far has been very stable with almost no quirks or any major bugs. But keep in mind that I did set this up in an isolated lab environment. Integration into existing data centers with a vast set of applications, as with any new technology, will have its teething problems.
I personally consider the ability to use different STP technologies in different sites very practical.With a better attempt to control inter-site traffic, where OTV is at and where OTV is heading, it most definitely seems very promising. So if consider using OTV in production, do the necessary testing or POC first!
As cool as OTV is, it is sadly still an enterprise technology. It’s not really geared for multi-tenant data centers, or a mix of old and new data centers where the the same VLAN might belong to different customers. Yes, there are obscure ways to achieve that , but it would be huge benefit if OTV or at least the Nexus 7000 switches supported VLAN rewrites. I suppose lets first hope for the addition of LDP in NX-OS, coming(promised) in the much awaited 5.2 release.
If the next post I will cover the lab setup I used to test OTV along with the config and outputs. I hope you have found this informative :)