This article is a follow up on my previous post called FortiGate inter-vDOMs communication using EMAC vLAN.
As the configuration on the FortiGate's involved within the given setup involves different routing protocols, in my example, OSPF for the back-end vDOMs, BGP between vPROXY and vWAN vDOMs and OSPF again on IPsec tunnels, routing convergence and therefore post FGCP HA failover sessions maintenance can be somewhat of a challenge.
Routing convergence indeed take's time -- and that is perfectly normal. Furthermore I do not think that neither Grateful Restart nor BFD are supported between vDOMs link interfaces. This either with regular inter-vdom-links or EMAC-vLANs.
Well after many hours of trial and error upon getting a sub second FortiGate cluster failover involving many routing protocols, many different vDOMs etc, I've stumbled upon the magical "David Copperfield" parameter.
Simply put, I've maxed out the possible buffer amount of this parameter:
config global config sys ha set route-ttl 3600
Here is the Fortinet explanations on the parameter:
The time to live for routes in a cluster unit routing table.
The time to live range is 0 to 3600 seconds. The default time to live is 0 seconds.
The time to live controls how long routes remain active in a cluster unit routing table after the cluster unit becomes a primary unit. To maintain communication sessions after a cluster unit becomes a primary unit, routes remain active in the routing table for the route time to live while the new primary unit acquires new routes.
Normally, the route-ttl is 0 and the primary unit must acquire new routes before it can continue processing traffic. Normally acquiring new routes occurs very quickly so only a minor delay is caused by acquiring new routes.
If the primary unit needs to acquire a very large number of routes, or if for other reasons, there is a delay in acquiring all routes, the primary unit may not be able to maintain all communication sessions. You can increase the route time to live if communication sessions are lost after a failover so that the primary unit can use routes that are already in the routing table, instead of waiting to acquire new routes.
Indeed, this time buffer will let all our routing protocols converge peacefully while having all the previously in routing table routes information readily available on the newly elected Master node. And it does the job wonderfully well. Every sessions remains in place, RDP to hosts seated behind the most remote vDOMs are steady and holds a complete cluster failover. Bingo!!
Not a single ICMP packet lost upon a failover, I'm doing this the whole day, failover after failover after failover... Love it !!!
Thanks for the read. And please don't be shy, send me your hate mails in the email provided below =)
Picture credits: Alien Investigation by Geoffroy Thoorens