I've done enough of these 'A madman's thoughts' for any of my older readers to see that these are usually a story of me overcoming a difficulty. Sometimes it's "DUH" and other times it's bizarre.
This one in my mind is a bit of both.
This starts with me helping a customer trying to establish proper settings for a FortiGate to Cisco ASA IPSEC VPN tunnel. Normally this would be a 'how-to' but some of the behavior was just insane, and how we learned about why this behavior happened, is the 'DUH' component of this story.
First things first, a little bit about the customer's environment:
Spoke Site FortiGate running FortiOS 6.2.5
Hub ASA running 9.8(4)
VTI/Route Based VPN was used - customer wanted to use BGP for internal routing, so this is a must
Ok, first things first. Let's cover how you do a FortiGate to ASA VTI tunnel.
FortiGate to ASA VTI tunnel:
Here's our topology:
FortiGate configuration:
IPSEC tunnel configuration
config vpn ipsec phase1-interface
edit "to-asa"
set interface "wan1"
set ike-version 2
set keylife 43200
set peertype any
set net-device disable
set passive-mode enable
set proposal aes256-sha256
set dpd on-idle
set npu-offload disable
set remote-gw 172.31.0.2
set psksecret FakeKey2020!
next
end
config vpn ipsec phase2-interface
edit "to-asa"
set phase1name "to-asa"
set proposal aes128-sha1
set keepalive enable
set keylifeseconds 86400
next
end
Tunnel interface configuration:
config system interface
edit to-asa
set ip 10.200.0.2/32
set remote-ip 10.200.0.1/30
next
end
Policy configuration - remember, the tunnel will not come up without the policy:
config firewall policy
edit 0
set name "from.asa
set srcintf "to-asa"
set dstintf "lan"
set srcaddr "all"
set dstaddr "all"
set action accept
set schedule "always"
set service "ALL"
set logtraffic all
next
edit 0
set name "to-asa"
set srcintf "lan"
set dstintf "to-asa"
set srcaddr "all"
set dstaddr "all"
set action accept
set schedule "always"
set service "ALL"
set logtraffic all
next
end
ASA Configuration:
! Phase 1 settings
crypto ikev2 policy 5
encryption aes-256
integrity sha256
group 14 5
prf sha256
lifetime seconds 43200
!
! Phase 2 Proposal
crypto ipsec ikev2 ipsec-proposal FORTIGATE_P2_PROPOSAL
protocol esp encryption aes-128
protocol esp integrity sha-1
!
! Phase 2 profile
crypto ipsec profile FORTIGATE_IPSEC_PROFILE
set ikev2 ipsec-proposal FORTIGATE_P2_PROPOSAL
set pfs group14
set security-association lifetime kilobytes unlimited
set security-association lifetime seconds 86400
!
! Tunnel Group Policy
group-policy 10.10.10.2 internal
group-policy 10.10.10.2 attributes
vpn-tunnel-protocol ikev2
!
! Tunnel Group
!
tunnel-group 10.10.10.2type ipsec-l2l
tunnel-group 10.10.10.2 general-attributes
default-group-policy 10.10.10.2
tunnel-group 10.10.10.2 ipsec-attributes
ikev2 remote-authentication pre-shared-key FakeKey2020!
ikev2 local-authentication pre-shared-key FakeKey2020!
And that's all that's really needed to get the tunnel up. You can set your firewall rules accordingly.
Disclaimer:
I am not proficient in ASA(IOS is a different story), I haven't used one since the 7/8 code days, and in retrospect, not all that well. If this seems like a "DUH" write up I apologize for the two braincells you lost in the process.
The challenge:
Attempt #1:
The customer reached out to me, and reported that we were seeing the tunnel drop once an hour initially. And after working with them, the customer found that their phase2 timers were off. The ASA had 1 hour for phase 2 SA lifetime, and the FortiGate had 1 day.
I know what you're thinking, and I'll get to it. Normally this would not be an issue, as if the FortiGate receives a rekey request, it will simply rekey. This was not happening - this is important, we'll revisit at the end.
Working with the customer I wanted to make sure that the timers lined up, and sure enough, the tunnel didn't drop the next 2 hours. We decided to call it done.
Attempt#2:
If you've read any of these Madman series, you'd know it doesn't just stop there, there's madness after all.
A few days later, I follow up with my customer and find that the drops are still occurring, once every 12 hours. At my suggested we opened a TAC case with each of Fortinet and Cisco's TAC. And we started to work the issue, after an escalation on either side, we're still no further.
Here are the commands we used to debug the tunnel:
FortiGate:
diagnose vpn ike log-filter dst-addr4
diagnose debug application ike-1
diagnose debug enable
Cisco:
debug cry condition peer x.x.x.x
debug crypto ikev2 protocol 255
debug crypto ipsec 255
After analyzing the debugs, I realized that while the symptom was the same, the issue was slightly different, this time Phase1 was being deleted. Odd.
Normally when phase1/2 rekey, you'll see a message stating "requesting_child_sa" and when that child SA is built, both members of the VPN will switch to that SA to resume communication, the old SA deleted, and communication should not cease.
I decide to go and buy my own ASA-5506X to see if I can replicate in the lab, and I seem to have success getting the tunnel to drop every 2 hours. So I start collecting my own debugs, and I start lining them up against the customer's, it doesn't appear to match the behavior. What I was seeing was something different, for a different post.
At this point the tunnel was failing reliably at ~12 hours. TAC had suggested switching the phase1 roles of the FortiGate and ASA. The roles are initiator and responder. The roles don't seem to matter for establishing a session, but they do matter for rekey, as the initiator will always initiate the rekey. This too is important, and we'll visit at the end.
We also were working directly with Fortinet's 3rd level of TAC which has access to the engineering team who write FortiOS. In the event we found anything that was a bug. In short we did not, but when cases usually go that high, it's likely a result of a bug.
When we tried this, the predictability went out the window, the tunnel would drop 5 hours in, 6 hours in. And we quickly switched back to the FortiGate being the initiator. Predictability resumed. Reported this to TAC, but they wanted the debug info to understand why this happened.
So we provide additional debugs, they're still unable to point out why the tunnel fails every 12 hours. But decide to see if switching back will do any good, predictably, the unpredictability resumed.
The customer was understandably frustrated at this point, I was mentally exhausted, I can only imagine the customer's state. Cisco TAC pointed out on the debugs that the FortiGate wasn't answering rekey requests when it attempted to rekey, and we provided that info to Fortinet's TAC. They set up some additional debugging AND packet captures, and we waited for the next failure.
We saw the same debugs on the ASA side, we saw the packets leave the ASA, we saw the packets arrive on the FortiGate, we did not see the rekey in the debugs.
Wait, what?!
That's right, the packets made it to the FortiGate, but didn't show up in the IKE daemon. That's when the Fortinet TAC engineer requested an updated copy of the configs from the customer, and found the following:
config firewall local-in-policy
edit 1
set intf "UNTRUSTED"
set srcaddr "HQ_IP"
set dstaddr "all"
set action accept
set service "HTTPS" "SSH" "PING"
set schedule "always"
next
edit 2
set intf "UNTRUSTED"
set srcaddr "all"
set dstaddr "all"
set service "ALL"
set schedule "always"
next
end
This explains the unpredictability. Here's why:
IPSEC vpns have a feature to detect down peers called Dead Peer Detection (DPD)
ASA can only support on-idle DPD, which means the ASA or FortiGate will send a DPD if nothing traverses the tunnel in a bit (~10 seconds normally, though this is configurable)
If there's traffic going over the tunnel, no DPDs are sent
IKE traffic (Initiation, Rekey, DPDs) all use udp/500, IPSEC/Tunnel traffic is over udp/4500
UDP is not stateful, but the FortiGate will listen for return UDP traffic from the destination for 3 minutes by default. If it hears nothing, it closes the "Session"
Customer states that there is always traffic going over the tunnel, no DPDs are sent. No udp/500 is sent after initiation until it's time to rekey
A FortiGate local in policy does not control traffic originated by the device, only traffic destined for it
In short, once the tunnel was initiated by the FortiGate (regardless of role) traffic started to flow over udp/4500, no DPDs were sent in 3 minutes. The FortiGate closes the udp/500 "session"
ASA attempts to rekey at approx. half the life of the tunnel (~5-6 hours), and since rekeys use udp/500, that traffic is getting dropped by the local in policy
EUREKA!!!
The customer added IKE to the local-in policy, and saw the rekey....
Attempt#3
...and 5 minutes later the tunnel dropped.
Wait what?!
So we set up debugs and pcaps for the next cycle (in 12 hours) and waited. We got debugs, and looks like the packets were going 2 way, but I saw something on the FortiGate side:
2020-11-09 07:16:28.649553 ike 0:to-asa:41698: received informational request
2020-11-09 07:16:28.649576 ike 0:to-asa:41698: processing delete request (proto 1)
2020-11-09 07:16:28.649607 ike 0:to-asa:41698: deleting IKE SA 96272xxxxxx1a4/82be83xxxxxxcc9e
And on the ASA:
KEv2-PROTO-4: (1119): Sending DELETE INFO message for IKEv2 SA [ISPI: 0x96272xxxxxx1a4 RSPI: 82be83xxxxxxcc9e]
IKEv2-PROTO-4: (1119): Building packet for encryption.
(1119):
Payload contents:
(1119): DELETE(1119): Next payload: NONE, reserved: 0x0, length: 8
What gives man?!?!?!
So we asked Fortinet/Cisco TAC at this time to clarify what we're doing wrong. They couldn't see anything wrong per se. They did ask us to enable the IKEv2 platform debugs, by issuing:
debug crypto ikev2 platform 255
In addition to the debugs above, and the next time it rekey'd we should have an answer.
The next morning, the logs were gathered by the customer, and uploaded to both TACs. Fortinet's L3/Engineering team came back showcasing this:
IKEv2-PLAT-4: (1134): idle timeout disable for VTI session
IKEv2-PLAT-4: (1134): session timeout set to: 720
IKEv2-PLAT-4: (1134): group policy set to 10.10.10.2
720 x 60 = 43200. Wait, that's the same as our Phase1 lifetime! So the VPN session is capped at a 12 hour lifetime, well ok, this makes sense why we saw the tunnel rekey, but drop at the 12 hour mark regardless.
Cisco came back acknowledging this, and analyzed the customer's config further. They found that the group-policy for the FortiGate tunnel had not had the session timeout configured (which is ok) so it inherited from the default group policy. The default group policy had a 720 minute maximum session life.
Normally a session lifetime is useful for remote access VPN (user VPNing in), but not a good idea for site to site vpn, for obvious reasons.
So TAC recommended we apply this to our FortiGate group policy:
group-policy 10.10.10.2 attributes
vpn-session-timeout none
This was applied yesterday during the day. The tunnel bounced as expected due to it being established under the old group-policy settings. The new tunnel had the max lifetime set to unlimited. This morning we woke up to a tunnel that remained up.
The takeaway here is that we had a couple of misconfigs on both sides. In my experience it's always something small that causes the biggest headaches, and it seems this case is no different. Also, use your resources, Fortinet TAC went out of their way to not only debug the issue on the Fortinet side, but also the Cisco side, helping guide us to ask the right questions. Last but not least, my customer who has the patience of a saint, thank you, thank you very much for being so flexible.
I hope this helps someone, somewhere, pounding their head against the desk wondering why the heck their tunnel keeps bouncing.
Comentarios