MTU woes in IPsec tunnels and how you can fix it

Today I ran into a problem with IPsec Xauth PSK and the built-in Android VPN client (Android 4.1.2), resulting in some sites (such as www.yahoo.com) not loading through the VPN tunnel. Turns out I was dealing with MTU issues. When the Android VPN is started, it sets the MTU to 1500 on the tun0 interface:

$ ip link show tun0
33: tun0: <POINTOPOINT,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 500 link/none

Looking at the Android Source, it appears someone must have forgotten to take care of IPsec Xauth PSK. With PPTP and L2TP based VPNs, the MTU is reduced to 1400 (line 758 – 778).

In comparison:

strongSwan Android client: MTU 1400
OS X / iOS 7 built-in IPsec client: MTU 1280 (for what it’s worth, 1280 is also the minimum IPv6 packet size and thus the MTU minimum required to make IPv6 work)
Windows 7 built-in IPsec client: MTU 1400
Cisco VPN client: MTU 1300

Among the tested clients, only the connection through the Android VPN client was causing the issue with stalling websites. In a nutshell, I was able to fix it with the following on the VPN server:

$ iptables -t mangle -A FORWARD -o eth0 \
 -p tcp -m tcp --tcp-flags SYN,RST SYN \
 -m tcpmss --mss 1361:1536 \
 -j TCPMSS --set-mss 1360

$ echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc

And here is why.

IP fragmentation – The original problem.

On the VPN server side, we have the interface set to a standard Ethernet MTU 1500. In the scenario with the Android client, the MTU along the entire path is 1500. This leaves room for up to 1460 bytes of data payload per packet (also referred to as the maximum segment size MSS).

Keep in mind that IPsec in tunnel mode adds an ESP header and an additional IP header for tunneling the packet (usually with an additional size of around 70-80 bytes). When a packet is nearly the size of the MTU and when you tack on this encapsulation overhead, it is likely to exceed the MTU of the outbound link. That’s where IP fragmentation kicks in – which could lead to performance degradation of your VPN tunnel. Or worse…

Path MTU discovery (PMTUD). A failed solution.

To avoid IP fragmentation, many TCP/IP stacks have path MTU discovery (PMTUD) implemented. To tell you right away: It doesn’t work for me, and it’s not going to work for you either. PMTUD attempts to discover the largest IP datagram that may be sent without fragmentation through an IP path. Instead of fragmenting a too-large IP packet, the VPN server is told (through theDon’t Fragment (DF) flag in the IP header of the sender) to discard the packet and reply with an ICMP fragmentation required (type3, subtype 4) message.

Recap: The sender is the website that you try to load on your VPN client.

When the sender receives this ICMP packet, it learns to use a smaller MTU for packets sent to our VPN server. In theory. In reality, many websites (senders like www.yahoo.com) stupidly implement ICMP filters that break PMTUD functionality. And that’s where hell breaks loose. The sender is expecting an acknowledgement for the original packet from our server, but since the packet was discarded, the acknowledgment never comes. Time goes by, then the sender repeats sending the too-large packet. The result? The VPN sender discards the packet, again, sends yet another ICMP message, and so on. Meanwhile our client on the other end of the VPN tunnel cannot tell what’s happening and is desperately waiting for some data. Everything appears to be stalled – a state which is also referred to as a black hole connection.

Because PMTUD doesn’t always work on the Internet, the use of it only makes sense in a site-to-site VPN, where basically one operator maintains the networks and is able to enable the forwarding of the “fragmentation required” ICMP (type 3, subtype 4) notifications in all firewalls in between.

MSS size reduction. A working solution.

So we can rule out PMTUD, but there is another way to assure that our VPN connection is working properly.

$ iptables -t mangle -A FORWARD -o eth0 \
 -p tcp -m tcp --tcp-flags SYN,RST SYN \
 -m tcpmss --mss 1361:1536 \
 -j TCPMSS --set-mss 1360

This iptables rule reduces the size of the allowed packet size by reducing the MSS of TCP SYN packets. The –set-mss value explicitly sets the MSS to 1360, which is a customary size for IPsec IPv4 interfaces. The –mss option is used to match only those MSS that are between 1361 and 1536 bytes (we don’t want to rewrite the MSS of a VPN client that uses a smaller MSS). The result in a tcpdump:

05:01:56.795798 IP 172.16.16.1.38695 > r2.ycpi.vip.ac4.yahoo.net.http: Flags [S], seq 2621580326, win 14600, options [mss 1460,sackOK,TS val 56614 ecr 0,nop,wscale 6], length 0
05:01:56.795865 IP vpn.zeitgeist.se.38695 > r2.ycpi.vip.ac4.yahoo.net.http: Flags [S], seq 2621580326, win 14600, options [mss 1360,sackOK,TS val 56614 ecr 0,nop,wscale 6], length 0
05:01:56.802695 IP r2.ycpi.vip.ac4.yahoo.net.http > vpn.zeitgeit.se.38695: Flags [S.], seq 3057551576, ack 2621580327, win 14480, options [mss 1460,sackOK,TS val 1410945796 ecr 56614,nop,wscale 8], length 0

In line 1, you see the request from our Android client through the tunnel to yahoo.com. Notice the MSS is 1460. In the second line, it is our VPN host initiating the TCP handshake with the external site. Only now the MSS value is rewritten to 1360, thanks to our iptables rule. As a consequence, the TCP connection will use the lower MSS of the two end points, which is 1360. Voila!

You can go further and restrict the iptables rule to rewrite only those packets that are forwarded from our VPN client:

$ iptables -t mangle -A FORWARD -o eth0 \
 -p tcp -m tcp --tcp-flags SYN,RST SYN \
 -s 172.16.16.0/24 \
 -m tcpmss --mss 1361:1536 \
 -j TCPMSS --set-mss 1360

This assumes that the address pool of your virtual IPs is 172.16.16.0/24 (see rightsourceip in /etc/ipsec.conf).

There is also an option to determine the MSS size dynamically (“MSS clamping”, via the –clamp-mss-to-pmtu option), but it wouldn’t fix IPsec for clients that set their MTU too high (like in the Android example).

The MSS iptables rule doesn’t work with UDP applications. UDP is a connectionless protocol; hence there is no way to negotiate a MSS during the handshake. The only solution to guarantee that UDP works is to disable the Don’t Fragment (DF) bit in the IP header of the sender. This will allow our VPN server to fragment any UDP packet, if necessary. In Linux, you do it like this:

$ echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc

The VPN server will fragment outgoing UDP packets that exceed the interface MTU, which may not be as great for performance and reliability reasons, but at least it doesn’t break the tunnel connection. Fortunately, most folks don’t use UDP for anything much greater than DNS.

Fonte: zeitgeist.se