In a recent, ongoing, installation we encountered a wide variety of sporadic network traffic issues affecting the VMs connected to NSX Logical Switches (VXLANs).
Some of the symptoms were:
- Can ping a device, but not get a complete tracert
- Can connect to a server over HTTPS, but not its neighbor (both webservers)
- vCAC Gugent cannot connect to vCAC server
- Were unable to perform a vmkping using the VXLAN TCP stack with more than 1470 bytes.
The last bullet made it pretty clear that the issue was related to the MTU. We had no visibility into the configuration of the north-south layer 3 devices, but had been assured that they were configured for 1600 byte frames.
In the NSX for vSphere implementation of VXLAN, the packets sent by devices get a new additional header, increasing its overall size beyond 1500 bytes (up to ~1540 bytes or so).
I checked the UCS service profiles and the vNIC Templates, it looked something like this:
It certainly looks like its set for jumbo frames, but also notice the second red ellipse there; QoS Policy. If you pay attention to things like that, you might also notice the warning about the MTU size.
The QoS policy assigned to the vNIC (template) is uses an egress priority based on a QoS System Class.
The QoS System Classes specify not only a class-of-service and a weight, but also MTU! In my case, I changed the vNIC Template QoS Policy to one with a System Class whose MTU is 9216. Once this change was made, the VMs behaved as expected.
A couple of notes:
- If your vNIC (templates) do not specify a QoS Policy, UCS appears to use the MTU given
- If you do not have an enabled QoS System Class with an MTU of 9216, you’ll have to type it in, the dropdown list only contains “normal” and “fc”
This is another of those posts where I just stumbled upon the fix and needed to write it up before I forgot. Hopefully this will some someone a lot of time later.