Hyper-V: LiveMigration cause network problems (Windows Firewall "InterfaceQuarantine")

Aus Wiki-WebPerfect
Wechseln zu: Navigation, Suche

Problem

After a Live Migration of VMs with Windows Firewall enabled the VM is no longer available via network for about 10 seconds, only for ping (ICMP) and UDP-Traffic. The problem occurs only by Live Migration of VMs with Windows Firewall (InGuest) enabled.

Affected Systems:
Problem occurs on VMs (Windows Server 2012 R2 & Windows Server 2016) running on Hyper-V 2012 R2 & Hyper-V 2016
A disabled Windows Firewall in the VM = no problems(maximum ping loss of 1 ping)



Cause

In the Live Migration scenario, they have changed the way to manage the interface “reconnection” for a classic disconnect/reconnect NIC. The goal is to speed up the reconnection without modifying anything in the TCPIP stack. Windows receives a NDIS_STATUS_NETWORK_CHANGE form the NDIS driver, which will end up in the firewall code with minimum information to be modified.
BUT:

  • We wait for a special EVENT to inform the firewall that all changes/checks have been realized, while we are waiting, we enter in the “interface quarantine” mode where we refuse all new incoming connections.
  • This event never arrived to the waiting thread and we wait for the 7 sec (hardcoded value) before leaving out the quarantine interface state.


Solution

With the following solution, the event with ID 14 (InGuest) is no longer generated and so the issue does not occurs. I hope Microsoft implement a simpler solution in the future.

  • Installation of [KB4338822] on the Hyper-V Node
  • Create following RegKey on the Hyper-V Node:
HKLM\System\CurrentControlSet\Services\VmsMp\Parameters
SendLmNetworkChangeIndication
DWORD Value data: 0



Workaround 1 (Recommended)

  • Add a RegKey (IntfQuarantineEnabled)
HKLM\SYSTEM\CurrentControlSet\services\SharedAccess\Parameters\FirewallPolicy
IntfQuarantineEnabled
DWORD  Value data: 0

Note: I propose this configuration until Microsoft fixes the problem, not a permanent solution. This will only disable the specific feature known as “interface quarantine” time after the live migration


Workaround 2

  • Disable "DHCP Media Sense"
netsh interface ipv4 set global dhcpmediasense=disabled


Workaround 3

  • Disable Service "Network List Service"



Additional Information / Explanation

Interface quarantining is intended to secure network communications for non-classified networks. The idea is that once network interface changes network connection (connects to another network) it must be restricted for inbound connections (quarantined) until it gets classified and firewall engine sets a proper active profile. After applying of the active profile filters quarantining restrictions must be removed (interface is un-quarantined).

Quarantining policy is applied from the layer named FW_PLUMBER_SUBLAYER_QUARANTINE. To quarantine interfaces this layer plumbs several persistent filters with weight equals to 0. Resultant policy blocks all inbound traffic for non-loopback and non-tunneled connections plus allows inbound ICMPv6 ‘Neighbor solicitation’ messages. To un-quarantine interfaces there are non-persistent filters applied per each interface (and per each protocol stack). These filters have low weight but greater than 0 to override blocking policy. As a result un-quarantining policy allows all inbound connections for interface with specific LUID (it is used as a condition in these filters) so the active profile policy can get full control on traffic.

To distinguish interface that requires quarantining from other interfaces there is a new interface property implemented. The property is named ‘epoch’ and updated each time interface receives network event (media connect). Literally it is a random number that is changed for every new event (for loopback pseudo-interface epoch is always 0 and remains unchanged).

Epoch value is used in un-quarantining filters to bind them to particular interface while it is connected to particular network. When connection is changed epoch value is also changed so ‘old’ un-quarantining filters are no longer actual and interface goes to quarantine with persistent policy until new un-quarantine filters are applied for it.

Duration of the time frame between network event (quarantining) and un-quarantining is about 3 seconds (about 2 seconds between quarantining and profile changing for the “normal” case connected to a wired network). To increase reliability there is an additional scenario implemented to allow timeout for quarantining. In this scenario (‘Quarantine Safety Timeout’) all interfaces get un-quarantined in 7 seconds after the last network event. This timeout was chosen because:

  • It is long enough to give NLM a chance to notify about a network, at least to say that it is identifying it
  • It is short enough to have TCP retry and still be successful considering a noisy network.''

Indeed, UDP based applications will suffer of this 7 sec blocking period.