Recovering from a hostd unresponsive event

The hostd service is the life-blood of your ESXi host. Because hostd is involved in so much of the behind-the-scenes work of ESXi it can be easily affected by changes in storage or network topology.

On a recent service call, a customer had decommissioned a SAN attached LUN nearly a week before. One of their ESXi hosts suddenly became very interested in this now unregistered and detached device. This resulted in the dreaded Permanent Device Loss (PDL) event. While a host is attempting to recover from a PDL or the equally painful All Paths Down (APD) event, hostd will become unresponsive. This results in the host’s inability to communicate with vCenter and vCenter displaying the host, associated VMs and resources as inaccessible, disconnected, or inactive.

While, in this instance, the VMs continued to run unaffected, loss of an in-use LUN will cause the VMs utilizing that LUN will be impacted. Any other VMs should continue to function as normal.

Step 1 – Clear the PDL or APD
The hostd process is obsessive about regaining access to the missing LUN. To clear this now dead path from the host’s configuration, we will make use of the localcli command. The localcli command has nearly the exact same functionality and syntax as the esxcli command. The only real difference is the localcli command bypasses hostd when running the commands.

DISCLAIMER: Use of the localcli command is not recommended without the assistance of VMware support. Please proceed with caution.

ESXi:~ # localcli storage core adapter rescan --all

This command will cause the host to rescan all storage adapters and simply accept what is discovered as the new configuration. In this instance, the following log entries were written to vobd.log and vmkernel.log.

vobd.log:

2016-03-04T19:29:42.070Z: [scsiCorrelator] 3733053265426us: [vob.scsi.scsipath.remove] Remove path: vmhba33:C0:T16:L0
2016-03-04T19:29:42.070Z: [scsiCorrelator] 3733053265588us: [vob.scsi.scsipath.remove] Remove path: vmhba33:C1:T16:L0

vmkernel.log:

2016-03-04T19:29:42.070Z cpu8:33096)ScsiPath: 6020: DeletePath : adapter=vmhba33, channel=0, target=16, lun=0
2016-03-04T19:29:42.070Z cpu8:33096)WARNING: NMP: vmk_NmpSatpIssueTUR:1043: Device Unregistered Device path vmhba33:C1:T16:L0 has been unmapped from the array
2016-03-04T19:29:42.070Z cpu8:33096)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device "Unregistered Device".
2016-03-04T19:29:42.070Z cpu8:33096)WARNING: NMP: vmk_NmpSatpIssueTUR:1043: Device Unregistered Device path vmhba33:C1:T16:L0 has been unmapped from the array
2016-03-04T19:29:42.070Z cpu8:33096)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device "Unregistered Device".
2016-03-04T19:29:42.070Z cpu8:33096)WARNING: NMP: nmpPathClaimEnd:1204: Device, seen through path vmhba33:C1:T16:L0 is not registered (no active paths)
2016-03-04T19:29:42.070Z cpu8:33096)WARNING: ScsiPath: 6098: Remove path: vmhba33:C0:T16:L0
2016-03-04T19:29:42.070Z cpu8:33096)ScsiPath: 6020: DeletePath : adapter=vmhba33, channel=1, target=16, lun=0
2016-03-04T19:29:42.070Z cpu8:33096)NMP: nmp_UnregisterDeviceEvents:2010: Failed to unregister events 0x180 on device "Unregistered Device". Not found
2016-03-04T19:29:42.070Z cpu8:33096)WARNING: NMP: nmpUnclaimPath:1590: Physical path "vmhba33:C1:T16:L0" is the last path to NMP device "Unregistered Device". The device has been unregistered.
2016-03-04T19:29:42.070Z cpu8:33096)WARNING: ScsiPath: 6098: Remove path: vmhba33:C1:T16:L0

Step 2 – Recover the management services
Since the host is no longer looking for the nonexistent LUN, we’ll turn our attention to the ailing ESXi services. Removing the dead storage paths from the configuration will likely still leave your host in a disconnected state. We now need to restart the management services to regain access to the host. We highly recommend following VMware’s KB article for restarting the management agents. Unfortunately, this process was not able to recover the services. Here is the modified process we followed.

ESXi:~ # /etc/init.d/hostd stop
ESXi:~ # /etc/init.d/vpxa stop
ESXi:~ # ps -Tcjstv | grep -i /sbin/snmpd
35759  35759  sh                             35759  User,Native    WAIT    UWAIT   0-31      0.8279    /bin/sh /sbin/watchdog.sh -s snmpd -u 60 -q 5 -t 10 /sbin/snmpd ++min=0,group=snmpd,securitydom=0
35769  35769  snmpd                          35769  User,Native    WAIT    UPOL    0-31      0.479833  /sbin/snmpd
43571  43571  grep                           43571  User,Native    WAIT    UPIPER  0-31      0.0       grep -i /sbin/snmpd
ESXi:~ # kill -9 35759
ESXi:~ # kill -9 35769
ESXi:~ # services.sh stop
ESXi:~ # services.sh start

It is important to note the services.sh start command didn’t not complete. It hung while starting the Likewise Agent but did start enough of the services to allow us to manage the host in vCenter.

Step 3 – vCenter
If you are patient, the host should reconnect to vCenter on its own. If not, you may want to force vCenter to reconnect to the host. To accomplish this, right click on the disconnected host and click Connect.

Assuming the host is now connected, migrate the VMs off of the host. This can be done with manual vMotion, or automatically with maintenance mode and DRS. Once the machines have migrated, ensure the host has achieved maintenance mode and reboot the host. Once the host comes back online, verify storage connectivity and check for any dead paths.

If you are working with support, this would be a great opportunity to collect ESXi system logs. Right-click on the host and choose Export System Logs. Otherwise, you may bring the host back into service by taking it out of maintenance mode. Once out of maintenance mode, manually vMotion machines back to the host or allow DRS to balance the load of the cluster automatically.

Recovering from a hostd unresponsive event

Similar posts

Saving your VCSA with Ephemeral Port Groups

Guest VM Network Unstable due to Guest Introspection

Fixing vCenter and PSC Ethernet Interface Names