Recently I got a call from a customer he was not able to log in to his ESX 5.5 hosts anymore trough ssh, and could not vMotion VM’s anymore. It seemed like the ssh daemon died and trying to start it again did not work.
I was able to log on to one of the hosts (DL380 G8) and have a look at the vmkernel.log file.
In the log file I saw a line that read:
WARNING: Heap: 3058: Heap_Align(globalCartel-1, 136/136 bytes, 8 align) failed. caller: 0x41802a2ca2fd
Google brought me to VMware KB article 2085618 with the title “ESXi host cannot initiate vMotion or enable services and reports the error: Heap globalCartel-1 already at its maximum size.Cannot expand” which sounded exactly like our problem, and seems to be caused by a memory leak in the hp-ams service.
And that’s where the fun started ….
By the way, the hp-ams endend up on the host trough the installation of the host with a custom HP vSphere installation ISO to be more precise the VMware-ESXi-5.5.0-Update1-1746018-HP-5.74.27-Jun2014.iso was used which has the affected hp-ams version 522.214.171.124-12.1198610 vib.
The KB article suggests to log in trough ssh and issue the command /etc/init.d/hp-ams.sh stop
But my ssh daemon was not running anymore and could not be started
So next option was to use ILO to open a console connection and log in on the console, which succeeded ad tome hosts, but for every command I entered, I got a message back saying “can’t fork” so no way to shut the hp-ams service ….
Tried some PowerCLI magic to see if I could stop this service, no go, tried to shut down some non production VM’s (had this issue on all 3 hosts in two clusters, so six in total) to try and free up some resources in the console, but still was not able to shut doen the hp-ams service, which is the first step of fixing the issue.
So the only alternative that seemed to be left is to do a full shut down of all production VM’s and restart all hosts … Which did not sound like a good plan 🙁
Not being able to vMotion does really bite in a situation like this.
So talking to the customer we figured out both clusters had one host that happened to have only VM’s on them that where not critical for production. So we decided to try and fix these hosts first. So we shut down all VM’s, rebooted the host, issued the /etc/init.d/hp-ams.sh stop command, removed the hp-ams vib, and just to be sure, rebooted the host a second time probably not necessary, but better safe than sorry)
So these two host where back up, one in each cluster.
And now the good news … vMotion “to” an affected host does not work, but vMotion “from” an affected host to the fixed host did! So now we where able to evacuate the affected hosts to the fixed hosts, fix the empty host, and move on to the next host.
So although we had to shut down some VM’s, we where able to fix all hosts without having to shut down ALL VM’s …