Recently I ran in to a situation where a customer suffered severe performance issues on a virtualized SQL server. In the SQL server we noticed a high CPU utilization, but the underlying ESX hosts only showed relatively low CPU utilization for this VM.
Debugging the VM performance issue with esxtop showed very high co-stop (%CTSP) vallues.
According to the vSphere Monitoring and Performance guide, %CTSP is
Percentage of time a resource pool spends in a ready, co-deschedule state.
NOTE You might see this statistic displayed, but it is intended for VMware use only.
Funny how VMware expresses this metric is only to be used by VMware
Usually the %CSTP value is a good indicator for over provisioning of vCPU’s for a VM as described for instance in Duncan Epping’s ESXTOP post on yellow-bricks.com
But in this case the VM was actually using all vCPU’s and the only 4 vCPU VM running on a 12 core host, so this did not explain the high %CSTP vallues.
Eventually we found a VMware KB article called “High co-stop (%CSTP) values seen during virtual machine snapshot activities”
The VM where we saw the issues was a SQL Database and had two large chained snapshots, due to some upgrades on the databases.
We where fully aware that having a snapshot has some performance impact on a VM, but in this case the performance of the SQL database was dramatic, way more then anticipated.
So if you have high IO VM’s please only use snapshots for very limited periods. As the KB article says:
As the size and number of snapshots on a virtual machine increase, so does the number of storage command operations within vmkernel. For each storage command issued by the virtual machine guest OS, multiple storage command operations may be necessary to traverse the entire snapshot chain to read the most appropriate block of data.
On production VMs with snapshots and high storage I/O (storage commands), or VMs with multiple snapshots in a chain, the size and complexity of snapshots can increase rapidly. This, in turn, can lead to an increase in the storage I/O required to complete each guest OS I/O action.
We had both, which killed the performance for our SQL server …