Recently I saw some unexplained LUN trespasses on an EMC VNX that is used in a vSphere 5 environment where we use VAAI.
Since we use pools on the VNX, it is advised to keep a LUN on the owning SP, to prevent unnecessary traffic over the internal bus between SPA and SPB. EMC says:
Avoid trespassing pool LUNs. Trespassing the pool LUNs to another SP may adversely affect performance. After a pool LUN trespass, a pool LUNs private information remains under control of the original owning SP. This will cause the trespassed LUNs I/Os to continue to be handled by the original owning SP. When this happens both SPs being used in handling the I/Os. Involving both SPs in an I/O increases the time used to complete an I/O.
When we investigated the cause of these trespasses, we noticed these trespasses seemed to happen during a Storage vMotion. The vmkernel.log showed (we uses EMC PowerPath/VE in this environment) that PowerPath/VE seems to be responsible for these trespasses:
PowerPath: EmcpEsxLogEvent:1252: Info:Mpx:Assigned volume 6006016029302F00A8445F18F351E111 to SPA
We could not understand why PowerPath decided to trespass these LUN’s
We opened a case at EMC and where pointed to EMC KB article emc287223
(EMC confirmed this is also true for Powerpath/VE 5.7)
This article says:
So what seems to happen is that when you do a Storage vMotion on a VNX from a Datastore on one SP, to a datastore on the other SP, is that after a certain amount of traffic has passed the internal bus between the SPs, the VNX decides it is better if both LUNs that hold the Datastores, are owned by the same SP, to keep the traffic caused by the Storage vMotion on the same SP.
The effect of this decision is that the Storage vMotion traffic now no longer traverses the internal bus, but this has as a side effect that all host traffic, which I consider more important then the Storage vMotion traffic, is now traversing the internal bus.
The scary part is this part:
In the event where there are both high application I/O and VAAI XCOPY I/Os, it is possible that the LUNs will trespass back and forth.
That is not what we would like to see.
I would prefer to see some more intelligence in the VNX that would say “OK, I know this is Storage vMotion Traffic I am sending over my internal bus, so if it takes a little longer to complete, it’s OK, as long as I keep the path for my host traffic optimal, by keeping the LUN on the owning SP” So optimize for host traffic, not for Storage vMotion traffic.
The article also says the message in the vmkernel.log file generated by PowerPath/VE is not the correct message. Instead of “Assigned volume to SPx” which suggest PowerPath/VE is responsible for the trespass, it should read “followed to SPx”
I was told EMC engineering is aware of this issue, and hope they are working on a solution.
By the way, we think this also happens when not using PowerPath/VE but also when using VMware’s native multipathing, since this seems to be a VNX issue, and not a PowerPath/VE specific issue. We have not confirmed this in tests yet.
So to complete this story, why is this only happening when I use VAAI?
That’s pretty simple to explain. If I doe a storage vMotion without VAAI, the host is responsible for moving the data between both Datastores. So the host reads from a LUN on SPA, and writes to a LUN on SPB, so no traffic passes the internal bus.
So if this is an issue in your environment, you have two possible workarounds:
- Only do Storage vMotions between LUNs on the same SP like EMC suggests, which could be an issue, especially when using Storage DRS (Unless off course, you only create Datastor Clusters with LUNs on the same SP, so a Datastore Cluster for LUNs on SPA and a Datastore Cluster for LUNs on SPB)
- Disable the VAAI XCOPY primitive (HardwareAcceleratedMove)
See Disabling the VAAI functionality in ESX/ESXi on how to do this.
If do not see this as a major problem in your environment, just log in to your VNX now and then and trespass the LUNs back to their owning SPs.
Thank you so much for this post. I have been working with EMC on this issue for the past 6 weeks. I can verify that does happen with NMP without PowerPath. Our luns were trespassing during a storage vmotion and it would bring down our vm’s to a crawl since all IO’s were going through one SP!!!! Disabling VAAI worked for me. I get it that now the host is doing the storage vmotion and it is slower but I would rather have this then LUN’s trespassing and choking our VM’s. Thank you!!!!!