The last few weeks I have been working on some serious issues in an environment where we used vSphere 5 with an EMC VNX storage array. All seemed to run fine, but whenever we started a Storage vMotion, we noticed all kinds of strange errors we where not expecting at all.
We saw messages regarding write-quiesced VMFS volumes, we lost paths. and in some cases, the Storage vMotions did not complete at all.
During these Storage vMotions we noticed datastore latency peaked at more then 5 seconds on the source and destination LUN’s.
When we had a closer look at the vmkernel.log files on the hosts, we noticed the problems always seemed to start a little after the Storage vMotion was started, just after a trespass of the LUN on the VNX (for the reason of this trespass, see my previous article Unexplained LUN trespasses on EMC VNX explained …)
The VNX SP collectes where analyzed by EMC, but did not show any info that pointed to a possible issue on the array.
We then started to do several tests to try and narrow down the problem to find a solution.
In this setup we use EMC PowerPath/VE, so one of the things we did is change the claimrules on one of the hosts so VMware’s NMP was used for the LUN’s, to make sure we did not run in to an issue with PowerPath. We also enabled debug logging on the HBA ‘s to see if that would provide us some more details on what was happening.
This seemed to rule out PowerPath as a posible cause for our problems.
When we did a Storage vMotion where both source and destination LUN where on the same host, so the trespass did not occur, we had no issues.
EMC suggested to disable VAAI completely, but when we would do that, the trespass would not happen in the first place, so we would probably not see this issue at all.
During our testing, “Jeeves” pointed us at EMC KB article emc283239:
In the environment, we use Pool LUN’s. we use VAAI, and the RecoverPoint enablers where indeed installed on our system. Since we where troubleshooting this issue for about 4 weeks now, and basically started to run out of options, we decided ist was worth a try and uninstalled the RecoverPoint Enabler.
It did not take too long before we could confirm we indeed seemed to have hit this particular issue on the VNX.
Where a storage vMotion of a VM used to take 10 minutes with lots of errors and warnings in the log, we now completed two VMs in less than 3 minutes with a normal latency.
According to the KB article EMC is working on a permanent fix for this problem.
If you suffer from this issue. you can use one of the two workarounds suggested in emc283239, so uninstall the RecoverPoint enabler when you are not using RecoverPoint, or disable VAAI HardwareAcceleratedMove if you do use RecoverPoint.
Thanks to all the people involved in finding the root cause for the problems we experienced.
The EMC Primus case was updated on the 11th of May and now says this should be fixed in VNX OE 05.31.000.5.709 for Block and CX4 FLARE 04.30.000.5.524
We have not confirmed this yet.
On the 29th of May we reinstalled the RecoverPoint enabler and ran some additional test on our VNX with VNX OE 05.31.000.5.709. Although latency was not as high as with the previous VNX OE version, we still suffered from lost paths and virt-resets during our storage vMotions, so it seems like this issues is not fixed in VNX OE 05.31.000.5.709
EMC is currently investigating.
Had a couple of debugging sessions with some very sharp EMC engineers that ran some debugging commands while we performed some Storage vMotions with VAAI enabled.
EMC confirmed they see some unexpected messages just after the implicit trespass that happens during the Storage vMotion when source and target LUNs are on different SP’s and confirmed our issues indeed seem to be related to this implicit trespass. Have an other session scheduled in a few days to collect some more debug info to help EMC to get to the root cause of our issues.
Backlinkje naar je site gemaakt op http://www.vclouds.nl
leuke site, nu nog vullen 🙂
Excellent posts from in the field! Keep them coming
Same problem here, although I was chasing the wrong primative. We were told that it was the thin recovery stuff, which we dutifully disabled. Still saw problems for months, off and on. I happened to discover the other day that I had fully disabled VAAI on 3 of my 6 hosts, while leaving it on on the other 3. So I went, hey, let’s try turning it back on. BAD CHOICE. I don’t know if I fully panicked the SPs last night, but I was doing a lot of svmotions, and the thing just went bananas. Thanks for the article. Please update if you get a fix!
Any update on your problems?
Unfortunately not really.
EMC was not able to reproduce our exact problem (though they tried very hard I must say), and we where not able to do more additional testing in our production environment …