Updated: Problem with downloading VIBs from https://vmwaredepot.dell.com/index.xml

Update as of januari 31st

Seems like Dell fixed their repository and vLCM can now download the VIBs again as it should when adding a VIB to an image.

Here is the original article I wrote

We have been using the Dell depot in vLCM for a while but since a couple of days we see issues when we try to include some VIBs in an image. Including in the image works, but as soon as we try to remediate a host, vCenter try to download the VIB but that fails and thus remediating the host fails.

We added the Dell repository to our vCLM config:

Next thing we do is add a VIB from this repository to the image we will apply to our cluster:

But when we try to remediate the cluster, it tries to fetch the VIB from the Dell repo but fails …

It seems like the VIB that needs to be downloaded is not available at the URL vCLM gets from the Dell repository ….

I am able to access https://vmwaredepot.dell.com from my vCenter Appliance but trying to fetch the VIB gives me a 404 error: (In this screenshot a proxy is used, but even with a direct connection it fails)

Seems to me like Dell changed the directory structure on https://vmwaredepot.dell.com but forgot to update the xml index file ….. Getting in touch with Dell to get this fixed, will update this article when I get more info.

Dell PowerStore warning when using two storage networks (can be ignored ….)

While configuring Dell PowerStore 1000T for NVMe over TCP, I opted to use two separate storage networks for availability purposes, similar to 2 storage fabric used in a Fibre Channel SAN. So each hosts has 2 NVMe over TCP NICs and each one connects to a separate switch. Each switch only has the VLAN of one of the two storage networks.

For this to work (due to the fact you can only configure the same subnet on the same front-end port for each node) this can only be done when in my scenario, port 2 of both Node A and Node B connect to the first switch, that only has storage network 1, and port 3 of both Node A and Node B connect to the second switch, that only has storage network 2.

So far so good, but as soon as I connected both ports 2 to switch 1 and ports 3 to switch 2, the PowerStore showed me an error: “Appliance port pairs are connected to the same data switch” Which is by design in my case and according to Dell’s best practices to use 2 storage networks for NVMe ….

But luckily there is a Dell knowledge base article (this one talks about iSCSI but it’s the same for NVMe) that says:

So in this scenario, “the warning can be safely ignored” which is what I did 🙂

Possible front-end port oversubscription on Dell PowerStores

While architecting an environment with a Dell PowerStore for a customer, I noticed some interesting details in the “Dell PowerStore: Best Practices Guide”

First a part about the on-board mezzanine card:

This means the ports on the mezzanine card have about 63.04 Gbits/sec of bandwidth available. When using 4 ports at 25Gbit/sec, we can not use the full bandwidth these ports offer. So what about the I/O module slots?

This part is about the I/O module slots? Slot 0 is 16-lane PCIe Gen3 which gives you about 126 Gbit/sec, which should be sufficient for the 4 port 25Gbit/sec card …

But ….. Regarding the I/O module:

So the 4 x 25Gbit/sec I/O module itself is 8-lane PCIe Gen3 …. So even in slot 0 you can not use the full bandwidth of this card ….

Since PowerStore OS there is a new 2-Port Ethernet card that supports speeds of up to 100 Gb/s. This 100 GbE card is supported on PowerStore 1000-9200 models in I/O Module 0 slot, the 16-lane PCIe Gen3 slot, which is still limited to 126 Gbit/sec ….

If you want to combine ethernet connectivity and Fibre Channel, it gets even more interesting, since the preferred slot for the 4 port 32Gbit/sec FC I/O module is slot 0 …. In slot 0 it can run at full capacity. When adding more FC ports to slot 1, the combined ports bandwidth is again limited to 63.04 Gbits/sec

So when designing a high bandwidth PowerStore environment, keep this in the back of your mind ….

The ramdisk ‘var’ is full on ESXi host and fix (without maintenance mode or reboot)

I recently encountered an issue where vMotions on a host would fail, the host would disconnect from vCenter, and some other strange errors.

This was an HP host installed with the HP ISO image, but not sure if that is the cause of this issue.

When investigating the logs on the host I noticed that /var on the ramdisk was full

When issuing vdf -h available space for /var on the ramdisk was  0%

Looking in /var/log i noticed all logfiles where symlinks to /scratch except for the EMU directory, where some Emulex process seemed to fill up a log file …..

When removing the logfile /var/log/EMU/mili/mili2d.log and after restarting hostd, space was freed up on /var in the ramdisk, but the logfile /var/log/EMU/mili/mili2d.log returned and started filling up again.

Googeling I found a suggestion to remove the Emulex vibs when not using an Emulex HBA, but these hosts did have Emulex HBA’s

After some more research I found a fix that did not need a reboot or maintenance mode (which is great since vMotion stoped working on these hosts):

Continue reading

Fixing problems with in-context log viewing from vROps to Log Insight

Had an issue at a customer where I was not able to use the in-context log viewing in vROps to view the logs for the ESXi servers in Log Insight. Using crops 6.6.1 and Log Insight 4.5.1

First part of the solution was to uses FQDN on the ESXi hosts. Only short hostnames where configured on ESXi hosts, which probably caused Log Insight not being able to match the logs it received from the ESXi hosts to the registered hosts it learned from vCenter. Because of this all hosts where missing the vmw_vr_ops_id metadata and this metadata is used by vROps to pass to Log Insight to find the logs for the correct host.

After fixing this, one host still had no vmw_vr_ops_id metadata.

Seems like for whatever reason, matching of the ESXi hostname to the name used to register the host in vCenter is case sensitive. After changing case for th hostname on the ESXi server, the match was made, the metadata was added, and the in-context log search worked … Probably a bug ….

vSphere Distributed Switch refused to upgrade to version 6.5

Just after vSphere 6.5 was released I decided to upgrade my lab to 6.5. Most of the upgrade went pretty smooth, but two of my 3 distributed switches refused to upgrade. Googeling for a solution dit not help too much, probably since the product was released just a day before 🙂 When I tried to upgrade I got a message the vDS config could not be read. I also noticed I was not able to upgrade these switches to enhanced LACP.

I did find some kb articles regarding some wrong vCenter database entries for LACP in previous upgrades, so I had a feeling this was related to LACP (which I do not use) … Continue reading

vCenter 6.5 upgrade did not recognise vSphere 6.0 Platform Service Controller version

When I tried to upgrade my lab environment from vCenter 6.0 with external PSC to vCenter 5.5, I ran in to an annoying issue. I tried to upgrade my PSC, but the installer was not able to determine the version from my current PSC. It assumed it was 5.5 and I had to confirm this, which of course, I did not. No way to tell it it was really 6.0 …

Continue reading

Invalid credentials message when registering vCenter Server with external Platform Service Controller in vSphere 6

When vSphere 6 was released, I decided to delete my RTM version of my external Platform service Controller and vCenter Server appliances, to replace them by the GA versions.

Installing the PSC went fine, but when installing the vCenter appliance, I was not able to register it to the PSC. I kept getting the message “Invalid credentials” every time I entered the SSO administrator password. Redeployed the PSC several time, using different passwords, but no luck registering the VCSA.

Continue reading

Nasty HP software bug hits vSphere 5.1 and 5.5 and helpful info to fix this

Recently I got a call from a customer he was not able to log in to his ESX 5.5 hosts anymore trough ssh, and could not vMotion VM’s anymore. It seemed like the ssh daemon died and trying to start it again did not work.
I was able to log on to one of the hosts (DL380 G8) and have a look at the vmkernel.log file.
In the log file I saw a line that read:
WARNING: Heap: 3058: Heap_Align(globalCartel-1, 136/136 bytes, 8 align) failed. caller: 0x41802a2ca2fd
Google brought me to VMware KB article 2085618 with the title “ESXi host cannot initiate vMotion or enable services and reports the error: Heap globalCartel-1 already at its maximum size.Cannot expand” which sounded exactly like our problem, and seems to be caused by a memory leak in the hp-ams service.

And that’s where the fun started ….

Continue reading

Add RecoverPoint 4.1 with SRM RecoverPoint SRA 2.2 fails with error “SRA command ‘discoverArrays’ failed” UPDATED

During an installation and configuration of an SRM solution for a customer based on EMC RecoverPoint 4.1 I ran in to an interesting issue.

When I tried to add the RecoverPoint Clusters on both sites using the RecoverPoint SRA 2.2 I received the following error message:

Error

Continue reading