The case of intermittent storage on vSphere

Recently, I upgraded my home lab to vSphere 4 from ESX 3.5U5.  The upgrade went smoothly.
My older system is an HP DL585 G1 with 32GB ram.  It has the HP insight agents installed.  I am aware of the Vmware 1a update that fixes various problems with HP agents and the vSphere upgrade process.  In short, the article shows far worse things that can happen than had happened to me.
When I performed the upgrade (before anyone knew of the 1a issue) I had no problems.  After I had found out about the issues, I quickly upgraded to 1a (or so I thought as you’ll see later).  A short time thereafter I started having problems.  The problems first arose looking like virtual switching problems.
I ran continuous pings on my VMs and every 5 or 10 minutes or so, the VMs would drop from the network (all VMs) for about 60 seconds or so.  My backups of the VMs would also fail because they would take longer than the loss of connectivity would allow for them to succeed.  I created new vswitches, moved NICs around, checked cabling, swapped for known good cabling, etc.  Everything network layer I tried.
After about 2 days of troubleshooting turmoil, I was giving up and wanted to save my VMs off to remote storage and rebuild from scratch…when voila!  I had tried to download the all the VM folders and noticed, that during the "outage" I couldn’t access storage.  I was thinking, why can’t I access this????  It’s all DAS……no NAS or iSCSI.  That’s when I put down the networking "issue" and started to think about 1a, the software HP uses, and how it might affect storage.
The VMware article doesn’t mention anything about intermittent connectivity to VMs (or as we now know, storage, which is causing the connectivity "issue").  As the article mentions, it has to due with HP agents, rpmdb and "locking".  I took a guess that something was getting locked temporarily, then releasing and working again.
I looked at various logs but couldn’t find anything that would point the finger at any particular process (partly because I didn’t have time to comb through 224MB of log dump).
So then I took a look at "re-patching" 1a again.  That’s where Update Manager gets involved.  UM showed that my host was compliant with all updates.  But it was really just kidding….here’s the process I used to "fix" my intermittent storage/networking connectivity issues:

1.       Uninstalled UM

2.       Reinstalled UM

3.       Re-downloaded all the updates (for this problem, VMware only updates would have sufficed)

4.       Re-scanned the host (now it shows out of compliance)

5.       Due to the hiccups with storage and thus networking, I decided to use the “Staging” option to copy the 1a update to the host before remediating.

6.       Powered down VMs and put host into maintenance mode (I don’t have a second host)

7.       Remediated the host, UM rebooted.

8.       Powered on VMs

9.       Have been consistent ever since.

I haven’t had a problem since!  So there’s more to the update than meets the eye.  And UM appears to have that same old VMware read/refresh problem.  :>)


Hopefully, my experience is another nugget to put into the tool belt.
Take care>>>Dustin
This entry was posted in Computers and Internet. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s