PAS with NSX-T Tip: use a fresh IP Block

I’ve fought with this for an embarrassingly long time.  Had a failed PAS (Pivotal Application Services) deployment (missed several of the NSX configuration requirements) but removed the cruft and tried again and again and again.  In each case, PAS and NCP would deploy, but fail on the PAS smoke_test errand. The error message said more detail is in the log.

Which Log?!

I ssh’d into the clock_global VM and found the smoke_test logs. They stated that the container for instance {whatever} could not be created and an error of NCP04004. This pointed me to the Diego Cells (where the containers would be created) and I poked around in the /var/vcap/sys/log/garden logs there. They stated that the interface for the instance could not be found. Ok, this is sounding more like an NSX problem.

I ended up parsing through the NSX Manager event log and found this gem:

IP Block Error

Ah-ha! Yup, I’d apparently allocated a couple of /28 subnets from the IP Block. So when the smoke test tried to allocate a /24, the “fixed” subnet size had already been set to /28, causing the error.

Resolution was to simply remove all of the allocated subnets from the IP block. This could have been avoided by either not reusing an existing IP Block or using the settings in the NCP configuration to create a new IP Block with a given CIDR.

Advertisement

Removing NSX-T VIBs from ESXi hosts

I’d wanted to revert my environment from (an incomplete install of) NSX-T v2.0 back to NSX for vSphere v6.3.x, but found that the hosts would not complete preparation.  The logs indicated that something was “claimed by multiple non-overlay vibs”.

Error in esxupdate.log

I found that the hosts still had the NSX-T VIBs loaded, so to remove them, here’s what I did:

  1. Put the host in maintenance mode.  This is necessary to “de-activate” the VIBs that may be in use
  2. Login to the host via SSH
  3. Run

    /etc/init.d/netcpad stop
    /etc/init.d/nsx-ctxteng stop remove
    /etc/init.d/nsx-da stop remove
    /etc/init.d/nsx-datapath stop remove
    /etc/init.d/nsx-exporter stop remove
    /etc/init.d/nsx-hyperbus stop remove
    /etc/init.d/nsx-lldp stop remove
    /etc/init.d/nsx-mpa stop remove
    /etc/init.d/nsx-nestdb stop remove
    /etc/init.d/nsx-platform-client stop remove
    /etc/init.d/nsx-sfhc stop remove
    /etc/init.d/nsx-support-bundle-client stop remove
    /etc/init.d/nsxa stop remove
    /etc/init.d/nsxcli stop remove

  4. Run this all in one line; note the the order of the vibs is important

    esxcli software vib remove -n nsx-ctxteng -n nsx-hyperbus -n nsx-platform-client -n nsx-nestdb -n nsx-aggservice -n nsx-da -n nsx-esx-datapath -n nsx-exporter -n nsx-host -n nsx-lldp -n nsx-mpa -n nsx-netcpa -n nsx-python-protobuf -n nsx-sfhc -n nsx-support-bundle-client -n nsxa -n nsxcli -n nsx-common-libs -n nsx-metrics-libs -n nsx-nestdb-libs -n nsx-rpc-libs -n nsx-shared-libs -n nsx-python-gevent -n nsx-python-greenlet

  5. reboot the host

A downside to VVols

I picked up a Dell Equallogic PS6000 for my homelab.  Updated it to the latest firmware and discovered it’s capable of VVols.  Yay!  I created a container and (eventually) migrated nearly everything to it.  Seriously, every VM except  Avamar VE.  Started creating and destroying VMs; DRS is happily moving VMs among the hosts.

UNTIL (dun dun dun)

The Equallogic VSM, running the VASA storage provider gets stuck during a vMotion.  Hmm, I notice that all of the powered-off VMs now have a status of “inaccessible”.  On the hosts, the VVol “datastore” is inaccessible.  

Ok, that’s bad.  Thank goodness for Cormac Hogan’s post about this issue.  It boils down to a chicken-and-egg problem.  vCenter relies on the VASA provider to supply information about the VVol.  If the VASA provider resides on the VVols, there’s no apparent way to recover it.  There’s no datastore to find the vmx and re-register, the connections to the VVols are based on the VM, so if it’s not running, there’s no connection to it.

To resolve, I had to create a new instance of the Equallogic VSM, re-register it with vCenter, re-register it as a VASA provider and add the Equallogic group.  Thankfully, the array itself is the source-of-truth for the VVol configuration, so the New VSM picked it up seamlessly.

So your options are apparently to place the VSM/VASA provider on a non-VVol or build a new one every time it shuts down.  Not cool.

 

vRealize Automation DEM worker cannot connect to Orchestrator

In vRA 6.2, using vRO 6.0, you may find that the data collection and other vRO workflows fail with the error “You must have at least one properly configured vCenter Orchestrator endpoint that is reachable”.  The IaaS/Monitoring/Log will show which DEM worker threw the error.  When you check the DEM worker logs for that instance, if you find the message “Could not create SSL/TLS secure channel. —> System.Net.WebException: The request was aborted: Could not create SSL/TLS secure channel“, you have probably been affected by VMKB 2123455 and MS KB 3061588.

Although both articles seem to suggest that removing the offending patch will solve the problem, I think figuring out exactly which patch is rather awkward.  The easier fix is to apply a quick registry hack to your DEM workers (and wherever the vRA Designer runs).

  1. Logon as an account with admin rights (suggest the account your IaaS services run under)
  2. locate or add the key

    HKLM\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\
    KeyExchangeAlgorithms\Diffie-Hellman

  3. Add/update the DWORD value ClientMinKeyBitLength and set the value to 512 decimal (200 hex)
  4. Restart the DEM worker service

 

Notes:
The Microsoft patch sets the default minimum group size to 1024.  It appears that the vRO 6.0.x appliances use something less than that.  This registry hack indicates that SCHANNEL should accept keys as small as 512 bits.  I suggest only applying this to the necessary and affected machines since it does lower the bar for the DHE security requirements.

Thanks to Zach Milleson for reminding me that this workaround may not resolve everyone’s issue, depending on which MS patches are installed.  If this workaround doesn’t work for you, you may have to locate and remove the offending patch.  YMMV.

Resolving Site Recovery Manager Error “Unable to create protection group. No VRM registered with vCenter Server”

Just a quick note here about this error.  I felt kinda silly once I realized what I’d done.

Scenario:

  1. Installed a vCenter Server at both sites
  2. Deployed a vSphere Replication 6.1 appliance at both sites.  Connected each to its corresponding PSC and vCenter Server
  3. Installed Site Recovery Manager 6.1 on each site, registered to corresponding vCenter Server
  4. Complete mappings and basic configurations in SRM, confirm VR is available at each site

    SRM configured with VR
    SRM configured with VR
  5. Attempt to create a protection group
  6. Receive Error “Unable to create protection group. No VRM registered with vCenter Server …”
  7. Google frantically

Its not clear from the documentation, but you have to configure at least one VM for replication using vSphere Replication section of Web Client first.  Totally counter-intuitive.  Just right-click the target VM and select “All vSphere Replication Actions | Configure Replication” and walk through the wizard.  Once complete, you can return to the SRM configuration and succesfully set up a Protection Group and subsequent Recovery Plan.

Weak Diffie-Hellman key in vRealize Orchestrator

{Edited Oct 19 2015 to reflect updated information inVMKB 2131619}

Recent versions of Google Chrome and Mozilla Firefox have begun rejecting connections using SSLv3 ciphers. Chrome complains of a weak ephemeral Diffie-Hellman public key, calling it a “disastrous misconfiguration”.  Firefox’s message also complains of a weak ephemeral Diffie-Hellman key in Server Key Exchange, but doesn’t foreshadow impending doom.

Interestingly (I guess), Internet Explorer 11, still happily connects…

Firefox message on vRO configuration page
Firefox message on vRO configuration page
Chrome message on vRO configuration page
Chrome message on vRO configuration page

Let’s fix Orchestrator so that we can use FF and Chrome…

Procedure

Confirmed this works on the vCO Appliance v5.5.2.1 through v6.0.2.1 and on the vRealize Automation Appliance v6.2.x

  1. SSH into the appliance
  2. Enter this to navigate to the configuration for the configuration page

    cd /etc/vco/configuration

  3. Enter this to backup the server.xml file

    cp ./server.xml ./serverxml.backup

  4. Use vi, or whatever you’re familiar with, to edit server.xml and replace the line that reads (as one line)

    ciphers=“TLS_DHE_RSA_WITH_AES_256_CBC_SHA,
    TLS_DHE_DSS_WITH_AES_256_CBC_SHA,
    TLS_RSA_WITH_AES_256_CBC_SHA,
    TLS_DHE_RSA_WITH_AES_128_CBC_SHA,
    TLS_DHE_DSS_WITH_AES_128_CBC_SHA,
    TLS_RSA_WITH_AES_128_CBC_SHA”

    with (again, as one line)

    ciphers=“TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
    TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
    TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,
    TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA,
    TLS_ECDH_RSA_WITH_AES_256_CBC_SHA,
    TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA,
    TLS_RSA_WITH_3DES_EDE_CBC_SHA”

  5. save the file
  6. Repeat the steps above for /etc/vco/app-server/server.xml
  7. Restart the vco-server and vcp-configurator services

    service restart vco-server
    service restart vco-configurator

That’s it, you should be good to go. There’s probably other VMware applications that will need the same treatment though.

Not Diffie-Hellman
Dang, wrong Hellman

Fix – Unable to import vCAC/vRA certificates into Orchestrator

Problem:

While in the vRealize Orchestrator Client you find that the Library/Configuration/SSL Trust Manager/”Import a certificate from URL” workflow returns an error reading “InternalError: handshake alert: unrecognized_name” when provided. The URL the resolves to the Load-Balancer VIP for the vCAC/vRA appliances.

 

Background:

Signed SSL certificate installed on vCAC/vRA Appliance, SSL Passthrough on NSX/vCNX Load-Balancer, vCAC/vRA Settings/Hostname set to resolve to VIP, matching SSL cert.

 

Fix:

  1. SSH into the vCAC Appliance as root
  2. Backup /etc/apache2/vhosts.d/vcac.conf to vcac.conf.bak
  3. Use vi to edit /etc/apache2/vhosts.d/vcac.conf
  4. Scroll down to  <virtualHost _default_:443>
  5. Add these lines

    ServerName fqdn.of.appliance.node

    ServerAlias: load.balancer.name

  6. Scroll further to ensure these params aren’t listed elsewhere, remove or revise if so.
  7. save the file and exit vi
  8. restart the vCAC/vRA services

NSX-v and vCNS Coexistence

It may or may not be apparent, but NSX for vSphere is in many ways the next version of vCNS. In my lab, I’ve attempted to keep vCNS while adding NSX to the same vCenter server. The license key or configuration apparently overlap; if vShield Manager boots up first, NSX indicates that it’s not licensed. If NSX manager boots first, the vShield Manager states that it’s not licensed.

I did not, but should have, performed an upgrade from vCNS to NSX and now will have to add NSX Edge Gateways to replace the vShield Edges.

Use Cisco Nexus 1000V for virtual hosts in nested ESXi

The native VMware vSwitch and Distributed vSwitch do not use MAC-learning. This was removed because the vSwitches would be aware of the VMs attached to them and the MAC addresses in use. As a result, if you nest ESXi under a standard vSwitch and power-on VMs under the nested instance, those VMs will be unable to communicate because their MACs are masked by the virtual host and the vSwitch is not aware of them.

Workaround options:

  1. Enable Promiscuous mode on the vSwitch.
  2. This works but should never be used in production.  It adds a lot of unnecessary traffic and work to the physical NICs.  It makes troubleshooting difficult and is a security risk
  3. Attach your virtual hosts to a Cisco Nexus 1000V.
  4. The 1000V retains MAC-learning, so VMs on nested virtual ESXi hosts can successfully communicate because the switch learns the nested MAC addresses.
  5. If your physical servers support virtual interfaces, you can create additional “physical” interfaces and pass them through to the virtual instances.  This allows you to place the virtual hosts on the same switch as the physical hosts if you choose.  There is obviously a finite amount of virtual interfaces you can create in the service profile, but I think this is a clean, low-overhead solution for environments using Cisco UCS or HP C7000 or similar.

Conclusion

The Nexus 1000V brings back important functionality for nested ESXi environments, especially those environments that do not have access to features like virtual interfaces and service profiles.

Helpful links:

Standing Up The Cisco Nexus 1000v In Less Than 10 Minutes by Kendrick Coleman

Expanding a VMDK for OpenFiler

In my lab, I have an OpenFiler 2.99.1 VM running on the physical host providing storage via iSCSI to my virtual hosts.

Increasing the size of the VMDK used by the OpenFiler VM does not equate to more storage shared by the OpenFiler. I banged my head against the wall for a few hours figuring it out; here’s how I did it.

  1. Expand VMDK
  2. Download GParted Live CD
  3. Stop anything consuming storage provided by OpenFiler
  4. Shut Down OpenFiler VM
  5. Boot OpenFiler from GParted Live CD
  6. Create additional LVM2 PV in the unused storage
  7. Apply changes
  8. Unmount Gparted ISO, reboot OpenFiler
  9. In the OpenFiler Web Interface, navigate to Volume Groups
  10. Add new PV to the Volume Group
  11. Navigate to Manage Volumes
  12. Select the VG, Edit the Volume, enter the new size (same as the volume group’s total space) in my case
  13. Restart iSCSI service
  14. In vSphere, view the properties of the iSCSI datastore to increase its size

What a pain, why is this necessary?
There is apparently an uncorrected bug in OpenFiler in that it will not create additional partitions on a block device. Attempting to create the PV/Partition from the CLI using parted will not accept the cylinders I provide, instead attempting to make the volume half as big as asked. – If someone knows why this is and how to correct, please comment.
In the future, if my OpenFiler needs more storage to share, I’ll just add a new VMDK, create the PV on it, add it to the Volume Group and increase the volume that way.