Archive

Posts Tagged ‘Configuration’

Automating PKS Upgrades

05/22/2018 Comments off

Last night, Pivotal announced new versions of PKS and Harbor, so I thought it’s time to simplify the upgrade process. Here is a concourse pipeline that essentially aggregates the upgrade-tile pipeline so that PKS and Harbor are upgraded in one go.

What it does:

  1. Runs on a schedule – you set the time and days it may run
  2. Downloads the latest version of PKS and Harbor from Pivnet- you set the major.minor version range
  3. Uploads the PKS and Harbor releases to your BOSH director
  4. Determines whether the new release is missing a stemcell, downloads it from PivNet and uploads it to BOSH director
  5. Stages the tiles/releases
  6. Applies changes

What you need:

  1. A working Concourse instance that is able to reach the Internet to pull down the binaries and repo
  2. The fly cli and credentials for your Concourse.
  3. A token from your PivNet account
  4. An instance of PKS 1.0.2 or 1.0.3 AND Harbor 1.4.x deployed on Ops Manager
  5. Credentials for your Ops Manager
  6. (optional) A token from your GitHub account

How to use the pipeline:

  1. Download params.yml and pipeline.yml from here.
  2. Edit the params.yml by replacing the values in double-parentheses with the actual value. Each line has a bit explaining what it’s expecting.  For example, ((ops_mgr_host)) becomes opsmgr.pcf1.domain.local
    • Remove the parens
    • If you have a GitHub Token, pop that value in, otherwise remove ((github_token))
    • The current pks_major_minor_version regex will get the latest 1.0.x.  If you want to pin it to a specific version, or when PKS 1.1.x is available, you can make those changes here.
    • The ops_mgr_usr and ops_mgr_pwd credentials are those you use to logon to Ops Manager itself.  Typically set when the Ops Manager OVA is deployed.
    • The schedule params should be adjusted to a convenient time to apply the upgrade.  Remember that in addition to the PKS Service being offline (it’s a singleton) during the upgrade, your Kubernetes clusters may be affected if you have the “Upgrade all Clusters” errand set to run in the PKS configuration, so schedule wisely!

  3. Open your cli and login to concourse with fly

    fly -t concourse login -c http://concourse.domain.local:8080 -u username -p password

  4. Set the new pipeline. Here, I’m naming the pipeline “PKS_Upgrade”. You’ll pass the pipeline.yml with the “-c” param and your edited params.yml with the “-l” param

    fly -t concourse sp -p PKS_Upgrade -c pipeline.yml -l params.yml

    Answer “y” to “Apply Configuration”…

  5. Unpause the pipeline so it can run when in the scheduled window

    fly -t concourse up -p PKS_Upgrade

  6. Login to the Concourse web to see our shiny new pipeline!

    If you don’t want to deal with the schedule and simply want it to upgrade on-demand, use the pipeline-nosched.yml instead of pipeline.yml, just be aware that when you unpause the pipeline, it’ll start doing its thing.  YMMV, but for me, it took about 8 minutes to complete the upgrade.

Behind the scenes
It’s not immediately obvious how the pipeline does what it does. When I first started out, I found it frustrating that there just isn’t much to the pipeline itself. To that end, I tried making pipelines that were entirely self-contained. This was good in that you can read the pipeline and see everything it’s doing; plus it can be made to run in an air-gapped environment. The downside is that there is no separation, one error in any task and you’ll have to edit the whole pipeline file.
As I learned a little more and poked around in what others were doing, it made sense to split the “tasks” out, keep them in a GitHub public repo and pull it down to run on-demand.

Pipelines generally have two main sections; resources and jobs.
Resources are objects that are used by jobs. In this case, the binary installation files, a zip of the GitHub repo and the schedule are resources.
Jobs are (essentially) made up of plans and plans have tasks.
Each task in most pipelines uses another source yml. This task.yml will indicate which image concourse should build a container from and what it should do on that container (typically, run a script). All of these task components are in the GitHub repo, so when the pipeline job runs, it clones the repo and runs the appropriate task script in a container built on an image pulled from dockerhub.

More info
I’ve got a several pipelines in the repo.   Some of them do what they’re supposed to. 🙂 Most of them are derived from others’ work, so many thanks to Pivotal Services and Sabha Parameswaran

Advertisements

PKS and NSX-T: I did everything wrong

05/15/2018 Comments off

I’ve fought with PKS and NSX-T for a month or so now. I’ll admit it: I did everything wrong, several times. One thing for certain, I know how NOT to configure it. So, now that I’ve finally gotten past my configuration issues, it makes sense to share the pain lessons learned.

  1. Set your expectations correctly. PKS is literally a 1.0 product right now. It’s getting a lot of attention and will make fantastic strides very quickly, but for now, it can be cumbersome and confusing. The documentation is still pretty raw. Similarly, NSX-T is very young. The docs are constantly referring you to the REST API instead of the GUI – this is fine of course, but is a turn-off for many. The GUI has many weird quirks. (when entering a tag, you’ll have to tab off of the value field after entering a value, since it is only checked onBlur)
  2. Use Chrome Incognito  NSX-T does not work in Firefox on Windows. It works in Chrome, but I had issues where the cache would problems (the web GUI would indicate that backup is not configured until I closed Chrome, cleared cache and logged in again)
  3. Do not use exclamation point in the NSX-T admin password Yep, learned that the hard way. Supposedly, this is resolved in PKS 1.0.3, but I’m not convinced as my environment did not wholly cooperate until I reset the admin password to something without an exclamation point in it
  4. Tag only one IP Pool with ncp/external I needed to build out several foundations on this environment and wanted to keep them in discrete IP space by created multiple “external IP Pools” and assigning each to its own foundation. Currently the nsx-cli.sh script that accompanies PKS with NSX-T only looks for the “ncp/external” tag on IP Pools, if more than one is found, it quits. I suppose you could work around this by forking the script and passing an additional “cluster” param, but I’m certain that the NSBU is working on something similar
  5. Do not take a snapshot of the NSX Manager This applies to NSX for vSphere and NSX-T, but I have made this mistake and it was costly. If your backup solution relies on snapshots (pretty much all of them do), be sure to exclude the NSX Manager and…
  6. Configure scheduled backups of NSX Manager I found the docs for this to be rather obtuse. Spent a while trying to configure a FileZilla SFTP or even IIS-FTP server until it finally dawned on me that it really is just FTP over SSH. So, the missing detail for me was that you’ll just need a linux machine with plenty of space that the NSX Manager can connect to – over SSH – and dump files to. I started with this procedure, but found that the permissions were too restrictive.
  7. Use concourse pipelines This was an opportunity for me to really dig into concourse pipelines and embrace what can be done. One moment of frustration came when PKS 1.0.3 was released and I discovered that the parameters for vSphere authentication had changed. In PKS 1.0 through 1.0.2, there was a single set of credentials to be used by PKS to communicate with vCenter Server. As of 1.0.3, this was split into credentials for master and credentials for workers. So, the pipeline needed a tweak in order to complete the install. I ended up putting in a conditional to check the release version, so the right params are populated. If interested, my pipelines can be found at https://github.com/BrianRagazzi/concourse-pipelines
  8. Count your Load-Balancers In NSX-T, the load-balancers can be considered a sort of empty appliance that Virtual Servers are attached to and can itself attach to a Logical Router. The load-balancers in-effect require pre-allocated resources that must come from an Edge Cluster. The “small” load-balancer consumes 2 CPU and 4GB RAM and the “Large” edge VM provides 8 CPU and 16GB RAM. So, a 2-node Edge Cluster can support up to FOUR active/standby Load-Balancers. This quickly becomes relevant when you realize that PKS creates a new load-balancer when a new K8s cluster is created. If you get errors in the diego databse with the ncp job when creating your fifth k8s cluster, you might need to add a few more edge nodes to the edge cluster.
  9. Configure your NAT rules as narrow as you can. I wasted a lot of time due to mis-configured NAT rules. The log data from provisioning failures did not point to NAT mis-configuration, so wild geese were chased.  Here’s what finally worked for me:
    Router Priority Action Source Destination Translated Description
    Tier1 PKS Management 512 No NAT [PKS Management CIDR] [PKS Service CIDR] Any No NAT between management and services
    [PKS Service CIDR] [PKS Management CIDR]
    1024 DNAT Any [External IP for Ops Manager] [Internal IP for Ops Manager] So Ops Manager is reachable
    [External IP for PKS Service] [Internal IP for PKS Service] (obtain from Status tab of PKS in Ops Manager) So PKS Service (and UAA) is reachable
    SNAT [Internal IP for PKS Service] Any [External IP for PKS Service] Return Traffic for PKS Service
    2048 [PKS Management CIDR] [Infrastructure CIDR] (vCenter Server, NSX Manager, DNS Servers) [External IP for Ops Manager] So PKS Management can reach infrastructure
    [PKS Management CIDR] [Additional Infrastructure] (NTP in this case) [External IP for Ops Manager]
    Tier1 PKS Services 512 No NAT [PKS Service CIDR] [PKS Management CIDR] Any No NAT between management and services
    [PKS Management CIDR] [PKS Service CIDR]
    1024 SNAT [PKS Service CIDR] [Infrastructure CIDR] (vCenter Server, NSX Manager, DNS Servers) [External IP] (not the same as Ops Manager and PKS Service, but in the same L3 network) So PKS Services can reach infrastructure
    [PKS Service CIDR] [Additional Infrastructure] (NTP in this case) [External IP]

Replacing the self-signed Certificate on NSX-T

Ran into a difficulty trying to use the self-signed certificate that comes pre-configured on the manager for NSX-T. In my case, Pivotal Operations Manager refused to accept the self-signed certificate.

So, for NSX-T 2.1, it looks like the procedure is:

    1. Log on to the NSX Manager and navigate to System|Trust
    2. Click CSRs tab and then “Generate CSR”, populate the certificate request details and click Save
    3. Select the new CSR and click Actions|Download CSR PEM to save the exported CSR in PEM format
    4. Submit the CSR to your CA to get it signed and save the new certificate. Be sure to save the root CA and any subordinate CA certificates too<. In this example, certnew.cer is the signed NSX Manager certificate, sub-CA.cer is the subordinate CA certificate and root-CA.cer is the Root CA certificate
    5. Open the two (or three) cer files in notepad or notepad++ and concatenate them in order of leaf cert, (subordinate CA cert), root CA cert
    6. Back in NSX Manager, select the CSR and click Actions|Import Certificate for CSR. In the Window, paste in the concatenated certificates from above and click save
    7. Now you’ll have a new certificate and CA certs listed under Certificates. The GUI only shows a portion of the ID by default, click it to display the full ID and copy it to the clip board
    8. Launch RESTClient in Firefox.
      • Click Authentication|Basic Authentication and enter the NSX Manager credentials for Username and Password, click “Okay”
      • For the URL, enter https://<NSX Manager IP or FQDN>api/v1/node/services/http?action=apply_certificate&certificate_id=<certificate ID copied in previous step>
      • Set the method to POST and click SEND button
      • check the Headers to confirm that the status code is 200
    9. Refresh browser session to NSX Manager GUI to confirm new certificate is in use

Notes:
I was concerned that replacing the certificate would break the components registered via the certificate thumbprint; this process does not break those things. They remain registered and trust the new certificate

Removing NSX-T VIBs from ESXi hosts

10/31/2017 Comments off

I’d wanted to revert my environment from (an incomplete install of) NSX-T v2.0 back to NSX for vSphere v6.3.x, but found that the hosts would not complete preparation.  The logs indicated that something was “claimed by multiple non-overlay vibs”.

Error in esxupdate.log

I found that the hosts still had the NSX-T VIBs loaded, so to remove them, here’s what I did:

  1. Put the host in maintenance mode.  This is necessary to “de-activate” the VIBs that may be in use
  2. Login to the host via SSH
  3. Run

    /etc/init.d/netcpad stop
    /etc/init.d/nsx-ctxteng stop remove
    /etc/init.d/nsx-da stop remove
    /etc/init.d/nsx-datapath stop remove
    /etc/init.d/nsx-exporter stop remove
    /etc/init.d/nsx-hyperbus stop remove
    /etc/init.d/nsx-lldp stop remove
    /etc/init.d/nsx-mpa stop remove
    /etc/init.d/nsx-nestdb stop remove
    /etc/init.d/nsx-platform-client stop remove
    /etc/init.d/nsx-sfhc stop remove
    /etc/init.d/nsx-support-bundle-client stop remove
    /etc/init.d/nsxa stop remove
    /etc/init.d/nsxcli stop remove

  4. Run this all in one line; note the the order of the vibs is important

    esxcli software vib remove -n nsx-ctxteng -n nsx-hyperbus -n nsx-platform-client -n nsx-nestdb -n nsx-aggservice -n nsx-da -n nsx-esx-datapath -n nsx-exporter -n nsx-host -n nsx-lldp -n nsx-mpa -n nsx-netcpa -n nsx-python-protobuf -n nsx-sfhc -n nsx-support-bundle-client -n nsxa -n nsxcli -n nsx-common-libs -n nsx-metrics-libs -n nsx-nestdb-libs -n nsx-rpc-libs -n nsx-shared-libs -n nsx-python-gevent -n nsx-python-greenlet

  5. reboot the host

Configuring NSX Load-Balancer for PCF

08/26/2016 Comments off

There’s not a lot of specific information out there for this configuration.  There’s some guidance from Pivotal and some how-tos from VMware, so with a little additional detail, we should be able to figure this out.

Edit – 2/1/17 – Updated with OpenSSL configuration detail
Edit – 3/20/17 – Updated SubjectAltNames in config

Preparation

  1. SSL Certificate. You’ll need the signed public cert for your URL (certnew.cer), the associated private key (pcf.key) and the public cert of the signing CA (root64.cer).
    1. Download and install OpenSSL
    2. Create a config file for your request – paste this into a text file:

      [ req ]
      default_bits = 2048
      default_keyfile = rui.key
      distinguished_name = req_distinguished_name
      encrypt_key = no
      prompt = no
      string_mask = nombstr
      req_extensions = v3_req

      [ v3_req ]
      basicConstraints = CA:FALSE
      keyUsage = digitalSignature, keyEncipherment
      extendedKeyUsage = serverAuth, clientAuth
      subjectAltName = DNS: *.pcf.domain.com, DNS:ServerShortName, IP:ServerIPAddress, DNS: *.system.pcf.domain.com, DNS: *.apps.pcf.domain.com, DNS:*.login.system.pcf.domain.com, DNS: *.uaa.system.pcf.domain.com

      [ req_distinguished_name ]
      countryName = US
      stateOrProvinceName = State
      localityName = City
      0.organizationName = Company Name
      organizationalUnitName = PCF
      commonName = *.pcf.domain.com

    3. Replace the values in red with those appropriate for your environment. Be sure to specify the server name and IP address as the Virtual IP and its associated DNS record. Save the file as pcf.cfg.  You’ll want to use the wildcard “base” name as the common name and the server name, as well as the *.system, *.apps, *.login.system and *.uaa.system Subject Alt Names.
    4. Use OpenSSL to create the Certificate Site Request (CSR) for the wildcard PCF domain.

      openssl req -new -newkey rsa:2048 -nodes -keyout pcf.key -out pcf.csr -config pcf.cfg

    5. Use OpenSSL to convert the key to RSA (required for NSX to accept it)

      openssl rsa -in pcf.key -out pcfrsa.key

    6. Submit the CSR (pcf.csr) to your CA (Microsoft Certificate Services in my case), retrieve the certificate (certnew.cer) and certificate chain (certnew.p7b) base-64 encoded.
    7. Double-click certnew.p7b to open certmgr. Export the CA certificate as 64-bit encoded x509 to a file (root64.cer is the file name I use)
  2. Networks.  You’ll need to know what layer 3 networks the PCF components will use.  In my case, I set up a logical switch in NSX and assigned the gateway address to the DLR. Probably should make this a 24-bit network, so there’s room to grow, but not reserving a ridiculous number of addresses. We’re going to carve up the address space a little, so make a note of the following:
    • Gateway and other addresses you typically reserve for network devices.  (eg:  first 9 addresses 1-9)
    • Address that will be assigned to the NSX load balancer.  Just need one (eg: 10)
    • Addresses that will be used by the PCF Routers.  At least two. These will be configured as members in the NSX Load Balancer Pool.
  3. DNS, IP addresses.  PCF will use “system” and “apps” subdomains, plus whatever names you give any apps deployed.  This takes some getting used to – not your typical application.  Based on the certificate we created earlier, I recommend just creating a “pcf” subdomain.  In my case, the network domain (using AD-DNS) is ragazzilab.com and I’ve created the following:
    • pcf.ragazzilab.com subdomain
    • *.pcf.ragazzilab.com A record for the IP address I’m going to assign to the NSX Load-Balancer

NSX

Assuming NSX is already installed and configured.  Create or identify an existing NSX Edge that has an interface on the network where PCF will be / is deployed.

  1. Assign the address we noted above to the inteface under Settings|Interfaces
  2. Under Settings|Certificates, add the our SSL certificates
    • Click the Green Plus and select “CA Certificate”.  Paste the content of the signing CA public certificate (base64.cer) into the Certificate Contents box.  Click OK.
    • Click the Green Plus and select “Certificate”.  Paste the content of the signed public cert (certnew.cer) into the Certificate Contents box and paste the content of the RSA private key (pcfrsa.key) into the Private Key box. Click OK.
  3. Under Load Balancer, create an Application Profile. We need to ensure that NSX inserts the x-forwarded-for HTTP headers.  To do that, we need to be able to decrypt the request and therefore must provide the certificate information.  I found that Pool Side SSL had to be enabled and using the same Service and CA Certificates.
    Router Application Profile

    Router Application Profile

     

  4. Create the Service Monitor.  What worked for me is a little different from what is described in the GoRouter project page. The key points are that we want to specify the useragent and look for a response of “ok” with a header of “200 OK”.

    Service Monitor for PCF Router

    Service Monitor for PCF Router

  5. Create the Pool.  Set it to ROUND-ROBIN using the Service Monitor you just created.  When adding the routers as members, be sure to set the port to 443, but the Monitor Port to 80.

    Router Pool

    Router Pool

  6. Create the Virtual Server.  Specify the Application Profile and default Pool we just created.  Obviously, specify the correct IP Address.
    Virtual Server Configuration

    Virtual Server Configuration


PCF – Ops Manager

Assuming you’ve already deployed the Ops Manager OVF, use the installation dashboard to edit the configuration for Ops Manager Director.  I’m just going to highlight the relevant areas of the configuration here:

Networks.  Under “Create Networks”, be sure that the Subnet specified has the correct values.  Pay special attention to the reserved IP ranges.  These should be the addresses of the network devices and the IP address assigned to the load-balancer.  Do not include the addresses we intend to use for the routers though.  Based on the example values above, we’ll reserve the first 10 addresses.

Ops Manager Network Config

Ops Manager Network Config

Ops Manager Director will probably use the first/lowest address in range that is not reserved.

PCF – Elastic Runtime

Next, we’ll install Elastic Runtime.  Again, I’ll highlight the relevant sections of the configuration.

  1. Domains.  In my case it’s System Domain = system.pcf.ragazzilab.com and Apps Domain = apps.pcf.ragazzilab.com
  2. Networking.
    • Set the Router IPs to the addresses (comma-separated) you noted and added to as members to the NSX load-balancer earlier.
    • Leave HAProxy IPs empty
    • Select the point-of-entry option for “external load balancer, and it can forward encrypted traffic”
    • Paste the content of the signed certificate (certnew.cer) into the Certificate PEM field.  Paste the content of the CA public certificate (root64.cer) into the same field, directly under the certificate content.
    • Paste the content of the private key (pcf.key) into the Private Key PEM field.
    • Check “Disable SSL Certificate verification for this environment”.
  3. Resource Config.  Be sure that the number of Routers is at least 2 and equal to the number of IP addresses you reserved for them.

 

Troubleshooting

Help! The Pool Status is down when the Service Monitor is enabled.

This could occur if your routers are behaving differently from mine.  Test the response by sending a request to one of the routers  through curl and specifying the user agent as HTTP-Monitor/1.1

curl -v -A “HTTP-Monitor/1.1” “http://{IP of router}”

 

Testing router with curl

Testing router with curl

The value in the yellow box should go into the “Expected” field of the Service Monitor and the value in the red box should go into the “Receive” field. Note that you should not get a 404 response, if you do, check that he user agent is set correctly.

 

Notes

This works for me and I hope it works for you.  If you have trouble or disagree, please let me know.

vRealize Automation DEM worker cannot connect to Orchestrator

08/23/2016 Comments off

In vRA 6.2, using vRO 6.0, you may find that the data collection and other vRO workflows fail with the error “You must have at least one properly configured vCenter Orchestrator endpoint that is reachable”.  The IaaS/Monitoring/Log will show which DEM worker threw the error.  When you check the DEM worker logs for that instance, if you find the message “Could not create SSL/TLS secure channel. —> System.Net.WebException: The request was aborted: Could not create SSL/TLS secure channel“, you have probably been affected by VMKB 2123455 and MS KB 3061588.

Although both articles seem to suggest that removing the offending patch will solve the problem, I think figuring out exactly which patch is rather awkward.  The easier fix is to apply a quick registry hack to your DEM workers (and wherever the vRA Designer runs).

  1. Logon as an account with admin rights (suggest the account your IaaS services run under)
  2. locate or add the key

    HKLM\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\
    KeyExchangeAlgorithms\Diffie-Hellman

  3. Add/update the DWORD value ClientMinKeyBitLength and set the value to 512 decimal (200 hex)
  4. Restart the DEM worker service

 

Notes:
The Microsoft patch sets the default minimum group size to 1024.  It appears that the vRO 6.0.x appliances use something less than that.  This registry hack indicates that SCHANNEL should accept keys as small as 512 bits.  I suggest only applying this to the necessary and affected machines since it does lower the bar for the DHE security requirements.

Thanks to Zach Milleson for reminding me that this workaround may not resolve everyone’s issue, depending on which MS patches are installed.  If this workaround doesn’t work for you, you may have to locate and remove the offending patch.  YMMV.

Pivotal Cloud Foundry vApp startup order workflow

08/07/2016 Comments off

After installing Pivotal Cloud Foundry (PCF) on vSphere, you’ll have a collection of at least 21 (probably closer to 60!) VMs with names that probably don’t match anyone’s convention.  Although, as noted in the PCF documentation, there is a correct order to starting up and shutting down the VMs in PCF, the installer does not configure a vApp so that we can control that order.  So, I dragged all the PCF VMs into a vApp and starting trying to determine which ones are in which role and quickly realized that it’s a pain.

Creating an AZ in Ops Manager on vSphere

Creating an AZ in Ops Manager on vSphere

As an aside, when you create your Availability Zone, you point it at a vSphere cluster and, optionally, a Resource Pool.  Unfortunately, if you specify a vApp Name instead of a Resource Pool name, BOSH will fail to deploy the VMs.  So, I’ve typically leave the Resource Pool field blank and then drag the VMs into a vApp post-deployment.

I put together a workflow that will help place the PCF VMs into correct startup/shutdown groups for you.

Example PCF VMNames

Example PCF VMNames

Instructions for Use

  1. Download the package from here
  2. Import the package into vRealize Orchestrator
  3. If you haven’t already, create a new vApp in your cluster and drag the Ops Manager, Ops Manager Director and all of the Elastic Runtime VMs into the vApp
  4. Run the “PCFvAppStartupOrder” workflow, select your new vApp as the input, click Submit
  5. If the PCF installation is scaled out to more VMs, just drag them to the vApp and rerun the workflow

How it works/What it does

  • The correct order is stored in a string array
  • The deployment, job and director custom fields are read for each VM in the vApp to get the VM’s assigned role
  • For the Ops Manager, the Notes field is read and if found, it is placed at the top of the startup sequnce
  • Unknown VMs are assigned a startup order higher than the last in the array.  This way, they start last and power-off first
  • Unknown VMs are those where the “deployment” field does not start with “cf”; with exceptions for Ops Manager (Notes field) and Ops Manager Director (“director” field value is “bosh-init”)

Additional suggestions and notes

  • Adjust the resources for the vApp based on VMware best practices and what makes sense for your environment
  • Use this at your own risk, there is no implied warranty