Getting Started with CredHub in Concourse

First, some background – I promise to keep it short.  You should never have credentials in a public github repo.  Probably not good to have them in a private repo either.  At Pivotal, the github client is configured with credalert which complains when I try to push credentials to github.  I maintain compliance, I needed a way to update the stuff in the repo and have my credentials too.  Concourse supports a couple of AWS credential managers, Vault and Credhub.  Since CredHub is built-in to ops-manager-deployed BOSH director, we don’t have to spin anything else up.

The simplified diagram shows how this will work.  CredHub is on the BOSH director, so it’ll need to be reachable from the Concourse Web ATS service and anywhere the credhub cli will be used. If your BOSH director is behind a NAT, you may want to configure a DNAT, so it can be reached.

In this case, we’re using a “management/infrastructure” Operations Manager and BOSH director to deploy and manage concourse and minio.  The pipelines on concourse will be used to deploy and maintain other foundations in the environment.

Configure UAA

  1. Logon to the ops manager and navigate to status to record the IP address of the BOSH director.  If your BOSH director is behind a NAT, locate it’s DNAT instead.
  2. Navigate to the credentials tab.  We’re going to need the uaa_login_client_credentials password and the uaa_admin_client_credentials password.
  3. While here, save the ops manager root ca to your computer.  From the installation dashboard, click on your name in the upper right, select Settings.  Then click Advanced and Download Root CA.
  4. SSH into your ops manager:  ubuntu@<ops manager name or IP>
  5. Set uaac target

    uaac target https://<IP of BOSH director>:8443 –ca-cert /var/tempest/workspaces/default/root_ca_certificate

  6. Login to uaac – ok, this gets awkward

    uaac token owner get login -s <uaa_login_client_credentials>

    • Replace <uaa_login_client_credentials> with the value you saved
    • When prompted for a username enter admin
    • For password enter the uaa_admin_client_password value you saved
    • You should see “Successfully fetched token…”
  7. Create a uaac client for concourse to use with credhub

    uaac client add –name concourse-to-credhub –authorized-grant-types client_credentials –authorities credhub.read,credhub.write –access-token-validity 30 –secret MySecretPassword

    Please replace MySecretPassword with something else

  8. Create a uaac user for use with the CredHub cli

    uaac user add credhub –email credhub@whatever.com -p MySecretPassword

Try out Credhub cli

    1. Download and install the credhub cli.  On mac, you can use brew install credhub
    2. From a terminal/command line run this to point the cli to the credhub instance on the BOSH director:

      credhub api -s <IP of BOSH director>:8844 –ca-cert ./root_ca_certificate

      • Replace <IP of BOSH director> with the name or reachable IP of the director
      • root_ca_certificate is the root CA from ops manager you downloaded earlier
    3. Login to credhub:

      credhub login -u credhub -p MySecretPassword

      User and pass are from the User we added to uaa earlier

    4. Set a test value:

      credhub –type:value –name=/testval –value=hello

      Here’s we’re setting a key (aka credential) with the name /testval to the value “hello”. Note that all the things stored in credhub start with a slash and that there are several types of credentials that can be stored, the simplest being “value”

    5. Get our value:

      credhub –name /testval

      This will return the metadata for our key/credential

Configuring Concourse to use CredHub

Concourse TSA must be configured to look to credhub as a credential manager. I’m using BOSH-deployed concourse, so I’ll simply update the deployment manifest with the new params. if you’re using concourse via docker-compose, you’ll want to update the yml with the additional params as described here.

For concourse deployed via BOSH and using concouse-bosh-deployment, we’ll include the /operations/credhub.yml file and the additional params.  For me this looks like

bosh -e core deploy -d concourse concourse.yml \
-l ../versions.yml \
–vars-store cluster-creds.yml \
-o operations/static-web.yml \
-o operations/basic-auth.yml \
-o operations/scale.yml \
-o operations/privileged-http.yml \
-o operations/credhub.yml \
–var web_ip=192.168.100.205 \
–var external_url=http://concourse.ragazzilab.com \
–var network_name=INFRA \
–var web_vm_type=small.disk \
–var db_vm_type=small.disk \
–var azs=[BOSH] \
–var db_persistent_disk_type=10240 \
–var worker_vm_type=concourse.worker \
–var deployment_name=concourse \
–var local_user.username=myuser \
–var local_user.password=mypass \
–var web_instances=1 \
–var worker_instances=1 \
–var syslog_address=syslog.ragazzilab.com \
–var syslog_port=514 \
–var syslog_permitted_peer=syslog.ragazzilab.com \
–var credhub_url=”https://192.168.100.200:8844 ” \
–var credhub_client_id=concourse-to-credhub \
–var credhub_client_secret=MySecretPassword \
–var credhub_ca_cert=”$(cat root_ca_certificate)”

Test a pipeline

    1. Use credhub cli to create a value

      credhub set –name /concourse/main/hello-credhub/hello –value World

      Concourse has a default pattern for looking up interpolation values. It’s /concourse/<team name>/<pipeline name>/<key>

    2. Get the test pipeline from here.

      jobs:
      – name: hello-credhub
      plan:
      – do:
      – task: hello-credhub
      config:
      platform: linux
      image_resource:
      type: docker-image
      source:
      repository: ubuntu
      run:
      path: sh
      args:
      – -exc
      – |
      echo “Hello $WORLD_PARAM”
      params:
      WORLD_PARAM: ((hello))

    3. Use fly to set the test pipeline

      fly -t concourse login -c http://concourse -u myuser -p mypass -n main
      fly -t core sp -p hello-credhub -c hello-credhub.yml

    4. Run the test pipeline in concourse. If all goes well, it should say Hello World”

Manually creating a Kubernetes cluster with kubeadm

I’ve talked about Pivotal Container Service (PKS) before and now work for Pivotal, so I’ve frequently got K8s on my mind.  I’ve discussed at length the benefits of PKS and the creation of K8s clusters, but didn’t have much of a point of reference for alternatives.  I know about Kelsey Hightower’s book and was looking for something a little less in-the-weeds.

Enter kubeadm, included in recent versions of K8s.  With this tool, you’re better able to understand what goes into a cluster, how the master and workers are related and how the networking is organized.  I really wanted to stand up a K8s cluster alongside a PKS-managed cluster in order to better understand the differences (if any).  This is also a part of the Linux Foundation training “Kubernetes Fundamentals”.  I don’t want to spoil the course for you, but will point to some of the docs on kubernetes.io

 

Getting Ready

I used VMware Fusion on the Macbook to create and run two Ubuntu 18.04 VMs.  Each was a linked clone with 2GB RAM, 1vCPU.  Had to make sure that they had different MAC addresses, IP, UUIDs and host names. I’m sure you can use nearly any virtualization tool to get your VMs running.  Once running, be sure you can SSH into each.

Install Docker

I thought, “hey I’ve done this before” and just installed Docker as per usual, but that method does not leverage the correct cgroup driver, so we’ll want to install Docker with the script found here.

Install Tools

Once again, the kubernetes.io site provides commands to properly install kubeadm, kubelet and kubectl on our Ubuntu nodes.  Use the commands on that page to ensure kubelet is installed and held to the correct version.

Choose your CNI pod network

Ok, what?  CNI is the Container Network Interface is a specification for networking add-ons for K8s.  Kubeadm requires that we use a pod network addon that uses the CNI spec.  The pod network – we may have only one per k8s cluster – is the network that the pods communicate on; think of it as using a NAT rather than the network you’ve actually assigned to the Ubuntu nodes.  Further, this can be confusing, because this address space is not what is actually assigned to the pods.  This address space is used when we “expose” a service.  What pod-network-cidr you assign depends on which network add-on you select.  In my case, I went with Canal as it seems to be both powerful and flexible.  Also, the pod-network cidr used by Calico is “192.168.0.0/16”, which is already in use in my home lab – it may not have actually been a conflict, but it certainly would be confusing if it were in use twice.

Create the master node
Make sure you’re ssh’d into your designated “Master” Ubuntu VM, make sure that you’ve installed kubeadm, kubelet, kubectl and docker from the steps above.  If you also choose canal, you’ll initialize the master node (not on the worker node – we’ll have a different command for that one) by running

kubeadm init –pod-network-cidr=10.244.0.0/16

Exactly that CIDR. It’ll take a few minutes to download, install and configure the k8s components.  When the initialization has completed, you’ll see a message like this:

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run “kubectl apply -f [podnetwork].yaml” with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.1.129:6443 –token zdjzrp.5jad4gihqjo46olg \
–discovery-token-ca-cert-hash sha256:90d1b349aa93a7130ee91668e4e763a4c29e5fc1502060191b38ea0e31d3cec8

Using, this, we’ll exit su and run the

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Make a note of the bottom section as we’ll need it in order to join our worker to the newly-formed cluster.

Sanity-check:

On the master, run kubectl get nodes.  you’ll notice that we have 1 node and it’s not ready:

Install the network pod add-on

Referencing the docs, you’ll note that Canal has a couple yaml files to be applied to our cluster.  So, again on our master node, we’ll run these commands to install and configure Canal:


kubectl apply -f https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/canal/rbac.yaml
kubectl apply -f https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/canal/canal.yaml

Sanity-check:

On the master, run kubectl get nodes.  You’ll now notice that we still have 1 node but now it’s ready:

Enable pod placement on master
By default, the cluster will not put pods on the master node. If you want to use a single-node cluster or use compute capacity on the master node for pods, we’ll need to remove the taint.

Default taint on master

We’ll remove the taint with the command

kubectl taint nodes –all node-role.kubernetes.io/master-

Join worker to cluster
You saved the output from kubeadm init earlier, right? We’ll need that now to join the worker to the cluster. On the worker VM, become root via sudo su – and run that command:

Join worker to cluster

Now, back on master, we run kubectl get nodes and can see both nodes!

Master and Worker Ready

Summary and Next Steps

At this point, we have a functional kubernetes cluster with 1 master and 1 worker.  Next in this series, we’ll deploy some applications and compare the behavior to a PKS-managed kubernetes cluster.

 

Hands-on with VxRail v4.7.100

Recently (yesterday!) upgraded the VxRail clusters in a lab my team uses for Pivotal Ready Architecture development and testing and immediately noticed many differences.

When trying to go to the VxRail Manager, I am redirected to the vSphere Web Client This was concerning at first, but quickly realized that the VxRail Manager interface is now embedded in the vSphere Web Client (hence the redirection!)

In this environment, we have a “management” cluster using the VxRail-managed vCenter Server. On this cluster is also another vCenter Server that is used by three additional VxRail clusters – when the “Availability Zone” clusters are deployed, they use this shared “external” vCenter Server.

After upgrading the management cluster, the VxRail extension was registered in the HTML5 vSphere Web Client.

From here, when selecting the “VxRail” choice, you’ll see the VxRail dashboard.  This allows you to see a quick status of the selected VxRail cluster, some support information and recent messages from the VxRail community.

Configure/VxRail

The most important features of the VxRail extension is found under the VxRail section of the Configure tab of a selected VxRail vSphere cluster:

  • The System section displays the currently-installed VxRail version and has a link to perform an update.  Clicking the “Update” link will launch a new browser tab and take you to the VxRail Manager web gui where you can perform the bundle update.
  • Next, the Market item will also launch a new browser tab on the VxRail Manager web gui where you can download available Dell EMC applications.  For now, it lists Data Domain Virtual Edition and Isilon SD Edge.
  • The Add VxRail Hosts item will display any newly-discovered VxRail nodes and allow you to add those nodes to a cluster
  • The Hosts item displays the hosts in the cluster.  One interesting feature here is that it displays the Service Tag and Appliance ID right here! You may not need this information often, but when you do, it’s super-critical.
    You’ll notice the “Edit” button on the hosts list; this allows you to set/change the host’s name and management IP.

Monitor/VxRail

  • On the Monitor tab for the selected vSphere VxRail cluster, the Appliances item provides a link to the VxRail Manager web gui where the hardware is shown and will highlight any components in need of attention.  Any faults will also be in the “All Issues” section of the regular web client, so the hardware detail will provide visual clues if needed.

Congratulations to the VxRail team responsible for this important milestone that brings another level of integration with vSphere and a “single-pane-of-glass”!

Using Helm and Dynamic PersistentVolumes with Multi-AZ PKS on vSphere

So, you’ve installed PKS and created a PKS cluster.  Excellent!  Now what?

We want to use helm charts to deploy applications.  Many of the charts use PersistentVolumes, so getting PVs set up is our first step.

There are a couple of complicating factors to be aware of when it comes to PVs in a multi-AZ/multi-vSphere-Cluster environment.  First, you probably have cluster-specific datastores – particularly if you are using Pivotal Ready Architecture and VSAN.  These datastores are not suitable for PersistentVolumes consumed by applications deployed to our Kubernetes cluster.  To work-around this, we’ll need to provide some shared block storage to each host in each cluster.  Probably the simplest way to do this is with an NFS share.

Prerequisites:

Common datastore; NFS share or iSCSI

In production, you’ll want a production-quality fault-tolerant solution for NFS or iSCSI, like Dell EMC Isilon. For this proof-of-concept, I’m going to use an existing NFS server, create a volume and share it to the hosts in the three vSphere clusters where the PKS workload VMs will run.  In this case, the NFS datastore is named “sharednfs” ’cause I’m creative like that.  Make sure that your hosts have adequate permissions to the share.  Using VMFS on iSCSI is supported, just be aware that you may need to cable-up additional NICs if yours are already consumed by N-VDS and/or VSAN.

Workstation Prep

We’ll need a handful of command-line tools, so make sure your workstation has the PKS CLI and Kubectl CLI from Pivotal and you’ve downloaded and extracted Helm.

PKS Cluster
We’ll want to provision a cluster using the PKS CLI tool.  This document assumes that your cluster was provisioned successfully, but nothing else has been done to it.  For my environment, I configured the “medium” plan to use 3 Masters and 3 Workers in all three AZs, then created the cluster with the command

pks create-cluster pks1cl1 --external-hostname cl1.pks1.lab13.myenv.lab --plan "medium" --num-nodes "3"


Logged-in
Make sure you’re logged into the Kubernetes cluster. In PKS, the easiest way to do this is via the PKS cli:

pks login -a api.pks1.lab13.myenv.lab -u pksadmin -p my_password --skip-ssl-validation
pks cluster pks1cl1
pks get-credentials pks1cl1
kubectl config use-context pks1cl1
kubectl get nodes -o wide

Where “pks1cl1″ is replaced by your cluster’s name,”api.pks1.lab13.myenv.lab” is replaced by the FQDN to your PKS API server, “pksadmin” is replaced by the username with admin rights to PKS and “my_password” is replaced with that account’s password.

Procedure:

  1. Create storageclass
    • Create storageclass spec yaml. Note that the file is named storageclass-nfs.yml and we’re naming the storage class itself “nfs”:
      kind: StorageClass
      apiVersion: storage.k8s.io/v1
      metadata:
        name: nfs
        annotations:
          storageclass.kubernetes.io/is-default-class: "true"
      provisioner: kubernetes.io/vsphere-volume
      parameters:
        diskformat: thin
        datastore: sharednfs
        fstype: ext3
      

    • Apply the yml with kubectl

      kubectl create -f storageclass-nfs.yml

    • Create a sample PVC (Persistent Volume Claim). Note that the file is names pvc-sample.yml, the PVC name is “pvc-sample” and uses the “nfs” storageclass we created above. This step is not absolutely necessary, but will help confirm we can use the storage.
      kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: pvc-sample
        annotations:
          volume.beta.kubernetes.io/storage-class: nfs
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
        storageClassName: nfs
      
    • Apply the yml with kubectl

      kubectl create -f pvc-sample.yml


      If you’re watching vSphere closely, you’ll see a VMDK created in the kubevols folder of the NFS datastore

    • Check that the PVC was created with

      kubectl get pvc

      and

      kubectl describe pvc pvc-sample

    • Remove sample PVC with

      kubectl delete -f pvc-sample

  2. Configure Helm and Tiller
    • Create Service Account for tiller with
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: tiller
        namespace: kube-system
      ---
      apiVersion: rbac.authorization.k8s.io/v1beta1
      kind: ClusterRoleBinding
      metadata:
        name: tiller
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: cluster-admin
      subjects:
        - kind: ServiceAccount
          name: tiller
          namespace: kube-system
      
    • Apply the service account yml with Kubectl

      kubectl create -f rbac-config.yml

    • Initialize helm and tiller with

      helm init --service-account tiller

    • Check that tiller is ready

      helm version


      Look for a version number for the version; note that it might take a few seconds for tiller in the cluster to get ready.

  3. Deploy sample helm chart
    • Update helm local chart repository. We do this so that we can be sure that helm can reach the public repo and to cache teh latest information to our local repo.

      helm repo update


      If this step results in a certificate error, you may have to add the cert to the trusted certificates on the workstation.

    • Install helm chart with ingress enabled. Here, I’ve selected the Dokuwiki app. The command below will enable ingress, so we can access it via routable IP and it will use the default storageclass we configured earlier.

      helm install --name dokuwiki \
      --set ingress.enabled="true",dokuwikiUsername=admin,dokuwikiPassword=password \
      stable/dokuwiki

      Edit – April 23 2019 – Passing the credentials in here makes connecting easier later.

    • Confirm that the app was deployed
      helm list
      kubectl get pods -n default
      kubectl get services -n default


      From the get services results, make a note of the external IP address – in the example above, it’s 192.13.6.73

    • Point a browser at the external address from the previous step and marvel at your success in deploying Dokuwiki via helm to Kubernetes!
      If you want to actually login to your Dokuwiki instance, first obtain the password for the user account with this command:

      kubectl get secret -n default dokuwiki-dokuwiki \
      -o jsonpath="{.data.dockuwiki-password}" | base64 --decode

      Then login with username “user” and that password.

       

      Edit – 04/23/19 – Login with the username and password you included in the helm install command

  4. Additional info
    • View Persistent Volume Claims with

      kubectl get pvc -n default


      This will list the PVCs and the volumes in the “default” namespace. Note the volume corresponds to the name of the VMDK on the datastore.

    • Load-Balancer
      Notice that since we are leveraging the NSX-T Container Networking Interface and enabled the ingress when we installed dokuwiki, a load-balancer in NSX-T was automatically created for us to point to the application.

This took me some time to figure out; had to weed through a lot of documentation – some of which contradicted itself and quite a bit of trial-and-error. I hope this helps save someone time later!

PAS with NSX-T Tip: use a fresh IP Block

I’ve fought with this for an embarrassingly long time.  Had a failed PAS (Pivotal Application Services) deployment (missed several of the NSX configuration requirements) but removed the cruft and tried again and again and again.  In each case, PAS and NCP would deploy, but fail on the PAS smoke_test errand. The error message said more detail is in the log.

Which Log?!

I ssh’d into the clock_global VM and found the smoke_test logs. They stated that the container for instance {whatever} could not be created and an error of NCP04004. This pointed me to the Diego Cells (where the containers would be created) and I poked around in the /var/vcap/sys/log/garden logs there. They stated that the interface for the instance could not be found. Ok, this is sounding more like an NSX problem.

I ended up parsing through the NSX Manager event log and found this gem:

IP Block Error

Ah-ha! Yup, I’d apparently allocated a couple of /28 subnets from the IP Block. So when the smoke test tried to allocate a /24, the “fixed” subnet size had already been set to /28, causing the error.

Resolution was to simply remove all of the allocated subnets from the IP block. This could have been avoided by either not reusing an existing IP Block or using the settings in the NCP configuration to create a new IP Block with a given CIDR.

NSX-T 2.2 – Error 100 when trying to enum Firewall Rules

After upgrading to NSX-T 2.2, my environment began throwing this error in the GUI when I tried to navigate to the firewall section or any router.  In addition, the nsx-cli shell script for cleanup was failing every time with a similar firewall-rule-related error.

Searching for a bitm I stumbled onto KB 56611: Upgrading NSX-T manager from 2.1.0.0 to 2.2.0.0 reports “General Error has occurred” on Firewall’s General UI section.

Down at the bottom of the KB, it essentially states that if you’ve already upgraded to 2.2 from 2.1, you’ll have to replace a jar file in order to resolve the problem.  Oh, and you have to open a ticket to get the .jar.

So, if you run into this – and you receive the nsx-firewall-1.0.jar file – here’s the steps for resolution:

    1. SSH into the NSX Manager as root (not admin)
    2. Navigate to /opt/vmare/proton-tomcat/webapps/nsxapi/WEB-INF/lib
    3. Copy the existing nsx-firewall-1.0.jar file elsewhere (I copied it to home and SCP’d it out from there)
    4. Copy the new nsx-firewall-1.0.jar file into this folder. (I put it on an local webserver and pulled it down with wget)
    5. Change the owner of the jar to uproton:

      chown uproton:uproton nsx-firewall-1.0.jar

    6. Change the permissions to match the other files:

      chmod o-r nsx-firewall-1.0.jar

    7. Reboot the NSX Manager
    8. Enjoy being able to see and edit firewall rules again!

 

Automating PKS Upgrades

Last night, Pivotal announced new versions of PKS and Harbor, so I thought it’s time to simplify the upgrade process. Here is a concourse pipeline that essentially aggregates the upgrade-tile pipeline so that PKS and Harbor are upgraded in one go.

What it does:

  1. Runs on a schedule – you set the time and days it may run
  2. Downloads the latest version of PKS and Harbor from Pivnet- you set the major.minor version range
  3. Uploads the PKS and Harbor releases to your BOSH director
  4. Determines whether the new release is missing a stemcell, downloads it from PivNet and uploads it to BOSH director
  5. Stages the tiles/releases
  6. Applies changes

What you need:

  1. A working Concourse instance that is able to reach the Internet to pull down the binaries and repo
  2. The fly cli and credentials for your Concourse.
  3. A token from your PivNet account
  4. An instance of PKS 1.0.2 or 1.0.3 AND Harbor 1.4.x deployed on Ops Manager
  5. Credentials for your Ops Manager
  6. (optional) A token from your GitHub account

How to use the pipeline:

  1. Download params.yml and pipeline.yml from here.
  2. Edit the params.yml by replacing the values in double-parentheses with the actual value. Each line has a bit explaining what it’s expecting.  For example, ((ops_mgr_host)) becomes opsmgr.pcf1.domain.local
    • Remove the parens
    • If you have a GitHub Token, pop that value in, otherwise remove ((github_token))
    • The current pks_major_minor_version regex will get the latest 1.0.x.  If you want to pin it to a specific version, or when PKS 1.1.x is available, you can make those changes here.
    • The ops_mgr_usr and ops_mgr_pwd credentials are those you use to logon to Ops Manager itself.  Typically set when the Ops Manager OVA is deployed.
    • The schedule params should be adjusted to a convenient time to apply the upgrade.  Remember that in addition to the PKS Service being offline (it’s a singleton) during the upgrade, your Kubernetes clusters may be affected if you have the “Upgrade all Clusters” errand set to run in the PKS configuration, so schedule wisely!

  3. Open your cli and login to concourse with fly

    fly -t concourse login -c http://concourse.domain.local:8080 -u username -p password

  4. Set the new pipeline. Here, I’m naming the pipeline “PKS_Upgrade”. You’ll pass the pipeline.yml with the “-c” param and your edited params.yml with the “-l” param

    fly -t concourse sp -p PKS_Upgrade -c pipeline.yml -l params.yml

    Answer “y” to “Apply Configuration”…

  5. Unpause the pipeline so it can run when in the scheduled window

    fly -t concourse up -p PKS_Upgrade

  6. Login to the Concourse web to see our shiny new pipeline!

    If you don’t want to deal with the schedule and simply want it to upgrade on-demand, use the pipeline-nosched.yml instead of pipeline.yml, just be aware that when you unpause the pipeline, it’ll start doing its thing.  YMMV, but for me, it took about 8 minutes to complete the upgrade.

Behind the scenes
It’s not immediately obvious how the pipeline does what it does. When I first started out, I found it frustrating that there just isn’t much to the pipeline itself. To that end, I tried making pipelines that were entirely self-contained. This was good in that you can read the pipeline and see everything it’s doing; plus it can be made to run in an air-gapped environment. The downside is that there is no separation, one error in any task and you’ll have to edit the whole pipeline file.
As I learned a little more and poked around in what others were doing, it made sense to split the “tasks” out, keep them in a GitHub public repo and pull it down to run on-demand.

Pipelines generally have two main sections; resources and jobs.
Resources are objects that are used by jobs. In this case, the binary installation files, a zip of the GitHub repo and the schedule are resources.
Jobs are (essentially) made up of plans and plans have tasks.
Each task in most pipelines uses another source yml. This task.yml will indicate which image concourse should build a container from and what it should do on that container (typically, run a script). All of these task components are in the GitHub repo, so when the pipeline job runs, it clones the repo and runs the appropriate task script in a container built on an image pulled from dockerhub.

More info
I’ve got a several pipelines in the repo.   Some of them do what they’re supposed to. 🙂 Most of them are derived from others’ work, so many thanks to Pivotal Services and Sabha Parameswaran

PKS and NSX-T: I did everything wrong

I’ve fought with PKS and NSX-T for a month or so now. I’ll admit it: I did everything wrong, several times. One thing for certain, I know how NOT to configure it. So, now that I’ve finally gotten past my configuration issues, it makes sense to share the pain lessons learned.

  1. Set your expectations correctly. PKS is literally a 1.0 product right now. It’s getting a lot of attention and will make fantastic strides very quickly, but for now, it can be cumbersome and confusing. The documentation is still pretty raw. Similarly, NSX-T is very young. The docs are constantly referring you to the REST API instead of the GUI – this is fine of course, but is a turn-off for many. The GUI has many weird quirks. (when entering a tag, you’ll have to tab off of the value field after entering a value, since it is only checked onBlur)
  2. Use Chrome Incognito  NSX-T does not work in Firefox on Windows. It works in Chrome, but I had issues where the cache would problems (the web GUI would indicate that backup is not configured until I closed Chrome, cleared cache and logged in again)
  3. Do not use exclamation point in the NSX-T admin password Yep, learned that the hard way. Supposedly, this is resolved in PKS 1.0.3, but I’m not convinced as my environment did not wholly cooperate until I reset the admin password to something without an exclamation point in it
  4. Tag only one IP Pool with ncp/external I needed to build out several foundations on this environment and wanted to keep them in discrete IP space by created multiple “external IP Pools” and assigning each to its own foundation. Currently the nsx-cli.sh script that accompanies PKS with NSX-T only looks for the “ncp/external” tag on IP Pools, if more than one is found, it quits. I suppose you could work around this by forking the script and passing an additional “cluster” param, but I’m certain that the NSBU is working on something similar
  5. Do not take a snapshot of the NSX Manager This applies to NSX for vSphere and NSX-T, but I have made this mistake and it was costly. If your backup solution relies on snapshots (pretty much all of them do), be sure to exclude the NSX Manager and…
  6. Configure scheduled backups of NSX Manager I found the docs for this to be rather obtuse. Spent a while trying to configure a FileZilla SFTP or even IIS-FTP server until it finally dawned on me that it really is just FTP over SSH. So, the missing detail for me was that you’ll just need a linux machine with plenty of space that the NSX Manager can connect to – over SSH – and dump files to. I started with this procedure, but found that the permissions were too restrictive.
  7. Use concourse pipelines This was an opportunity for me to really dig into concourse pipelines and embrace what can be done. One moment of frustration came when PKS 1.0.3 was released and I discovered that the parameters for vSphere authentication had changed. In PKS 1.0 through 1.0.2, there was a single set of credentials to be used by PKS to communicate with vCenter Server. As of 1.0.3, this was split into credentials for master and credentials for workers. So, the pipeline needed a tweak in order to complete the install. I ended up putting in a conditional to check the release version, so the right params are populated. If interested, my pipelines can be found at https://github.com/BrianRagazzi/concourse-pipelines
  8. Count your Load-Balancers In NSX-T, the load-balancers can be considered a sort of empty appliance that Virtual Servers are attached to and can itself attach to a Logical Router. The load-balancers in-effect require pre-allocated resources that must come from an Edge Cluster. The “small” load-balancer consumes 2 CPU and 4GB RAM and the “Large” edge VM provides 8 CPU and 16GB RAM. So, a 2-node Edge Cluster can support up to FOUR active/standby Load-Balancers. This quickly becomes relevant when you realize that PKS creates a new load-balancer when a new K8s cluster is created. If you get errors in the diego databse with the ncp job when creating your fifth k8s cluster, you might need to add a few more edge nodes to the edge cluster.
  9. Configure your NAT rules as narrow as you can. I wasted a lot of time due to mis-configured NAT rules. The log data from provisioning failures did not point to NAT mis-configuration, so wild geese were chased.  Here’s what finally worked for me:
    Router Priority Action Source Destination Translated Description
    Tier1 PKS Management 512 No NAT [PKS Management CIDR] [PKS Service CIDR] Any No NAT between management and services
    [PKS Service CIDR] [PKS Management CIDR]
    1024 DNAT Any [External IP for Ops Manager] [Internal IP for Ops Manager] So Ops Manager is reachable
    [External IP for PKS Service] [Internal IP for PKS Service] (obtain from Status tab of PKS in Ops Manager) So PKS Service (and UAA) is reachable
    SNAT [Internal IP for PKS Service] Any [External IP for PKS Service] Return Traffic for PKS Service
    2048 [PKS Management CIDR] [Infrastructure CIDR] (vCenter Server, NSX Manager, DNS Servers) [External IP for Ops Manager] So PKS Management can reach infrastructure
    [PKS Management CIDR] [Additional Infrastructure] (NTP in this case) [External IP for Ops Manager]
    Tier1 PKS Services 512 No NAT [PKS Service CIDR] [PKS Management CIDR] Any No NAT between management and services
    [PKS Management CIDR] [PKS Service CIDR]
    1024 SNAT [PKS Service CIDR] [Infrastructure CIDR] (vCenter Server, NSX Manager, DNS Servers) [External IP] (not the same as Ops Manager and PKS Service, but in the same L3 network) So PKS Services can reach infrastructure
    [PKS Service CIDR] [Additional Infrastructure] (NTP in this case) [External IP]

Replacing the self-signed Certificate on NSX-T

Ran into a difficulty trying to use the self-signed certificate that comes pre-configured on the manager for NSX-T. In my case, Pivotal Operations Manager refused to accept the self-signed certificate.

So, for NSX-T 2.1, it looks like the procedure is:

    1. Log on to the NSX Manager and navigate to System|Trust
    2. Click CSRs tab and then “Generate CSR”, populate the certificate request details and click Save
    3. Select the new CSR and click Actions|Download CSR PEM to save the exported CSR in PEM format
    4. Submit the CSR to your CA to get it signed and save the new certificate. Be sure to save the root CA and any subordinate CA certificates too<. In this example, certnew.cer is the signed NSX Manager certificate, sub-CA.cer is the subordinate CA certificate and root-CA.cer is the Root CA certificate
    5. Open the two (or three) cer files in notepad or notepad++ and concatenate them in order of leaf cert, (subordinate CA cert), root CA cert
    6. Back in NSX Manager, select the CSR and click Actions|Import Certificate for CSR. In the Window, paste in the concatenated certificates from above and click save
    7. Now you’ll have a new certificate and CA certs listed under Certificates. The GUI only shows a portion of the ID by default, click it to display the full ID and copy it to the clip board
    8. Launch RESTClient in Firefox.
      • Click Authentication|Basic Authentication and enter the NSX Manager credentials for Username and Password, click “Okay”
      • For the URL, enter https://<NSX Manager IP or FQDN>api/v1/node/services/http?action=apply_certificate&certificate_id=<certificate ID copied in previous step>
      • Set the method to POST and click SEND button
      • check the Headers to confirm that the status code is 200
    9. Refresh browser session to NSX Manager GUI to confirm new certificate is in use

Notes:
I was concerned that replacing the certificate would break the components registered via the certificate thumbprint; this process does not break those things. They remain registered and trust the new certificate

BOSH Stemcell 3541.2 breaks Concourse 3.9.0

Looks like there was a breaking change in stemcell v3541.2 where the default umask was set to 077.  If this stemcell is used with BOSH-deployed-Concourse.CI v3.9.0, resource checking fails with a “permission denied” error.

Note that Pivotal (as of Feb 22 2018)  has not updated their stemcells to 3541.x and their latest is still in the 3468 chain.