I’ve fought with this for an embarrassingly long time. Had a failed PAS (Pivotal Application Services) deployment (missed several of the NSX configuration requirements) but removed the cruft and tried again and again and again. In each case, PAS and NCP would deploy, but fail on the PAS smoke_test errand. The error message said more detail is in the log.
Which Log?!
I ssh’d into the clock_global VM and found the smoke_test logs. They stated that the container for instance {whatever} could not be created and an error of NCP04004. This pointed me to the Diego Cells (where the containers would be created) and I poked around in the /var/vcap/sys/log/garden logs there. They stated that the interface for the instance could not be found. Ok, this is sounding more like an NSX problem.
I ended up parsing through the NSX Manager event log and found this gem:
IP Block Error
Ah-ha! Yup, I’d apparently allocated a couple of /28 subnets from the IP Block. So when the smoke test tried to allocate a /24, the “fixed” subnet size had already been set to /28, causing the error.
Resolution was to simply remove all of the allocated subnets from the IP block. This could have been avoided by either not reusing an existing IP Block or using the settings in the NCP configuration to create a new IP Block with a given CIDR.
If I put it here, I’m much more likely to follow-through. Like many, I work best under some pressure. Here is a list of what I want to do differently (with regard to technology) next year.
Do more blogging. I can make a ton of excuses for not blogging as much this year. I love sharing what I’ve learned; the more new stuff I learn, the more I share. So….
Do more for NSX for vSphere and NSX-T. I feel strongly that SDN is critical to the future of how datacenters operate. NSX is the logical leader in this space and will only grow in interest. There is still a tendency to replicate what was done with pre-SDN technology and I’d like to see modern ways to solve problems while finding and pushing the limits of what can be done in SDN.
Do more with containers and PKS. The technologies that Pivotal provides are cutting edge. Already and continuing, containers and applications-as-code methods are growing and will define the datacenter of the future. Just as a few years ago, we stopped thinking of hardware servers as single-purpose, we’ll embrace multiple workloads within a VM.
Do more coding. I love concourse and pipelines, but have a lot to learn. Let’s find the limits of BOSH and pipelines. Can we not only deploy, but automate the operation and maintenance of a PaaS solution?
Do more coding. I feel that as we move to “applications-as-code”, it’s important to understand what that means to developers and operators. What sort of problems become irrelevant in this approach? What molehills become mountains?
Starting with a working PCF 1.11 deployment, a random linux VM and the BOSH Backup and Restore bits, let’s try it out!
Background
We’ll perform two types of backup jobs using BBR; one against the BOSH director and one against the Elastic Runtime deployment. The command and parameters are different between the jobs.
BBR stores the backup data in subfolders where the executable is run
Tiles other than Elastic Runtime (CF) may be backed up with BBR later, but as of late June 2017, they do not have the BBR scripts in place.
If you don’t turn on MySQL backups and the Backup Prepare Node in Elastic Runtime, the CF deployment backup job will fail in that it cannot find the backup scripts for the MySQL database
I’m using a CentOS VM in the environment as the jumpbox to run BBR. You’ll want to make sure that the jumpbox is able to reach the BOSH director on TCP22 and TCP25555.
Steps
Prepare PCF
Logon to Ops Manager
Click the “Pivotal Elastic Runtime” tile
Assuming you’re using the internal MySQL, click “Internal MySQL” on the Settings tab
Under Automated Backups Configuration, select “Enable automated backups from MySQL to and S3 bucket or other S3-compatible file store”. Right here, you’re thinking, “but I don’t have an S3 server or account or whatever”. That’s ok, just fake it. Put bogus values in the fields and an unreachable date (like February 31st). Click Save.
Bogus S3 info
Under Resource Config, make sure the Backup Prepare Node instance count is 1 (or more?). Click Save
Return to the Installation Dashboard and Apply Changes
Get the BBR credentials.
Logon to Ops Manager
Click the “Ops Manager Director” tile
Click the “Credentials” tab
Click the “Link to Credential” link beside “Bbr Ssh Credentials”
BBR Director Backup Credential
The page the loads will display a yml-type file with the PEM-encoded Private and Public Keys. Select and copy from “—–BEGIN RSA PRIVATE KEY—–” through “—–END RSA PRIVATE KEY—–“.
Paste this into a text editor. In my case, on Windows, the content used literally “/n” to indicate new-line rather than an actual newline. So, to convert it, I used Notepad++ to replace “//n” with “/n” in the Extended Search Mode.
Using Notepad++
The username that BBR will use for the director job is “bbr”
Back on the “Credentials” tab of Ops Manager Director, click “Link to Credential” beside “Uaa Bbr Client Credentials”
On the page that loads, note that the identity is”bbr_client” and record the password value. This will be used for the BBR deployment job(s)
Back on the “Credentials” tab of Ops Manager Director, click “Link to Credential” beside “Director Credentials”
On the page that loads, note that the identity is”director” and record the password value. You’ll need this to login to BOSH in order to get the deployment name next
Get the deployment name
Open an SSH session to the Ops Manager, logging on as ubuntu
bosh –ca-cert /var/tempest/workspaces/default/root_ca_certificate target DIRECTOR-IP-ADDRESS
Logon as “director” with the password saved earlier
Run this:
bosh deployments
In the results, copy the deployment name that begins with “cf-“. (eg: cf-67afe56410858743331)
Prepare the jumpbox
Logon with a privileged account
Using SCP or similar, copy “/var/tempest/workspaces/default/root_ca_certificate” from Ops Manager to the jump box
Copy the bbr-0.1.2.tar file to the jumpbox
Extract it – tar -xvf bbr-0.1.2.tar
Make sure you have plenty of space on the jumpbox. In my case, I mounted a NFS share and ran BBR from the mount point.
Copy <extracted files>/release/bbr to the root folder where you want the backups to reside.
Save the PEM-encoded RSA Private Key from above to the jumpbox, making a note of it’s path and filename. I just stuck it in the same folder as the bbr executable.
Make sure you can connect to the BOSH director via ssh
ssh -i bbr@
Director Backup
On the jumpbox, navigate to where you placed the bbr executable. Remember that it will create a time-stamped subfolder here and dump all the backups into it.
Run this, replacing the values in red with the correct path to the private key file and BOSH Director IP address :
./bbr director –private-key-path ./private.key –username bbr –host 172.16.9.16 pre-backup-check
Check that the pre-check results indicate that the director can be backed up
Run this to perform the backup: (same as before, just passing the “backup” sub-command instead of the “pre-backup-check’ subcommand)
./bbr director –private-key-path ./private.key –username bbr –host 172.16.9.16 backup
Wait a while for the backup to complete
What’d it do?
Backed up BOSH director database to bosh-0-director.tar
Dumped credhub database to bosh-0-credhub.tar
Dumped uaa database to bosh-0-uaa.tar
Backed up the BOSH director blobstore to bosh-0-blobstore.tar
Saved the blobstore metadata to a file named metadata
Elastic Runtime Backup
On the jumpbox, navigate to where you placed the bbr executable. Remember that it will create a time-stamped subfolder here and dump all the backups into it.
Run this, replacing the values in red with the IP/FQDN of your BOSH director, password for the bbr_client account retrieved from Ops Manager, the Elastic Runtime deployment name and path to the root_ca-certificate copied from the Ops Manager:
Deployment Pre-check
Recently, I’ve found myself needing a Concourse CI system. I struggled with the documentation on concourse.ci, couldn’t find any comprehensive build guides. Knew for certain I wasn’t going to use VirtualBox. So, having worked it out; thought I’d share what I went through to get to a working system.
Starting Position
Discovered that the CentOS version I was using previously did not have a compatible Linux kernel version. CentOS 7.2 uses kernel 3.10, Concourse requires 3.19+. So, I’m starting with a freshly-deployed Ubuntu Server 16.04 LTS this time.
Prep Ubuntu
Not a lot we have to do, but still pretty important:
Make sure port for concourse is open
sudo ufw allow 8080
sudo ufw status
sudo ufw disable
I disabled the firewall on ubuntu because it was preventing the concourse worker and concourse web from communicating.
Update and make sure wget is installed
apt-get update
apt-get install wget
Postgresql
Concourse expects to use a postgresql database, I don’t have one standing by, so let’s install it.
Pretty straightforward on Ubuntu too:
apt-get install postgresql postgresql-contrib
Enter y to install the bits. On Ubuntu, we don’t have to take extra steps to configure the service.
Ok, now we have to create an account and a database for concourse. First, lets create the linux account. I’m calling mine “concourse” because I’m creative like that.
adduser concourse
passwd concourse
Next, we create the account (aka “role” or “user”) in postgres via the createuser command. In order to do this, we have to switch to the postgres account, do that with sudo:
sudo -i -u postgres
Now, while in as postgres we can use the createuser command
createuser –interactive
You’ll enter the name of the account, and answer a couple of special permissions questions.
While still logged in as postgres, run this command to create a new database for concourse. I’m naming my database “concourse” – my creativity is legendary. Actually, I think it makes life easier if the role and database are named the same
createdb concourse
Test by switching users to the concourse account and making sure it can run psql against the concourse databaseWhile in psql, use this command to set the password for the account in postgress
ALTER ROLE concourse WITH PASSWORD 'changeme';
Type \q to exit psql
Concourse
Ok, we have a running postgresql service and and account to be used for concourse. Let’s go.
Create a folder for concourse. I used /concourse, but you can use /var/lib/whatever/concourse if you feel like it.
The items in red should definitely be changed for your environment. “external_url” uses the IP address of the VM its running on. and the username and password values in the postgres-data-source should reflect what you set up earlier. Save the file and be sure to set it as executable (chmod +x ./start.sh)
Run the script “./start.sh”. You should see several lines go by concerning worker-collectors and builder-reapers.
If you instead see a message about authentication, you’ll want to make sure that 1) the credentials in the script are correct, 2) the account has not had it’s password set in linux or in postgres
If you instead see a message about the connection not accepting SSL, be sure that the connection string in the script includes “?sslmode=disable” after the database name
Test by pointing a browser at the value you assigned to the external_url. You should see “no pipelines configured”. You can login using the basic-auth username and password you specified in the startup script.
Success!
Back in your SSH session, you can kill it with <CRTL>+C
Finishing Up
Now we just have to make sure that concourse starts when the system reboots. I am certain that there are better/safer/more reliable ways to do this, but here’s what I did:
Use nano or your favorite text editor to add “/concourse/start.sh” to /etc/rc.local ABOVE the line that reads “exit 0”
Now, reboot your VM and retest the connectivity to the concourse page.
Recently, I’ve found myself needing a Concourse CI system. I struggled with the documentation on concourse.ci, couldn’t find any comprehensive build guides. Knew for certain I wasn’t going to use VirtualBox. So, having worked it out; thought I’d share what I went through to get to a working system.
Starting Position
I’m starting with a freshly-deployed CentOS 7 VM. I use Simon’s template build, so it comes up quickly and reliably. Logged on as root.
Prep CentOS
Not a lot we have to do, but still pretty important:
optionally, you can open 5432 for postgres if you feel like it
Update and make sure wget is installed
yum update
yum install wget
Postgresql
Concourse expects to use a postgresql database, I don’t have one standing by, so let’s install it.
Pretty straightforward on CentOS:
yum install postgresql-server postgresql-contrib
Enter y to install the bits.
When that step is done, we’ll set it up with this command:
sudo postgresql-setup initdb
Next, we’ll update the postgresql config to allow passwords. Use your favorite editor to open /var/lib/pgsql/data/pg_hba.conf We need to update the value in the method column for IPv4 and IPv6 connections from “ident” to “md5” then save the file.
Before
After
Now, let’s start postgresql and set it to run automatically
Ok, now we have to create an account and a database for concourse. First, lets create the linux account. I’m calling mine “concourse” because I’m creative like that.
adduser concourse
passwd concourse
Next, we create the account (aka “role” or “user”) in postgres via the createuser command. In order to do this, we have to switch to the postgres account, do that with sudo:
sudo -i -u postgres
Now, while in as postgres we can use the createuser command
createuser –interactive
You’ll enter the name of the account, and answer a couple of special permissions questions.
While still logged in as postgres, run this command to create a new database for concourse. I’m naming my database “concourse” – my creativity is legendary. Actually, I think it makes life easier if the role and database are named the same
createdb concourse
Test by switching users to the concourse account and making sure it can run psql against the concourse databaseWhile in psql, use this command to set the password for the account in postgress
ALTER ROLE concourse WITH PASSWORD 'changeme';
Type \q to exit psql
Concourse
Ok, we have a running postgresql service and and account to be used for concourse. Let’s go.
Create a folder for concourse. I used /concourse, but you can use /var/lib/whatever/concourse if you feel like it.
The items in red should definitely be changed for your environment. “external_url” uses the IP address of the VM its running on. and the username and password values in the postgres-data-source should reflect what you set up earlier. Save the file and be sure to set it as executable (chmod +x ./start.sh)
Run the script “./start.sh”. You should see several lines go by concerning worker-collectors and builder-reapers.
If you instead see a message about authentication, you’ll want to make sure that 1) the credentials in the script are correct, 2) the account has not had it’s password set in linux or in postgres and 3) the pg_hba.conf fie has been updated to use md5 instead of ident
If you instead see a message about the connection not accepting SSL, be sure that the connection string in the script includes “?sslmode=disable” after the database name
Test by pointing a browser at the value you assigned to the external_url. You should see “no pipelines configured”
Success!
Back in your SSH session, you can kill it with <CRTL>+X
Finishing Up
Now we just have to make sure that concourse starts when the system reboots. I am certain that there are better/safer/more reliable ways to do this, but here’s what I did:
Edit – 2/1/17 – Updated with OpenSSL configuration detail
Edit – 3/20/17 – Updated SubjectAltNames in config
Preparation
SSL Certificate. You’ll need the signed public cert for your URL (certnew.cer), the associated private key (pcf.key) and the public cert of the signing CA (root64.cer).
Download and install OpenSSL
Create a config file for your request – paste this into a text file:
[ req_distinguished_name ]
countryName = US
stateOrProvinceName = State
localityName = City
0.organizationName = Company Name
organizationalUnitName = PCF
commonName = *.pcf.domain.com
Replace the values in red with those appropriate for your environment. Be sure to specify the server name and IP address as the Virtual IP and its associated DNS record. Save the file as pcf.cfg. You’ll want to use the wildcard “base” name as the common name and the server name, as well as the *.system, *.apps, *.login.system and *.uaa.system Subject Alt Names.
Use OpenSSL to create the Certificate Site Request (CSR) for the wildcard PCF domain.
Use OpenSSL to convert the key to RSA (required for NSX to accept it)
openssl rsa -in pcf.key -out pcfrsa.key
Submit the CSR (pcf.csr) to your CA (Microsoft Certificate Services in my case), retrieve the certificate (certnew.cer) and certificate chain (certnew.p7b) base-64 encoded.
Double-click certnew.p7b to open certmgr. Export the CA certificate as 64-bit encoded x509 to a file (root64.cer is the file name I use)
Networks. You’ll need to know what layer 3 networks the PCF components will use. In my case, I set up a logical switch in NSX and assigned the gateway address to the DLR. Probably should make this a 24-bit network, so there’s room to grow, but not reserving a ridiculous number of addresses. We’re going to carve up the address space a little, so make a note of the following:
Gateway and other addresses you typically reserve for network devices. (eg: first 9 addresses 1-9)
Address that will be assigned to the NSX load balancer. Just need one (eg: 10)
Addresses that will be used by the PCF Routers. At least two. These will be configured as members in the NSX Load Balancer Pool.
DNS, IP addresses. PCF will use “system” and “apps” subdomains, plus whatever names you give any apps deployed. This takes some getting used to – not your typical application. Based on the certificate we created earlier, I recommend just creating a “pcf” subdomain. In my case, the network domain (using AD-DNS) is ragazzilab.com and I’ve created the following:
pcf.ragazzilab.com subdomain
*.pcf.ragazzilab.com A record for the IP address I’m going to assign to the NSX Load-Balancer
NSX
Assuming NSX is already installed and configured. Create or identify an existing NSX Edge that has an interface on the network where PCF will be / is deployed.
Assign the address we noted above to the inteface under Settings|Interfaces
Under Settings|Certificates, add the our SSL certificates
Click the Green Plus and select “CA Certificate”. Paste the content of the signing CA public certificate (base64.cer) into the Certificate Contents box. Click OK.
Click the Green Plus and select “Certificate”. Paste the content of the signed public cert (certnew.cer) into the Certificate Contents box and paste the content of the RSA private key (pcfrsa.key) into the Private Key box. Click OK.
Under Load Balancer, create an Application Profile. We need to ensure that NSX inserts the x-forwarded-for HTTP headers. To do that, we need to be able to decrypt the request and therefore must provide the certificate information. I found that Pool Side SSL had to be enabled and using the same Service and CA Certificates.
Router Application Profile
Create the Service Monitor. What worked for me is a little different from what is described in the GoRouter project page. The key points are that we want to specify the useragent and look for a response of “ok” with a header of “200 OK”.
Service Monitor for PCF Router
Create the Pool. Set it to ROUND-ROBIN using the Service Monitor you just created. When adding the routers as members, be sure to set the port to 443, but the Monitor Port to 80.
Router Pool
Create the Virtual Server. Specify the Application Profile and default Pool we just created. Obviously, specify the correct IP Address.
Virtual Server Configuration
PCF – Ops Manager
Assuming you’ve already deployed the Ops Manager OVF, use the installation dashboard to edit the configuration for Ops Manager Director. I’m just going to highlight the relevant areas of the configuration here:
Networks. Under “Create Networks”, be sure that the Subnet specified has the correct values. Pay special attention to the reserved IP ranges. These should be the addresses of the network devices and the IP address assigned to the load-balancer. Do not include the addresses we intend to use for the routers though. Based on the example values above, we’ll reserve the first 10 addresses.
Ops Manager Network Config
Ops Manager Director will probably use the first/lowest address in range that is not reserved.
PCF – Elastic Runtime
Next, we’ll install Elastic Runtime. Again, I’ll highlight the relevant sections of the configuration.
Domains. In my case it’s System Domain = system.pcf.ragazzilab.com and Apps Domain = apps.pcf.ragazzilab.com
Networking.
Set the Router IPs to the addresses (comma-separated) you noted and added to as members to the NSX load-balancer earlier.
Leave HAProxy IPs empty
Select the point-of-entry option for “external load balancer, and it can forward encrypted traffic”
Paste the content of the signed certificate (certnew.cer) into the Certificate PEM field. Paste the content of the CA public certificate (root64.cer) into the same field, directly under the certificate content.
Paste the content of the private key (pcf.key) into the Private Key PEM field.
Check “Disable SSL Certificate verification for this environment”.
Resource Config. Be sure that the number of Routers is at least 2 and equal to the number of IP addresses you reserved for them.
Troubleshooting
Help! The Pool Status is down when the Service Monitor is enabled.
This could occur if your routers are behaving differently from mine. Test the response by sending a request to one of the routers through curl and specifying the user agent as HTTP-Monitor/1.1
curl -v -A “HTTP-Monitor/1.1” “http://{IP of router}”
Testing router with curl
The value in the yellow box should go into the “Expected” field of the Service Monitor and the value in the red box should go into the “Receive” field. Note that you should not get a 404 response, if you do, check that he user agent is set correctly.
Notes
This works for me and I hope it works for you. If you have trouble or disagree, please let me know.
After installing Pivotal Cloud Foundry (PCF) on vSphere, you’ll have a collection of at least 21 (probably closer to 60!) VMs with names that probably don’t match anyone’s convention. Although, as noted in the PCF documentation, there is a correct order to starting up and shutting down the VMs in PCF, the installer does not configure a vApp so that we can control that order. So, I dragged all the PCF VMs into a vApp and starting trying to determine which ones are in which role and quickly realized that it’s a pain.
Creating an AZ in Ops Manager on vSphere
As an aside, when you create your Availability Zone, you point it at a vSphere cluster and, optionally, a Resource Pool. Unfortunately, if you specify a vApp Name instead of a Resource Pool name, BOSH will fail to deploy the VMs. So, I’ve typically leave the Resource Pool field blank and then drag the VMs into a vApp post-deployment.
I put together a workflow that will help place the PCF VMs into correct startup/shutdown groups for you.
If you haven’t already, create a new vApp in your cluster and drag the Ops Manager, Ops Manager Director and all of the Elastic Runtime VMs into the vApp
Run the “PCFvAppStartupOrder” workflow, select your new vApp as the input, click Submit
If the PCF installation is scaled out to more VMs, just drag them to the vApp and rerun the workflow
The deployment, job and director custom fields are read for each VM in the vApp to get the VM’s assigned role
For the Ops Manager, the Notes field is read and if found, it is placed at the top of the startup sequnce
Unknown VMs are assigned a startup order higher than the last in the array. This way, they start last and power-off first
Unknown VMs are those where the “deployment” field does not start with “cf”; with exceptions for Ops Manager (Notes field) and Ops Manager Director (“director” field value is “bosh-init”)
Additional suggestions and notes
Adjust the resources for the vApp based on VMware best practices and what makes sense for your environment
Use this at your own risk, there is no implied warranty