Overview

PKS 1.1.5 was recently released and has a number of important bug fixes and improvements. These include:

Support for NSX-T 2.2
NCP 2.2.1
TLS support for Kubernetes Ingress
NCP is no longer a Kubernetes Pod and is now a Linux process running on the master nodes
NCP no longer creates duplicate virtual servers when restarted, which would happen when a master VM was restarted. This was problematic since the LB could only support 10 virtual servers. Once this limit was reached you would no longer be able to created Kubenetes load balanced servers or ingresses.
All virtual servers are now removed when deleting multi-port Kubernetes services. Previously virtual servers would be left behind, which again would cause the LB to hit the maximum of 10 virtual servers.
Running the pks delete-cluster command will now cleanup NSX-T related resources even if the cluster is in a bad state. Previously this required running the PKS NSX-T cleanup script.

See the release notes for more information.

Recovering from some of these issues required running multiple API calls and was kind of a pain. I’ve been putting the 1.1.5 release through its paces in multiple lab environments and it’s resolved all of the issues that I was running into.

This walk through will show how to upgrade from PKS 1.1.x to 1.1.5.

Upgrade Checklist

Read the Release Notes
Read the Documentation
Verify the health of the current environment:
- Run kubectl get nodes for all Kubernetes context and verify that all all nodes are in a ready state.
- Run kubectl get pods –all-namespaces for all Kubernetes context and verify that all pods are running.
- Run bosh -d service-instance <UUID> instances –ps for each BOSH Kubernetes deployment and verify that all the processes are in a running state.
- Make sure there are no issues at the IaaS layer. If using vSphere, verify that datastores have enough space, hosts have enough memory, there are no alarms, hosts are in a good state, etc.
Backup the environment

Files Used

Upgrade the PKS Tile

In the Ops Manager portal, select Import a Product, browser to the PKS file and select it. When using Chrome, you can monitor the upload progress in the status bar:

It can take a while once it gets to Waiting for 10.40.14.3…

Once it’s finished you’ll need to select the + sign to add the product:

Import Stemcell

This release of PKS requires Stemcell 3586.36. After you download the stemcell, select Stemcell Library and then Import Stemcell.

Browse to where you downloaded the stemcell, select it and then select Apply Stemcell to Products.

Verify that stemcell is applied to PKS:

Now the dashboard should be all green:

Upgrade the worker node size

Navigate to the Ops Manager Installation Dashboard.
Click the Pivotal Container Service tile.
Click Plan 1.
Under Worker VM Type, select a K8 worker VM type with a minimum disk size of 16 GB.

Verify the NSX-T Manager CA Cert settings

Navigate to the Ops Manager Installation Dashboard.
Click the Pivotal Container Service tile.
Click Networking
Under NSX Manager CA Cert make sure you either have a valid NSX-T manager cert or check Disable SSL certificate verification but not both.

Apply Changes

In the upper-right of the Ops Manager portal you should see the pending changes, which include updating PKS. Select Apply Changes to upgrade the environment.

Post Upgrade Checklist

Verify the health of the current environment:

Run kubectl get nodes for all Kubernetes context and verify that all all nodes are in a ready state.
Run kubectl get pods –all-namespaces for all Kubernetes context and verify that all pods are running.
Run bosh -d service-instance <UUID> instances –ps for each BOSH Kubernetes deployment and verify that all the processes are in a running state.

NCP Changes

NCP will be running as a bosh host process starting in PKS 1.1.5. Each master VM will have one NCP process running. One NCP process will be active and the others will be in standby.

Check NCP Process

Use BOSH to ssh into the master node and run monit summary

The Monit daemon 5.2.5 uptime: 8d 21h 33m

Process 'kube-apiserver'                  running
Process 'kube-controller-manager'         running
Process 'kube-scheduler'                  running
Process 'etcd'                            running
Process 'blackbox'                        running
Process 'ncp’                             running <<<<<<< this is the NCP process
Process 'bosh-dns'                        running
Process 'pks-helpers-bosh-dns-resolvconf' running
System 'system_localhost’                 running

Check if the NCP process in this master is active or standby

Use BOSH to ssh into the master node and run /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-master status

This instance is the NCP master
Current NCP Master id is 03631258-f37d-41f7-8d78-9e4233995a23
Current NCP Instance id is 03631258-f37d-41f7-8d78-9e4233995a23
Last master update at Thu Aug 23 17:26:57 2018

Restart NCP service

Use BOSH to ssh into the master node and run monit restart ncp

Monitor the restart status with monit summary

The Monit daemon 5.2.5 uptime: 8d 21h 36m

Process 'kube-apiserver'                  running
Process 'kube-controller-manager'         running
Process 'kube-scheduler'                  running
Process 'etcd'                            running
Process 'blackbox'                        running
Process 'ncp'                             not monitored - restart pending <<<<<<< NCP is restarting
Process 'bosh-dns'                        running
Process 'pks-helpers-bosh-dns-resolvconf' running
System 'system_localhost'                 running

Note: Restarting the NCP process will also trigger cache rebuild.