Adding a new project into kubectl/kubectx cli

I use to main command line tools when it comes to kubernetes – kubectx and kubectl

Kubectx is a great little utility for switching between kubernetes clusters, and kubectl is the official utility to manage and deploy kubernetes applications.

Adding a context into kubectx/kubectl

Go to the google cloud console and navigate to Kubernetes Engine and then click the connect button on the cluster, which will provide the gcloud command you need to run on your machine to configure access to the cluster locally.

as an example, lets say we have just added gke-dbemohsin-cluster-sandbox 

MYMAC:cloud-architecture dbamohsin$ kubectx
DEV
PREPROD
PROD
TESTING
gke-dbemohsin-cluster-sandbox

To rename a context in kubectx:

kubectx SANDBOX=gke-dbemohsin-cluster-sandbox
Context "gke-dbemohsin-cluster-sandbox" renamed to "SANDBOX".

To add a wrapper alias around context switching, expecially if some context run under different gcloud accounts, you can add the commands to your bash_profile:

alias k8-sandbox='gcloud config configurations activate dbemohsin; kubectx SANDBOX'

and to call the alias:

MYMAC:~ dbamohsin$ k8-sandbox
Switched to context "SANDBOX".

 

Advertisements

Troubleshooting Kubernetes to GCP Mongo Atlas connectivity – so where do you start?

**Update – 03/09/2018**

Since I wrote the original post we have seen further connectivity issues getting Istio working with mongo replica sets. The error looks something like this;

pymongo.errors.ServerSelectionTimeoutError: No replica set members match selector “Primary()”

This has been raised as an issue with Istio 1.0.1 – https://github.com/istio/istio/issues/8410

We believe that Istio is not correctly routing traffic when it comes to replica set hosts, and is using the host list in the service entry for load balancing rather than just a List of nodes which can be talked to externally. This is a big problem for a clustering based data store such as MongoDB where connection flow is incredibly important. This is a hypothetical theory we have rather than something that has been concluded.

As a result of this issue we are no longer using the outbound port whitelisting feature in Istio, as it turned out this feature had been the cause of a few unrelated issues in our environment. Although the post below is still valid in terms of troubleshooting possible connectivity issues, we are no longer using Service Entries to for outbound traffic interception.

************************************

Original Post

In the last few weeks, We have started to prove out running a latency sensitive application in a k8s container against a database running on mongoDB Atlas which is also in GCP.

Scenario

We currently have a service running in Kubernetes in a europe-west4 container, but the metrics are showing that it runs around 5x slower on average when it connects to an on premise mongo database running 3.2.

We wanted to get the database closer to the application, and since the general availability of GCP as a region in Mongo Atlas, this became a lot easier to try out. MongoDB kindly gave us some Atlas credits to try this out, and within 10-15 minutes, I had the following running in Mongo Atlas:

  • 3 node replica set running v3.6 with WiredTiger
    • 2 nodes in Europe-West4 (Netherlands)
    • 1 node in Europe-West1 (Belgium)
  • M30 Production Cluster
    • 7.5GB RAM
    • 2 vCPUs
    • Storage
      • 100GB Storage
      • 6000 IOPS
      • Auto Expand Storage
      • Encrypted Disk

“Only Available in AWS”

Get used to hearing and seeing this both in the documentation, and when discussing with Support or Account Managers. As expected, not all features are available in GCP yet, the main ones that are frustrating us (As of August 2018) are VPC Peering and Encyption at Rest.

VPC Peering

Atlas supports VPC peering with other AWS VPCs in the same region. MongoDB does not support cross-region VPC peering. For multi-region clusters, you must create VPC peering connections per-region.

Depending on the volume we push to and from our databases, lack of support for VPC Peering may be cost prohibitive as you end up paying egress costs twice, both as your data leaves your K8s application project for data entry and also when it leaves GCP Mongo Atlas on data retrieval. Mongo generally give a cost estimation of 7-10% of cluster size for Egress costs.

Encryption at Rest

The following restrictions apply to Encyption at Rest on an Atlas cluster:

  • Continuous Backups are not supported. When enabling backup for a cluster using Encryption at Rest, you must use Cloud Provider Snapshots from AWS or Azure to encrypt your backup snapshots.
  • You cannot enable Encryption at Rest for clusters running on GCP.

So, currently we cannot take advantage of this option because we would eventually want to run backups via Atlas.

Administrators who deploy clusters on GCP and want to enable backup should keep those clusters in a separate project from deployments that use Encryption at Rest or Cloud Provider Snapshots.

Problem

We changed the uri connection string for our test application running in k8s to point towards the mongo atlas cluster. Something like this:

uri = mongodb://node-shard-00-00.gcp.mongodb.net:27017,node-shard-00-01.gcp.mongodb.net:27017,node-shard-00-02.gcp.mongodb.net:27017/myDB?ssl=true&replicaSet=node-shard-0&authSource=admin&compressors=snappy

However, the application was failing its readiness check (healthcheck on pod startup could not connect to the mongo database).

Troubleshooting Avenues

Kubernetes is not easy to diagnose issues in – especially for novices like me!

Using kubectl logs on the pod was not a great help in this instance, but that might be down to application configuration and what it sends to stdin, stdout, stderr (which is the basis of kubectl logs).

This doesnt tell me much in terms of error logging:

kubectl logs my-k8s-service-5f9b978fc8-d6ddc -n my-k8s-service -c masterUsing config environment: PREPROD[24/Aug/2018:13:29:17] ENGINE Bus STARTING[24/Aug/2018:13:29:17] ENGINE Started monitor thread 'Autoreloader'.[24/Aug/2018:13:29:17] ENGINE Started monitor thread '_TimeoutMonitor'.[24/Aug/2018:13:29:17] ENGINE Serving on http://0.0.0.0:8080[24/Aug/2018:13:29:17] ENGINE Bus STARTED

Replicate the connectivity issue locally on the pod

We knew our readiness check was failing on connectivity so runnng an Interactive bash terminal inside the pod and trying to make a connection to the Atlas replica set without the application was a fair shout.

kubectl exec -it my-k8s-service-5f9b978fc8-d6ddc -c master bash -n my-k8s-service

Then I generated a manual mongo uri connection using python (the application runs in python) and tried to retrive the database collections.

//connect using uri
/usr/bin/scl enable rh-python35 python
from pymongo import MongoClient
uri = "mongodb://myuser:mypassword@node-shard-00-00.gcp.mongodb.net:27017,node-shard-00-01.gcp.mongodb.net:27017,node-shard-00-02.gcp.mongodb.net:27017/myDB?ssl=true&replicaSet=node-shard-0&authSource=admin&retryWrites=true&compressors=snappy"
client = MongoClient(uri)
db = client.myDB
collection = db.myCollection
db.collection_names(include_system_collections=False)

and this is what happened:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/atcloud/.local/lib/python3.5/site-packages/pymongo/database.py", line 715, in collection_names nameOnly=True, **kws)] File "/home/atcloud/.local/lib/python3.5/site-packages/pymongo/database.py", line 674, in list_collections read_pref) as (sock_info, slave_okay): File "/opt/rh/rh-python35/root/usr/lib64/python3.5/contextlib.py", line 59, in __enter__return next(self.gen) File "/home/.local/lib/python3.5/site-packages/pymongo/mongo_client.py", line 1099, in _socket_for_reads  server = topology.select_server(read_preference) File "/home/.local/lib/python3.5/site-packages/pymongo/topology.py", line 224, in select_server address)) File "/home/.local/lib/python3.5/site-packages/pymongo/topology.py", line 183, in select_servers selector, server_timeout, address) File "/home/.local/lib/python3.5/site-packages/pymongo/topology.py", line 199, in _select_servers_loop self._error_message(selector))pymongo.errors.ServerSelectionTimeoutError: SSL handshake failed: node-shard-00-01.gcp.mongodb.net:27017: EOF occurred in violation of protocol (_ssl.c:645),SSL handshake failed: node-shard-00-02.gcp.mongodb.net:27017: EOF occurred in violation of protocol (_ssl.c:645),SSL handshake failed: node-shard-00-00.gcp.mongodb.net:27017: EOF occurred in violation of protocol (_ssl.c:645)

Failing without the application – thats a good sign as it now allows us to do some quick fail fast testing with the uri connection code above!

Lets copy the pod and start it in a testing environment – to start testing changes more aggresively

Create a mytestpod.yaml file with a basic pod configuration with the same base image as the application:

apiVersion: v1kind: Podmetadata:  annotations:    sidecar.istio.io/inject: "true"  name: dbamohsin-istio-test-podspec:  containers:  - args:    - "10000"    image: eu.gcr.io/somewhere/my-k8s-service    imagePullPolicy: Always    name: master    command: ["/bin/sleep"]

and create it in kubernetes:

kubectl apply -f mytestpod.yaml

Then go into an interactive terminal with bash

kubectl exec -it dbamohsin-istio-test-pod bash

and run the same uri mongo connection as before. Doing this gave the same python connectivity error shown early in the post.

Again, this is a good sign as now we have moved our issue to a less strict environment where we can test changes better.

Can we test the same thing without the istio sidecar…sure!

Delete the pod:

//There are 2 ways of doing this:
kubectl delete -f mytestpod.yaml
kubectl delete pod dbamohsin-istio-test-pod

Modify mytestpod.yaml and comment out the istio sidecar annotation

...
metadata:  annotations:#    sidecar.istio.io/inject: "true"
...

Then re-create the pod and go into an interactive terminal with bash.

Retry the mongo Uri…and it worked!

Python 3.5.1 (default, Oct 21 2016, 21:37:19) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linuxType "help", "copyright", "credits" or "license" for more information.>>> from pymongo import MongoClient>>> uri = "mongodb://valuations:red9898@node-shard-00-00.gcp.mongodb.net:27017,node-shard-00-01.gcp.mongodb.net:27017,node-shard-00-02.gcp.mongodb.net:27017/myDB?ssl=true&replicaSet=node-shard-0&authSource=admin&retryWrites=true">>> client = MongoClient(uri)>>> db = client.myDB>>> collection = db.myCollection>>> db.collection_names(include_system_collections=False)['col1', 'col2', 'col3', 'col4']

so this narrowed it down to something we potentially hadnt configured in istio.

Connectivity from GCP K8s to GCP Mongo

We use Istio for service discovery/unifying traffic flow management (Auto Trader Case Study on Computer World UK). I didnt know beforehand but we restrict egress traffic out of our GCP projects and as a result we needed a service entry in kubernetes to allow 27017. Some MESH_EXTERNAL services documentation – https://istio.io/blog/2018/egress-tcp/

The good thing is that there is a specific mongo protocol available in istio…We added a service entry to our testing environment as shown below:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
 name: egress-mongo-all

spec:
 hosts:
 - node-shard-00-00.gcp.mongodb.net
 - node-shard-00-01.gcp.mongodb.net
 - node-shard-00-02.gcp.mongodb.net

ports:
 - number: 27017
 name: mongo
 protocol: MONGO
 resolution: DNS

and deployed:

kubectl apply -f egress-mongo-all.yaml

Changes from service entry’s are immediate to pods.

Check its deployed:

kubectl get serviceentry -n core-namespace

and double check the config:

kubectl get serviceentry -n core-namespace -o egress-mongo-atlas.yaml

All good.

Re-enable istio and retry

We re-added the istio sidecar annotation into our mytestpod.yaml file and redeployed. This time, even with Istio, the mongo connection worked fine and retrieved the data.

Other connectivity areas to check

Connectivity into GCP Mongo…Whitelists!

Defining a whitelist can be very tricky as providing a google IP range for our K8s applications is more or less impossible – or would cover most of the internet! The reason for this issue is that only the DNS is static for K8s services, so the IP for an application pod can change multiple times a day, meaning its more or less impossible to assign a strong IP whitelist bound to applications unless an automated check is implemented.

We have several different viewpoints here internally, ranging from we should have a structured IP whitelist to why do we even need a whitelist? Our Authentication and Encryption should be at a level that it can run securely with a 0.0.0.0/0 IP Whitelist.

Dont forget client connectivity…

Bear in mind that you may have an on premise firewall between your laptop and Mongo Atlas and may need your network administrator to enable access for you to go outbound on 27017 to your Atlas Nodes.

Takeaway points

  • Isolate the issue quickly
    • We took the application out of the equation first
    • Then proved that it wasnt that specific google project by moving environments
    • and finally isolated a networking sidecar (istio)
  • Find an easy way to reproduce the issue
    • We created a simple bit of code to quickly test connectivity using the same language as the application (Python)
    • Created a dummy pod that we could quickly trash and recreate and test against BUT ensured the fundamentals of the application remained (Used the same image as the application for the dummy pod)
  • Have an environment where you can potentially break stuff. An infrastructure testing area of some kind.

How to list containers running in a kubernetes pod

Unfortunately, there isnt a simple get command that can be run to retrieve containers running inside pods. However, the custom-columns command can be exposed via kubectl to get this information out.

For more info on custom-columns visit the kubectl docs – https://kubernetes.io/docs/reference/kubectl/overview/#custom-columns

The following command will list pod name and container name for all pods in the default namespace

kubectl get pods -o=custom-columns=NAME:.metadata.name,CONTAINERS:.spec.containers[*].name

The output is something like this:

kubectl_containers

You can also just get the container via the jsonpath option

kubectl get pods mongod-1 -o jsonpath='{.spec.containers[*].name}'

The reason I did a bit of research (5 minutes!) into this was because the container name is required when running the exec command in kubectl. eg:

kubectl exec -ti mongod-1 -c mongod-container bash

The remote copy of database x has not been rolled forward to a point in time that is encompassed in the local copy

When setting up database mirroring, it is possible to receieve the following error when starting the mirroring session between the Principal and Mirror databases:

The remote copy of database x has not been rolled forward to a point in time that is encompassed in the local copy

Assuming the steps taken follow a pattern similar to below:

  1. BACKUP database DBMTest on SQLINS01
  2. BACKUP log DBMTest on SQLINS01
  3. Copy db and log backup files to SQLINS02
  4. RESTORE DBMTest with norecovery
  5. RESTORE log DBMTest with norecovery
  6. create endpoints on both SQLINS01 and SQLINS02
    CREATE ENDPOINT [Mirroring]
    STATE=STARTED
    AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
    FOR DATA_MIRRORING (ROLE = PARTNER)
  7. Enable mirror on MIRROR Server SQLINS02
    :connect SQLINS02
    ALTER DATABASE DBMTest
    SET Partner = N'TCP://SQLINS01.mydomain.com:5022';
  8. Enable mirror on PRINCIPAL server SQLINS01
    :connect SQLINS01
    ALTER DATABASE DBMTest
    SET Partner = N'TCP://SQLINS02.mydomain.com:5022';

The error would appear on step 8

Normally the reason for the error is because the backup file for the log or full backup has not been initialized and therefore contains more than one backup of the database on file.

Add the WITH INIT option to the backups to create a fresh initialized backup file

ORA-00020: maximum number of processes exceeded

We got the following error on our Development 12.1.0.2 database

866322-Wed Mar 28 15:56:40 2018
866347:ORA-00020: maximum number of processes (1256) exceeded

An easy way to see how exhausted your processes/sessions are getting is by running:

SELECT 
inst_id,resource_name, 
current_utilization, 
max_utilization, 
limit_value 
FROM gv$resource_limit 
WHERE resource_name in ('processes','sessions')

Which will give something like this – max utilization is the high water mark level:

2018-03-29_09-37-10

 

From the screenshot above, you can see the processes on instance 1 have been fully exhausted at some point, but currently are fine.

You can also see which machines/schemas are causing any potential process exhaustion:

select distinct
        s.inst_id,
        s.username,
        s.machine,
        count(*)
from    gv$session s,
        gv$process p
where   s.paddr       =  p.addr
and     s.inst_id     =  p.inst_id
GROUP BY         s.inst_id,
        s.username,
        s.machine
ORDER BY 4 desc;

Nice summary of the difference between Processes, Sessions and Connections from AskTom:

A connection is a physical circuit between you and the database. A connection might be one of many types — most popular begin DEDICATED server and SHARED server. Zero, one or more sessions may be established over a given connection to the database as show above with sqlplus. A process will be used by a session to execute statements. Sometimes there is a one to one relationship between CONNECTION->SESSION->PROCESS (eg: a normal dedicated server connection). Sometimes there is a one to many from connection to sessions (eg: like autotrace, one connection, two sessions, one process). A process does not have to be dedicated to a specific connection or session however, for example when using shared server (MTS), your SESSION will grab a process from a pool of processes in order to execute a statement. When the call is over, that process is released back to the pool of processes.

Insufficient access rights to perform the operation when running setspn

When attempting to add an SPN to a service account for SQL Server, you may get the following error if you are not a domain admin:

setspn -S MSSQLSvc/VSQLDEV01.DOMAIN DOMAIN\SVCACCOUNT.SVC
Checking domain DC=..,DC=....,DC=..,DC=..

Registering ServicePrincipalNames for CN=vsqldev01 svc,OU=Service Accounts,OU=Shared Resources,OU=..,DC=..,DC=
...,DC=..,DC=..
 MSSQLSvc/VSQLDEV01.DOMAIN
Failed to assign SPN on account 'CN=vsqldev01 svc,OU=Service Accounts,OU=Shared Resources,OU=E..,DC=.,DC=...
,DC=..,DC=..', 
error 0x2098/8344 
-> Insufficient access rights to perform the operation.

If your lucky enough, then get your domain admin to give you the required permissinos against the OU in Active Directory. They would need to do the following:

On a Domain Controller, run adsiedit.msc (Doing this via the normal dsa.msc console will not expose the spn permissions that need to be added)

Then run the following sequence of actions:

Right-Click on the OU and select Properties
Select the "Security" tab
Select the "Advanced" tab
 Select the "Add" button
 Enter the security principal name
 security principal
  Ok
 Properties tab
 Apply to:
 Descendant User objects
 Permissions:
 Read servicePrincipalName - Allow
 Write servicePrincipalName - Allow
  Ok
 Ok
Ok

 

Set up Windows Auth to a SQL Instance with SQL Ops Studio on Mac OS

Microsoft released a preview of SQL Ops Studio in the last week or two, and as a Mac user I was interested to see how well the interface would work compared to SSMS.

Here is a quick intro to the produce from Microsoft: https://www.youtube.com/watch?v=s5DopE7ktwo
More details about Ops Studio can be found here: https://docs.microsoft.com/en-us/sql/sql-operations-studio/what-is
Download – https://docs.microsoft.com/en-us/sql/sql-operations-studio/download

For reference, I am using MacOS Sierra version 10.12.6 and this is what I did to get windows authentication working properly.

If you try to get Windows AD Auth working, you might initially see this error:

dbamohsin-opsstudio-connectionerror

The link in the message above takes you to: Connect SQL Operations Studio (preview) to your SQL Server using Windows authentication – Kerberos

There are 3 main areas to configure before Windows Auth to a SQL Instance works.

Service Principal Names (SPN)

A service principal name (SPN) is a unique identifier of a service instance. SPNs are used by Kerberos authentication to associate a service instance with a service logon account. This allows a client application to request that the service authenticate an account even if the client does not have the account name.

As an Example. on a SQL Clustered instance called VSQLDEV01, you can check if the SPN is set by running the following in a command prompt/ps terminal:

setspn -L VSQLDEV01

If it doesnt return a row that contains the following:

MSSQLSvc/VSQLDEV01.<DOMAIN.COMPANY.COM>

MSSQLSvc/FQDN:port | MSSQLSvc/FQDN, where:

  • MSSQLSvc is the service that is being registered.
  • FQDN is the fully qualified domain name of the server.
  • port is the TCP port number.

You can add an SPN to register the service account by doing the following:

setspn -A MSSQLSvc/VSQLDEV01.DOMAIN.COMPANY.COM DOMAIN\SQLSERVICEACC

Get Key Distribution Center (KDC) and join Domain

The KDC is usually just the FQDN of your Domain Controller – fairly straightforward to find out via the nltest command on a windows machine:

 nltest /dsgetdc:DOMAIN.COMPANY.COM
Configure KDC in krb5.conf on your mac

Edit the /etc/krb5.conf in an editor of your choice. Configure the following keys

sudo vi /etc/krb5.conf

[libdefaults]
  default_realm = DOMAIN.COMPANY.COM

[realms]
DOMAIN.COMPANY.COM = {
   kdc = dc-33.domain.company.com
}

Then save the krb5.conf file and exit

Test Granting and Retreiving a ticket from the KDC

On your Mac, run the following in a terminal:

kinit username@DOMAIN.COMPANY.COM

Authenticate and then check the ticket has been granted:

Credentials cache: API:9999E999-99CA-9999-AC9C-99A999999D99
        Principal: ADUSER@DOMAIN.COMPANY.COM
Issued                Expires               Principal
Mar  8 07:55:10 2018  Mar  8 17:55:01 2018  krbtgt/DOMAIN@DOMAIN

Hopefully, if the intruction are followed, then you should be ready to go!

Make a connection using Windows Auth via SQL Ops Studio.

To test a connection has authenticated via KERBEROS, you can check in sql once a connection is made:

SELECT 
session_ID, 
connect_time, 
net_transport, 
protocol_type, 
auth_scheme 
FROM sys.dm_exec_connections
WHERE auth_scheme = 'KERBEROS'

Should return your connected session:

Auth_scheme