Scalable, Secure Application Load Balancing with VPC Native GKE and Istio

At the time of this writing, GCP does not have a generally available non-public facing Layer 7 load balancer. While this is sure to change in the future, this article outlines a design pattern which has been proven to provide scalable and extensible application load balancing services for multiple applications running in Kubernetes pods on GKE.

When you create a service of type LoadBalancer in GKE, Kubernetes hooks into the provider (GCP in this case) on your behalf to create a Google Load Balancer, while this may be specified as INTERNAL, there are two issues:

Issue #1:

The GCP load balancer created for you is a Layer 4 TCP load balancer.

Issue #2:

The normal behaviour is for Google to enumerate all of the node pools in your GKE cluster and “automagically” create mapping GCE instance groups for each node pool for each zone the instances are deployed in. This means the entire surface area of your cluster is exposed to the external network – which may not be optimal for internal applications on a multi tenanted cluster.

The Solution:

Using Istio deployed on GKE along with the Istio Ingress Gateway along with an externally created load balancer, it is possible to get scalable HTTP load balancing along with all the normal ALB goodness (stickiness, path-based routing, host-based routing, health checks, TLS offload, etc.).

An abstract depiction of this architecture is shown here:

Istio Ingress Design Pattern for VPC Native GKE Clusters

This can be deployed with a combination of Terraform and kubectl. The steps to deploy at a high level are:

  1. Create a GKE cluster with at least two node pools: ingress-nodepool and service-nodepool. Ideally create these node pools as multi-zonal for availability. You could create additional node pools for your Egress Gateway or an operations-nodepool to host Istio, etc as well.
  2. Deploy Istio.
  3. Deploy the Istio Ingress Gateway service on the ingress-nodepool using Service type NodePort.
  4. Create an associated Certificate Gateway using server certificates and private keys for TLS offload.
  5. Create a service in the service-nodepool.
  6. Reserve an unallocated static IP address from the node network range.
  7. Create an internal TCP load balancer:
    1. Specify the frontend as the IP address reserved in step 6.
    2. Specify the backend as the managed instance groups created during the node pool creation for the ingress-nodepool (ingress-nodepool-ig-a, ingress-nodepool-ig-b, ingress-nodepool-ig-c).
    3. Specify ports 80 and 443.
  8. Create a GCP Firewall Rule to allow traffic from authorized sources (network tags or CIDR ranges) to a target of the ingress-nodepool network tag.
  9. Create a Cloud DNS A Record for your managed zone as *.namespace.zone pointing to the IP Address assigned to the load balancer frontend in step 7.1.
  10. Enable Health Checks through the GCP firewall to reach the ingress-nodepool network tag at a minimum – however there is no harm in allowing these to all node pools.

The service should then be resolvable and routable from authorized internal networks (peered private VPCs or internal networks connected via VPN or Dedicated Interconnect) as:

https://<service>.<namespace>.<zone>/<endpoint>

The advantages of this design pattern are…

  1. The Ingress Gateway provides fully functional application load balancing services.
  2. Istio provides service discovery and routing using names and namespaces.
  3. The Ingress Gateway service and ingress gateway node pool can be scaled as required to meet demand.
  4. The Ingress Gateway is multi zonal for greater availability

GCP Networking for AWS Professionals

GCP and AWS share many similarities, they both provide similar services and both leverage containerization, virtualization and software defined networking.

There are some significant differences when it comes to their respective implementations, networking is a key example of this.

Before we compare and contrast the two different approaches to networking, it is worthwhile noting the genesis of the two major cloud providers.

Google was born to be global, Amazon became global

By no means am I suggesting that Amazon didn’t have designs on going global from it’s beginnings, but AWS was driven (entirely at the beginning) by the needs of the Amazon eCommerce business. Amazon started in the US before expanding into other regions (such as Europe and Australia). In some cases the expansion took decades (Amazon only entered Australia as a retailer in 2018).

Google, by contrast, was providing application, search and marketing services worldwide from its very beginning. GCP which was used as the vector to deliver these services and applications was architected around this global model, even though their actual data centre expansion may not have been as rapid as AWS’s (for example GCP opened its Australia region 5 years after AWS).

Their respective networking implementations reflect how their respective companies evolved.

AWS is a leader in IaaS, GCP is a leader in PaaS

This is only an opinion and may be argued, however if you look at the chronology of the two platforms, consider this:

  • The first services released by AWS (simultaneously for all intents and purposes) were S3, SQS and EC2
  • The first service released by Google was AppEngine (a pure PaaS offering)

Google has launched and matured their IaaS offerings since as AWS has done similarly with their PaaS offerings, but they started from much different places.

With all of that said, here are the key differences when it comes to networking between the two major cloud providers:

GCP VPCs are Global by default, AWS VPCs are Regional only

This is the first fundamental difference between the two providers. Each GCP project is allocated one VPC network with Subnets in each of the 18 GCP Regions. Whereas each AWS Account is allocated one Default VPC in each AWS Region with a Subnet in each AWS Availability Zone for that Region, that is each account has 17 VPCs in each of the 17 Regions (excluding GovCloud regions).

Default Global VPC Network in GCP

It is entirely possible to create VPCs in GCP which are Regional, but they are Global by default.

This global tenancy can be advantageous in many cases, but can be limiting in others, for instance there is a limit of 25 peering connections to any one VPC, the limit in AWS is 125.

GCP Subnets are Regional, AWS Subnets are Zonal

Subnets in GCP automatically span all Zones in a Region, whereas AWS VPC Subnets are assigned to Availability Zones in a Region. This means you are abstracted from some of the networking and zonal complexity, but you have less control over specific network placement of instances and endpoints. You can infer from this design that Zones are replicated or synchronised within a Region, making them less of a direct consideration for High Availability (or at least as much or your concern as they otherwise would be).

All GCP Firewall Rules are Stateful

AWS Security Groups are stateful firewall rules – meaning they maintain connection state for inbound connections, AWS also has Network ACLs (NACLs) which are stateless firewall rules. GCP has no direct equivalent of NACLs, however GCP Firewall Rules are more configurable than their AWS counterparts. For instance, GCP Firewall Rules can include Deny actions which is not an option with AWS Security Group Rules.

Load Balancers in GCP are layer 4 (TCP/UDP) unless they are public facing

AWS Application Load Balancers can be deployed in private VPCs with no external IPs attached to them. GCP has Application Load Balancers (Layer 7 load balancers) but only for public facing applications, internal facing load balancers in GCP are Network Load Balancers. This presents some challenges with application level load balancing functionality such as stickiness. There are potential workarounds however such as NGINX in GKE behind

Firewall rules are at the Network Level not at the Instance or Service Level

There are simple firewall settings available at the instance level, these are limited to allowing HTTP and HTTPS traffic to the instance only and don’t allow you to specify sources. Detailed Firewall Rules are set at the GCP VPC Network level and are not attached or associated with instances as they are in AWS.

Hopefully this is helpful for AWS engineers and architects being exposed to GCP for the first time!

Test Driven Infrastructure and Test Automation with Ansible, Molecule and Azure

A few years back, before the rise of the hyper-scalers, I had my first infracode ‘aha moment’ with OpenStack. The second came with Kitchen.

I had already been using test driven development for application code and configuration automation for infrastructure but Kitchen brought the two together. Kitchen made it possible to write tests, spin up infrastructure, and then tear everything down again – the Red/Green/Refactor cycle for infrastructure. What made this even better was that it wasn’t a facsimile of a target environment, it was the same – same VM’s, same OS, same network.

Coming from a Chef background for configuration automation, Kitchen is a great fit to the Ruby ecosystem. Kitchen works with Ansible and Azure, but a Ruby environment and at least a smattering of Ruby coding skills are required.

Molecule provides a similar red-green development cycle to Kitchen, but without the need to step outside of the familiar Python environment.

Out of the box, Molecule supports development of Ansible roles using either a Docker or Virtual Box infrastructure provider. Molecule also leverages the Ansible drivers for private and public cloud platforms.

Molecule can be configured to test an individual role or collections of roles in Ansible playbooks.

This tutorial demonstrates how to use Molecule with Azure to develop and test an individual Ansible role following the red/green/refactor infracode workflow, which can be generalised as:

  • Red– write a failing infrastructure test
  • Green – write the Ansible tasks needed to pass the test
  • Refactor – repeat the process

The steps required for this tutorial are as follows:

Azure setup

Ensure there is an existing Azure Resource Group that will be used for infracode development and testing. Within the resource group, ensure there is a single virtual network (vnet) with a single subnet. Ansible will use these for the default network setup.

Setup a working environment

There are a number of options for setting up a Python environment for Ansible and Molecule, including Python virtualenv or a Docker container environment.

Create a Docker image for Ansible+Molecule+Azure

This tutorial uses a Docker container environment. A Dockerfile for the image can be found in ./molecule-azure-image/Dockerfile. The image sets up a sane Python3 environment with Ansible, Ansible[azure], and Molecule pip modules installed.

Create a Docker workspace

Setup a working environment using the Docker image with Ansible, Molecule, and the azure-cli installed.

This example assumes the following:

  • a resource group already exists with access rights to create virtual machines; and
  • the resource group contains a single vnet with a single subnet

Log into an Azure subcription

Ansible supports a number of different methods for authenticating with Azure. This example uses the azure-cli to login interactively.

Create an empty Ansible role with Molecule

Molecule provides an init function with defaults for various providers. The molecule-azure-role-template creates an empty role with scaffolding for Azure.

Check that the environment is working by running the following code:

The output should look be similar to…

Spin up an Azure VM

Spin up a fresh VM to be used for infra-code development.

Molecule provides a handy option for logging into the new VM:

There is now a fresh Ubuntu 18.04 virtual machine ready for infra-code development. For this example, a basic Nginx server will be installed and verified.

Write a failing test

Testinfra provides a pytest based framework for verifying server and infrastructure configuration. Molecule then manages the execution of those testinfra tests. The Molecule template provides a starting point for crafting tests of your own. For this tutorial, installation of the nginx service is verified. Modify the tests file using vi molecule/default/tests/test_default.py

Execute the failing test

The Ansible task needed to install and enable nginx has not yet been written, so the test should fail:

If the initial sample tests in test_default.py are kept, then 3 tests should fail and 2 tests should pass.

Write a task to install nginx

Add a task to install the nginx service using vi tasks/main.yml:

Apply the role

Apply the role to the instance created using Molecule.

The nginx package should now be installed, both enabled and started, and listening on port 80. Note that the nginx instance will not be accessible from the Internet due to the Azure network security rules. The nginx instance can be confirmed manually by logging into the instance and using curl to make a request to the nginx service.

Execute the passing test

After applying the Ansible task to the instance, the testinfra tests should now pass.

Cleanup

Now that the Ansible role works as defined in the test specification, the development environment can be cleaned up.

Molecule removes the Azure resources created to develop and test the configuration role. Note that deletion may take a few minutes.

Finally, once you are done, exit the container environment. If the container was started with the --rm switch, the container will also be removed, leaving you with a clean workspace and newly minted Ansible role with automated test cases.

Full source code can be found at: https://github.com/gamma-data/json-wrangling-with-golang

When Life Gives you Undrinkable Wine – Make Cognac…

Firstly, this is not another motivational talk or motherhood statement (there are plenty of those out there already), but a metaphor about resourcefulness.

We have all heard the adage “when life gives you lemons, make lemonade”. Allow me to present a slight twist (pardon the pun…) on this statement.

Cognac is a variety of brandy named after the town of Cognac, France. It is distilled from a white wine grown from grapes in the area. The wine used to make Cognac is characterised as “virtually undrinkable”. However in its distilled form, Cognac is considered to be the world’s most refined spirit. With bottles of high end Cognac fetching as much as $200,000.

The area surrounding the town of Cognac was a recognised wine-growing area dating back to the third century, however when it was evident that the grapes produced in the town of Cognac itself were unsuitable for wine making, local producers developed the practice of double distillation in copper pot stills and ageing in oak barrels for at least two years. The product yielded was the spirit we now know as Cognac.

Fast forward 15 centuries to Scotland in the 1800’s, where John Walker was a humble shopkeeper. Local whiskey producers would bring John different varieties of single malt whiskeys, many of them below an acceptable standard. Instead of on selling these varieties – he began blending them as made-to-order whiskies, blended to meet specific customer requirements. From this idea a product (and subsequently one of the most globally recognised brands) was created which would last over 200 years.

Interesting, but what does this have to do with technology you might ask?

Ingenuity and resourcefulness have existed for as long as man has, but in the realm of technology and in particular cloud computing and open source software it has never been more imperative. We are continually faced with challenges on each assignment or project, often technology not working as planned or as advertised, or finding dead ends when trying to solve problems. This is particularly evident now as software and technology are moving at such an accelerated rate.

To be successful in this era you need creativity and lateral thinking. You need an ability not just to solve problems to which there are no documented solutions, but to create new products from existing ones which are not necessarily suitable for your specific objective. In many cases, looking to add value in the process. That is not just making lemonade from lemons (which anyone could think of) but making Cognac from sub-standard grapes or premium blended whiskey from sub-standard single malt whiskies.

One of my favourite Amazon Leadership Principles is “Invent and Simplify”:

Leaders expect and require innovation and invention from their teams and always find ways to simplify. They are externally aware, look for new ideas from everywhere, and are not limited by “not invented here”. As we do new things, we accept that we may be misunderstood for long periods of time.

https://www.amazon.jobs/en/principles

The intent in this statement is not simply to “lift and shift” current on premise systems and processes to the cloud, but to look for ways to streamline, rationalise and simplify processes in doing so. This mandates a combination of practicality and creativity.

The take away is when you are presented with a challenge, be creative and inventive and not just solve the problem but look for ways to add value in the process.

Cheers!