Sick of hearing about Service Mesh? Here’s what you need to know…

So you’ve started delivering a new project and it’s all about this “Cloud Native” or “Microservices” thing. You’re a Delivery Manager or Software Engineer at some type of company and someone has lightly peppered a meeting with a term, ‘Mesh’.

They possibly said event mesh. Or better yet, they mentioned a service mesh. As time went on you kept hearing more and more about the service mesh. You’ve attempted to read up about it, digested a whole bunch of new terms and still didn’t completely understand what the Mesh even does, why you would need it or why the hype train around this technology shows no sign of stopping. This article is an attempt to provide a focused guide to the service mesh, and why it is so interesting.

Ok, so what is this thing?

Truth be told, the service mesh is actually pretty simple. It’s built around the idea of small, repeatable bits of software, in this case userspace proxies, stuck very close to your services. This is called the data plane. In addition to the userspace proxies, you also get a bunch of management processes, which is referred to as the control plane. Simply put, the data plane (userspace proxies) intercepts all calls between services and the control plane (management processes) coordinates the wholesale behaviour of those proxies. This allows you to perform sweeping changes across your service landscape via the control planes API’s, operators and provides the capability to measure your mesh as a whole.

Before we get into the engineering of what the proxies are, let’s go with an example.

  • The business has bought some software.
  • The engineers are tasked with deploying this software in their Kubernetes cluster.
  • The engineers first task is to containerise this application, expose its functionality to downstream applications and deploy it to the cluster in a repeatable, continuous fashion.
  • There’s a requirement in your organisation that says ‘I need all communications to this vendors software as TLS1.3’. Or, ‘I would like to measure all API latency from this application’.

The engineer replies ‘I can’t make changes to a third party application! What do I do?’. Service mesh to the rescue.

Using a service mesh, you can deploy a proxy right next to your vendor container and in effect, abstract away the complexities of measurement and data transport mechanisms, and allow the vendor software to concentrate on it’s business logic.

This vendor container is now part of the service mesh.

Proxies

When we talk about proxies, we usually discuss things in OSI model terminology, that is to say Layers 1 through 7. Most of the time when it comes to proxies, you’re comparing Layer 4 to Layer 7. Here’s a quick run-down:

Layer 4 (L4) -> operates with the delivery of messages with no regard to the content of the messages. They would simply forward network packets to and from the server without inspecting any part of the packets.

Layer 7 (L7) -> this is a higher level, application layer. This deals with the actual content of the message. If you were routing network traffic, you could do this at L7 in a much more sophisticated way because you can now make decisions based on the packets messages within.

Why pick between L4 and L7? Speed.

Back to the service mesh, these userspace proxies are L7-aware TCP proxies. Think NGINX or haproxy. There are different proxies; Linkerd is an ultralight service mesh for Kubernetes. The most popular is Envoy, which was created by the ride-share company Lyft. Above, I also mentioned NGINX and haproxy which are also quite popular. So what differentiates NGINX proxies from the service mesh? Their focus. You would implement NGINX as an Ingress proxy (traffic entering your network), but when it comes to proxies that focus on traffic between services, that’s when the service mesh proxy comes in to play.

Ok, probably time for a diagram now that we’ve explained the Data Plane.

Tune in for part 2 for when we discuss the Control Plane!

The Ultimate AWS to GCP Thesaurus

There are many posts available which map analogous services between the different cloud providers, but this post attempts to go a step further and map additional concepts, terms, and configuration options to be the definitive thesaurus for cloud practitioners familiar with AWS looking to fast track their familiarisation with GCP.

It should be noted that AWS and GCP are fundamentally different platforms, nowhere is this more apparent than in the way networking is implemented between the two providers, see:

This post is focused on the core infrastructure, networking and security services offered by the two major cloud providers, I will do a future post on higher level services such as the ML/AI offerings from the respective providers.

Furthermore this will be a living post which I will continue to update, I encourage comments from readers on additional mappings which I will incorporate into the post as well.

I have broken this down into sections based upon the layout of the AWS Console.

Compute

EC2 (Elastic Compute Cloud)GCE (Google Compute Engine)
Availability ZoneZone
InstanceVM Instance
Instance FamilyMachine Family
Instance TypeMachine Type
Amazon Machine Image (AMI)Image
IAM Role (for an EC2 Instance)Service Account
Security GroupsVPC Firewall Rules (ALLOW)
TagLabel
Termination ProtectionDeletion Protection
Reserved InstancesCommitted Use Discounts
Capacity ReservationReservation
User DataStartup Script
Spot InstancesPreemptible VMs
Dedicated InstancesSole Tenancy
EBS VolumePersistent Disk
Auto Scaling GroupManaged Instance Group
Launch ConfigurationInstance Template
ELB ListenerURL Map (Load Balancer)
ELB Target GroupBackend/ Instance Group
Instance Storage (ephemeral)Local SSDs
EBS SnapshotsSnapshots
KeypairSSH Keys
Elastic IPExternal IP
LambdaGoogle Cloud Functions
Elastic BeanstalkGoogle App Engine
Elastic Container Registry (ECR)Google Container Registry (GCR)
Elastic Container Service (ECS)Google Kubernetes Engine (GKE)
Elastic Kubernetes Service (EKS)Google Kubernetes Engine (GKE)
AWS FargateCloud Run
AWS Service QuotasAllocation Quotas
Account (within an Organisation)†Project
RegionRegion
AWS Cloud​FormationCloud Deployment Manager

Storage

Simple Storage Service (S3)Google Cloud Storage (GCS)
Standard Storage ClassStandard Storage Class
Infrequent Access Storage ClassNearline Storage Class
Amazon GlacierColdline Storage Class
Lifecycle PolicyRetention Policy
TagsLabels
SnowballTransfer Appliance
Requester PaysRequester Pays
RegionLocation Type/Location
Object LockHold
Vault Lock (Glacier)Bucket Lock
Multi Part UploadParallel Composite Transfer
Cross-Origin Resource Sharing (CORS)Cross-Origin Resource Sharing (CORS)
Static Website HostingBucket Website Configuration
S3 Access PointsVPC Service Controls
Object NotificationsPub/Sub Notifications for Cloud Storage
Presigned URLSigned URL
Transfer AccelerationStorage Transfer Service
Elastic File System (EFS)Cloud Filestore
AWS DataSyncTransfer Service for on-premises data
ETagETag
BucketBucket
aws s3gsutil

Database

Relational Database Service (RDS)Cloud SQL
DynamoDBCloud Datastore
ElastiCacheCloud Memorystore
Table (DynamoDB)Kind (Cloud Datastore)
Item (DynamoDB)Entity (Cloud Datastore)
Partition Key (DynamoDB)Key (Cloud Datastore)
Attributes (DynamoDB)Properties (Cloud Datastore)
Local Secondary Index (DynamoDB)Composite Index (Cloud Datastore)
Elastic Map Reduce (EMR)Cloud DataProc
AthenaBig Query
AWS GlueCloud DataFlow
Glue CatalogData Catalog
Amazon Simple Notification Service (SNS)Cloud PubSub (push subscription)
Amazon KinesisCloud PubSub
Amazon Simple Queue Service (SQS)Cloud PubSub (poll and pull mode)

Networking & Content Delivery

Virtual Private Cloud (VPC) (Regional)VPC Network (Global or Regional)
Subnet (Zonal)Subnet (Regional)
Route TablesRoutes
Network ACLs (NACLS)VPC Firewall Rules (ALLOW or DENY)
CloudFrontCloud CDN
Route 53Cloud DNS/Google Domains
Direct ConnectDedicated (or Partner) Interconnect
Virtual Private Network (VPN)Cloud VPN
AWS PrivateLinkGoogle Private Access
NAT GatewayCloud NAT
Elastic Load BalancerLoad Balancer
AWS WAFCloud Armour
VPC Peering ConnectionVPC Network Peering
Amazon API GatewayApigee API Gateway
Amazon API GatewayCloud Endpoints

Security, Identity, & Compliance

Root AccountSuper Admin
IAM UserMember
IAM PolicyRole (Collection of Permissions)
IAM Policy AttachmentIAM Role Binding (or IAM Binding)
Key Management Service (KMS)Cloud KMS
CloudHSMCloud HSM
Amazon Inspector (agent based)Cloud Security Scanner (scan based)
AWS Security HubCloud Security Command Center (SCC)
Secrets ManagerSecret Manager
Amazon MacieCloud Data Loss Prevention (DLP)
AWS WAFCloud Armour
AWS ShieldCloud Armour

† No direct equivalent, this is the closest equivalent

Google Cloud Storage Object Notifications using Slack

This article describes the steps to integrate Slack with Google Cloud Functions to get notified about object events within a specified Google Cloud Storage bucket.

Google Cloud Storage Object Notifications using Slack

Events could include the creation of new objects, as well as delete, archive or metadata operations performed on a given bucket.

This pattern could be easily extended to other event sources supported by Cloud Functions including:

  • Cloud Pub/Sub messages
  • Cloud Firestore and Firebase events
  • Stackdriver log entries

More information can be found at https://cloud.google.com/functions/docs/concepts/events-triggers.

The prerequisite steps to configure Slack are provided here:

1. First you will need to create a Slack app (assuming you have already set up an account and a workspace). The following screenshots demonstrate this process:

Create a Slack app
Give the app a name and associate it with an existing Slack workspace

2. Next you need to Enable and Activate Incoming Webhooks to your app and add this to your workspace. The following screenshots demonstrate this process:

Enable Incoming Web Hooks for the app
Activate incoming webhooks
Add the webhook to your workspace

3. Next you need to specify a channel for notifications generated from object events.

Select a channel for the webhook

4. Now you need to copy the Webhook url provided, you will use this later in your Cloud Function.

Copy the webhook URL to the clipboard

Treat your webhook url as a secret, do not upload this to a public source code repository

Next you need to create your Cloud Function, this example uses Python but you can use an alternative runtime including Node.js or Go.

This example templates the source code using the Terraform template_file data source. The function source code is shown here:

Within your Terraform code you need to render your Cloud Function code substituting the slack_webhook_url for it’s value which you will supply as a Terraform variable. The rendered template file is then placed in a local directory along with a requirements.txt file and zipped up. The resulting Zip archive is uploaded to a specified bucket where it will be sourced to create the Cloud Function.

Now you need to create the Cloud Function, the following HCL snippet demonstrates this:

The event_trigger block in particular specifies which GCS bucket to watch and what events will trigger invocation of the function. Bucket events include:

  • google.storage.object.finalize (the creation of a new object)
  • google.storage.object.delete
  • google.storage.object.archive
  • google.storage.object.metadataUpdate

You could add additional logic to the Cloud Function code to look for specific object names or naming patterns, but keep in mind the function will fire upon every event matching the event_type and resource criteria.

To deploy the function, you would simply run:

terraform apply -var="slack_webhook_url=https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX"

Now once you upload a file named test-object.txt, voilà!:

Slack notification for a new object created

Full source code can be found at:
https://github.com/gamma-data/gcs-object-notifications-using-slack

Map Reduce is Dead, Long Live Map Reduce

Firstly, this is not another Hadoop obituary, there are enough of those out there already.

The generalized title of this article has been used as an expression to convey the idea that something old has been replaced by something new. In the case of the expression “the King is dead, long live the King” the inference is that although one monarch has passed, another monarch instantly succeeds him.

In the age of instant gratification and hype cycle driven ‘pump and dump’ investment we are very quick to discard technologies that don’t realise overzealous targets for sales or market share. In our continuous attempts to find the next big thing, we are quick to throw out the last big thing and everything associated with it.

The Reports of My Death Have Been Greatly Exaggerated

A classic example of this is the notion that Map Reduce is dead. Largely proliferated by the Hadoop obituaries which seem to be growing exponentially with each day.

A common e-myth is that Google invented the Map Reduce pattern, which is completely incorrect. In 2004, Google described a framework distributed systems implementation of the Map Reduce pattern in a white paper named “MapReduce: Simplified Data Processing on Large Clusters.” – this would inspire the first-generation processing framework (MapReduce) in the Hadoop project. But neither Google nor Yahoo! nor contributors to the Hadoop project (which include the pure play vendors) created the Map Reduce algorithm or processing pattern and neither shall any one of these have the rights to kill it.

The origins of the Map Reduce pattern can be traced all the way back to the early foundations of functional programming beginning with Lambda Calculus in the 1930s to LISP in the 1960s. Map Reduce is an integral pattern in all of today’s functional and distributed systems programming. You only need to look at the support for map() and reduce() operators in some of the most popular languages today including Python, JavaScript, Scala, and many more languages that support functional programming.

As far as distributed processing frameworks go, the Map Reduce pattern and its map() and reduce() methods are very prominent as higher order functions in APIs such as Spark, Kafka Streams, Apache Samza and Apache Flink to name a few.

While the initial Hadoop adaptation of Map Reduce has been supplanted by superior approaches, the Map Reduce processing pattern is far from dead.

On the fall of Hadoop…

There is so much hysteria around the fall of Hadoop, we need to be careful not to toss the baby out with the bath water. Hadoop served a significant role in bringing open source, distributed systems from search engine providers to academia all the way to the mainstream, and still serves an important purpose in many organizations data ecosystems today and will continue to do so for some time.

OK, it wasn’t the panacea to everything, but who said it was supposed to be? The Hadoop movement was hijacked by hysteria, hype, venture capital, over ambitious sales targets and financial engineering – this does not mean the technology was bad.

Hadoop spawned many significant related projects such as Spark, Kafka and Presto to name a few. These projects paved the way for cloud integration, which is now the dominant vector for data storage, processing, and analysis.

While the quest for world domination by the Hadoop pure play vendors may be over, the Hadoop movement (and the impact it has had on the enterprise data landscape) will live on.

Ansible Tower for Continuous Infrastructure

As infrastructure and teams scale, effective and robust configuration management requires growing beyond manual processes and local conventions. Fortunately, Ansible Tower (or the upstream Open Source project Ansible AWX) provides a perfect platform for configuration management at scale.

The Ansible Tower/AWX documentation and tutorials provide comprehensive information about the individual components.  However, assembling all the moving pieces into a whole working solution can involve some trial and error and reverse engineering in order to understand how the components relate to one another.  Ansible Tower, like the core Ansible solution, offers flexibility in how features assembled to support different typed of workflows. The types of workflows can include once-off initial configurations, ad-hoc system maintenance, or continuous convergence.

Continuous convergence, also referred to as desired state, regularly re-applies the defined configuration to infrastructure. This tends to ‘correct the drift’ often encountered when only applying the configuration on infrastructure setup. For example, a continuous convergence approach to configuration management could apply the desired configuration on a recurring schedule of every 30 minutes.  

Some continuous convergence workflow characteristics can include:

  • Idempotent Ansible roles. If there are no required configuration deviations, run will report 0 changes.
  • A source code repository per Ansible role, similar to the Ansible Galaxy approach,
  • A source code repository for Ansible playbooks that include the individual Ansible roles,
  • A host configured to provide one unique service function only,
  • An Ansible playbook defined for each unique service function that gets applied to the host,
  • Playbooks applied to each host on a repeating schedule.

One way to achieve a continuous convergence workflow combines the Ansible Tower components according to the following conceptual model.

The Workflow Components

Playbook and Role Source Code

Ansible roles contain the individual tasks, handlers, and content with a role responsible for the installation and configuration of a particular software service.

Ansible playbooks configure a host for a particular service function in the environment acting as a wrapper for the individual role based configurations.  All the roles expected to be applied to a host must be defined in the playbook.

Source Code Repositories

Role git repositories contain the versioned definition of a role, e.g. one git repository per individual role.  The roles are pulled into the playbooks using the git reference and tags, which pegs the role version used within a playbook.

Project git repositories group the individual playbooks into single collection, e.g. one git repository per set of playbooks.  As with roles, specific versions of project repositories are also identified by version tags. 

Ansible Tower Server

Two foundational concepts in Ansible Tower are projects and inventories. Projects provide access to playbooks and roles. Inventories provide the connection to “real” infrastructure.  Inventories and projects also provide authorisation scope for activities in Ansible Tower. For example, a given group can use the playbooks in Project X and apply jobs to hosts in Inventory Y.

Each Ansible Tower Project is backed by a project git repository.  Each repository contains the playbooks and included roles that can be applied by a given job.  The Project is the glue between the Ansible configuration tasks and the plays that apply the configuration.

Ansible Tower Inventories are sets of hosts grouped for administration, similar to inventory sets used when applying playbooks manually.  One option is to group hosts into Inventories by environment.  For example, the hosts for development may be in one Inventory while the hosts for production may be in another Inventory.  User authorisation controls are applied at the Inventory level.

Ansible Tower Inventory Groups define sub-sets of hosts within the larger Inventory.  These subsets can then be used to limit the scope of a playbook job.  One option is to group hosts within an Inventory by function.  For example, the hosts for web servers may be in one Inventory Group and the hosts for databases may be in another Inventory Group.  This enables one playbook to target one inventory group.  Inventory groups effectively provide metadata labels for hosts in the Inventory.

An Ansible Job Template determines the configuration to be applied to hosts.  The Job Template links a playbook from a project to an inventory.   The inventory scope can be optionally further limited by specifying inventory group limits.  A Job Template can be invoked either on an ad-hoc basis or via a recurring schedule.

Ansible Job Schedules define the time and frequency at which the configuration specified in the Job Template is applied.  Each Job Template can be associated with one or more Job Schedules.  A schedule supports either once-off execution, for example during a defined change window, or regularly recurring execution.  A job schedule that applies the desired state configuration with a frequency of 30 minutes provides an example of a job schedule used for a continuous convergence workflow.

“Real” Infrastructure

An Ansible Job Instance defines a single invocation of an Ansible Job Template, both for scheduled and ad-hoc invocations of the job template.  Outside of Ansible Tower, the Job Instance is the equivalent of executing the ansible-playbook command using an inventory file.

Host is the actual target infrastructure resources configured by the job instance, applying an ansible playbook of included roles.

A note on Ansible Variables

As with other features of Ansible and Ansible Tower, variables also offer flexibility in defining parameters and context when applying a configuration.  In addition to declaring and defining variables in roles and playbooks, variable definitions can also be defined in Ansible Tower job templates, inventory and inventory groups, and individual hosts.  Given the plethora of options for variable definition locations, without a set of conventions for managing variable values, debugging runtime issues with roles and playbooks can become difficult.  E.g. which value defined at which location was used when applying the role?

One example of variable definitions conventions could include:

  • Variables shall be given default values in the role, .e.g. in the ../defaults/main.yml file.
  • If the variable must have a ‘real’ value supplied when applying the playbook, the variable shall be defined with an obvious placeholder value which will fail if not overridden.
  • Variables shall be described in the role README.md documentation
  • Do not apply variables at the host inventory level as host inventory can be transient.
  • Variables that select specific capabilities within a role shall be defined at the Ansible Tower Inventory Group.  For example, a role contains the configuration definition for both master and work nodes.  The Inventory Group variables are used to indicate which hosts must have the master configuration and applied and which must have the worker configuration applied.
  • Variables that define the environment context for configuration shall be defined in the Ansible Tower Job Template.

Following these conventions, each of the possible variable definition options serves a particular purpose.  When an issue with variable definition does arise, the source is easily identified.

Managing Secrets in CICD Pipelines

Overview

With the adoption automation for deploying and managing application environments, protecting privileged accounts and credential secrets in a consistent, secure, and scalable manner becomes critical.  Secrets can include account usernames, account passwords and API tokens.  Good credentials management and secrets automation practices reduce the risk of secrets escaping into the wild and being used either intentionally (hacked) or unintentionally (accident).

  • Reduce the likelihood of passwords slipping into source code commits and getting pushed to code repositories, especially public repositories such as github.
  • Minimise the secrets exposure surface area by reducing the number of people who require knowledge of credentials.  With an automated credentials management process that number can reach zero.
  • Limit the useful life of a secret by employing short expiry times and small time-to-live (TTL) values.  Automation enables reliable low-effort secret re-issue and rotation.

Objectives

The following objectives have been considered in designing a secrets automation solution that can be integrated into an existing CICD environment.

  • Integrate into an existing CICD environment without requiring an “all or nothing” implementation.  Allow existing jobs to operate alongside jobs that have been converted to the new secrets automation solution.
  • A single design that can be applied across different toolchains and deployment models.  For example, deployment to a Kubernetes environment can use the same secrets management process as an application installation on a virtual machine.  Similarly, the design can be used with different CICD tools, such as GitLab-CI, Travis-CI, or other build and deploy automation tool.
  • Multi-cloud capable by limiting coupling to a specific hosting environment or cloud services provider.
  • The use of secrets (or not) can be decided at any point in time, without requiring changes to the CICD job definition, similar to the use of feature flags in applications.
  • Enable changes to secrets, either due to rotation or revocation, to be maintained from a central service point.  Avoid storing the same secret multiple times in different locations.
  • Secrets organised in predictable locations in a “rest-ish” fashion by treating secrets and credentials as attributes of entities.
  • Use environment variables as the standard interface between deployment orchestration and deployed application, following the 12 Factor App approach.

Solution

  • Secrets stored centrally in Hashicorp Vault.
  • CICD jobs retrieve secrets from Vault and configure the application deployment environment.
  • Deployed applications use the secrets supplied by CICD job to access backend services.
CICD Secrets with Vault

Storing Secrets

Use Vault by Hashicorp as a centralised secrets storage service.  The CICD service retrieves secrets information for integration and deployment jobs.  Vault provides a flexible set of features to support numerous different workflows and available as either Vault Open Source or Vault Enterprise.  The secrets management pattern described uses the Vault Open Source version.  The workflow described here can be explored using Vault in the unsecured development mode, however, a properly configured and managed Vault service is required for production use.

Vault supports a number of secrets backends and access workflow models.  This solution makes use of the Vault AppRole method, which is designed to support machine-to-machine automated workflows.  With the AppRole workflow model human access to secrets is minimised through the use of access controls and temporary credentials with short TTL’s.  Within Vault, secrets are organised using an entity centric “rest-ish” style approach ensuring a given secret for a given service is stored in a single predictable location.

The use of Vault satisfies several of the design objectives:

  • enables single point management of secrets. The secrets content is stored in a single location referenced at CICD job runtime.  On the next invocation, the CICD job retrieves the latest version of the secrets content.
  • enables storing secrets in predictable locations with file system directory style path location.  The “rest-ish” approach to organising secret locations enables storing a given secret only once.  Access policies provide the mechanism to limit CICD  visibility to only those secrets required for the CICD job.

Passing Secrets to Applications

Use environment variables to pass secrets from the CICD service to the application environment.  

There are existing utilities available for populating a child process environment with Vault sourced secrets, such as vaultenv or envconsul.  This approach works well for running an application service.  However, with CICD, often there are often sets of tasks that require access to secrets information as opposed to a single command.  Using the child environment approach would require wrapping each command in a CICD job step with the env utility.  This works against the objective of introducing a secrets automation solution into existing CICD jobs without requiring substantial refactoring.  Similarly, some CICD solutions such as Jenkins provide Vault integration plugins which pre-populate the environment with secrets content.  This meets the objective of minimal CICD job refactoring, but closely couples the solution to a particular CICD service stack, reducing portability.  

With a job script oriented CICD automation stack like GitLab-CI or Travis-CI, an alternative is to insert a job step at the beginning of a series of CICD tasks that will populated the required secret values into expected environment variables.  Subsequent tasks in the job can then execute without requiring refactoring.  The decision on whether to source a particular environment variable’s content directly from the CICD job setup or from the Vault secrets store can be made by adding an optional prefix to environment variables to be sourced from the Vault secrets store.  The prefixed instance of the environment variable contains the location or path to the required secret.  Secret locations are identified using the convention /<vault-secret-path>/<secret-key>

  • enables progressive implementation due to transparency of secret sourcing. Subsequent steps continue to rely on expected environment vars
  • enables use in any toolchain that supports use of environment variables to pass information to application environment. 
  • CICD job steps not tied to a specific secrets store. An alternative secrets storage service could be supported by only requiring modification of the secret getter utility.
  • control of whether to source application environment variables from the CICD job directly or from the secrets engine is managed at the CICD job setup level as opposed to requiring CICD job refactoring to switch the content source.
  • continues the 12 Factor App approach of using environment variables to pass context to application environments.

Example Workflow

An example workflow for a CICD job designed to use environment variables for configuring an application.

Assumptions

The following are available in the CICD environment.

  • A job script oriented CICD automation stack that executes job tasks as a series of shell commands, such as GitLab-CI or Jenkins Pipelines.
  • A secrets storage engine with a python API, such as Hashicorp Vault.
  • CICD execution environment includes the get-vault-secrets-by-approle utility script.

Workflow Steps

Add a Vault secret

Add a secret to Vault at the location secret/fake-app/users/fake-users with a key/value entry of password=fake-password

Add a Vault access policy

Add a Vault policy for the CICD job (or set of CICD jobs) that includes ‘read’ access to the secret.

# cicd-fake-app-policy 
path "secret/data/fake-app/users/fake-user" {
    capabilities = ["read"]
}

path "secret/metadata/fake-app/users/fake-user" {
    capabilities = ["list"]
}

Add a Vault appRole

Add a Vault appRole linked to the new policy.  This example specifies a new appRole with an secret-id TTL of 60 days and non-renewable access tokens with a TTL of 5 minutes.  The CICD job uses the access token to read secrets.

vault write auth/approle/role/fake-role \
    secret_id_ttl=1440h \
    token_ttl=5m \
    token_max_ttl=5m \
    policies=cicd-fake-app-policy

Read the Vault approle-id

Retrieve the approle-id of the new appRole taking note of the returned approle-id.

vault read auth/approle/role/fake-role

Add a Vault appRole secret-id

Add a secret-id for the appRole, taking note of the returned secret-id

vault write -f auth/approle/role/fake-role/secret-id

Add CICD Job Steps

In the CICD job definition insert job steps to retrieve secrets values a set variables in the job execution environment. These are the steps to add in a gitlab-ci.yml CICD job.

...
script:
- get-vault-secrets-by-approle > ${VAULT_VAR_FILE}
- source ${VAULT_VAR_FILE} && rm ${VAULT_VAR_FILE}
...

The helper script get-vault-secrets-by-approle could be executed and sourced in a single step, e.g. source $(get-vault-secrets-by-approle).  However, when executed in a single statement all script output is processed by the source command and script error messages don’t get printed and captured in the job logs.  Splitting the read and environment var sourcing into 2 steps aids in troubleshooting.

Add CICD job vars for Vault access

In the CICD job configuration add Vault access environment variables.

VAULT_ADDR=https://vault.example.com:8200
VAULT_ROLE_ID=db02de05-fa39-4855-059b-67221c5c2f63
VAULT_SECRET_ID=6a174c20-f6de-a53c-74d2-6018fcceff64
VAULT_VAR_FILE=/var/tmp/vault-vars.sh

Add CICD job vars for Vault secrets

In the CICD job configuration add environment variables for the items to be sourced from vault secrets.  The secret path follows the convention of <secret-mount-path>/<secret-path>/<secret-key>

V_FAKE_PASSWORD=secret/fake-app/users/fake-user/password

Remove CICD job vars

In the CICD job configuration remove the previously used FAKE_APP_PASSWORD variable.

Execute the CICD job

Kick off the CICD job.  Any CICD job configuration variables prefixed with “V_” results in the addition of a corresponding environment variable in the job execution environment with content sourced from Vault.

Full source code can be found at:

https://github.com/datwiz/cicd-secrets-in-vault

Change Data Capture at Scale using Spark

Change Data Capture (CDC) is one of the most challenging processing patterns to implement at scale. I personally have had several cracks at this using various different frameworks and approaches, the most recent of which was implemented using Spark – and I think I have finally found the best approach. Even though the code examples referenced use Spark, the pattern is language agnostic – the focus is on the approach not the specific implementation (as this could be applied to any framework or runtime).

The first challenge you are faced with, is to compare a very large dataset (representing the current state of an object) with another potentially very large dataset (representing new or incoming data). Ideally, you would like the process to be configuration driven and accommodate such things as composite primary keys, or operational columns which you would like to restrict from change detection. You may also want to implement a pattern to segregate sensitive attributes from non-sensitive attributes.

Overview

This pattern (and all my other recent attempts) is fundamentally based upon calculating a deterministic hash of the key and non-key attribute(s), and then using this hash as the basis for comparison. The difference between this pattern and my other attempts is in the distillation and reconstitution of data during the process, as well as breaking the pattern into discrete stages (designed to minimize the impact to other applications). This pattern can be used to process delta or full datasets.

A high-level flowchart representing the basic pattern is shown here:

CDC Flowchart

The Example

The example provided uses the Synthetic CDC Data Generator application, configuring an incoming set with 5 uuid columns acting as a composite key, and 10 random number columns acting as non key values. The initial days payload consists of 10,000 records, the subsequent days payload consists of another 10,000 records. From the initial dataset, a DELETE operation was performed at the source system for 20% of records, an UPDATE was performed on 40% of the records and the remaining 40% of records were unchanged. In this case the 20% of records that were deleted at the source, were replaced by new INSERT operations creating new keys.

After creating the synthesized day 1 and day 2 datasets, the files are processed as follows:

$ spark-submit cdc.py config.yaml data/day1 2019-06-18
$ spark-submit cdc.py config.yaml data/day2 2019-06-19

Where config.yaml is the configuration for the dataset, data/day1 and data/day2 represent the different data files, and 2019-06-18 and 2019-06-19 represent a business effective date.

The Results

You should see the following output from running the preceding commands for day 1 and day 2 respectively:

Day 1:

Day 2:

A summary analysis of the resultant dataset should show:

Pattern Details

Details about the pattern and its implementation follow.

Current and Historical Datasets

The output of each operation will yield a current dataset (that is the current stateful representation of a give object) and a historical dataset partition (capturing the net changes from the previous state in an appended partition).

This is useful, because often consumers will primarily query the latest state of an object. The change sets (or historical dataset partitions) can be used for more advanced analysis by sophisticated users.

Type 2 SCDs (sort of)

Two operational columns are added to each current and historical object:

  • OPERATION : Represents the last known operation to the record, valid values include :
    • I (INSERT)
    • U (UPDATE)
    • D (DELETE – hard DELETEs, applies to full datasets only)
    • X (Not supplied, applies to delta processing only)
    • N (No change)
  • EFF_START_DATE

Since data structures on most big data or cloud storage platforms are immutable, we only store the effective start date for each record, this is changed as needed with each coarse-grained operation on the current object. The effective end date is inferred by the presence of a new effective start date (or change in the EFF_START_DATE value for a given record).

The Configuration

I am using a YAML document to store the configuration for the pattern. Important attributes to include in your configuration are a list of keys and non keys and their datatype (this implementation does type casting as well). Other important attributes include the table names and file paths for the current and historical data structures.

The configuration is read at the beginning of a routine as an input along with the path of an incoming data file (a CSV file in this case) and a business effective date (which will be used as the EFF_START_DATE for new or updated records).

Processing is performed using the specified key and non key attributes and the output datasets (current and historical) are written to columnar storage files (parquet in this case). This is designed to make subsequent access and processing more efficient.

The Algorithm

I have broken the process into stages as follows:

Stage 1 – Type Cast and Hash Incoming Data

The first step is to create deterministic hashes of the configured key and non key values for incoming data. The hashes are calculated based upon a list of elements representing the key and non key values using the MD5 algorithm. The hashes for each record are then stored with the respective record. Furthermore, the fields are casted their target datatype as specified in the configuration. Both of these operations can be performed in a single pass of each row using a map() operation.

Importantly we only calculate hashes once upon arrival of new data, as the hashes are persisted for the life of the data – and the data structures are immutable – the hashes should never change or be invalidated.

Stage 2 – Determine INSERTs

We now compare Incoming Hashes with previously calculated hash values for the (previous day’s) current object. If no current object exists for the dataset, then it can be assumed this is a first run. In this case every record is considered as an INSERT with an EFF_START_DATE of the business effective date supplied.

If there is a current object, then the key and non key hash values (only the hash values) are read from the current object. These are then compared to the respective hashes of the incoming data (which should still be in memory).

Given the full outer join:

incoming_data(keyhash, nonkeyhash) 
FULL OUTER JOIN  
current_data(keyhash, nonkeyhash) 
ON keyhash

Keys which exist in the left entity which do not exist in the right entity must be the results of an INSERT operation.

Tag these records with an operation of I with an EFF_START_DATE of the business effective date, then rejoin only these records with their full attribute payload from the incoming dataset. Finally, write out these records to the current and historical partition in overwrite mode.

Stage 3 – Determine DELETEs or Missing Records

Referring the previous full outer join operation, keys which exist in the right entity (current object) which do not appear in the left entity (incoming data) will be the result of a (hard) DELETE operation if you are processing full snapshots, otherwise if you are processing deltas these would be missing records (possibly because there were no changes at the source).

Tag these records as D or X respectively with an EFF_START_DATE of the business effective date, rejoin these records with their full attribute payload from the current dataset, then write out these records to the current and historical partition in append mode.

Stage 4 – Determine UPDATEs or Unchanged Records

Again, referring to the previous full outer join, keys which exist in both the incoming and current datasets must be either the result of an UPDATE or they could be unchanged. To determine which case they fall under, compare the non key hashes. If the non key hashes differ, it must have been a result of an UPDATE operation at the source, otherwise the record would be unchanged.

Tag these records as U or N respectively with an EFF_START_DATE of the business effective date (in the case of an update – otherwise maintain the current EFF_START_DATE), rejoin these records with their full attribute payload from the incoming dataset, then write out these records to the current and historical partition in append mode.

Key Callouts

A summary of the key callouts from this pattern are:

  • Use the RDD API for iterative record operations (such as type casting and hashing)
  • Persist hashes with the records
  • Use Dataframes for JOIN operations
  • Only perform JOINs with the keyhash and nonkeyhash columns – this minimizes the amount of data shuffled across the network
  • Write output data in columnar (Parquet) format
  • Break the routine into stages, covering each operation, culminating with a saveAsParquet() action – this may seem expensive but for large datsets it is more efficient to break down DAGs for each operation
  • Use caching for objects which will be reused between actions

Metastore Integration

Although I did not include this in my example, you could easily integrate this pattern with a metastore (such as a Hive metastore or AWS Glue Catalog), by using table objects and ALTER TABLE statements to add historical partitions.

Further optimisations

If the incoming data is known to be relatively small (in the case of delta processing for instance), you could consider a broadcast join where the smaller incoming data is distributed to all of the different Executors hosting partitions from the current dataset.

Also you could add a key to the column config to configure a column to be nullable or not.

Happy CDCing!

Full source code for this article can be found at:

https://github.com/avensolutions/cdc-at-scale-using-spark

Synthetic CDC Data Generator

This is a simple routine to generate random data with a configurable number or records, key fields and non key fields to be used to create synthetic data for source change data capture (CDC) processing. The output includes an initial directory containing CSV files representing an initial data load, and an incremental directory containing CSV files representing incremental data.

Arguments (by position) include:

  • no_init_recs : the number of initial records to generate
  • no_incr_recs : the number of incremental records on the second run – should be >= no_init_recs
  • no_keys : number of key columns in the dataset – keys are generated as UUIDs
  • no_nonkeys : number of non-key columns in the dataset – nonkey values are generated as random numbers
  • pct_del : percentage of initial records deleted on the second run – between 0.0 and 1.0
  • pct_upd : percentage of initial records updated on the second run – between 0.0 and 1.0
  • pct_unchanged : percentage of records unchanged on the second run – between 0.0 and 1.0
  • initial_output : folder for initial output in CSV format
  • incremental_output : folder for incremental output in CSV format

NOTE : pct_del + pct_upd + pct_unchanged must equal 1.0

Example usage:

$ spark-submit synthetic-cdc-data-generator.py 100000 100000 2 3 0.2 0.4 0.4 data/day1 data/day2

Example output from the day1 run for the above configuration would look like this:

Note that this routine can be run subsequent times producing different key and non key values each time, as the keys are UUIDs and the values are random numbers.

We will use this application to generate random input data to demonstrate CDC using Spark in a subsequent post, see you soon!

Full source code can be found at:

https://github.com/avensolutions/synthetic-cdc-data-generator

Scalable, Secure Application Load Balancing with VPC Native GKE and Istio

At the time of this writing, GCP does not have a generally available non-public facing Layer 7 load balancer. While this is sure to change in the future, this article outlines a design pattern which has been proven to provide scalable and extensible application load balancing services for multiple applications running in Kubernetes pods on GKE.

When you create a service of type LoadBalancer in GKE, Kubernetes hooks into the provider (GCP in this case) on your behalf to create a Google Load Balancer, while this may be specified as INTERNAL, there are two issues:

Issue #1:

The GCP load balancer created for you is a Layer 4 TCP load balancer.

Issue #2:

The normal behaviour is for Google to enumerate all of the node pools in your GKE cluster and “automagically” create mapping GCE instance groups for each node pool for each zone the instances are deployed in. This means the entire surface area of your cluster is exposed to the external network – which may not be optimal for internal applications on a multi tenanted cluster.

The Solution:

Using Istio deployed on GKE along with the Istio Ingress Gateway along with an externally created load balancer, it is possible to get scalable HTTP load balancing along with all the normal ALB goodness (stickiness, path-based routing, host-based routing, health checks, TLS offload, etc.).

An abstract depiction of this architecture is shown here:

Istio Ingress Design Pattern for VPC Native GKE Clusters

This can be deployed with a combination of Terraform and kubectl. The steps to deploy at a high level are:

  1. Create a GKE cluster with at least two node pools: ingress-nodepool and service-nodepool. Ideally create these node pools as multi-zonal for availability. You could create additional node pools for your Egress Gateway or an operations-nodepool to host Istio, etc as well.
  2. Deploy Istio.
  3. Deploy the Istio Ingress Gateway service on the ingress-nodepool using Service type NodePort.
  4. Create an associated Certificate Gateway using server certificates and private keys for TLS offload.
  5. Create a service in the service-nodepool.
  6. Reserve an unallocated static IP address from the node network range.
  7. Create an internal TCP load balancer:
    1. Specify the frontend as the IP address reserved in step 6.
    2. Specify the backend as the managed instance groups created during the node pool creation for the ingress-nodepool (ingress-nodepool-ig-a, ingress-nodepool-ig-b, ingress-nodepool-ig-c).
    3. Specify ports 80 and 443.
  8. Create a GCP Firewall Rule to allow traffic from authorized sources (network tags or CIDR ranges) to a target of the ingress-nodepool network tag.
  9. Create a Cloud DNS A Record for your managed zone as *.namespace.zone pointing to the IP Address assigned to the load balancer frontend in step 7.1.
  10. Enable Health Checks through the GCP firewall to reach the ingress-nodepool network tag at a minimum – however there is no harm in allowing these to all node pools.

The service should then be resolvable and routable from authorized internal networks (peered private VPCs or internal networks connected via VPN or Dedicated Interconnect) as:

https://<service>.<namespace>.<zone>/<endpoint>

The advantages of this design pattern are…

  1. The Ingress Gateway provides fully functional application load balancing services.
  2. Istio provides service discovery and routing using names and namespaces.
  3. The Ingress Gateway service and ingress gateway node pool can be scaled as required to meet demand.
  4. The Ingress Gateway is multi zonal for greater availability

AWS Professional and Speciality Exam Tips

One you get beyond the Associate level AWS certification exams into the Professional or Speciality track exams the degree of difficulty rises significantly. As a veteran of the Certified Solutions Architect Professional and Big Data Specialty exams, I thought I would share my experiences which I believe are applicable to all the certification streams and tracks in the AWS certification program.

First off let me say that I am a self-professed certification addict, having sat more than thirty technical certification exams over my thirty plus year career in technology including certification and re-certification exams. I would put the AWS professional and specialty exams right up there in terms of their level of difficulty.

The AWS Professional and Specialty exams are specifically designed to be challenging. Although they have removed the pre-requisites for these exams (much to my dismay…), you really need to be prepared for these exams otherwise you are throwing your hard-earned money away.

There are very few – if any – “easy” questions. All of the questions are scenario based and require you to design a solution to meet multiple requirements. The question and/or the correct answer will invariably involve the use of multiple AWS services (not just one). You will be tested on your reading comprehension, time management and ability to cope under pressure as well as being tested on your knowledge of the AWS platform.

The following sections provide some general tips which will help you approach the exam and give you the best chance of success on the day. This is not a brain dump or a substitute for the hard work and dedication required to ensure success on your exam day.

Time Management

Needless to say, your ability to manage time is critical, on average you will have approximately 2-3 minutes to answer each question. Reading the questions and answers carefully may take up 1-2 minutes on its own. If the answer is not apparent to you, you are best to mark the question and come back to it at the end of the exam.

In many cases there may be subsequent questions and answer sets which jog your memory or help you deduce the correct answers to the questions you initial passed on. For instance, you may see references in future questions which put context around services you may not be completely familiar with, this may enable you to answer flagged questions with more confidence.

Of course, you must answer all questions before completing the exam, there are no points for incomplete or unattempted answers.

Recommended Approach to each Question

Most of the questions on the Professional or Specialty certification exams fall into one of three categories:

  • Short-ish question, multiple long detailed answer options
  • Long-ish scenario question, multiple relatively short answer options
  • Long-ish question with multiple relatively long, detailed answers

The latter scenario is thankfully less common. However, in all cases it is important to read the last sentence in the question first, this will provide indicators to help you read through the question in its entirety and all of the possible answers with a clear understanding of what is “really” being asked. For instance, the operative phrase may be “highly available” or “most cost effective”.

Try to eliminate answers based on what you know, for instance answers with erroneous instance families can be eliminated immediately. This will give you a much better statistical chance of success, even if you have to venture an educated guess in the end.

The Most Complicated Solution is Probably Not the Correct One

In many answer sets to questions on the Professional or Specialty exams you will see some ridiculously complicated solution approaches, these are most often incorrect answers. Although there may be enough loosely relevant terminology or services to appear reasonable.

Note the following statement direct from the AWS Certified Solutions Architect Professional Exam Blueprint:


“Distractors, or incorrect answers, are response options that an examinee with incomplete knowledge or skill would likely choose. However, they are generally plausible responses that fit in the content area defined by the test objective.”

AWS wants professionals who design and implement solutions which are simple, sustainable, highly available, scalable and cost effective. One of the key Amazon Leadership Principles is “Invent and Simplify”, simplify is often the operative word.

Don’t spend time on dumps or practice exams (other than those from AWS)

The question pools for AWS exams are enormous, the chances of you getting the same questions and answer sets as someone else are slim. Furthermore, non-AWS sources may not be trustworthy. There is no substitute to AWS white papers, how to’s, and real-life application of your learnings.

Don’t focus on Service Limits or Calculations

In my experiences with AWS exams, they are not overly concerned with service limits, default values, formulas (e.g. the formula to calculate required partitions for a DynamoDB table) or syntax – so don’t waste time remembering them. You should however understand the 7 layer OSI model and be able to read and interpret CIDR notation.

Mainly, however, they want you to understand how services work together in an AWS solution to achieve an outcome for a customer.

Some Final Words of Advice

Always do what you think AWS would want you to do! 

It is worthwhile having a quick look at the AWS Leadership Principles (I have already referenced one of these in this article) as these are applied religiously in every aspect of the AWS business.  In particular, you should pay specific attention to the principals around simplicity and frugality.

Good luck!