Technical Deep Dive, Educational Maksym Lushpenko Technical Deep Dive, Educational Maksym Lushpenko

How We Reduced Our Google Cloud Bill by 65%

Learn how we reduced our Google Cloud costs by 65% using Kubernetes optimizations, workload consolidation, and smarter logging strategies. Perfect for startups aiming to extend their runway and save money.

How We Reduced Our Google Cloud Bill by 65%

Google Cloud Cost Reduction

Introduction

No matter if you are running a startup or working at a big corporation, keeping infrastructure costs under control is always a good practice. But it’s especially important for startups to extend their runway. This was our goal.

We just got a bill from Google Cloud for the month of November and are happy to see that we reduced our costs by ~65%, from $687/month to $247/month.

Most of our infrastructure is running on Google Kubernetes Engine (GKE), so most savings tips are related to that. This is one of those situations on how to optimize at a small scale, but most of the things can be applied to big-scale setups as well.

TLDR

Here’s what we did, sorted from the biggest impact to the least amount of savings:

  • Almost got rid of stable on-demand instances by moving part of the setup to spot instances and reducing the amount of time stable nodes have to be running to the bare minimum.

  • Consolidated dev and prod environments

  • Optimized logging

  • Optimized workload scheduling

Some of these steps are interrelated, but they have a specific impact on your cloud bill. Let’s dive in.

Stable Instances

The biggest impact on our cloud costs was running stable servers. We needed them for several purposes:

  • some services didn’t have a highly available (HA) setup (multiple instances of the same service)

  • some of our skills assessments are running inside a single Kubernetes pod and we can’t allow pod restarts or the progress of the test will be lost

  • we weren’t sure if all of our backend services could handle a shutdown gracefully in case of a node restart

For services that didn’t have a HA setup, we had the option to explore HA setup were possible (this often requires installing additional infrastructure components, especially for stateful applications, which in turn drives infrastructure costs up); migrating the service to a managed solution (e.g. offload Postgres setup to Google Cloud instead of managing it ourselves); accept that service may be down for 1-2 minutes a day if it’s not critical for the user experience.

For instance, we are running a small Postgres instance on Google Cloud and the load on this instance is very small. So, when some other backend component needs Postgres, we create a new database on the same instance instead of spinning up another instance on Google Cloud or running a Postgres pod on our Kubernetes cluster.

I know this approach is not for everyone, but it works for us as several Postgres databases all have a very light load. And remember, it’s not only about cost savings, this also allows us not to think about node restarts or basic database management.

At the same time, we are running a single instance of Grafana (monitoring tool). It’s not a big deal if it goes down during node restart as it is our internal tool and we can wait a few minutes before it comes back to life if we need to check some dashboards. A similar approach to the ArgoCD server that handles our deployments - it doesn’t have to be up all the time.

High Availability Setup

Here’s what we did for HA of our services on Kubernetes to be able to get rid of stable nodes, this can be applied to the majority of services:

  • created multiple replicas of our services (at least 2), so if one pod goes down, another one can serve traffic

  • configured pod anti-affinity based on the node name, so our service replicas are always running on different nodes:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app.kubernetes.io/name
          operator: In
          values:
          - pgbouncer
      topologyKey: kubernetes.io/hostname
  • added PodDistributionBudget with a minimum of 1 available pod (for services with 2 replicas). This doesn’t guarantee protection, but as we have automated node upgrades enabled, it can prevent GKE from killing our nodes when we don’t have a spare replica ready

  • reviewed terminationGracePeriodSeconds settings for each service to make sure applications have enough time to shut down properly

  • updated code in some apps to make sure they could be shut down unexpectedly. This is a separate topic, but you need to make sure no critical data is lost and you can recover from whatever happens during node shutdown

  • moved these services to spot instances (the main cost-savings step, the other steps were just needed for reliable service operations)

Experienced Kubernetes engineers can suggest a few more improvements, but this is enough for us right now.

Temporary Stable Instances

Now we come to the part about our skills assessments that need stable nodes. We can’t easily circumvent this requirement (yet, we have some ideas for the future).

We decided to try node auto-provisioning on GKE. Instead of having always available stable servers, we would dynamically create node pools with specific characteristics to run our skills assessments.

This comes with certain drawbacks - candidates who start our skills assessments have to wait an extra minute while the server is being provisioned compared to the past setup where stable servers were just waiting for Kubernetes pods to start. It’s not ideal, but considering it saves us a lot of money, it’s acceptable.

As we want to make sure no other workloads are running on those stable nodes, we use node taints and tolerations for our tests. Here’s what we add to our deployment spec:

nodeSelector:
  type: stable
tolerations:
  - effect: NoSchedule
    key: type
    operator: Equal
    value: stable

We also add resource requests (and limits, where needed), so auto-provisioning can select the right-sized node pool for our workloads. So, when there is a pending pod, auto-provisioning creates a new node pool of specific size with correct labels and tolerations:

GKE Node Taints and Labels from node auto-provisioning

Node Taints and Labels

Our skills assessment are running a maximum of 3 hours at a time and then automatically removed, which allows Kubernetes autoscaler to scale down our nodes.

There are a few more important things to mention. You need to actively manage resources for you workloads or pods may get evicted by Kubernetes (kicked out of the node because they are using more resources than they should).

In our case, we are going through each skill assessment we develop and take a note of resource usage to define how much we need. If this was an always-on type of workload, we could have deployed vertical pod autscaler that can provide automatic recommendations of how much resources you need based on resource usage metrics.

Another important point, is that sometimes autoscaler can kick in and remove the node if the usage if quite low, so we had to add the following annotation to our deployments to make sure we don’t get accidental pod restarts:

spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

All of this allows us to have temporary stable nodes for our workloads. We use backend service to remove deployments after 3 hours maximum, but GKE auto-provisioning has its own mechanism where you can define how long these nodes can stay alive.

Optimizations

While testing this setup, we noticed that auto-provisioning was not perfect - it was choosing a little too big nodes for our liking.

Another problem, as expected, creating new node pools for every new workload takes some extra time, e.g. it takes 1m53s for a pending pod to start on an existing node pool vs 2m11s on a new node pool.

So, here’s what we did to save a bit more money:

  • pre-created node pools of multiple sizes with 0 nodes by default and autoscaling enabled. All of these have the same labels and taints, so autoscaler chooses the most optimal one. This saves us a bit of money vs node auto-provisioning

  • choose older instance types, e.g. N1 family vs N2 which is newer but a bit more expensive. Saved some more money

Plus, got faster test provisioning as node pools are already created, and we still have auto-provisioning as a backup option in case we forget to create a new node pool for future tests.

The last thing I wanted to mention here, we were considering 1-node per test semantics for resource-hungry tests, e.g. ReactJS environments. This can be achieved with additional labels and pod anti-affinity as discussed previously. We might add this on a case-by-case basis.

Consolidated Dev and Prod

We have a relatively simple setup for a small team: dev and prod. Each environment consists of a GKE cluster and a Postgres database (and some other things not related to cost savings).

I went to a Kubernetes meetup in San Franciso in September and discovered a cool tool called vcluster. It allows you to create virtual Kubernetes clusters within the same Kubernetes cluster, so developers can get access to fully isolated Kubernetes clusters and install whatever they want inside without messing up the main cluster.

They have nice documentation, so I will just share how it impacted our cost savings. We moved from a separate GKE cluster in another project for our dev environment to a virtual cluster inside our prod GKE cluster. What that means:

  • We got rid of a full GKE cluster. Even not taking into account actual nodes, Google started charging a fee for cluster management recently.

  • We can share nodes between dev and prod clusters. Even empty nodes require around 0.5 CPU and 0.5 GB RAM to operate, so the fewer nodes, the better.

  • We save money on shared infrastructure, e.g. we don’t need two Grafana instances, Prometheus Operators, etc. because it is the same “physical” infrastructure and we can monitor it together. The isolation between virtual clusters happens on the namespace level and some smart renaming mechanics.

  • We save money by avoiding paying for extra load balancers. Vcluster allows you to share ingress controllers (and other resources you’d like to share) between clusters, a kind of parent-child relationship.

  • We don’t need another cloud database, we moved our dev database to the prod database instance. You don’t have to do this step, but our goal was aggressive cost savings.

We had some struggles with Identity and Access Management (IAM) set up during this migration as some functionality required a subscription to vcluster, but we found a workaround.

We understand that there are certain risks with such a setup, but we are small-scale for now and we can always improve isolation and availability concerns as we grow.

Cloud Logging

I was reviewing our billing last month and noticed something strange - daily charges for Cloud Logging even though I couldn’t remember enabling anything special like Managed Prometheus service.

Google Cloud Logging Billing

Google Cloud Logging Billing

I got worried as this would mean spending almost $100/month for I don’t know what. I was also baffled why it started in the middle of the month, I thought maybe one of the developers enabled something and forgot.

After some investigation, I found what it was:

Google Cloud Logging Volume

Google Cloud Logging Volume

GKE Control Plane components were generating 100GB of logs every month. The reason I saw some charges in the middle of the month is there is a free tier of 50GB, so for the first two weeks there wouldn’t be any charges, and once you cross the threshold, you start seeing it in billing.

We already had somewhat optimized setup by disabling logging for user worklods:

GKE Cloud Logging Setup

GKE Cloud Logging Setup

We want to have control plane logs in case there are some issues, but this was way too much. I started investigating deeper and found that the vast majority of logs are info-level logs from the API Server. Those are often very basic and don’t help much with troubleshooting.

To solve this, we added an exclusion rule to the _Default Log Router Sink to exclude info logs from the API server:

Log Router Sink Exclusion Filter

Log Router Sink Exclusion Filter

As you can see on one of the previous images, the logging generation flattened out after applying this filter and we now have GKE logging under control. I’ve also added a budget alert specifically for Cloud Logging to catch this earlier in the future.

Conclusion & Next Steps

I wanted to see how much we can achieve without relying on any committed-use discounts or reserved instances as those approaches still cost money and are associated with extra risks, depending on if you buy 1 or 3-year commitments. Now, that we reduced our costs a lot, we can consider applying committed use discounts as those will be a pretty low risk at this level of costs.

I hope this will give you a few fresh ideas on how to optimize your own infrastructure as most of these decisions can be applied to all major cloud providers.

Read More