How We Reduced Our Google Cloud Bill by 65%
Learn how we reduced our Google Cloud costs by 65% using Kubernetes optimizations, workload consolidation, and smarter logging strategies. Perfect for startups aiming to extend their runway and save money.
Introduction
No matter if you are running a startup or working at a big corporation, keeping infrastructure costs under control is always a good practice. But it’s especially important for startups to extend their runway. This was our goal.
We just got a bill from Google Cloud for the month of November and are happy to see that we reduced our costs by ~65%, from $687/month to $247/month.
Most of our infrastructure is running on Google Kubernetes Engine (GKE), so most savings tips are related to that. This is one of those situations on how to optimize at a small scale, but most of the things can be applied to big-scale setups as well.
TLDR
Here’s what we did, sorted from the biggest impact to the least amount of savings:
Almost got rid of stable on-demand instances by moving part of the setup to spot instances and reducing the amount of time stable nodes have to be running to the bare minimum.
Consolidated dev and prod environments
Optimized logging
Optimized workload scheduling
Some of these steps are interrelated, but they have a specific impact on your cloud bill. Let’s dive in.
Stable Instances
The biggest impact on our cloud costs was running stable servers. We needed them for several purposes:
some services didn’t have a highly available (HA) setup (multiple instances of the same service)
some of our skills assessments are running inside a single Kubernetes pod and we can’t allow pod restarts or the progress of the test will be lost
we weren’t sure if all of our backend services could handle a shutdown gracefully in case of a node restart
For services that didn’t have a HA setup, we had the option to explore HA setup were possible (this often requires installing additional infrastructure components, especially for stateful applications, which in turn drives infrastructure costs up); migrating the service to a managed solution (e.g. offload Postgres setup to Google Cloud instead of managing it ourselves); accept that service may be down for 1-2 minutes a day if it’s not critical for the user experience.
For instance, we are running a small Postgres instance on Google Cloud and the load on this instance is very small. So, when some other backend component needs Postgres, we create a new database on the same instance instead of spinning up another instance on Google Cloud or running a Postgres pod on our Kubernetes cluster.
I know this approach is not for everyone, but it works for us as several Postgres databases all have a very light load. And remember, it’s not only about cost savings, this also allows us not to think about node restarts or basic database management.
At the same time, we are running a single instance of Grafana (monitoring tool). It’s not a big deal if it goes down during node restart as it is our internal tool and we can wait a few minutes before it comes back to life if we need to check some dashboards. A similar approach to the ArgoCD server that handles our deployments - it doesn’t have to be up all the time.
High Availability Setup
Here’s what we did for HA of our services on Kubernetes to be able to get rid of stable nodes, this can be applied to the majority of services:
created multiple replicas of our services (at least 2), so if one pod goes down, another one can serve traffic
configured pod anti-affinity based on the node name, so our service replicas are always running on different nodes:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- pgbouncer
topologyKey: kubernetes.io/hostname
added PodDistributionBudget with a minimum of 1 available pod (for services with 2 replicas). This doesn’t guarantee protection, but as we have automated node upgrades enabled, it can prevent GKE from killing our nodes when we don’t have a spare replica ready
reviewed terminationGracePeriodSeconds settings for each service to make sure applications have enough time to shut down properly
updated code in some apps to make sure they could be shut down unexpectedly. This is a separate topic, but you need to make sure no critical data is lost and you can recover from whatever happens during node shutdown
moved these services to spot instances (the main cost-savings step, the other steps were just needed for reliable service operations)
Experienced Kubernetes engineers can suggest a few more improvements, but this is enough for us right now.
Temporary Stable Instances
Now we come to the part about our skills assessments that need stable nodes. We can’t easily circumvent this requirement (yet, we have some ideas for the future).
We decided to try node auto-provisioning on GKE. Instead of having always available stable servers, we would dynamically create node pools with specific characteristics to run our skills assessments.
This comes with certain drawbacks - candidates who start our skills assessments have to wait an extra minute while the server is being provisioned compared to the past setup where stable servers were just waiting for Kubernetes pods to start. It’s not ideal, but considering it saves us a lot of money, it’s acceptable.
As we want to make sure no other workloads are running on those stable nodes, we use node taints and tolerations for our tests. Here’s what we add to our deployment spec:
nodeSelector:
type: stable
tolerations:
- effect: NoSchedule
key: type
operator: Equal
value: stable
We also add resource requests (and limits, where needed), so auto-provisioning can select the right-sized node pool for our workloads. So, when there is a pending pod, auto-provisioning creates a new node pool of specific size with correct labels and tolerations:
Our skills assessment are running a maximum of 3 hours at a time and then automatically removed, which allows Kubernetes autoscaler to scale down our nodes.
There are a few more important things to mention. You need to actively manage resources for you workloads or pods may get evicted by Kubernetes (kicked out of the node because they are using more resources than they should).
In our case, we are going through each skill assessment we develop and take a note of resource usage to define how much we need. If this was an always-on type of workload, we could have deployed vertical pod autscaler that can provide automatic recommendations of how much resources you need based on resource usage metrics.
Another important point, is that sometimes autoscaler can kick in and remove the node if the usage if quite low, so we had to add the following annotation to our deployments to make sure we don’t get accidental pod restarts:
spec:
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
All of this allows us to have temporary stable nodes for our workloads. We use backend service to remove deployments after 3 hours maximum, but GKE auto-provisioning has its own mechanism where you can define how long these nodes can stay alive.
Optimizations
While testing this setup, we noticed that auto-provisioning was not perfect - it was choosing a little too big nodes for our liking.
Another problem, as expected, creating new node pools for every new workload takes some extra time, e.g. it takes 1m53s for a pending pod to start on an existing node pool vs 2m11s on a new node pool.
So, here’s what we did to save a bit more money:
pre-created node pools of multiple sizes with 0 nodes by default and autoscaling enabled. All of these have the same labels and taints, so autoscaler chooses the most optimal one. This saves us a bit of money vs node auto-provisioning
choose older instance types, e.g. N1 family vs N2 which is newer but a bit more expensive. Saved some more money
Plus, got faster test provisioning as node pools are already created, and we still have auto-provisioning as a backup option in case we forget to create a new node pool for future tests.
The last thing I wanted to mention here, we were considering 1-node per test semantics for resource-hungry tests, e.g. ReactJS environments. This can be achieved with additional labels and pod anti-affinity as discussed previously. We might add this on a case-by-case basis.
Consolidated Dev and Prod
We have a relatively simple setup for a small team: dev and prod. Each environment consists of a GKE cluster and a Postgres database (and some other things not related to cost savings).
I went to a Kubernetes meetup in San Franciso in September and discovered a cool tool called vcluster. It allows you to create virtual Kubernetes clusters within the same Kubernetes cluster, so developers can get access to fully isolated Kubernetes clusters and install whatever they want inside without messing up the main cluster.
They have nice documentation, so I will just share how it impacted our cost savings. We moved from a separate GKE cluster in another project for our dev environment to a virtual cluster inside our prod GKE cluster. What that means:
We got rid of a full GKE cluster. Even not taking into account actual nodes, Google started charging a fee for cluster management recently.
We can share nodes between dev and prod clusters. Even empty nodes require around 0.5 CPU and 0.5 GB RAM to operate, so the fewer nodes, the better.
We save money on shared infrastructure, e.g. we don’t need two Grafana instances, Prometheus Operators, etc. because it is the same “physical” infrastructure and we can monitor it together. The isolation between virtual clusters happens on the namespace level and some smart renaming mechanics.
We save money by avoiding paying for extra load balancers. Vcluster allows you to share ingress controllers (and other resources you’d like to share) between clusters, a kind of parent-child relationship.
We don’t need another cloud database, we moved our dev database to the prod database instance. You don’t have to do this step, but our goal was aggressive cost savings.
We had some struggles with Identity and Access Management (IAM) set up during this migration as some functionality required a subscription to vcluster, but we found a workaround.
We understand that there are certain risks with such a setup, but we are small-scale for now and we can always improve isolation and availability concerns as we grow.
Cloud Logging
I was reviewing our billing last month and noticed something strange - daily charges for Cloud Logging even though I couldn’t remember enabling anything special like Managed Prometheus service.
I got worried as this would mean spending almost $100/month for I don’t know what. I was also baffled why it started in the middle of the month, I thought maybe one of the developers enabled something and forgot.
After some investigation, I found what it was:
GKE Control Plane components were generating 100GB of logs every month. The reason I saw some charges in the middle of the month is there is a free tier of 50GB, so for the first two weeks there wouldn’t be any charges, and once you cross the threshold, you start seeing it in billing.
We already had somewhat optimized setup by disabling logging for user worklods:
We want to have control plane logs in case there are some issues, but this was way too much. I started investigating deeper and found that the vast majority of logs are info-level logs from the API Server. Those are often very basic and don’t help much with troubleshooting.
To solve this, we added an exclusion rule to the _Default Log Router Sink to exclude info logs from the API server:
As you can see on one of the previous images, the logging generation flattened out after applying this filter and we now have GKE logging under control. I’ve also added a budget alert specifically for Cloud Logging to catch this earlier in the future.
Conclusion & Next Steps
I wanted to see how much we can achieve without relying on any committed-use discounts or reserved instances as those approaches still cost money and are associated with extra risks, depending on if you buy 1 or 3-year commitments. Now, that we reduced our costs a lot, we can consider applying committed use discounts as those will be a pretty low risk at this level of costs.
I hope this will give you a few fresh ideas on how to optimize your own infrastructure as most of these decisions can be applied to all major cloud providers.
How Hands-On Lab Training Accelerates Your DevOps Learning Curve
In this article, we’ll explore why hands-on labs are so effective and how they can drastically improve your DevOps skills.
DevOps is a fast-paced, dynamic field where theoretical knowledge alone is rarely enough to succeed. To truly master the skills needed in this industry, hands-on experience is essential.
Hands-on lab training offers a practical, immersive way for DevOps engineers to accelerate their learning curve and become job-ready faster.
In this article, we’ll explore why hands-on labs are so effective and how they can drastically improve your DevOps skills.
1. Real-World Problem Solving
Learning by Doing
In DevOps, engineers face complex, real-world challenges daily. Hands-on labs simulate these real-life tasks, such as configuring a Kubernetes cluster, troubleshooting cloud infrastructure, or setting up CI/CD pipelines. This experience allows engineers to actively solve problems rather than passively learn concepts.
Why It Matters
Theoretical knowledge can only take you so far. Working on actual infrastructure and handling real problems solidifies what you’ve learned, ensuring you can apply those skills when it matters most—on the job.
Example: Many engineers use Brokee’s hands-on labs to practice AWS, Azure, and DevOps tasks that mirror real job environments.
Whether you’re an entry-level engineer or preparing for a new role, Brokee’s labs provide practical experience that accelerates your job readiness.
2. Builds Confidence for Day-One Readiness
Hands-On = Confidence
Many engineers struggle with confidence during their first few months on the job because they’ve never had the chance to apply what they learned in real scenarios. Hands-on labs give engineers the opportunity to practice these skills repeatedly until they are fully confident in their abilities.
Why It Matters
Confidence in your DevOps skills from day one can drastically shorten onboarding time and increase your productivity early in your career.
Companies often prefer candidates who have hands-on experience with the tools and technologies they use.
3. Mastering Tools and Platforms
Get Familiar with Industry-Standard Tools
Hands-on labs allow engineers to get comfortable using critical DevOps tools like Terraform, Ansible, Docker, Jenkins, and cloud platforms like AWS, Azure, and GCP.
Lab environments replicate real job tasks, so engineers can focus on mastering specific tools while understanding how they integrate into larger workflows.
Why It Matters
Becoming proficient with tools is crucial for DevOps roles. Hands-on labs provide the chance to not only learn new tools but to also understand how they function in complex environments.
Example: Engineers can practice setting up a continuous integration pipeline using Jenkins, deploy a containerized application with Kubernetes, or automate infrastructure with Terraform in a lab environment before applying these skills in production.
Read More: The Top DevOps Tools in 2024
4. Safe Environment to Make Mistakes
Learning Without the Pressure
One of the greatest advantages of hands-on lab training is the ability to make mistakes without real-world consequences.
In an actual job setting, errors can lead to downtime, security risks, or financial losses. In a lab, engineers can experiment, fail, and learn without the pressure of damaging live environments.
Why It Matters
The freedom to experiment helps engineers learn faster. They can try different approaches, discover what works, and learn from failures—all without impacting actual projects.
Read More: The Best DevOps Bootcamps in 2024
5. Speeds Up the Learning Curve
Accelerating Skill Development
Hands-on labs enable faster learning by giving engineers instant feedback. Instead of reading through documentation and theory, they can immediately see the results of their actions in the lab environment.
This kind of real-time feedback significantly speeds up the learning process, as engineers can adjust their approach on the fly.
Why It Matters
Learning by doing accelerates mastery of concepts and tools. Engineers gain a deep understanding of how different DevOps practices work together, which ultimately helps them become proficient more quickly than with theoretical learning alone.
6. Preparing for Certifications
Practical Experience for Exams
Certifications like AWS DevOps Engineer, Microsoft Azure DevOps, or Google Cloud Professional DevOps Engineer require not just theoretical knowledge, but also practical understanding. Hands-on labs prepare engineers for these exams by allowing them to practice the exact scenarios they’ll be tested on.
Why It Matters
While studying for certifications is important, real-world practice is what truly prepares you to pass the exams and apply the knowledge in the workplace. Hands-on labs give you the confidence and experience to tackle even the most challenging certification questions.
Example: In an AWS hands-on lab, engineers can set up auto-scaling groups, configure CloudWatch for monitoring, and use Lambda for automation—real-world tasks that they’ll likely face on the AWS DevOps Engineer certification exam.
Read More: AWS DevOps Interview Questions and Answers for 2024
7. Gaining Practical Job Experience
Simulate the Job Environment
Hands-on labs not only prepare engineers for exams but also simulate day-to-day job tasks.
These labs mirror the exact work you’ll do in a DevOps role, such as deploying cloud infrastructure, setting up monitoring systems, or configuring secure environments. The more practice you get, the more comfortable you’ll be when performing these tasks in a live environment.
Why It Matters
This kind of real-world experience is what hiring managers look for. By practicing in labs, engineers can demonstrate they are ready to step into a role without needing extensive on-the-job training.
Conclusion
Hands-on lab training is an invaluable tool for accelerating the DevOps learning curve.
Whether you're mastering tools, preparing for certifications, or gaining real-world job experience, these labs provide the perfect environment to learn by doing.
We currently offer 3 free labs for engineers (no credit card needed!), and after that, you can have access to our unlimited testing library for only $9 per month. Try Brokee risk-free today!
The practical experience gained from our labs will significantly boost your confidence, shorten onboarding time, and make you job-ready from day one.
Mastering Azure DevOps: Top Training Resources and Certifications to Kickstart Your Career
As businesses increasingly move to cloud-native solutions, mastering Azure DevOps has become essential for engineers aiming to boost their careers.
Whether you're starting your journey or looking to advance your skills, here’s a guide to the best Azure DevOps training resources and certifications that will help you stand out in this fast-growing field.
As businesses increasingly move to cloud-native solutions, mastering Azure DevOps has become essential for engineers aiming to boost their careers.
Whether you're starting your journey or looking to advance your skills, here’s a guide to the best Azure DevOps training resources and certifications that will help you stand out in this fast-growing field.
1. Microsoft Certified: DevOps Engineer Expert
What It Is
The Microsoft Certified: DevOps Engineer Expert certification is one of the most recognized credentials for Azure DevOps engineers. It validates your ability to combine people, processes, and technologies to deliver continuously improved products and services.
What You’ll Learn
How to design and implement DevOps processes
Using version control systems like Git
Implementing CI/CD pipelines
Managing infrastructure using Azure DevOps and tools like Terraform and Ansible
Why It’s Important
This certification proves you can create and implement strategies that improve software development lifecycles, a critical skill for Azure DevOps engineers.
Recommended Resources:
Microsoft Learning Path: Free modules on the official Microsoft site provide a structured learning path to pass the certification.
Whizlabs and Udemy Courses: These platforms offer in-depth preparation courses for this certification.
2. AZ-400: Designing and Implementing Microsoft DevOps Solutions
What It Is
AZ-400 is the exam required to earn the Microsoft Certified: DevOps Engineer Expert certification. It covers designing and implementing DevOps practices for infrastructure, CI/CD, security, and compliance.
What You’ll Learn
How to integrate source control and implement continuous integration
Strategies for automating deployments and scaling infrastructure
Monitoring cloud environments and managing incidents effectively
Why It’s Important
Passing this exam is crucial for anyone aiming to specialize in Azure DevOps. It showcases your ability to manage full lifecycle DevOps processes in Azure environments.
Recommended Resources:
Microsoft Learn: This free resource offers structured modules and practice tests.
Udemy: The AZ-400 Exam Preparation Course is a highly rated resource for detailed exam preparation.
3. LinkedIn Learning: Azure DevOps for Beginners
What It Is
This LinkedIn Learning course is an excellent introduction for beginners to Azure DevOps, covering the basics of using the platform for continuous delivery, infrastructure management, and monitoring.
What You’ll Learn
Setting up an Azure DevOps environment
Managing code repositories with Git
Implementing CI/CD pipelines using Azure Pipelines
Why It’s Important
If you’re new to DevOps or just getting started with Azure, this course provides a solid foundation for understanding the tools and practices needed to succeed.
Recommended Resources:
LinkedIn Learning Subscription: Offers access to this and thousands of other related courses.
4. Pluralsight: Azure DevOps Fundamentals
What It Is
Pluralsight offers an in-depth course that covers core Azure DevOps concepts, including project management, version control, and pipeline automation.
What You’ll Learn
How to manage Azure DevOps organizations, projects, and teams
Configuring CI/CD pipelines for automated builds and deployments
Automating infrastructure with Terraform and Azure Resource Manager
Why It’s Important
For those who already have a basic understanding of DevOps, this course dives deeper into Azure-specific functionalities, preparing you for hands-on work with Azure projects.
Recommended Resources:
Pluralsight Subscription: Provides unlimited access to this course and other DevOps-related content.
5. Azure DevOps Hands-On Labs
What It Is
Hands-on labs offer practical, real-world experience by simulating real tasks and challenges within Azure DevOps environments. Labs allow engineers to practice and test their knowledge in controlled scenarios that mirror actual job tasks.
Why It’s Important
Nothing beats hands-on experience when learning new tools. Labs allow engineers to practice and refine their skills by working on real-world problems, making them invaluable for both beginners and those preparing for certifications.
Recommended Resources:
Brokee DevOps Assessments: Brokee offers real-world cloud-based assessments that simulate job environments, helping engineers practice hands-on Azure DevOps tasks and allowing companies to assess candidates' proficiency in real-time.
6. GitHub Learning Lab: CI/CD with GitHub Actions and Azure
What It Is
GitHub Learning Lab provides an interactive guide to integrating GitHub Actions with Azure for CI/CD pipelines. It's a great way to learn how to automate workflows and deployments using GitHub alongside Azure DevOps.
What You’ll Learn
Automating code builds and deployments with GitHub Actions
Integrating GitHub repositories with Azure environments
Best practices for implementing automated workflows in cloud environments
Why It’s Important
With many organizations using GitHub for code management, this course equips you with the skills to merge GitHub's powerful automation tools with Azure's cloud infrastructure.
Recommended Resources:
GitHub Learning Lab: Free access to interactive, self-paced courses.
Conclusion
Azure DevOps is a critical skill set for anyone entering the cloud engineering space, and mastering it requires both theoretical knowledge and practical experience.
By leveraging the right training resources and certifications, you can position yourself for success in a competitive job market.
Top 10 SRE Tools Every DevOps Engineer Should Know About
As a DevOps engineer, knowing the right tools for the job is essential to managing and optimizing complex infrastructures.
Let's explore the top 10 SRE tools every DevOps engineer should be familiar with.
Site Reliability Engineering (SRE) plays a crucial role in ensuring systems are reliable, scalable, and performant.
As a DevOps engineer, knowing the right tools for the job is essential to managing and optimizing complex infrastructures.
Below are the top 10 SRE tools every DevOps engineer should be familiar with, whether they’re focused on monitoring, automation, or incident management.
1. Prometheus
What is Prometheus?
Prometheus is an open-source monitoring and alerting toolkit designed for reliability. It collects metrics from various sources, stores them in a time-series database, and allows engineers to set up powerful alerting based on predefined thresholds.
Why You Need It
Prometheus is widely adopted for system monitoring due to its scalability and flexibility. It integrates seamlessly with Kubernetes and other cloud-native environments, making it an essential tool for SREs and DevOps engineers alike.
2. Grafana
What is Grafana?
Grafana is an open-source data visualization and analytics tool that integrates with Prometheus and other data sources to provide real-time dashboards.
Why You Need It
Grafana’s customizable dashboards give teams a clear visual overview of system health, performance metrics, and potential bottlenecks. This allows SREs to spot issues quickly and maintain system reliability.
3. Terraform
What is Terraform?
Terraform by HashiCorp is a powerful tool for Infrastructure as Code (IaC). It enables engineers to define cloud infrastructure resources using declarative code, which can be version-controlled and automated.
Why You Need It
Automating infrastructure provisioning with Terraform reduces human error and ensures consistency across environments. For SREs, this means more reliable deployments and faster recovery from incidents.
4. Kubernetes
What is Kubernetes?
Kubernetes is the most popular container orchestration platform, used to manage and scale containerized applications across clusters.
Why You Need It
Kubernetes automates the deployment, scaling, and management of containerized applications. Its self-healing capabilities, auto-scaling, and robust ecosystem make it an indispensable tool for any SRE or DevOps engineer focused on maintaining reliability.
5. PagerDuty
What is PagerDuty?
PagerDuty is an incident management platform designed to help DevOps and SRE teams respond to incidents in real-time.
Why You Need It
PagerDuty integrates with monitoring tools and alerts teams when something goes wrong. It helps organize and escalate incidents, ensuring that the right people respond promptly to minimize downtime and system impact.
6. Ansible
What is Ansible?
Ansible is an open-source tool for automation and configuration management. It allows for the automation of application deployment, cloud provisioning, and system configurations.
Why You Need It
SREs use Ansible to automate repetitive tasks, reducing manual intervention and minimizing configuration drift across environments. It’s essential for maintaining consistent and reliable infrastructure.
7. ELK Stack (Elasticsearch, Logstash, Kibana)
What is the ELK Stack?
The ELK Stack is a combination of three tools: Elasticsearch (search and analytics engine), Logstash (log pipeline), and Kibana (visualization).
Why You Need It
This stack is perfect for log management, allowing SREs to collect, analyze, and visualize logs in real-time. With ELK, you can identify and troubleshoot issues across distributed systems, improving reliability and system observability.
8. Jenkins
What is Jenkins?
Jenkins is a popular open-source automation server used to build and manage CI/CD pipelines.
Why You Need It
SREs rely on Jenkins to automate the building, testing, and deployment of code. With its broad plugin ecosystem, Jenkins integrates with many tools and platforms, making it a key player in ensuring smooth and reliable software delivery.
9. Datadog
What is Datadog?
Datadog is a monitoring and analytics platform for cloud applications, offering real-time insights into system performance.
Why You Need It
Datadog combines metrics, traces, and logs into a single platform, enabling SREs to monitor cloud infrastructures, troubleshoot issues quickly, and maintain system performance with greater clarity.
10. Runbook Automation (Rundeck)
What is Rundeck?
Rundeck is a runbook automation tool that helps SREs create and execute automated procedures to handle system operations and incidents.
Why You Need It
Automating routine tasks and operational procedures with Rundeck reduces human error, speeds up incident resolution, and allows SREs to focus on more strategic tasks, all while maintaining system reliability.
Conclusion
Mastering these tools will equip any DevOps engineer or SRE to manage and scale infrastructures with confidence.
From monitoring and observability with Prometheus and Grafana, to automating infrastructure and workflows with Terraform and Ansible, each tool plays a pivotal role in ensuring system reliability and efficiency.
Want to hone your ability to use SRE tools? Brokee’s assessments incorporate real-world tasks using these essential SRE tools, helping engineers hone their skills and allowing companies to evaluate candidates’ hands-on proficiency.