What is Platform Engineering?

Recently, someone inquired about my understanding of a Platform Engineer's role. Having experienced various engineering positions, I found this query quite intriguing. Throughout my tenure in management, my go-to interview question has always been, "What is DevOps?" I'm aware it's a broad question that invites various responses. However, when it comes to "What is Platform Engineering?" I believe the answer is much more clear-cut.

Introduction

Platform Engineering is an evolving field in software engineering that focuses on creating and maintaining scalable and efficient infrastructure platforms. These platforms are designed to support software development and operations processes, enabling teams to deliver high-quality software products and services more quickly and reliably.

At its core, platform engineering involves the development and management of internal tools, frameworks, and services that streamline various aspects of software development, deployment, and operations. This includes setting up and maintaining cloud infrastructure, implementing continuous integration and continuous deployment (CI/CD) pipelines, and ensuring the overall scalability, reliability, and security of the software infrastructure.

Platform engineers work to abstract the complexities of underlying infrastructure and operations from software developers, allowing them to focus more on writing code and developing features. This is achieved through automating various processes, providing self-service capabilities, and ensuring best practices in software deployment and operations are adhered to.

The importance of platform engineers in modern software development cannot be overstated. As software development practices and technologies have evolved, the role of platform engineering has become increasingly crucial in ensuring efficient, reliable, and scalable software delivery. Some key reasons highlighting their importance:

Facilitating DevOps Practices
Enabling Scalability and Reliability
Automating Infrastructure Management
Enhancing Security
Improving Developer Productivity
Cost Optimization
Supporting Microservices and Containerization
Driving Innovation
Ensuring Compliance and Governance
Enhancing Customer Experience

Platform engineers are integral to modern software development, providing the technological backbone for organizations to develop, deploy, and operate software efficiently and effectively in a rapidly changing digital landscape. Their role is critical in balancing operational demands with the need for innovation, security, and optimal performance.

Understanding Platform Engineering (Core Responsibilities and Key Objectives)

Platform engineers play a pivotal role in the software development and deployment process. Their core responsibilities revolve around creating and maintaining the infrastructure that supports software development, deployment, and operations. Here are the critical responsibilities of platform engineers:

Designing and Building Infrastructure: Platform engineers are responsible for designing and building robust, scalable, and efficient infrastructure systems. This includes setting up servers, networks, and cloud environments, ensuring they can handle the demands of the applications and services they support.
Implementing Automation: A significant part of their role involves automating repetitive tasks. This includes automating the deployment of servers, application code, and configurations using Infrastructure as Code (IaC) tools like Terraform, Ansible, or Chef.
Developing and Maintaining CI/CD Pipelines: They set up and maintain Continuous Integration/Continuous Deployment (CI/CD) pipelines to automate the software release process. This ensures that new code changes are automatically tested and deployed efficiently and reliably.
Ensuring System Reliability and Availability: Platform engineers work to ensure that the systems are reliable and available. This includes monitoring system performance, troubleshooting issues, analysing root causes, and implementing resilience and failover strategies. SREs also cover this and will cover the overlap of this below.
Optimizing Resource Utilization and Cost: They are responsible for monitoring and optimizing the utilization of resources to ensure cost-effective operation, especially in cloud environments where resources are billed based on usage.
Ensuring Security and Compliance: Platform engineers are tasked with implementing security best practices in the infrastructure. This includes securing servers and networks, managing access controls, and ensuring compliance with relevant regulations and standards.
Supporting Microservices and Containerization: They often manage containerized environments and microservices architectures. This involves using tools like Docker and Kubernetes to deploy, scale, and manage containerized applications.
Collaborating with Development Teams: Platform engineers work closely with software developers to understand their needs and provide them with the necessary tools and environments. This collaboration is crucial for aligning development and operational goals.
Monitoring and Logging: They set up and manage monitoring and logging systems, such as Prometheus, Grafana, and the ELK stack, to track the health and performance of applications and infrastructure.
Capacity Planning and Scalability: Platform engineers are responsible for forecasting future infrastructure needs and scaling resources accordingly to handle growth and peak loads.
Documentation and Knowledge Sharing: Creating and maintaining thorough documentation for infrastructure setups, configurations, and best practices is also a key responsibility. This ensures knowledge sharing and consistency across teams.
Staying Updated with Emerging Technologies: They continuously learn and evaluate new technologies and methodologies to improve efficiency and stay ahead in the rapidly evolving tech landscape.

Platform engineers have a diverse and critical set of responsibilities spanning various aspects of software development and operations. Their role is integral to the smooth functioning and continual improvement of the technological infrastructure that supports modern software applications.

Tools Used in Platform Engineering

IaC

Infrastructure as Code (IaC) is a fundamental practice in modern platform engineering, enabling engineers to manage and provision infrastructure through code rather than through manual processes. The IaC toolset varies depending on the specific needs and preferences of an organization, but several vital tools are widely used in the industry. Here's an overview of some of the IaC tools; there are lots more but small touch of them:

Terraform: Developed by HashiCorp, Terraform is one of the most popular IaC tools. It allows engineers to define and provision infrastructure across various cloud providers (like AWS, Azure, and GCP) using a declarative configuration language. Terraform is known for its ability to manage complex infrastructure setups with a focus on infrastructure state management.
Ansible: Ansible, now a part of Red Hat, is an open-source tool primarily used for configuration management and application deployment. It uses simple YAML syntax for its playbook scripts, making it relatively easy to learn and use. Ansible operates over SSH and doesn't require agent installation on the target nodes, simplifying the management of multiple servers.
AWS CloudFormation: Specifically for Amazon Web Services (AWS) infrastructure, CloudFormation is a tool that allows users to define and provision AWS resources with JSON or YAML templates. It's deeply integrated with AWS and supports all its services and resources.
Azure Resource Manager (ARM): Similar to AWS CloudFormation but for Microsoft Azure, ARM templates are used to automate the deployment and management of Azure resources. These JSON templates enable complex deployments and provide a way to manage infrastructure and application resources together.
Google Cloud Deployment Manager: For those using Google Cloud Platform (GCP), Deployment Manager allows the creation and management of resources using declarative templates in YAML.
Pulumi: A newer entrant in the IaC space, Pulumi allows users to define infrastructure using general-purpose programming languages like JavaScript, TypeScript, Python, Go, and .NET languages. This approach can be more flexible and familiar for teams already proficient in these languages.

CI/CD

Continuous Integration/Continuous Deployment (CI/CD) tools are integral to the platform engineering toolkit, enabling automated testing and deployment of software. These tools streamline the software release process, ensuring code changes are integrated, tested, and deployed efficiently and reliably. Here's an overview of some essential CI/CD tools commonly used:

Jenkins: An open-source automation server, Jenkins is one of the most popular CI/CD tools. It offers a wide range of plugins to support building, deploying, and automating any project. Jenkins' flexibility and extensive community support make it a preferred choice for many organizations.
GitLab CI/CD: Integrated within the GitLab platform, GitLab CI/CD provides a streamlined solution for the entire code lifecycle, from code creation to deployment. Its integration with GitLab's source control features makes it a powerful and convenient option for many DevOps teams.
CircleCI: Known for its ease of use and configuration, CircleCI offers cloud-based CI/CD services that support rapid software development and deployment. It integrates seamlessly with GitHub and Bitbucket and supports containerized environments.
Azure DevOps (formerly VSTS): Azure DevOps provides a range of tools for software development, including CI/CD pipelines. It integrates well with Microsoft’s cloud services and offers comprehensive solutions for teams working on Microsoft platforms.
GitHub Actions: GitHub Actions enable automation of workflows, including CI/CD directly in GitHub repositories. This makes it convenient for teams using GitHub for source control to implement CI/CD without leaving the platform.
Argo Tool Kit: Argo CD is a significant and increasingly popular tool in the realm of Continuous Deployment, particularly in Kubernetes environments. It's part of the Argo project, which is a set of tools for Kubernetes to run workflows, manage clusters, and do GitOps.

Monitoring and logging

Monitoring and logging tools are essential in platform engineering for tracking the performance, health, and reliability of applications and infrastructure. They provide insights into system behaviour, help identify issues, and are crucial for maintaining system stability and efficiency. Here's an overview of some widely used monitoring and logging tools:

Prometheus: An open-source monitoring system and time series database. Prometheus is particularly well-suited for monitoring dynamic container environments like Kubernetes. It's known for its powerful querying language (PromQL) and easy integration with various ecosystems.
Grafana: Often used in conjunction with Prometheus, Grafana is an open-source analytics and monitoring solution. It provides advanced visualization features, allowing users to create informative and interactive dashboards based on data from multiple sources, including Prometheus, Elasticsearch, and more.
Elasticsearch, Logstash, and Kibana (ELK Stack): This trio forms a complete logging platform. Elasticsearch is a search and analytics engine, Logstash is used for log collection and processing, and Kibana is a visualization tool. Together, they provide a powerful solution for searching, analyzing, and visualizing log data in real time.
Splunk: A comprehensive tool for searching, monitoring, and analyzing machine-generated data. Splunk is widely used in large enterprises for its powerful data processing capabilities and user-friendly interface.
Datadog: A cloud-based monitoring service that provides monitoring of servers, databases, tools, and services through a SaaS-based data analytics platform. Datadog is known for its easy-to-use interface, comprehensive coverage of various technologies, and seamless integration with cloud services.
Zabbix: An enterprise-class open-source monitoring solution for networks and applications. Zabbix is known for its scalability, high performance, and comprehensive features.
New Relic: A cloud-based observability platform that offers real-time monitoring and analytics. New Relic is particularly popular for application performance monitoring (APM) and provides deep insights into software performance.

Cloud Services

Cloud Service Providers (CSPs) play a pivotal role in platform engineering, offering various services and resources that enable organizations to deploy, manage, and scale applications and infrastructure. Platform engineers often leverage these services to build and maintain robust, scalable, and efficient platforms. Here are some of the major Cloud Service Providers commonly used:

Amazon Web Services (AWS): As the largest and most comprehensive cloud provider, AWS offers an extensive range of services, including computing power (EC2), storage (S3), and databases (RDS). AWS is known for its scalability, reliability, and a wide array of tools and services that cater to various needs, from machine learning to IoT.
Microsoft Azure: Azure provides a broad set of cloud services, including solutions for AI, machine learning, IoT, and analytics. It’s particularly popular among enterprises due to its integration with Microsoft’s software products, and they usually give better discounts than any other.
Google Cloud Platform (GCP): GCP is known for its high-performance computing services, big data and analytics capabilities, and machine learning services. Although anything I have done with GCP in the past has been quite challenging because it seems their documentation never matches what is actually in production, but that's just me.

Containerization and Orchestration

Containerization and orchestration tools are crucial in modern platform engineering, facilitating the deployment, scaling, and management of applications. These tools help in creating lightweight, portable, and consistent environments for applications, making them ideal for cloud-native development and microservices architectures. Here’s an overview of some key containerization and orchestration tools used:

Docker: Docker is the most widely used containerization platform. It allows developers to package applications into containers—standardized executable components combining application source code with the operating system (OS) libraries and dependencies required to run that code in any environment.
Kubernetes (K8s): Kubernetes is the leading container orchestration platform. It automates the deployment, scaling, and management of containerized applications. Kubernetes is highly scalable and supports a large ecosystem of add-ons and integrations, making it a go-to choice for managing complex containerized applications.
Docker Swarm: Docker Swarm is Docker's native clustering and orchestration tool. While it’s less complex and easier to set up than Kubernetes, it doesn’t offer as many features. Docker Swarm is suitable for smaller-scale operations or for those who prefer simplicity and are already invested in the Docker ecosystem.
OpenShift: Developed by Red Hat, OpenShift is a Kubernetes-based container platform that provides a comprehensive solution for development, deployment, and management of containerized applications. It extends Kubernetes with additional features and a user-friendly interface, making it suitable for enterprise environments.
Amazon Elastic Container Service (ECS): ECS is a container orchestration service provided by AWS, designed for Docker containers. It allows you to run applications on a managed cluster of AWS EC2 instances.
Amazon Elastic Kubernetes Service (EKS): EKS is AWS's managed Kubernetes service. It simplifies the process of running Kubernetes on AWS without needing to install and operate your own Kubernetes control plane or nodes.
Azure Kubernetes Service (AKS): AKS is Microsoft Azure's managed Kubernetes service. It handles critical tasks like health monitoring and maintenance for you, simplifying the Kubernetes management process.
Google Kubernetes Engine (GKE): GKE is Google Cloud's managed Kubernetes service. It offers integrated support for Google Cloud features and brings Google's expertise in container orchestration to the table.

Best Practices in Platform Engineering

Emphasizing Automation

Automation is a cornerstone of practical platform engineering. It streamlines processes, reduces manual errors, and frees up valuable time for engineers to focus on more complex and creative tasks. Key aspects include:

Infrastructure as Code (IaC): Managing infrastructure through code allows for consistent, repeatable processes. Tools like Terraform or Ansible enable engineers to automate the setup and scaling of infrastructure.
Automated Testing and Deployment: Implementing Continuous Integration/Continuous Deployment (CI/CD) pipelines ensures that new code is automatically tested and deployed, speeding up the development cycle and reducing human error.
Monitoring and Alerts Automation: Automating monitoring and alerting systems helps in proactively identifying and addressing issues before they impact users.

Implementing Scalability and Reliability

Scalability and reliability are critical for platforms to handle growth and maintain performance under varying loads.

Elastic Scaling: Platforms should be designed to automatically scale resources based on demand, using cloud-native services or container orchestration tools like Kubernetes.
High Availability and Disaster Recovery: Implementing strategies for high availability, such as multi-region deployments, and having a robust disaster recovery plan ensures business continuity.
Load Balancing and Redundancy: Proper load balancing and redundancy prevent any single point of failure and distribute traffic efficiently across resources.

Security Best Practices

Security is a top priority in platform engineering, involving protecting data, applications, and infrastructure from threats.

Secure Coding Practices: Ensuring that security is integrated into the development process with regular code audits and vulnerability assessments.
Access Control and Authentication: Implementing robust access control mechanisms and using identity and access management (IAM) services helps control who can access what resources.
Regular Patching and Updates: Keeping software and dependencies updated is crucial for protecting against known vulnerabilities.
Encryption and Data Protection: Implementing encryption for data at rest and in transit and adhering to data protection regulations.

Collaboration and Communication Within Teams

Effective collaboration and communication are vital for the success of platform engineering teams.

Cross-Functional Teams: Encouraging a culture where development, operations, and security teams work closely together improves understanding and efficiency.
Agile Methodologies: Adopting agile practices and tools facilitates better communication, faster decision-making, and more adaptive planning.
Documentation and Knowledge Sharing: Maintaining comprehensive documentation and encouraging knowledge sharing ensures that the entire team understands the systems and processes.
Regular Reviews and Feedback: Conducting regular reviews and encouraging open feedback helps in identifying areas for improvement and fostering innovation.

Best practices in platform engineering involve a blend of technical strategies and team management practices. Emphasizing automation, scalability, reliability, and security and fostering a collaborative and communicative team environment are essential for creating and maintaining robust, efficient, and secure platforms. These practices not only optimize platform performance but also enhance the overall productivity and innovation capacity of the engineering teams.

Comparing SRE, DevOps, and Platform Engineering

So I am going to steel directly from another blog post that goes over the same subject as I don't think I could put this any better:

platform engineering shouldn’t be viewed as an alternative to DevOps. It’s more accurate to treat it as an implementation of DevOps concepts and philosophies. The overarching aim of DevOps is to simultaneously improve software quality and throughput using new tools, processes, and collaboration frameworks. Platform engineering is an example of what this looks like in practice.

Platform engineering neighbors Site Reliability Engineering (SRE) too. The main purpose of SRE is to preserve the stability of your production environments. These teams use objective data-driven targets such as SLAs and SLOs to identify when incidents materially affect your customers or your business. SRE then manages the incident resolution, analyzes what went wrong, and implements changes to prevent the problem from recurring.

Because platform engineering looks at internal systems, it doesn’t directly overlap with SRE. SRE produces infrastructure that’s optimized for highly reliable operations. Platform engineering creates assets that facilitate high-velocity development.

Information should be shared between the disciplines, though, as insights from one field are often valuable to the other.