SRE Teams #1: Hash
How culture is enabling a 2 people team to help scale the product org of a disrupting fintech.
There is an interesting discussion happening in Hacker News on the topic.
This is the first issue of SRE Teams. Here I share how interesting companies are implementing Site Reliability Engineering practices.
The goal is to share the real-world of Site Reliability Engineering. From companies that are not FAANG scale. Startups and traditional businesses of all sizes. What is working, what challenges they face, and any advice for companies also building SRE Teams.
The next 3 issues are ready and will be released in the coming days. Make sure you subscribe to receive them directly in your inbox.
I got the chance to speak with Jhonn. He is ahead of SRE at Hash. We talked about how they are structuring their SRE organization and any advice they have for new teams.
Hash is a Brazilian fintech building the next-generation of payments infrastructure. They currently provide terminal solutions where merchants can accept physical card payments. It’s like Square but in Brazil. And they have a lot more coming up.
Hash uses a very advanced stack for the type of business they are running. Most Fintech are conservative on the tech side, but this is not the case for Hash. They run containers on Kubernetes with Istio. Use Prometheus, Alert Manager, and Grafana for monitoring and alerts. And Gitlab for CI and CD.
The team at Hash is using the Google SRE principles. The SRE team has autonomy in the company. They have maturity policies in place and operate mission-critical applications. But only if they pass the readiness checklist. The SRE team provides standard tooling that product teams can use. SRE owns CI/CD pipelines, infrastructure code, and others. But developers are free to contribute to features or write their own customized versions. Custom features lose the SRE team support.
Product teams that opt-out or don’t have their application ready for SRE run on their own. They have full access to the resources they need. The interface with the SRE team is minimal in these cases. The team in charge of the product gets its own namespace in k8s. Limited by network policies, resource quotas, and others. They can iterate fast without risking other product containers in the cluster.
Everything is code managed by Git, from infrastructure resources to monitoring alerts. This allows for an extreme speed for any change, including infrastructure and alerts. They also run tests on infrastructure resources. It makes sure new changes don’t break existing behaviors.
The SRE team is in charge of SLOs and SLAs of the products they run. Well defined alerts route product-related issues to product teams. Paging happens only in critical events related to something they are in charge of.
The Hash team has a healthy DevOps culture. Product teams contribute to infrastructure code and have a good understanding of the platform. This allowed the SRE team at Hash to do everything with only 2 people! Running the platform applications, tracking indicators, creating processes, and many more. They are a 2 person team in a 22 product engineers company. This is a 1:11 SRE to product engineer ratio. They are growing fast and are expected to reach 40 people in engineering by year-end.
The rapid growth of the business created some side-effects. It requires improved communication with product teams for significant changes to platform services. And making these processes is a challenging task. The maturity model spec is evolving a lot. Receiving contributions from other areas of the company. Maintaining this culture is also challenging with the rapid growth of the teams.
One of the critical things allowing Hash to scale fast with quality is culture. One of the key lessons from the team is not to compromise on culture-fit when hiring. It’s tempting to bring new people to lower the pressure of 2 people team. But someone with the wrong fit would break the work of years.
The SRE Team @ Hash is hiring! Reach out to Jhonn if you want to join this great team and project.
Make sure to subscribe to receive the next SRE Teams issue in your inbox.
Then share the word with your friends at work. This will make sure they get updates on what is happening with other SRE Teams.