SRE Teams #8: Loggi
Challenges of hypergrowth with more than 200 engineers joining the company in one year. Monorepos, Release Managers, and more.
👋 Hello and welcome to this week’s edition of SRE Teams — a newsletter where I share how interesting companies are implementing Site Reliability Engineering and DevOps practices.
I got the chance to speak with Italo. He is the SRE Manager at Loggi. We talked about how they structured SRE during hyper-growth, their focus on building talent inside the company, Monorepos, and a lot more. Let's dive in.
Company
Loggi is a logistics company with the mission of connecting Brazil. They recently raised USD 212 million to connect 100% of the Brazilian population; they ended last year reaching 54% of people in Brazil, up from 43% in the year before. They grew 390% in 2020.
Team
Three hundred people work in technology divided between engineering, design, product, and data. They organize teams in vertical squads grouped by tribes. Chapters are vertical teams that connect with all squads. SRE is one of these chapters along with Platform, and Data Analytics. With 10 people they have a 1:15 SRE to engineer ratio. Their goal is to reach 1:20 using automation, even with their exponential growth.
Stack
The SRE team's goal is to deliver automations without magic by introducing the Infrastructure Platform. Their focus is on training engineers to understand what happens under the hood to troubleshoot and also contribute to automations and reliability. The platforms' goal is to increase productivity with automation without hiding too much of the underlying infrastructure. They believe that exposing engineers to some level of the infrastructure is healthy.
They use a single Git repository for multiple services of the same language -- a Monorepo. Deployment pipelines, applications architecture, library standards all became easier to standardize. One of the best results was the standardization of libraries. When they had scattered repositories, it was common for different teams to build libraries to solve the same problem. Today, finding if a library already exists and how it's been used is as easy as searching a single codebase.
This article explains in more detail how they migrated from using a broad set of technologies to standardizing two main stacks: 1) Kotlin with Micronaut, MongoDB, and Kafka. 2) Python with Django, PostgreSQL, and RabbitMQ. Their focus is on leveraging managed services as much as possible to focus on their core business: connect Brazil. They use both Azure and AWS as cloud providers. Synchronous services communications are using gRPC. The asynchronous communications use Kafka and they use Elastic Cloud to manage logs and application search.
Delivery
The Monorepo facilitates the delivery infrastructure. All applications follow the same format and the deployment pipeline has few exceptions, despite the high number of services. They use Github tags to trigger deployments, triggered by squad members. The company deals with mission-critical infrastructure in the real world, so the teams have recommended times and procedures for deployments and also protect people's work-life balance. However, they are recommendations, not rules. The Monorepo together with the infrastructure platform built with Istio + Kubernetes also makes it easier to standardize logs, metrics, and traces across services. They use Feature Flags using Unleash for gradual rollouts.
Ops
The SRE team is in charge of the on-call rotation. They always have two people on-call: primary and fallback. The people on-call for the week also works on other day-to-day tasks, such as incident response, help support tickets, cherrypicking updates to mission-critical systems. During on-call shifts, there are no expectations in terms of feature delivery.
Recent Success
Their Release Manager delivery model. After many iterations, they arrived at a workflow that enabled continuous delivery even after adding more than 200 engineers in 2020. The delivery workflow prioritizes stability and people. Engineers rotate in the role of Release Manager and help coordinate changes made to the main application. They have more than 40 people contributing to the process as Release Managers and working alongside SRE to improve the system’s reliability
Recent Challenge
Their large Django application uses a single database with more than 5TB of data. Managing this database is hard. They are reaching all sorts of limits from the database engine. It's also challenging to scale the applications as it's limited to vertical scaling. To deal with this they are working to split the applications into smaller infrastructure pieces, including smaller databases, this is an ongoing job but already proved the effectiveness. They managed to split the whole stack for one service recently.
Advice
Breaking deploys from the release was a game-changer for them. Using Feature Flags increased the number of deployments and reduced the number of incidents in production. Engineers that were somewhat afraid to make changes to production felt more confident. It's now part of the culture. Making it extremely easy to revert changes resulted in fewer rollbacks due to the increase in quality from more frequent deployments. Split deployment and release if you didn't already.
If you liked the challenges, Loggi has open positions for SRE!
Thanks
If you're enjoying SRE Teams, I'd love it if you shared it with a friend or two. I try to make it one of the best emails you get in the week, and I hope you're enjoying it.
That’s it for this week! Hit me up if you have any thoughts, feedback, or insights to share. Otherwise, see you next week!