SRE Teams #10: Quintoandar

Sep 28, 2021

👋 Hello and welcome to this week’s edition of SRE Teams — a newsletter where I share how interesting companies are implementing Site Reliability Engineering and DevOps practices.

I got the chance to speak with Edson Marquezani. He is the lead of the SRE Platform squad at Quintoandar.

Company

Quintoandar is an end-to-end solution for long-term rentals that, among other things, connects potential tenants to landlords and vice versa. Last year, they also expanded into connecting home buyers to sellers. Their long-term plan is to evolve into a one-stop real estate shop offering mortgage, title insurance, and escrow services. They raised more than $700 million and are valued at $5.1 billion.

Illustration from undraw.co

Team

SRE at Quintoandar is an independent engineering team. They are in charge of creating and maintaining tools for enabling product teams to access infrastructure. These tools support developers in deploying and maintaining their applications. They provide the whole Engineering org with a standard way to interact with infrastructure. Every squad has autonomy from creating a simple Cloud resource or alarm to full micro-service deployment. SRE is a multidisciplinary team of people with different skills and backgrounds. Network, security, system architecture and development, database, operations, to name a few. With more than 30 people, they split the team into squads. Scopes include security, observability, container orchestration, and CI/CD.

Stack

They have applications written in Java, Python, NodeJS, Golang, and Clojure. Older backend apps use Java. Frontends are all Progressive Web Apps using ReactJS.

Micro-services run 100% in Kubernetes, and clusters are KOPS-managed using different AWS accounts/regions. They have a small EKS cluster. AWS is their primary provider for services like storage, messaging, big data, and others. A few workloads run on GCP, most unrelated to core business and not close to the SRE team management.

Prometheus stack (Alertmanger, Grafana) is the main monitoring system. They use a clusterized architecture with Thanos. APM runs on Instana. Logs on ElasticSearch and Kibana. Other services also use ES as an indexing engine, including their main listing service.

They have MySQL, Postgres, MongoDB, and Cassandra as database engines. Hashicorp Vault is their Secrets Manager.

They built a few in-house custom tools, including:

Kubernetes injectors
Kubernetes operators
Credentials management systems
A multi-purpose CLI to interact with the platform
Plugins for many platforms

Engineering teams have autonomy to propose and adopt new technologies. The level of engagement of the SRE team depends on whether the platform currently supports it. They work with product teams to settle on what is best for the product instead of individual teams’ needs.

Delivery

Autonomy is a crucial value for them as a company, and this includes developers. The SRE team is always thinking about how they can make developers more and more independent. Yet, establishing a standardized way of doing things.

They deliver software with Drone CI, Helm charts, and ArgoCD. Developers define application parameters (as Helm value files). A custom plugin generates Kubernetes manifests and applies them on Kubernetes in Gitops fashion. Their automations commit manifests to Git, and this triggers ArgoCD to synchronize them.

They have a metadata file in every repository. It provides data about the application to map every service to its owners, ecosystems, etc. So new services go live without the SRE team having to be aware or involved.

This architecture is under revision right now. They want to reduce technology coupling and support different use-cases. The company is seeing fast expansion, and many new needs should arise.

Operations

They have a rotating schedule where engineers stay on-call for 24 hours, including working hours. Anyone from the Engineering team can apply for it. They are in charge of handling alarms and applying documented solutions (runbooks) or getting in touch with the team responsible for the service. They’ll only call the SRE team if one of their systems has an alarm. Otherwise, they don’t need to take part in the resolution. However, SRE may engage in troubleshooting and solution discussion if the problem escalates and they realize they can help.

Recent success

Reducing SRE overhead by providing tools and evolving our platform, increasing developers’ autonomy so they can focus on their tasks. This strategy has been successful for the last few years.

They are also growing the team without losing quality or control of anything. They split the team into different squads with their dedicated scopes to keep focus, but they collaborate a lot. This is not easy to achieve. However, it’s one of the critical factors responsible for the culture's success and good leadership.

Recent challenge

Scaling databases in the past years was one of the primary sources of issues in production. One only takes a careful look at databases when load scales. They are working on it and getting one step ahead in detecting problems. Queries going to production without proper optimization generate alerts. They have good instrumentation in place now, and things are pretty under control.

Advice

Collaboration is the critical success factor. There’s no way to ensure all decisions will survive the test of time and still prove to be the best ones. Things change all the time, especially in fast-growing companies like ours. The best we can do is work as a team, refactor or rebuild things, fix problems as we realize they are needed, and don’t let egos interfere in this process. If something can’t be maintained when a single person leaves the team, you have a severe problem. Documentation is fundamental. We have to think as an engineering team, not a supporting unit. Like any team, we have our systems, and we have to evolve them so that people may come in and out without much turbulence.

They have open positions for the SRE and other teams, check them out.

Thanks

If you're enjoying SRE Teams, I'd love it if you shared it with a friend or two. I try to make it one of the best emails you get in the week, and I hope you're enjoying it.

Share SRE Teams

That’s it for this week! Hit me up if you have any thoughts, feedback, or insights to share. Otherwise, see you next week!

SRE Teams