SRE Teams #5: Empiricus Research

How culture enabled a technology org to scale without a dedicated platform team.

Dec 07, 2020

👋 Hello and welcome to this week’s edition of SRE Teams — a weekly email where I share how interesting companies are implementing Site Reliability Engineering and DevOps practices.

If you’re not a subscriber, you are missing the bonus content I share every week. It’s the top 2 pieces on SRE & DevOps I found in the week. It only goes into the newsletter's e-mail.

This week I’m sharing:

The shortest summary of Kubernetes’ move from Docker.
The best structure for having engineers on-call.

I got the chance to speak with Rodrigo Gianotto. He is one of the heads of technology at Empiricus Research. We talked about structuring a tech organization with no dedicated platform team. And how a DevOps culture is critical for teams with a complex technology stack.

Empiricus started as an investment research company that now has 370 thousand active subscribers. They just announced a merger and an acquisition. They are merging with Vitreo and acquiring Real Valor. An investment platform and an app for tracking investments. Vitreo has more than 1 billion dollars under management. Together they will become Universa Holding. Their goal is to transform investments in Brazil.

Team

Empiricus has 60 people working in technology; 50 engineers distributed in 7 squads, each focused on a business domain. No central area in the company is taking care of infrastructure or platform. They get some help from a third-party. But the company’s product engineers are in charge of setting up and using cloud services APIs.

Teams are free to try and adopt different technologies. Critical projects use stable stacks. Experienced engineers provide some guides to help with decisions, but squads have the final call. This model enabled two platforms to emerge. They work side-by-side, each solving its specific problems. One uses serverless functions with AWS Lambdas. The other runs on Kubernetes.

Stack

The main CMS component of their system runs on Lambdas. With more than 150 functions split into different repositories, organized by business domain. This system supports mobile apps, e-learning platforms, and research/publication tooling. With half-million daily page views and more than 50k daily unique visitors.

The main CMS component of their system runs on Lambdas. With more than 150 functions split into different repositories, organized by business domain.
Click to tweet

Transactional applications, including e-commerce, authentication, ERP, anti-fraud, and others, use Kubernetes. Their clusters run on EKS. They started using Istio Service Mesh from its earliest versions to help with visibility and routing. Kubernetes clusters run 30 applications, with more than 400 containers running in production. The cluster ingress is Ambassador, and they are planning a migration to Istios'.

Prometheus, Grafana, and Elasticsearch also run in the clusters. They are now moving to Datadog to centralize logging, metrics, and alerting. The goal is to simplify the stack. They want to use the HPA with custom metrics to scale applications in k8s.

Applications are event-driven, and they use AWS SNS + SQS for messaging. They are now studying Apache Kafka to speedup IA/ML and improve robustness. Most applications use Java and Python, but they have a few apps using Kotlin, Go, and PHP.

CI/CD & Ops

Automated pipelines run tests, deployments, rollouts, and other steps for every code change. Developers have the autonomy to change production environments without restrictions. Product Owners have a software engineering background and take part in the delivery lifecycle. Sometimes even making the deployments themselves. The automations and close collaboration of teams speed up the delivery time.

A key motivation for using Istio was its routing features. The team wanted to deliver changes to production with a smaller impact on customers. It was hard to roll out a change to services with many consumers deep down in the request chain.

Istio solved this problem by enabling different routing decisions for each service. They are refactoring a payment application to a more modern stack. For each new REST endpoint, they can roll out and watch the results without impacting other parts of the system. They create routes for each HTTP verb and method using Virtual Services. They use Helm charts to deploy k8s resources.

Developers are in charge of production applications. They use Pagerduty to notify incidents. And APM with Elasticsearch to troubleshoot problems. Experienced engineers in each squad have access to production and can run emergency changes.

Recent success

Using Istio Service Mesh to make safe releases of services deep in the request chain. Before Istio, it was hard to change services far from the frontend with reliability. Now they use Virtual Services to create fine-grained rules. Each new REST route and method changed get gradual rollouts. It makes releases more reliable and fast.

Recent challenge

The process for merging three companies. Each company has its own engineering culture, technology stack, and organization. The recent business change brought a significant challenge to the technology team. They are working on how to make this move in the best way possible. The goal is to take the best pieces of each org and create a better one.

Advice

One key area that you can't compromise is the automation of software delivery. You may not have the perfect tech stack, but you need to iterate fast. A fast and reliable pipeline is critical. Make sure engineers can iterate quickly in production. Iteration is the most vital metric that ensures systems will get better over time.

Thanks

Thank you to everyone that reached out after I announced RunOps last week. If you are having challenges or have a good solution for giving autonomy to developers, it’d help us a lot if we could chat about them. Please, reach out.

—

If you're enjoying SRE Teams, I'd love it if you shared it with a friend or two. I try to make it one of the best emails you get on the week, and I hope you're enjoying it.

Share SRE Teams

That’s it for this week! Hit me up if you have any thoughts, feedback, or insights to share. Otherwise, see you next week!

SRE Teams

SRE Teams #5: Empiricus Research

How culture enabled a technology org to scale without a dedicated platform team.

Thanks

Discussion about this post