SRE Teams
SRE Teams
SRE Teams #12: PicPay
--:--
Current time: --:-- / Total time: --:--
--:--

SRE Teams #12: PicPay

Developing the SRE culture inside product teams by fostering technical discussions and overseeing major architectural decisions.

Hey everyone 👋

As you can probably guess, Runops is keeping me busy. Before we get to todays’ interview, I wanted to share an updated demo with a bunch of new features we released over the past few months. Feel free to get in touch if you want to learn more. Thanks!

Runops quick demo

Watch 4 minutes video


I got the chance to speak with Guilherme Oliveira. He is the Site Reliability Engineering Skill Lead at PicPay. We talked about developing the SRE culture inside product teams by fostering technical discussions and overseeing major architectural decisions.

Company

PicPay is a financial services platform that includes credit cards, digital wallets, p2p payments, e-commerce payments, p2p lending, messaging, and many more. They have more than 50 million users on the platform and are one of the biggest Brazilian super-apps. Their goal is to be present in all moments of the daily lives of their users.

Team

They split the SRE Skill into vertical teams. This model enables them to cover the highest possible number of product squads and technologies. Being agile at the scale of PicPay means new squads appear and disappear all the time. SRE needs a flexible enough structure to support this.

Their focus is on delivery speed. They do this by reducing dependencies between teams and increasing autonomy. The decision about new technologies happens in a weekly meeting with the engineering team. Each group shares what they are doing in new tech and how others can contribute or benefit. They also have dedicated Skills for databases, developer experience, and monitoring.

With the autonomy of product teams, they test many new technologies. Teams present to the rest of the org as designs mature. They believe that having each group with autonomy to bring new technologies to the stack increases their innovation power and creates a competitive advantage.

Stack

Java and PHP are their primary languages. But they also have applications using Swift, Kotlin, Ruby, Go, Python, JS, Lua, and others. Due to the exponential growth over the last two years, they had to make critical decisions on the infrastructure side. They had to create a platform to push autonomy to developers. All while ensuring scalability, resiliency, and control over applications.

They run on AWS, focusing on using services that enable performance and low management overhead. Applications run on Kubernetes with EKS with the support of some Lambda functions. Databases use RDS. Kafka runs on MSK, and they also use SQS/SNS for messaging. Caching uses ElastiCache. They also use EMR, Cloudfront, and many other AWS services.

They have logs and APM with two of the largest players in the space. Filebeat and Logstash stream logs to Kafka. Kafka sends logs to the logging solution and stores them as backups using S3 Glacier. The platform also automates the configuration of alerts in the APM tools.

Delivery

The platform supports a well-established development workflow. Autonomy is the focus in the delivery flow of creating and updating applications. SRE supports a consulting model to help with the performance, architecture, and scale.

Their delivery pipeline starts with the code pushed to Github. Next, CodeBuild runs tests, builds, and ships Docker images to ECR, with Harness updating the images in their EKS clusters. Automated workflows enable developers to set up new applications using the standard pipelines and project scaffold. As a result, a developer gets all the platform to start shipping a production-ready application from scratch in a few clicks.

Ops

Product teams receive alerts configured based on the business rules and impact. The volume of incidents is high, but this is inherent to the size and scale of the business. They have many services and complex financial infrastructure rules to follow. Product teams get paged first during business hours.

Product engineers are in charge of identifying and fixing problems related to their services, products, or workflows. Teams are also in charge of building automations to support further incidents of their services. They create a Postmortem to analyze the impact level, recurrence, and other aspects. The SRE team is in charge of pages outside working hours. They use OpsGenie to automate the paging based on monitoring alerts.

Recent Success

They went through a reorganization of the SRE team. SRE got split into verticals focused on specific areas. With more specialized groups, they increase coverage. The reorg improved support to product teams. Different teams need different types of support in a given period, and the vertical approach adapts to this scenario. The specialized SRE approach was a huge success and is how they operate today. The SRE team got more organized, and product teams saw a massive increase in the quality of the platform and support.

Recent Challenge

A big challenge is the understanding and visibility of their cloud costs. They are discussing and evaluating this problem inside the SRE skill. The challenge resulted from their exponential growth, making it hard to track all resources. They didn't use the same standard for setting up infrastructure during the growth phase. The resulting technical debt is hard to fix at the scale they have today. This is one of the main areas of focus of the team today. Infracost, a tool they use to help with the problem is generating promising results.

Advice

Push autonomy to engineering teams and keep them close to the business. This structure enables PicPay to keep evolving its platform. New solutions come up from different groups at the time. Fostering the usage of PoCs, RFCs, presentations, and others ensures everyone is onboard with designs and avoids duplicate work. They encourage teams to present what they are working on quickly. The more they discuss designs, the less time they spend building the wrong solutions. In the end, they become more resilient to different market and technology shifts.

You can find Guilherme on Linkedin here.


Thanks!

If you're enjoying SRE Teams, I'd love it if you shared it with a friend or two. I try to make it one of the best emails you get in the week, and I hope you're enjoying it.

Share SRE Teams

That’s it for this week! Hit me up if you have any thoughts, feedback, or insights to share. Otherwise, see you later!

Discussion about this podcast

SRE Teams
SRE Teams
Learn how interesting companies build Site Reliability Engineering in the real-world.
Listen on
Substack App
RSS Feed
Appears in episode
Andrios