SRE Teams #2: Dafiti
The wins and challenges of an SRE group with the mission of bridging teams across the company.
I got the chance to speak with Yuri; he is a member of the SRE team at Dafiti. We talked about how SRE is changing the company. The wins and challenges of an SRE group with the mission of bridging teams across the company.
Dafiti Group is the starting point of fashion in Latin America. They operate in Brazil, Argentina, Chile, and Colombia. Kanui and Tricae are also part of the group with operations in Brazil. More than 6,000 brands and 700,000 products are in their portfolios. The group is part of Global Fashion Group, the leading fashion and lifestyle retail destination in LatAm, CIS, Southeast Asia, and ANZ.
SRE
The SRE team has 15 people between engineering productivity, observability, and platform apps. The productivity team builds automations to increase developers' speed. Observability focus on reliability tooling. Platform takes care of Kubernetes and other base services. The product engineering org has around 60 people.
Stack
The tech stack has 50 applications. A few apps use PHP, and new services use Golang, Scala, and Java. These systems control everything from ordering, pricing, and invoicing to integrations with warehouses, distribution centers, and logistics.
Applications are running on Kubernetes. They are migrating from a kops cluster to AWS EKS. In the process, they are also making a few changes to the cluster setup. The kops cluster is using the Nginx ingress controller and Envoy sidecars with Helm-managed configs. EKS will use Istio for ingress and for managing envoy configs.
They didn't enable all Istio features from day 0. The team decided to study each component from all perspectives. Considering the benefits they bring and the costs of operating them reliably. Visibility is among the most wanted features. Some of the auto-documentation of Kiali will help with their complex architecture.
They use Newrelic for monitoring and Graylog for logging. These are the key interfaces for product engineering teams to operate the apps. Product teams are in charge of running their apps in production. The company has a 24/7 team dealing with first level support. Problems escalate to the product team member on-call and SRE if the issue is platform-related.
Continuous Integration & Delivery
CI/CD is completely automated. They use CircleCI for both, and the new cluster is getting a delivery upgrade. The pipeline deploying to the kops cluster is using Helm. ArgoCD takes care of deployment in the new cluster setup. A Git repo has all Kubernetes definitions, and CircleCI updates it after each build. The updated Git definition then triggers ArgoCD sync with Kubernetes, resulting in the new container image's deployment.
They use Terraform to create every infrastructure resource. A central Git repository keeps all definitions, and teams submit Pull Requests to create or change resources. Both product and the SRE team commit to the repo.
Communication
With the pandemics, the engineering org started using Discord. They also use Slack and Zoom for company-wide communications. But Discord, with its rooms and easy sharing features, made things easier for the product and the SRE teams to collaborate.
Another interesting tool they use is making async solution discussions on Github using RFCs. Inspired by the Kubernetes SIGs model. Anybody with an issue or a proposal submits it to the repository. Teams contribute with comments, answers, and research. After settling on a topic, they build a document to serve as documentation. This is a great alternative to Slack threads discussions. Future team members can find and understand past decisions.
Recent Success
The migration to using ArgoCD is making a big impact. In the kops cluster, the k8s definitions of the apps stay in Tiller, generating some problems. Rolling back versions using Helm has known issues. And Tiller got deprecated. ArgoCD solves these problems and also brings definitions to Git. It enables better visibility into k8s resources for developers. Without making them learn Kubectl and avoiding the hard-to-main RBAC rules.
Recent Challenge
The decision around what Istio components to enable was challenging. The skills required to run these components are profound. And the project has known issues due to its young age. Weighing these against the problems it solves for the company is a big challenge.
Advice
Connected to the challenge with Istio: be patient. Some teams will hurry to start using new technology without considering all sides of the problem. Only to end up with reliability problems in the product. Making the best decision requires patience to align ideas across all areas of the business. Dafiti has open positions; check them out if you liked the challenges.
Thanks!
The number of subscribers is growing; it means you are sharing SRE Teams. Thank you! I’m glad it’s been helpful to other teams. As always, feel free to reach out and send any feedback. The easiest is to reply to the e-mail if in your inbox or leave a comment to share with everyone.
Also, I started a personal newsletter called Reducer. It's one email a week with everything interesting I’ve read or found, plus new articles and books. Check it out.
See you next week;