SRE Teams #4: TOTVS
The benefits and challenges of creating an internal Kubernetes provider, and how it enables leveraging public and private cloud infrastructure.
Hi! This week I have something very special to share with you at the end. I've been working hard these past few weeks to release it today.
If you’re new here, welcome! Make sure you check the previous editions:
How culture is enabling a 2 people team to help scale the product org of a disrupting fintech @ Hash
How tooling enabled a company to scale fast without breaking things @ Creditas
Now, let’s learn about one more interesting SRE team.
I got the chance to speak with Andre, Lead SRE at TOTVS. They enable product teams to leverage public and private cloud infrastructure. We talked about the benefits and challenges of creating an internal Kubernetes provider.
TOTVS is the largest technology company in Brazil. With a portfolio of systems and platforms that can manage companies in 12 segments. They have 40 thousand companies using their software solutions. From SMBs to large enterprises, in all regions of Brazil.
Team
The platform team organizes into two groups: TCloud Kubernetes Service and TOTVS Apps. TCloud is a team of 3 that built a Kubernetes service provider from the ground up. TOTVS Apps, a group of 4 people creating automations and best practices. Product teams use the tools created by TOTVS Apps to run applications in the TCloud service.
A suite of cloud-native applications runs on the platform. TOTVS Apps use RFCs to collaborate with product engineers. They help with applications design and build automations to ship them to TCloud.
Kubernetes Service
TCloud Kubernetes abstracts public and private cloud services. Supporting Amazon, Google, and their edge data centers. They also created a suite of add-ons that get installed in the clusters with a YAML configuration. Engineers store YAML cluster definitions on Git. Automations using Prow trigger provisioning of clusters and add-ons from Github. Kubernetes add-ons include Elasticsearch, Prometheus, Grafana, HashiCorp Vault, Controllers, and CRDs for Cloud Services. They use CRDs for Databases, Object storage, Queue, and other infrastructure components.
Organization
TOTVS Apps manage access to the clusters by product teams. They use Rancher that is deployed in the clusters from TCloud. Product teams can create their own clusters or run apps in a shared and more affordable one. On a shared cluster, users get an isolated namespace. TCloud is supporting 20 clusters deployed across AWS, GCP, and the private cloud. More than 200 developers are using the clusters to run applications for 30 products.
Stack
TCloud provisioning automations use Golang, Bash, Packer, and Terraform. Prow events trigger commands to a container inside a dedicated cluster. This cluster has provisioning apps and required permissions. They built internal Terraform providers to support the provisioning of custom resources. The goal is to abstract the private cloud infrastructure using APIs. Teams should spin up a cluster using the same interface, no matter where they get provisioned. Product teams use .Net Core, Java, NodeJs, and Go to create applications.
Teams should spin up a cluster using the same interface, no matter where they get provisioned.
Controller Extension
To achieve transparent provisioning, they had to go beyond terraform—provisioning custom private cloud infrastructure within Kubernetes. The goal was to bring up components like load balancers using the Kubernetes API. For that, they had to extend the Kubernetes Cloud Controller Manager and Container Storage Interface. These are the components in charge of interacting with infrastructure APIs.
Delivery & Ops
The TOTVS Apps team creates the delivery and deployments pipeline. They use TCloud APIs with Azure DevOps. Product teams are on-call to support apps. TOTVS Apps deal with questions from product about the clusters APIs and automations. The Apps team interacts with TCloud for issues with the Kubernetes Services APIs.
TCloud is supporting more than 50 Kubernetes addons, controllers, and extensions. Keeping track of all components, their version, new features, and patches is challenging. They are creating a process with automations to track the components’ versions and required patches.
Teams use Prometheus and Grafana add-ons for monitoring and alerting; and Elasticsearch for logging. These are the primary interfaces developers use to operate apps. Alerts definitions are on Git, and ArgoCD deploys them to Kubernetes. Prometheus Operator configures Alert Manager.
Recent Success
The team recently released a zero-trust network model. A sealed network, using Cilium, is the standard for new clusters. Product teams create Network Policies to define the communication needs of their apps. A significant improvement in security and observability for the networking of the platform.
Recent Challenge
The creation of the custom Cloud Controller Manager for Kubernetes was challenging. They worked close to the teams managing datacenters and helped them build APIs. Geography was also a challenge. For a better customer experience, they have Data centers spread across Brazil. Networks and other infrastructure resources are not uniform across regions. And this made things even more challenging.
Advice
Treat everything as a product. TCloud is a literal product within the company. It motivates them to keep pushing towards quality. Other areas are not obligated to use the internal Kubernetes service. TCloud must deliver a better outcome than cloud providers’ alternatives. Instead of focusing on product teams’ isolated feature requests, they have a clear roadmap. Alignments to the business needs are constant, but the team has a clear long-term vision.
Treat everything as a product.
TCloud is a literal product within the company.
It motivates them to keep pushing towards quality.
Special Announcement
I’ve worked for companies in regulated industries for my whole career. It’s hard to do DevOps when you have to follow some rules created before DevOps was even a thing. So I had to do a lot of automation to avoid removing ownership from developers.
But I was never happy with the existing tools. A lot is missing in tools like Gitlab, Jenkins, Jira, and others to solve compliance-related problems. As a result, even modern companies resort to removing access from developers and centralize it with a group of experts at a certain scale.
Meet RunOps
So, I created a company to solve this problem. I’m really excited to announce that we are launching today! 🚀
Check out runops.io. It’s a tool that makes compliance transparent to developers and platform teams lives’ easier. We have been working with a few teams while building it; some are readers of the newsletter; thank you guys for the support!
—
If you're enjoying SRE Teams, I'd love it if you shared it with a friend or two. You can send them here to sign up. I try to make it one of the best emails you get on the week, and I hope you're enjoying it.
That’s it for this week! Hit me up if you have any thoughts, feedback, or insights to share. Otherwise, see you next week!