Role Overview
We are looking for an experienced Senior Site Reliability Engineer to join our Professional Services team and deliver Software and DevSecOps projects. You will report to a Site Reliability Engineering Manager.
SRE / DevOps is one of our core competencies. You will be part of a highly-skilled team that continuously innovates and delivers high-value solutions to clients across various industries on all public clouds (AWS, Azure, GCP, etc.). Technologies we work with daily include Kubernetes, Helm, Terraform, and GitOps, just to name a few.
What You Will Be Doing
Enablement & RelOps Culture
- Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build dashboards that track specific error budgets.
- Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.
- Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).
Frameworks & Automation
- Standardised Alerting & On-Call: Continuously improve company-wide alerting and on-call frameworks to reduce alert fatigue, ensuring alerts are highly actionable and symptom-based.
- Disaster Recovery: Drive evolution of DR strategies from manual processes into fully automated runbooks-as-code, allowing teams to prove and improve service recoverability through autonomous, evidence-based testing.
- Eliminate Toil: Develop systems, automations, and tooling for pre- and post-deployment verification, ensuring our hands-off reliability vision becomes a production reality, via Python (or similar).
- Reliability-as-Code: Lead the drive to manage our entire reliability suite through IaC. Use Terraform to architect, deploy, and configure our observability stack including ELK, Grafana, Loki, Prometheus, and Tracing.
Expected Output for the Role
- Automate Azure infrastructure provisioning and configuration using PowerShell, YAML, and Bicep.
- Monitor and troubleshoot issues in the Azure environment, including network, storage, and compute resources.
- Deploy and manage Azure Databricks infrastructure for data processing and analytics.
- Attend to support tickets, which may arise due to product components not functioning as expected.
- Develop and maintain technical support documentation of the product.
- Promote innovations to support business requirements through activities that test, pilot, and implement innovative concepts.
- Responsible for supporting and troubleshooting DevOps tools and processes for stakeholders.
About You
For us to achieve our ambitious vision together as a team, it is important for our Martians to lead at all levels, be self-starters who take initiative, and put their hands up for challenging tasks. A growth mindset is important to us and we encourage all our Martians to openly share knowledge, support and help each other, ask questions, get creative with new technologies, and learn from setbacks.
Becoming a Martian Means:
- Comfortably working and learning from a fully remote, culturally diverse team based predominantly in South Africa, Kenya, Nigeria, and Ghana.
- Being an open, honest, and respectful communicator.
- You enjoy asking questions, identifying areas of improvement, and proposing solutions, no matter your job title or whether you have been with us for a day, a month, or years!
- You are comfortable taking initiative and operating independently.
- You thrive in a fast-paced environment, where change is constant.
- You find it exciting to work with various clients, from different industries, each with a different problem for you and your team to solve.
- Intentionally sharing tech and industry trends that excite you with your peers.
- Seeking continuous feedback and actively taking steps to continuously grow personally and professionally.
What You Get By Joining Us
- Become a member of a team where we value each individual's contribution from day 1 and empower you to make suggestions, get involved, and do what you love most!
- Flexibility and the freedom to work remotely.
- Work-life balance where you are not expected to work over weekends or after hours.
- A forward-thinking remote company that knows how important it is to stay connected as one team, by providing virtual social platforms for employee engagement.
- A monthly work from home allowance which you can use to set yourself up to work comfortably from home. Whether that is pens, notebooks, new headphones, or work snacks!
- A MacBook or Windows laptop for you to do your best work on.
- Become part of a team of exceptionally clever and talented people who like to share their knowledge and learnings.
- We support your career growth and love to celebrate your successes and advancement!