Software Engineer - Resilience
We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams. We operate at high scale—trillions of data points per day—allowing for seamless collaboration and problem-solving among Dev, Ops and Security teams globally for tens of thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way.
The Core Resilience team is part of our SRE organization and is responsible for partnering with groups across Datadog to improve our technical and organizational resiliency. We steward the post-mortem and incident response processes across the company, constantly iterating and seeking improvements through the lessons we learn from production. We run training sessions for on-call and incident management and occasionally embed in product groups to ensure we remain aligned and can offer practical solutions to reliability problems. We value:
- Blamelessness in our processes. Our primary goal in incident reviews is to learn from and adapt our mental models of how our systems run in production. As Nabokov said, complacency is a state of mind that exists only in retrospective.
- A people-centered approach: ensuring that automation and systems support engineers doing work, not vice versa.
- An understanding that systems are inherently complex and failure is inevitable. What we can control is how resilient our systems and organization are when responding to these inevitable events.
- The idea that safety and risk are emergent properties in a socio-technical system and that they arise from a complex interaction of factors that constitute normal work. Resilience is a dynamic process of steering rather than a static quality.
What You’ll Do
- Help run the post-mortem process for the company and partner with teams on writing them, as well as identifying and implementing opportunities to reduce friction and maximize learning value to the organization.
- Define how we respond to incidents as a company and write software to streamline that process, partnering with our product teams where necessary. Our goal is to support our incident responders as much as possible to deal with complexity.
- Train our on-callers in our incident and post-mortem processes. This involves both introducing newcomers to on-call responsibilities and refreshing the knowledge of existing engineers.
- Perform cross-functional engagements with different teams across the organization, embedding in their group for a few weeks in order to either learn about how work is performed or to solve a specific reliability problem.
- Facilitate incident reviews in a way that emphasizes learning and blamelessness.
- Write reliability bulletins, blog posts, and other forms of documentation that identify systemic risks to the company, provide actionable remediations, and promote best reliability practices.
Who You Are:
Somebody who has experience or is interested in the following:
- Writing software that solves real user problems, as well as reviewing others’ code in an empathetic and collaborative way. We mainly use Go and Python.
- Analyzing incidents, identifying broader risk patterns, and sharing your findings in an engaging way that other people can understand and learn from.
- Responding to incidents as an incident commander or responder (preferably those with high-impact), and iteratively improving incident response processes.
- Teaching and training other engineers on best practices.
- Familiarity with Kubernetes and distributed systems as well as their potential failure scenarios.
Datadog values people from all walks of life. We understand not everyone will meet all the above qualifications on day one. That's okay. If you’re passionate about technology and want to grow your skills, we encourage you to apply.
The reasonably estimated salary for this role at Datadog ranges from $130,000 - $300,000, plus a competitive equity package, and may include variable compensation. Actual compensation is based on factors such as the candidate's skills, qualifications, and experience. In addition, Datadog offers a wide range of best in class, comprehensive and inclusive employee benefits for this role including healthcare, dental, parental planning, and mental health benefits, a 401(k) plan and match, paid time off, fitness reimbursements, and a discounted employee stock purchase plan.
Datadog (NASDAQ: DDOG) is a global SaaS business, delivering a rare combination of growth and profitability. We are on a mission to break down silos and solve complexity in the cloud age by enabling digital transformation, cloud migration, and infrastructure monitoring of our customers’ entire technology stacks. Built by engineers, for engineers, Datadog is used by organizations of all sizes across a wide range of industries. Together, we champion professional development, diversity of thought, innovation, and work excellence to empower continuous growth. Join the pack and become part of a collaborative, pragmatic, and thoughtful people-first community where we solve tough problems, take smart risks, and celebrate one another. Learn more about #DatadogLife on Instagram, LinkedIn and Datadog Learning Center.
Equal Opportunity at Datadog:
Datadog is an Affirmative Action and Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.
Any information you submit to Datadog as part of your application will be processed in accordance with Datadog’s Applicant and Candidate Privacy Notice.