Posted 17 June, 2026
Software Development Engineer, AWS Incident Tooling & Response
Amazon
IE, D, Dublin
Full Time
Reference: 71_457722_2604dc4a-c3a2-47e3-99e1-5c65f5d28983
AWS Infrastructure Services owns the design, planning, delivery, and operation of all AWS global infrastructure. In other words, we're the people who keep the cloud running. We support all AWS data centers and all of the servers, storage, networking, power, and cooling equipment that ensure our customers have continual access to the innovation they rely on. We work on the most challenging problems, with thousands of variables impacting the supply chain, and we're looking for talented people who want to help.
You'll join a diverse team of software, hardware, and network engineers, supply chain specialists, security experts, operations managers, and other vital roles. You'll collaborate with people across AWS to help us deliver the highest standards for safety and security while providing seemingly infinite capacity at the lowest possible cost for our customers. And you'll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.
The Incident Management Service (IMS) team is building the platform that AWS uses to coordinate response during high-severity incidents. When AWS services are degraded, incident responders use IMS to detect, triage, mitigate, and resolve issues, coordinating across dozens of service teams in real time. We are building the next generation of Incident management tooling - a unified platform that must remain available and performant precisely when AWS infrastructure is unhealthy, deployed across three AWS regions with automated failover. You will own significant portions of the service architecture: the data layer, authorization system, API model, and integrations with incident automation systems. You will design and deliver components across the stack, drive cross-team technical alignment, and mentor other engineers. You need to be a strong software developer with a track record of delivering production services, and also excel in communication and technical leadership. You'll use agentic AI development to move fast from concept to production. This is an opportunity to own architecture on a high-visibility platform that is used during AWS's most critical moments.
Key job responsibilities
Design and implement service components for a multi-region, multi-tenant incident management platform. Own subsystems including the data layer, authorization, and API surface. Build integrations with incident automation systems, conference bridge providers, and downstream event consumers. Drive technical design decisions, balancing reliability, performance, and delivery speed. Participate in operational support and ensure the service is resilient during the incidents it is designed to manage. Mentor other engineers and lead technical design reviews. Use agentic AI development practices to move quickly from concept to tested, production-ready code.
About the team
We are a high-performing team building the incident management platform for all of AWS. Our software is used during AWS's worst moments, so reliability is not optional. We operate what we build, and every engineer has direct visibility into how their code performs during real incidents. We value high delivery velocity, pragmatic architecture decisions, and engineers who take ownership beyond their assigned scope. We use agentic AI development practices and invest in tooling that lets engineers and agents validate changes locally before they reach the pipeline.
You'll join a diverse team of software, hardware, and network engineers, supply chain specialists, security experts, operations managers, and other vital roles. You'll collaborate with people across AWS to help us deliver the highest standards for safety and security while providing seemingly infinite capacity at the lowest possible cost for our customers. And you'll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.
The Incident Management Service (IMS) team is building the platform that AWS uses to coordinate response during high-severity incidents. When AWS services are degraded, incident responders use IMS to detect, triage, mitigate, and resolve issues, coordinating across dozens of service teams in real time. We are building the next generation of Incident management tooling - a unified platform that must remain available and performant precisely when AWS infrastructure is unhealthy, deployed across three AWS regions with automated failover. You will own significant portions of the service architecture: the data layer, authorization system, API model, and integrations with incident automation systems. You will design and deliver components across the stack, drive cross-team technical alignment, and mentor other engineers. You need to be a strong software developer with a track record of delivering production services, and also excel in communication and technical leadership. You'll use agentic AI development to move fast from concept to production. This is an opportunity to own architecture on a high-visibility platform that is used during AWS's most critical moments.
Key job responsibilities
Design and implement service components for a multi-region, multi-tenant incident management platform. Own subsystems including the data layer, authorization, and API surface. Build integrations with incident automation systems, conference bridge providers, and downstream event consumers. Drive technical design decisions, balancing reliability, performance, and delivery speed. Participate in operational support and ensure the service is resilient during the incidents it is designed to manage. Mentor other engineers and lead technical design reviews. Use agentic AI development practices to move quickly from concept to tested, production-ready code.
About the team
We are a high-performing team building the incident management platform for all of AWS. Our software is used during AWS's worst moments, so reliability is not optional. We operate what we build, and every engineer has direct visibility into how their code performs during real incidents. We value high delivery velocity, pragmatic architecture decisions, and engineers who take ownership beyond their assigned scope. We use agentic AI development practices and invest in tooling that lets engineers and agents validate changes locally before they reach the pipeline.