Senior Software Engineer - Core Team
About Userpilot
Userpilot is a leading product analytics and engagement platform. Hundreds of product teams use us to understand, segment, and activate their users in real time. Under the hood, that's a distributed Elixir/Phoenix backend sustaining hundreds of thousands of concurrent WebSocket connections, high-throughput Kafka event ingestion, ClickHouse analytics at scale, and always-on content delivery.
We move fast, we ship often, and we believe the best engineers care as much about how the whole system holds together as about the feature in front of them.
The Role
This is the most senior individual-contributor engineering role at Userpilot, and it is a different kind of role. Core Team engineers are the closest thing we have to software architects. They don't own a single feature area; they own how the system fits together, how it behaves under load, and how it recovers when something breaks.
They are a rare breed: equally at home in a Terraform module, an application lifecycle, a high volume database query plan, and an architecture review. They set the technical direction the rest of engineering builds on, they are the first responders when production is on fire, and they design the guardrails that stop a class of problem from ever happening twice. Application squads move fast on features precisely because the Core Team keeps the ground underneath them solid.
And they do all of this in an AI-native way. Coding agents extend their reach across the stack, but the judgment about what is safe, what will scale, and what must never break stays with them.
Where You'll Have Impact
- Technical direction and system design. Decide how non-trivial work should be built before a squad writes the first line. Write the ADRs, choose the patterns, and make durability, extensibility, robustness, observability, and scalability properties of the system rather than afterthoughts bolted on later.
- Scale and reliability. Keep a distributed, real-time system healthy as traffic grows: event pipelines from Kafka into ClickHouse, real-time delivery over hundreds of thousands of connections, caching, backpressure, and the failure modes that only appear at scale or during a deploy.
- Firefighting and incident response. Be the first call when production breaks. Diagnose under pressure, restore service, find the real root cause, and then turn that incident into a guardrail so the squads don't keep hitting it.
- Infrastructure and foundations. Own infrastructure provisioning end to end: AWS (EKS, EC2, S3, RDS) and the Terraform and Kubernetes that tie it together. This is one of the things you do, not the whole job.
- Enabling the squads. Raise the architectural bar across teams you don't manage. Review for architectural consistency, drive adoption of patterns that actually stick, and keep application engineers focused on shipping product.
-
Agentic engineering infrastructure. Make the system safe for a team that ships with AI agents: CI/CD quality gates every PR must pass regardless of author, AGENTS.md and runbooks that teach agents the topology and operational constraints, and Infrastructure as Code clean enough that an agent's change proposal is safe to reason about.
What You'll Do
- Lead system design for cross-cutting and high-risk work, and write and shepherd ADRs the org actually follows.
- Partner with application squads to turn product requirements into designs that hold up under load and over time, then get out of their way.
- Own production reliability: monitoring, alerting, and on-call practices that surface real problems without drowning the team in noise (Grafana, Prometheus, CloudWatch).
- Be first-in on incidents: run the diagnosis, coordinate the fix, write the postmortem, and ship the change that prevents a recurrence.
- Design, provision, and operate infrastructure on AWS with Terraform and Kubernetes, with high availability and cost both in mind.
- Build and improve CI/CD pipelines and validation gates that make every change trustworthy, whether a human or an agent wrote it.
- Write the technical context (ADRs, runbooks, AGENTS.md) that makes the system understandable to new engineers and safe for AI tools.
- Keep an eye on infrastructure cost and find the optimizations that actually matter.
- Provide technical direction and mentorship across the engineering org.
What We're Looking For
Required
- Senior experience designing and operating distributed systems in production, with a track record of being the person who owns how the whole system fits together.
- Strong software-engineering and CS fundamentals (data structures, algorithms, system design). You can go deep in application and backend code, not just infrastructure.
- Architectural judgment: you reason explicitly about durability, extensibility, robustness, observability, and scalability and the tradeoffs between them, and can write an ADR others can follow.
- Distributed-systems instincts: you can break down a complex system to find its failure modes, bottlenecks, and the one change that actually moves the needle.
- Calm, methodical incident response: you root-cause under pressure and instinctively turn an incident into prevention.
- Hands-on infrastructure: AWS (EKS, EC2, S3, RDS) and the networking that connects them, production Kubernetes and Docker (operating clusters, not just deploying to them), and solid Terraform / Infrastructure as Code.
- Observability in practice: Grafana, Prometheus, CloudWatch, and alerting that signals real problems.
- Strong communication and influence: this role touches every team, and you drive adoption of patterns across people who don't report to you.
- An AI-native workflow: you use AI coding agents (Claude Code, Cursor) as a real part of how you work, and you have a point of view on how to review and trust their output.
Bonus Points
- Elixir, Erlang, or BEAM systems (our backend runs on them) and OTP patterns: supervision trees, GenServers, distribution.
- Scaling highly available distributed systems in a fast-moving product environment.
- Kafka, RabbitMQ, ClickHouse, Broadway, or similar high-throughput data tooling (we use both brokers).
- Building and operating CI/CD that supports high-frequency deployments.
- Cloud cost optimization through caching, right-sizing, or more efficient data processing.
- Experience as a tech lead, staff engineer, or architect setting direction for an engineering org.
- A point of view on the trust model for automated and agent-generated change: automated PRs, agent-triggered deploys, and the gates that make them safe.
- Interest in AI-powered observability: anomaly detection, automated runbook execution, or self-healing infrastructure.
- Writing technical context documentation (runbooks, ADRs, AGENTS.md-style files) that makes systems understandable to the people and agents joining them.
Our Stack
- Cloud: AWS (EKS, EC2, S3, RDS, CloudFront)
- Orchestration: Kubernetes, Docker, Terraform
- Backend: Elixir / Phoenix, OTP
- Data: ClickHouse (analytics), MySQL (primary)
- Messaging: Kafka, RabbitMQ, Broadway
- Observability: Grafana, Prometheus, CloudWatch
- CI/CD: GitHub Actions
- AI: Claude Code / Cursor for agentic development; AGENTS.md, CLAUDE.md, and Infrastructure as Code as shared context for humans and agents alike
What "Agentic Engineering" Means Here
We are shifting toward spec-driven, AI-assisted development, and the Core Team is what makes that safe.
- Every PR, human or agent, passes the same quality gates. Our CI/CD has to be reliable, fast, and unambiguous in its feedback, regardless of who (or what) wrote the change.
- Agents need to understand where they're operating. We maintain AGENTS.md and operational context so an agent doesn't make a dangerous assumption about topology, service contracts, or operational constraints.
- Infrastructure as Code is the single source of truth, for humans and for agents proposing changes. The cleaner and more expressive it is, the safer agent-assisted work becomes.
- Agents do a lot of the typing; the Core Team owns the architecture, the judgment, and the boundaries that keep fast-moving, non-deterministic development from compounding into risk.
You don't need to have built agentic infrastructure before. But you should find the challenge genuinely interesting.
Right to Work
Candidates must have the right to work in Ireland. We are not in a position to offer visa sponsorship for this role.
Equal Opportunities Statement
Userpilot is an equal opportunity employer. We are committed to creating an inclusive environment for all employees and applicants. We do not discriminate on the basis of gender, civil status, family status, age, disability, race, religion, sexual orientation, or membership of the Traveller community, in accordance with the Employment Equality Acts 1998-2015.
Data Privacy Notice
By applying for this role, your personal data will be processed by Userpilot for the purposes of recruitment and candidate evaluation. We will retain your information for no longer than is necessary for this purpose.