HashiCorp Logo

HashiCorp

SRE Manager - Incident Excellence (Hybrid - Bangalore)

🌎

India - Bengaluru

1d ago
👀 3 views
📥 0 clicked apply

Job Description

Hybrid

The Role
As an Engineering Manager for the Resilience Engineering team, you will lead a group focused on ensuring the reliability, scalability, and disaster recovery of HashiCorp’s cloud and enterprise products. Your team will play a critical role in strengthening fault tolerance, optimizing failover strategies, and automating recovery processes to enhance operational resilience across our platform.

With experience in managing engineering teams, incident response, and distributed systems, you will drive technical strategy, mentor engineers, and collaborate across teams to improve disaster recovery, testing, automation, and system reliability. Your leadership will be instrumental in advancing incident excellence, ensuring HashiCorp’s products meet the highest standards of availability, performance, and compliance across the Infrastructure Cloud.

What you’ll do (responsibilities)

We’re looking for an Engineering Manager to lead our Resilience Engineering team, driving strategic initiatives around incident excellence, disaster recovery, and system reliability across HashiCorp’s cloud and enterprise products. You’ll be responsible for refining our incident response strategy, ensuring rapid and effective resolution of operational disruptions, and strengthening overall platform resilience.

Your leadership will be instrumental in scaling HashiCorp’s infrastructure while embedding a culture of operational excellence, ensuring our customers can rely on a highly available and resilient platform.

In this role, you can expect to:

  • Define and implement a comprehensive incident response framework, ensuring coordination across development, operations, and security teams.
  • Analyze incident trends and root causes to drive continuous improvements in reliability, post-incident processes, and automation.
  • Develop and enhance tooling for minimizing manual intervention and accelerating recovery times.
  • Establish best practices for disaster recovery and system reliability, proactively identifying failure points and implementing automated mitigations.
  • Conduct post-incident reviews and foster a culture of learning, driving accountability and systemic improvements across teams.
  • Lead cross-functional collaboration on operational readiness, ensuring HashiCorp’s products meet the highest standards of availability, fault tolerance, and compliance.
  • Mentor and guide engineers, fostering professional growth, best practices, and a proactive approach to resilience engineering.

What you’ll need (basic qualifications)

  • You have 8+ years of experience in site reliability engineering, systems administration, or software engineering, with a strong focus on incident response and operational reliability.
  • You have 1+ years of leadership experience managing and delivering large-scale incident response initiatives, driving operational excellence and system resilience.
  • You have a proven track record of managing and resolving incidents in cloud-based environments, with expertise in major public cloud platforms such as AWS, GCP, and Azure.
  • You possess deep knowledge of monitoring and alerting systems, with the ability to develop metrics and alarms that accurately reflect system health and operational risks.
  • You have hands-on experience with incident management tools and best practices, including post-mortem analysis, root cause investigation, and proactive risk mitigation strategies.
  • You are an effective communicator and collaborator, adept at working across engineering, operations, and leadership teams to drive alignment and ensure rapid incident resolution.
  • You have familiarity with HashiCorp’s product suite and infrastructure automation tools, leveraging them to enhance system reliability and efficiency.
  • You are passionate about fostering a culture of reliability and continuous improvement, mentoring teams, optimizing processes, and leveraging automation to enhance system performance and resilience.

What's nice to have (preferred qualifications)

  • You have experience using HashiCorp products (Terraform, Packer, Waypoint, Nomad, Vault, Boundary, Consul).
  • You have prior experience working in cloud platform engineering teams. #LI-Hybrid 

“HashiCorp is an IBM subsidiary which has been acquired by IBM and will be integrated into the IBM organization. HashiCorp will be the hiring entity. By proceeding with this application you understand that HashiCorp will share your personal information with other IBM subsidiaries involved in your recruitment process, wherever these are located. More information on how IBM protects your personal information, including the safeguards in case of cross-border data transfer, are available here: link to IBM privacy statement.”

More Jobs at HashiCorp