Job Details
SiteOps Global Incident Management Lead
Meta is seeking a strategic technical leader to define and implement a scalable incident and incident management strategy across our global fleet of data centers. Our data centers are the foundation upon which our rapidly scaling infrastructure efficiently operates and upon which our innovative services are delivered.The Incident Management Lead will define and deliver a robust strategy that includes incident management, event management, and communications that drive efficient planning and align mitigation efforts. We seek a strong leader and subject expert who can continue to drive innovation in this space, spanning people, processes, infrastructure, tooling, automation, cost, and quality. The successful candidate is someone who can quickly understand and respond to the technical needs of subject matter experts, local site leadership, and our cross functional teams in a rapidly evolving technical environment. Moreover, the candidate will gain alignment across these globally distributed teams and partner organizations, defining and executing strategies and driving initiatives that deliver the most impact by prioritizing resources and focus areas.
Required Skills
SiteOps Global Incident Management Lead Responsibilities:
- Define the Incident and events management strategy and implement a consistent framework to minimize the impact and ensure business continuity for the Site Operations organization
- Standardize the incident response across the fleet of the data centers for efficiency
- Develop crisis and incident response plans, providing oversight and identifying gaps in crisis response assumptions and plans with the first line of defense
- Identify the training needs, knowledge gaps and training strategy for incident response and management
- Integrate the incident management strategy with other processes and systems
- Work closely with other cross-functional partners, such as facilities, logistics, and software teams, to ensure SiteOps incident management plans are aligned with those teams
- Monitoring the current incident management process and identify areas of improvement
- Build trusted relationships within the team, to understand the biggest challenges and opportunities, and to advocate effectively for the right incident management initiatives
- Drive a singular operations strategy, goals, and priorities for the global Incident Management function within Site Operations
- Develop scaling strategies and plans, be forward thinking by understanding infrastructure growth, identifying scaling issues before they occur, and contributing to solutions
- Ensure robust, timely communications across a globally distributed team, and provide the team great visibility to progress and strategy
- Effective at rallying the team around plans and decisions and can adjust based on feedback
- Deliver timely, accurate, and complete incident reporting for relevant stakeholders
- Document results and lessons learned from crisis response exercises and live events
- Lead a program team consisting of project managers and engineers to execute on roadmaps that work towards our long-term strategies
- Drive formal root cause analysis, follow-ups and implement lessons learned
- Provide Incident Manager On-Call support for high-severity disasters impacting our Global fleet of data centers
- Develop metrics and leverage data to understand the gaps and opportunities in Event Management
- Drive a training program with our Global team to ensure teams are prepared for any event in our data centers
- Align and collaborate with partner teams across Meta working on Incident Management
- Collaborate with partners to understand risks with growth and new technology. Proactively work with engineers to mitigate these risks
- 30% - travel required
Minumum Qualification
Minimum Qualifications:
- Proven experience as Engineering or Operations Director, or Technical, Operations or Engineering Lead role
- BS, BA or BEng in technical field or commensurate experience
- Working knowledge of IT/Operations Infrastructure
- Experience influencing effectively, working with cross-functional teams to advance the needs of the team, and adapting as needs change
- Understand servers/services setup, and work with them to produce monitoring solutions based on best practices
- Proven experience as Engineering or Operations leader, or relevant Senior Technical, Operations or Engineering Lead role
- Prioritization skills and proven experience leading tooling, systems, automation, and process
- Experience analyzing a high volume of technical data and work in a fast-paced environment
- Problem solving, analytical, and time management skills
- Experience leading teams or initiatives focused on data acquisition, security, or eradication of storage media
Preferred Qualification
Preferred Qualifications:
- BS Incident Management, Operations Management, Cybersecurity, or related degree
- Linux/Unix Administration experience