SiteOps Global Infrastructure Services Engineer
The Site Operations team is responsible for the delivery of data center compute and storage at Meta, enabling our family of apps and services to support a growing global community. We are seeking a forward-thinking individual skilled across multiple disciplines to lead global initiatives on this team. The Infrastructure Services Engineer will take on complex technical problems, delivering effective and impactful solutions, working, and communicating with distributed teams and key stakeholders, across multiple disciplines. This individual will partner with AI teams across Meta and influence complex AI infrastructure technical strategy across the globe and spanning multiple disciplines such as Hardware, Software/Firmware, Networking and Power & Cooling. This role would also be responsible for looking at the AI infrastructure strategy from an operational perspective and providing guidance and direction. The individual will be able to convey the technical AI details and distill a high level strong message in a way that is understood by all levels. Although the focus of this engineer will be oriented towards AI infrastructure, the expectation would be that they also be able to leverage their skills across other infrastructure domains. The person should enjoy working in a complex, highly technical environment where innovative design, planning, execution and communication is key to success. The candidate must be able to work collaboratively with cross functional teams to bring innovative infrastructure designs and initiatives from engineering concept to solution, implementing them in new and operational data centers across the globe.
SiteOps Global Infrastructure Services Engineer Responsibilities:
- Serve as a critical member of the global infrastructure engineering team supporting and driving the operations of the AI infrastructure/hardware platforms and associated new technologies across Site Operations.
- Drive complex AI/ML technical solutions globally and spanning multiple disciplines such as Hardware, Software/Firmware, Networking and Power & Cooling (all aspects of cooling solutions).
- Work closely with other Engineering team members to share best practices and ensure appropriate feedback is given to cross-functional teams in support of AI deployment and operations.
- Work with the AI cluster management team to provide serviceability feedback on AI/ML production hardware, network, storage, and DC design impacts.
- Influence the higher stack requirements, translate those requirements into impacts for AI zones (DC planning, buffer management, regional fluidity IaaS, workload requirements, capacity management and AI lifecycle.
- Represents Site Operations in leading work to define and architect new solutions on global initiatives, by working with key partner teams across multiple disciplines.
- Assemble and lead cognitively diverse teams to address complex engineering challenges, requiring a deep technical expertise as well as a broad understanding of Meta’s overall infrastructure.
- Acts as key Subject Matter Expert and mentor in the design, operation, and troubleshooting of tools, technologies, and processes utilized within the Site Operations environment.
- Understand and assess risks and challenges associated with emerging new hardware, data center and software technologies, and define plans for how to address and mitigate these.
- Effectively bridge between the logical and physical world, ensuring a holistic understanding of the full infrastructure stack.
- Acts as a global communication and advisory point of contact for the design, implementation and delivery of projects that affect our global data center and server fleet, and facilitates resolution of issues drawing on local expertise and global support partners.
- Ability to address issues that often are ambiguous and of global nature, requiring leadership and collaboration across time zones, teams and technical domains.
- Leverages data-driven methodologies to understand a problem at the onset, defining a plan and being able to measure progress throughout a project. Provides data supplied narratives, and ensures a strong focus on continuous improvement.
- Builds and supports strong cross-functional connections with teams across the globe and serves as an advocate for the Site Operations Team with key partners, influencing policies and procedures to improve global data center operations.
- Ability to travel up to 20% to 30% required.
- Experience building globally scalable solutions and translating global strategic initiatives into local executable projects.
- Experience building, operating and scaling with Linux or Unix Operating systems.
- BS, BEng or BA in technical field or commensurate experience.
- Understanding of the full stack of infrastructure, with experience building or operating logical infrastructure on top of a complex, distributed physical infrastructure.
- Knowledge of storage and AI/ML related services and general knowledge of the hardware that supports them. Experience with GPU/TPU based platform hardware that operates in AI/ML computing clusters & workloads. Experience with AI algorithms and knowledge of systems that can exploit them. Understanding the workload characteristics of training and inference engines.
- 10+ years of technical experience, in a large-scale data center or IT Infrastructure environment.
- Strong knowledge of storage and AI/ML related services and the hardware that supports them.
- Coding or scripting experience such as Go, Bash, PHP, Python, or SQL.
- Strong communication skills and experience working in a highly distributed environment, across teams/department boundaries.
- Data Center Design and Expansion. Experience with high level data center design, operations, basic electrical/mechanical infrastructure, and scaling physical infrastructure.
- Knowledge and experience with virtualization, containerization, distributed systems, fault tolerance, and incident management.
- Knowledge of the interdependencies of data center functions and technologies including electrical, cooling, structured cabling, security, network, server and storage systems.
- Experience in providing technical guidance to external vendors and partners.
- Experience communicating the results of analysis and insights to cross functional teams and influencing the strategy of these teams.