Meta (Facebook) Logo

Meta (Facebook)

Sr. Technical Lead Manager - AI/HPC Systems Performance

🌎

Menlo Park, CA

4d ago
πŸ‘€ 1 views
πŸ“₯ 0 clicked apply

Job Description

Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing use cases of AI. This results in a dramatic scaling challenges that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure and the related software that connects myriad of training accelerators like GPUs together. In addition, we need to ensure that the system is running smoothly and meets stringent performance and availability requirements of large-scale training and inference workloads. To improve performance of these systems we constantly look for opportunities across stack: network fabric, host networking, communication libraries and scheduling infrastructure.

More Jobs at Meta (Facebook)