The company
Nebius AI is an AI cloud platform with one of the largest GPU capacities in Europe. Launched in November 2023, the Nebius AI platform provides high-end, training-optimized infrastructure for AI practitioners. As an NVIDIA preferred cloud service provider, Nebius AI offers a variety of NVIDIA GPUs for training and inference, as well as a set of tools for efficient multi-node training.
Nebius AI owns a data center in Finland, built from the ground up by the company’s R&D team and showcasing our commitment to sustainability. The data center is home to ISEG, the most powerful commercially available supercomputer in Europe and the 16th most powerful globally (Top 500 list, November 2023).
Nebius’s headquarters are in Amsterdam, Netherlands, with teams working out of R&D hubs across Europe and the Middle East.
Nebius AI is built with the talent of more than 500 highly skilled engineers with a proven track record in developing sophisticated cloud and ML solutions and designing cutting-edge hardware. This allows all the layers of the Nebius AI cloud – from hardware to UI – to be built in-house, distinctly differentiating Nebius AI from the majority of specialized clouds: Nebius customers get a true hyperscaler-cloud experience tailored for AI practitioners.
The role
We are seeking a Senior Technical Product Manager, ML/AI Lifecycle Services to join our team. In this role, you will oversee the planning and prioritization of services across the ML/AI lifecycle, including training, fine-tuning, experiments, monitoring and inference. You will deliver products for leading AI companies, utilizing thousands GPU within one cluster with cutting-edge hardware. We also provide room for creativity, empowering you to take the initiative and build what you think is best.
Responsibilities:
Requirements:
Distributed training that utilizes at least dozens of hosts using Slurm, Ray Cluster, MosaicML
Organizing ML infrastructure using best MLOps practices with instruments like MLflow, W&B, MosaicML, Kubeflow, ClearML, AzureML, SageMaker, VertexAI
Maintaining and optimizing a large inference cluster with KServe, vLLM, Triton, RunAI, Seldon
Building a product on top of LLMs that leverages techniques such as RAG, fine-tuning, and function calling, with an understanding of continuous eval of the quality
Ideal Candidate:
Does all that sound like your kind of challenge? Then join us!