Posted 8h ago

Machine Learning Engineer – World Model

@ Institute of Foundation Models
Sunnyvale, California, United States
$150k-$450k/yrOnsiteFull Time
Responsibilities:Design infrastructure, Build pipelines, Manage systems
Requirements Summary:3+ years in ML infrastructure/MLOps or related backend/platform engineering; strong cloud skills; experience with scalable ML systems.
Technical Tools Mentioned:AWS, Docker, Python, Git, Kubernetes, Ray, Kafka, Spark, Gradio, OpenWebUI
Save
Mark Applied
Hide Job
Report & Hide
Job Description

About the Institute of Foundation Models 
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. 

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers. 

The Team 

We are the AllWorld Team under the Institute of Foundation Model (IFM) at MBZUAI. At AllWorld, we are pioneering the development of the PAN (Physical, Agentic, and Networked) world models—the next-generation foundation models to unlock machine intelligence beyond lingual.  

  

Our mission is to tackle the fundamental challenges of world modeling and establish a new paradigm for next-generation machine reasoning. We are looking for passionate individuals who share our vision and are eager to push the boundaries of AI together. 

 

Role Overview 

  • We’re looking for a Machine Learning Engineer focused on ML infrastructure and MLOps to design and operate the systems that power our research environment. You’ll build scalable, reliable, and observable cloud infrastructure, working closely with researchers to support data pipelines, experimentation, and evaluation workflows. 

  • This role balances fast-moving research needs with production-grade systems, ensuring that experimental work can scale reliably when needed. 

Key Responsibilities 

  • Design, build, and operate scalable ML infrastructure on AWS (e.g., compute, storage, networking, access control).  

  • Develop and maintain MLOps workflows for data versioning 

  • Build and manage distributed systems for large-scale data processing (filtering, captioning, etc.) and model evaluation.  

  • Own architecture decisions for ML infrastructure and drive best practices in reliability, scalability, and cost efficiency.  

  • Implement observability across systems, including monitoring, logging, and alerting.  

  • Integrate OpenWebUI, Gradio, or similar UIs for data quality assurance 

  • Build and maintain dashboards for experiment tracking and system health.  

  • Partner closely with researchers to translate experimental workflows into robust, scalable systems.  

Qualifications 

Must-Haves 

  • 3+ years of experience in MLOps, ML infrastructure, or related backend/platform engineering roles.  

  • Strong experience with cloud platforms (preferably AWS) and core services for compute, storage, and access control.  

  • Experience designing and operating distributed systems (e.g., Kubernetes, Ray, or similar frameworks).  

  • Solid software engineering skills, including system design, debugging, and testing (Python, Docker, Git).  

  • Familiarity with data processing and pipeline orchestration tools (e.g., Spark, Kafka, or similar).  

  • Experience with observability practices (monitoring, logging, alerting).  

  • Ability to work closely with researchers and translate ambiguous requirements into production-ready systems.  

Nice-to-Haves 

  • Experience in fast-paced or research-driven environments.  

  • Experience with large-scale video or multimodal data pipelines.  

  • Experience building automated model evaluation or benchmarking systems.  

  • Knowledge of cost optimization, security, and networking in multi-tenant environments.  

  • Familiarity with modern developer and AI-assisted coding (e.g., Codex, Cursor, Claude Code) 



About the Institute of Foundation Models 
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. 

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers. 

The Team 

We are the AllWorld Team under the Institute of Foundation Model (IFM) at MBZUAI. At AllWorld, we are pioneering the development of the PAN (Physical, Agentic, and Networked) world models—the next-generation foundation models to unlock machine intelligence beyond lingual.  

  

Our mission is to tackle the fundamental challenges of world modeling and establish a new paradigm for next-generation machine reasoning. We are looking for passionate individuals who share our vision and are eager to push the boundaries of AI together. 

 

Role Overview 

  • We’re looking for a Machine Learning Engineer focused on ML infrastructure and MLOps to design and operate the systems that power our research environment. You’ll build scalable, reliable, and observable cloud infrastructure, working closely with researchers to support data pipelines, experimentation, and evaluation workflows. 

  • This role balances fast-moving research needs with production-grade systems, ensuring that experimental work can scale reliably when needed. 

Key Responsibilities 

  • Design, build, and operate scalable ML infrastructure on AWS (e.g., compute, storage, networking, access control).  

  • Develop and maintain MLOps workflows for data versioning 

  • Build and manage distributed systems for large-scale data processing (filtering, captioning, etc.) and model evaluation.  

  • Own architecture decisions for ML infrastructure and drive best practices in reliability, scalability, and cost efficiency.  

  • Implement observability across systems, including monitoring, logging, and alerting.  

  • Integrate OpenWebUI, Gradio, or similar UIs for data quality assurance 

  • Build and maintain dashboards for experiment tracking and system health.  

  • Partner closely with researchers to translate experimental workflows into robust, scalable systems.  

Qualifications 

Must-Haves 

  • 3+ years of experience in MLOps, ML infrastructure, or related backend/platform engineering roles.  

  • Strong experience with cloud platforms (preferably AWS) and core services for compute, storage, and access control.  

  • Experience designing and operating distributed systems (e.g., Kubernetes, Ray, or similar frameworks).  

  • Solid software engineering skills, including system design, debugging, and testing (Python, Docker, Git).  

  • Familiarity with data processing and pipeline orchestration tools (e.g., Spark, Kafka, or similar).  

  • Experience with observability practices (monitoring, logging, alerting).  

  • Ability to work closely with researchers and translate ambiguous requirements into production-ready systems.  

Nice-to-Haves 

  • Experience in fast-paced or research-driven environments.  

  • Experience with large-scale video or multimodal data pipelines.  

  • Experience building automated model evaluation or benchmarking systems.  

  • Knowledge of cost optimization, security, and networking in multi-tenant environments.  

  • Familiarity with modern developer and AI-assisted coding (e.g., Codex, Cursor, Claude Code) 



About the Institute of Foundation Models 
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. 
As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers. 
The Team 
We are the AllWorld Team under the Institute of Foundation Model (IFM) at MBZUAI. At AllWorld, we are pioneering the development of the PAN (Physical, Agentic, and Networked) world models—the next-generation foundation models to unlock machine intelligence beyond lingual.  
  
Our mission is to tackle the fundamental challenges of world modeling and establish a new paradigm for next-generation machine reasoning. We are looking for passionate individuals who share our vision and are eager to push the boundaries of AI together. 
 
Role Overview 
We’re looking for a Machine Learning Engineer focused on ML infrastructure and MLOps to design and operate the systems that power our research environment. You’ll build scalable, reliable, and observable cloud infrastructure, working closely with researchers to support data pipelines, experimentation, and evaluation workflows. 



Visa Sponsorship
This position is eligible for visa sponsorship.
 
Benefits Include
*Comprehensive medical, dental, and vision benefits 
 *Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability