About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Team

We are the AllWorld Team under the Institute of Foundation Model (IFM) at MBZUAI. At AllWorld, we are pioneering the development of the PAN (Physical, Agentic, and Networked) world models—the next-generation foundation models to unlock machine intelligence beyond lingual.

Our mission is to tackle the fundamental challenges of world modeling and establish a new paradigm for next-generation machine reasoning. We are looking for passionate individuals who share our vision and are eager to push the boundaries of AI together.

Role Overview

We’re looking for a Machine Learning Engineer focused on ML infrastructure and MLOps to design and operate the systems that power our research environment. You’ll build scalable, reliable, and observable cloud infrastructure, working closely with researchers to support data pipelines, experimentation, and evaluation workflows.

This role balances fast-moving research needs with production-grade systems, ensuring that experimental work can scale reliably when needed.

Key Responsibilities

Design, build, and operate scalable ML infrastructure on AWS (e.g., compute, storage, networking, access control).

Develop and maintain MLOps workflows for data versioning

Build and manage distributed systems for large-scale data processing (filtering, captioning, etc.) and model evaluation.

Own architecture decisions for ML infrastructure and drive best practices in reliability, scalability, and cost efficiency.

Implement observability across systems, including monitoring, logging, and alerting.

Integrate OpenWebUI, Gradio, or similar UIs for data quality assurance

Build and maintain dashboards for experiment tracking and system health.

Partner closely with researchers to translate experimental workflows into robust, scalable systems.

Qualifications

Must-Haves

3+ years of experience in MLOps, ML infrastructure, or related backend/platform engineering roles.

Strong experience with cloud platforms (preferably AWS) and core services for compute, storage, and access control.

Experience designing and operating distributed systems (e.g., Kubernetes, Ray, or similar frameworks).

Solid software engineering skills, including system design, debugging, and testing (Python, Docker, Git).

Familiarity with data processing and pipeline orchestration tools (e.g., Spark, Kafka, or similar).

Experience with observability practices (monitoring, logging, alerting).

Ability to work closely with researchers and translate ambiguous requirements into production-ready systems.

Nice-to-Haves

Experience in fast-paced or research-driven environments.

Experience with large-scale video or multimodal data pipelines.

Experience building automated model evaluation or benchmarking systems.

Knowledge of cost optimization, security, and networking in multi-tenant environments.

Familiarity with modern developer and AI-assisted coding (e.g., Codex, Cursor, Claude Code)

The Team

Role Overview

We’re looking for a Machine Learning Engineer focused on ML infrastructure and MLOps to design and operate the systems that power our research environment. You’ll build scalable, reliable, and observable cloud infrastructure, working closely with researchers to support data pipelines, experimentation, and evaluation workflows.

This role balances fast-moving research needs with production-grade systems, ensuring that experimental work can scale reliably when needed.

Key Responsibilities

Design, build, and operate scalable ML infrastructure on AWS (e.g., compute, storage, networking, access control).

Develop and maintain MLOps workflows for data versioning

Build and manage distributed systems for large-scale data processing (filtering, captioning, etc.) and model evaluation.

Own architecture decisions for ML infrastructure and drive best practices in reliability, scalability, and cost efficiency.

Implement observability across systems, including monitoring, logging, and alerting.

Integrate OpenWebUI, Gradio, or similar UIs for data quality assurance

Build and maintain dashboards for experiment tracking and system health.

Partner closely with researchers to translate experimental workflows into robust, scalable systems.

Qualifications

Must-Haves

3+ years of experience in MLOps, ML infrastructure, or related backend/platform engineering roles.

Strong experience with cloud platforms (preferably AWS) and core services for compute, storage, and access control.

Experience designing and operating distributed systems (e.g., Kubernetes, Ray, or similar frameworks).

Solid software engineering skills, including system design, debugging, and testing (Python, Docker, Git).

Familiarity with data processing and pipeline orchestration tools (e.g., Spark, Kafka, or similar).

Experience with observability practices (monitoring, logging, alerting).

Ability to work closely with researchers and translate ambiguous requirements into production-ready systems.

Nice-to-Haves

Experience in fast-paced or research-driven environments.

Experience with large-scale video or multimodal data pipelines.

Experience building automated model evaluation or benchmarking systems.

Knowledge of cost optimization, security, and networking in multi-tenant environments.

Familiarity with modern developer and AI-assisted coding (e.g., Codex, Cursor, Claude Code)

About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.
As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.
The Team
We are the AllWorld Team under the Institute of Foundation Model (IFM) at MBZUAI. At AllWorld, we are pioneering the development of the PAN (Physical, Agentic, and Networked) world models—the next-generation foundation models to unlock machine intelligence beyond lingual.

Our mission is to tackle the fundamental challenges of world modeling and establish a new paradigm for next-generation machine reasoning. We are looking for passionate individuals who share our vision and are eager to push the boundaries of AI together.

Role Overview
We’re looking for a Machine Learning Engineer focused on ML infrastructure and MLOps to design and operate the systems that power our research environment. You’ll build scalable, reliable, and observable cloud infrastructure, working closely with researchers to support data pipelines, experimentation, and evaluation workflows.

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

Machine Learning Engineer – World Model