Posted 3mo ago

Software Engineer, ML Infrastructure

@ Cursor
San Francisco or New York City
OnsiteFull Time
Responsibilities:Collaborate researchers, Build infrastructure, Automate deployment
Requirements Summary:Strong background in systems and infrastructure-focused software engineering (Python, Typescript, Rust, Golang); distributed storage and networking on Linux in cloud and bare metal; large-scale systems across thousands of nodes; infrastructure-as-code and Kubernetes.
Technical Tools Mentioned:Python, Typescript, Rust, Golang, Linux, Kubernetes, Ray, Slurm, Nvidia GPUs, Infiniband, RoCE
Save
Mark Applied
Hide Job
Report & Hide
Job Description

ML Infrastructure Engineer

The ML Infrastructure team builds large-scale compute, storage, and software infrastructure to support Cursor’s work building the world’s best agentic coding model. We’re looking for strong engineers who are interested in building high-performance infrastructure and the software to support it. This role works closely with ML researchers and engineers to enable their work through improvements to our training framework, systems reliability/performance, and developer experience.

What you might do

  • Collaborate with ML researchers to improve the throughput and reliability of training

  • Work with OEMs, cloud service providers, and others to plan and build cutting-edge GPU infrastructure

  • Improve the density and scalability of compute environments to enable increasingly large RL workloads

  • Create software and systems to automate building, monitoring, and running GPU clusters

  • Build workload scheduling and data movement systems to support Cursor’s growing training footprint

What we’re looking for

  • A strong background in systems and infrastructure-focused software engineering, particularly in Python, Typescript, Rust, and Golang

  • Experience with distributed storage and networking infrastructure, particularly on Linux systems across cloud and bare metal environments

  • Exposure to large-scale systems and their unique challenges, ideally across thousands of nodes with significant resource footprints.

  • Production use of infrastructure-as-code and configuration management, across hosts and Kubernetes

Nice to have

  • Operational exposure to Nvidia GPUs with Infiniband or RoCE, particularly with Blackwell and Hopper-class hardware

  • Exposure to Ray, Slurm, or other common compute and runtime schedulers