ML Engineer - Inference & Model Deployment
Job discovery is broken. Indeed and LinkedIn want to keep it that way. HiringCafe is building a 100x better job search engine: fast, comprehensive, honest, and actually useful. We index millions of jobs, remove noise, rank what matters, and help people find real opportunities without dark patterns, ads, or pay-to-win placement.
We are looking for a founding ML engineer who can help us turn powerful AI and ML models into fast, reliable production systems. You will own the bridge between model development and real user-facing infrastructure: deploying models, optimizing inference latency and throughput, scaling serving systems, and making sure our models run efficiently in production.
This is a hands-on engineering role for someone who loves the details of model performance, GPU utilization, inference architecture, and production reliability.
What You’ll Do
- Deploy and integrate researcher-trained model checkpoints into our cloud infrastructure and production pipelines.
- Profile and benchmark model performance to identify latency, throughput, memory, and compute bottlenecks.
- Implement optimization techniques such as quantization, pruning, batching, caching, efficient attention, and precision trade-offs while preserving model quality.
- Build scalable multi-GPU inference systems for search, ranking, recommendations, agents, and other AI-powered product experiences.
- Design reliable model-serving architecture that can support millions of users.
- Develop efficient training and fine-tuning workflows where needed, including distributed training, mixed precision, and parallelism strategies.
- Work closely with our search & engineering teams to make model deployment a smooth part of our development workflow.
You May Be a Strong Fit If You
- Have deployed and optimized deep learning models in production environments.
- Have experience with large-scale model serving, multi-GPU inference, or high-throughput inference systems.
- Understand inference optimization techniques such as quantization, pruning, compilation, batching, caching, and memory optimization.
- Have strong instincts for profiling, benchmarking, and debugging model performance.
- Are familiar with efficient attention mechanisms, transformer optimization, or modern LLM/embedding/ranking model infrastructure.
- Have worked with inference frameworks or serving stacks such as SGLang, vLLM, TensorRT, or equivelant.
- Can write clean, production-quality code and integrate ML systems into backend infrastructure.
- Are comfortable with cloud platforms, distributed systems, storage systems, and modern ML training or serving workflows.
- Want ownership, leverage, and responsibility from day one.
Logistics
This role is based in Cupertino, where we work in person. We believe the best ideas come from being in the same room.
We offer generous health, dental, and vision coverage, paid parental leave, and relocation support.
Don’t meet every single qualification? That’s okay. We care more about your trajectory than checking every box. If the role excites you and the mission resonates, we’d love to hear from you.
Salary band
- Base Salary: $250k - $310k + equity
Tell us how you can help us #BeatIndeed and #BeatLinkedIn.