We are hiring a Senior DevOps Engineer to join our US team to lead the deployment and scaling of AI solutions for financial workflows. As our company delivers full-stack AI applications to some of the world’s leading financial institutions, your work will be critical in ensuring these solutions run flawlessly in the most demanding environments.
You will provide technical leadership and hands-on expertise to keep our infrastructure secure, reliable, and high performing. This role requires optimizing deployments for massive scale across both cloud and on-prem environments, implementing rigorous security controls, and managing the unique challenges of AI/ML workloads.
You’ll collaborate closely with our engineering, product, and customer teams to streamline continuous delivery of software and AI applications. As we scale deployments across Google Cloud Platform, Microsoft Azure, Amazon Web Services, and on-premises environments, you’ll play a central role in maintaining, optimizing, and supporting the systems that make it all possible.
This is an opportunity to work at the forefront of AI adoption in financial services, solving complex challenges and shaping the infrastructure behind innovative enterprise solutions.
Key Responsibilities
Infrastructure Leadership: Architect and maintain scalable, secure software-stack and infrastructure on major cloud providers (AWS, GCP, Azure) and on-premises environments.
AI Operations (AIOps): Build and support high-performance computing clusters for model training and inference.
Continuous Delivery: Design and implement robust CI/CD pipelines (Argo CD, GitHub Actions) to streamline the delivery of AI models and applications.
Reliability Engineering: Define SLOs/SLIs and implement comprehensive monitoring and alerting systems (Datadog, Prometheus, Grafana) to ensure high availability.
Security & Compliance: Enforce DevSecOps best practices, managing IAM policies, network security, and compliance automation for regulated financial environments.
Database Management: Oversee the deployment and maintenance of production databases, including Postgres, Vector Stores and Graph Databases.
What You Have
6+ years of DevOps or SRE experience, with a strong background in supporting distributed systems at scale.
Expert-level knowledge of Kubernetes (EKS, GKE, AKS) and Docker. Deep proficiency with at least one major cloud provider (AWS, GCP, or Azure) and hybrid/on-prem deployments.
Advanced skills in Terraform or Ansible for reproducible infrastructure.
Experience building complex pipelines with tools like Jenkins, Argo CD, or GitLab CI. Strong scripting skills in Bash and Python.
Hands-on experience with modern monitoring stacks (Datadog, ELK, Prometheus/Grafana) and distributed tracing.
Solid understanding of network security, IAM, VPC peering, and encryption standards.
Excellent communication and collaboration abilities.
What Would Be Nice to have
Experience deploying and scaling ML models (vLLM, Ray, Kubeflow) or managing GPU clusters.
Prior experience working in highly regulated industries (Investment Banks, Fund Managers, Custodian Banks, etc.).
Knowledge of managing stateful workloads on K8s or optimizing PostgreSQL/Vector DB/Graph DB performance.
Ability to interact with clients as needed.
Benefits
Compensation
Domyn offers a competitive compensation structure, including salary, performance-based bonuses, and additional components based on experience. All roles include comprehensive benefits as part of the total compensation package.
About Domyn
Domyn is a company specializing in the research and development of Responsible AI for regulated industries, including financial services, government, and heavy industry. It supports enterprises with proprietary, fully governable solutions based on a composable AI architecture — including LLMs, AI agents, and one of the world’s largest supercomputers.
Please review our Privacy Policy here https://bit.ly/4tndszN.