Position Summary
Support enterprise AI mission systems by designing, developing, and optimizing GPU clusters, with deep focus on operating systems, hardware, GPU platforms, and high-speed networking in a secure customer environment.
Essential Duties and Responsibilities
- Design, configure, and maintain GPU clusters.
- Collaborate with a multidisciplinary team to define and optimize architectures for performance, power efficiency, and required features.
- Work closely with AI/ML engineers to integrate GPUs with Linux-based systems.
- Optimize GPU drivers for compatibility, reliability, and performance.
- Analyze GPU performance, identify bottlenecks, and develop strategies to improve efficiency across hardware and software layers.
- Build and maintain debugging tools, profiling utilities, and performance analysis software for Linux environments.
- Leverage Bash, Python, Ansible, Puppet, and Salt for tooling and automation.
- Maintain technical documentation, architectural specifications, and Linux best practices.
- Support ATO activities and ensure compliance with federal security standards.
Required Qualifications
- Active TS/SCI with ability to obtain a CI Polygraph.
- Bachelor's degree with a minimum of ten years of experience in the category field.
- Experience managing NVIDIA GPU data center platforms, including DGX, HGX, H200, H100, and L4s.
- Knowledge of enterprise server components, including storage/network controllers, HBAs, and SSDs.
- Strong expertise with Linux distributions, including RHEL, Ubuntu, Oracle, and Rocky.
- Excellent problem-solving skills and the ability to collaborate within a team.
- Meet DoD 8570.11 IAT Level II certification requirements at a minimum; IAT Level III is also acceptable.
- U.S. citizenship is required due to the nature of the government contracts supported.
Preferred Qualifications
- Experience with Kubernetes cluster management and AI/ML workflow orchestration, including Argo, Airflow, and Kubeflow.
- Familiarity with GPU virtualization and cloud computing.
- Experience with Prometheus and Grafana for monitoring.
- Knowledge of distributed resource scheduling systems such as Slurm, LSF, or similar tools.
Required Education and Experience Equivalency
| Education | Years of Experience |
| High School Diploma/GED | Not Applicable |
| Associates Degree | Not Applicable |
| Bachelors’ Degree | 10 |
| Masters’ Degree | 10 |
| PhD | 10 |
Required Certifications
- DoD 8570.11 IAT Level II certification: Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP.
Required Security Clearance
- Active TS/SCI with ability to obtain a CI Polygraph.