Posted 2w ago

GPU Systems Engineer 4

@ Base-2 Solutions
Bethesda, Maryland, United States
OnsiteFull Time
Responsibilities:Design GPU clusters, Collaborate with team, Integrate GPUs with Linux
Requirements Summary:Active TS/SCI with CI Polygraph; 10+ years in field; experience with NVIDIA GPU data center platforms; strong Linux and DoD 8570.11 IAT II/III; U.S. citizenship required.
Technical Tools Mentioned:Bash, Python, Ansible, Puppet, Salt, Linux, RHEL, Ubuntu, Oracle Linux, Rocky Linux
Save
Mark Applied
Hide Job
Report & Hide
Job Description

Position Summary

Support enterprise AI mission systems by designing, developing, and optimizing GPU clusters, with deep focus on operating systems, hardware, GPU platforms, and high-speed networking in a secure customer environment.

Essential Duties and Responsibilities

  • Design, configure, and maintain GPU clusters. 
  • Collaborate with a multidisciplinary team to define and optimize architectures for performance, power efficiency, and required features. 
  • Work closely with AI/ML engineers to integrate GPUs with Linux-based systems. 
  • Optimize GPU drivers for compatibility, reliability, and performance. 
  • Analyze GPU performance, identify bottlenecks, and develop strategies to improve efficiency across hardware and software layers. 
  • Build and maintain debugging tools, profiling utilities, and performance analysis software for Linux environments. 
  • Leverage Bash, Python, Ansible, Puppet, and Salt for tooling and automation. 
  • Maintain technical documentation, architectural specifications, and Linux best practices. 
  • Support ATO activities and ensure compliance with federal security standards.

Required Qualifications

  • Active TS/SCI with ability to obtain a CI Polygraph.
  • Bachelor's degree with a minimum of ten years of experience in the category field.
  • Experience managing NVIDIA GPU data center platforms, including DGX, HGX, H200, H100, and L4s. 
  • Knowledge of enterprise server components, including storage/network controllers, HBAs, and SSDs. 
  • Strong expertise with Linux distributions, including RHEL, Ubuntu, Oracle, and Rocky. 
  • Excellent problem-solving skills and the ability to collaborate within a team. 
  • Meet DoD 8570.11 IAT Level II certification requirements at a minimum; IAT Level III is also acceptable. 
  • U.S. citizenship is required due to the nature of the government contracts supported.

Preferred Qualifications

  • Experience with Kubernetes cluster management and AI/ML workflow orchestration, including Argo, Airflow, and Kubeflow. 
  • Familiarity with GPU virtualization and cloud computing. 
  • Experience with Prometheus and Grafana for monitoring. 
  • Knowledge of distributed resource scheduling systems such as Slurm, LSF, or similar tools.

Required Education and Experience Equivalency

EducationYears of Experience
High School Diploma/GEDNot Applicable
Associates DegreeNot Applicable
Bachelors’ Degree10
Masters’ Degree10
PhD10

Required Certifications

  • DoD 8570.11 IAT Level II certification: Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP.

Required Security Clearance

  • Active TS/SCI with ability to obtain a CI Polygraph.