AI Research Computing Infrastructure Engineer
Company: Frederick National Laboratory for Cancer Research
Location: Frederick
Posted on: February 22, 2026
|
|
|
Job Description:
AI Research Computing Infrastructure Engineer Job ID: req4426
Employee Type: exempt full-time Division: Enterprise Information
Technology Facility: Frederick: Ft Detrick Location: PO Box B,
Frederick, MD 21702 USA The Frederick National Laboratory is
operated by Leidos Biomedical Research, Inc. The lab addresses some
of the most urgent and intractable problems in the biomedical
sciences in cancer and AIDS, drug development and first-in-human
clinical trials, applications of nanotechnology in medicine, and
rapid response to emerging threats of infectious diseases.
Accountability, Compassion, Collaboration, Dedication, Integrity
and Versatility; it's the FNL way. PROGRAM DESCRIPTION The mission
of Enterprise Information Technology (EIT) is to develop an
enterprise-level, consolidated information technology
infrastructure that provides exceptional IT capabilities to the
Frederick National Labs for Cancer Research (NCI-Frederick/FNLCR)
in support of basic, translational, and clinical cancer and AIDS
research. The IT Operations Group (ITOG) is a part of Enterprise
Information Technology (EIT) within Leidos Biomedical Research,
Inc. ITOG is responsible for computational servers, storage
servers, virtual machine infrastructure, and the FNLCR network.
ITOG focuses on implementing enterprise IT best practices in the
areas of computational services, storage, backup, and archiving;
batch and application support; server consolidation and
virtualization; network infrastructure; unification of voice,
teleconferencing, and video communication technologies; and
improved infrastructure for collocation of dedicated servers. KEY
ROLES/RESPONSIBILITIES: The Research Computing Infrastructure
Engineer will design, build, and operate next-generation
high-performance computing (HPC) environments that support
container-based workflows and GPU-accelerated research computing.
The position will play a key role in evaluating, implementing, and
maintaining scalable and secure computing architectures for
advanced data analysis, AI/ML model training, and simulation
workloads. The engineer will collaborate closely with researchers,
IT professionals, and external partners to translate scientific
requirements into reliable, high-performance computing solutions.
Design and implement next-generation high-performance computing
(HPC) environments that leverage container-driven workflows for
GPU-accelerated research. Build and maintain container
orchestration systems for batch and distributed workloads.
Integrate containerized job workflows with existing HPC schedulers
and storage systems. Develop and maintain job templates for batch
GPU training and multi-node distributed computing. Automate
deployment, configuration, and scaling through
infrastructure-as-code and CI/CD practices. Monitor, benchmark, and
optimize system performance, reliability, and resource utilization.
Collaborate with researchers to containerize and optimize legacy
workflows for scalable execution. Lead evaluation of emerging tools
(e.g., Prefect, Ray, Airflow, Dagster) for workflow orchestration
and distributed computing. Contribute to the development of tools
and bridges between orchestration frameworks and traditional HPC
environments. BASIC QUALIFICATIONS To be considered for this
position, you must minimally meet the knowledge, skills, and
abilities listed below: Possession of Bachelor’s degree from an
accredited college/university according to the Council for Higher
Education Accreditation (CHEA) or four (4) years relevant
experience in lieu of degree. Foreign degrees must be evaluated for
U.S. equivalency. In addition to the education requirement, a
minimum of eight (8) years of related experience. Strong Linux
systems engineering and administration experience. Hands-on
experience with container orchestration tools such as Kubernetes,
Nomad, Run:AI, etc. Hands-on experience with scripting/programming
skills (Python, Bash, or Go) for automation, monitoring, and job
orchestration. Experience with infrastructure-as-code / automation
tooling (Terraform, Ansible, Packer, or equivalent). Familiarity
with system performance analysis, monitoring, and tuning.
Comfortable with small-team environments and taking end-to-end
ownership of compute infrastructure. Ability to obtain and maintain
a security clearance. PREFERRED QUALIFICATIONS Candidates with
these desired skills will be given preferential consideration:
Experience with multi-node distributed ML frameworks (PyTorch DDP,
Ray, Horovod, TensorFlow,etc). Familiarity with pipeline
orchestration tools (Prefect, Airflow, Dagster, Kubeflow).
Understanding of resource management and scheduling concepts
(queues, allocations, GPU device plugins, gang scheduling,
multi-node coordination). Understanding of storage integration with
high-performance clusters (POSIX object storage, VAST or similar).
Familiarity with cloud GPU environments (AWS, GCP, Azure) and
hybrid workflows. Familiarity with workflow orchestration/pipeline
tools (Argo, Kubeflow, Ray, MLFlow). Good communication and
documentation skills, the ability to make complex infrastructure
understandable to researchers and other engineers. EXPECTED
COMPETENCIES: Expertise in Kubernetes, Nomad, or equivalent
container orchestration systems for large-scale computing. Deep
knowledge of Linux systems administration, performance tuning, and
automation. Ability to translate research computing needs into
scalable, reliable infrastructure designs. Commitment to
documentation, reproducibility, and open science principles.
Collaborative mindset and willingness to mentor peers in
containerization and HPC best practices. Commitment to
Non-Discrimination All qualified applicants will receive
consideration for employment without regard to sex, race,
ethnicity, color, age, national origin, citizenship, religion,
physical or mental disability, medical condition, genetic
information, pregnancy, family structure, marital status, ancestry,
domestic partner status, sexual orientation, gender identity or
expression, veteran or military status, or any other basis
prohibited by law. Leidos will also consider for employment
qualified applicants with criminal histories consistent with
relevant laws. Pay and Benefits Pay and benefits are fundamental to
any career decision. That's why we craft compensation packages that
reflect the importance of the work we do for our customers.
Employment benefits include competitive compensation, Health and
Wellness programs, Income Protection, Paid Leave and Retirement.
More details are available here 123,800.00 - 207,125.00 USD The
posted pay range for this job is a general guideline and not a
guarantee of compensation or salary. Additional factors considered
in extending an offer include, but are not limited to,
responsibilities of the job, education, experience, knowledge,
skills, and abilities as well as internal equity, and alignment
with market data. The salary range posted is a full-time equivalent
salary and will vary depending on scheduled hours for part time
positions
Keywords: Frederick National Laboratory for Cancer Research, Centreville , AI Research Computing Infrastructure Engineer, IT / Software / Systems , Frederick, Virginia