Location:Remote (LATAM, South Africa or PH)Contract:Minimum 6-month contract with the potential for an indefinite extension based on performance.Schedule:Full-Time, Monday-Friday, PST or PH timezone.Reports to:Head of Infrastructure / SRE

About the Company

We operate state-of-the-art AI Factories across Europe and the US, running large-scale NVIDIA GPU clusters (H100, H200, B200, B300) on bare metal for frontier AI workloads. We design, build, and operate the full stack: datacenter power and cooling, InfiniBand fabrics, SLURM and Kubernetes orchestration, storage, and the control plane that turns raw iron into reliable compute for our customers.

Role Overview

We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for US hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery, and cleaner operations. This is a hands-on role with significant production impact from week one.

Key Responsibilities

I. Cluster Operations & Hardening

Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.

Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics.

Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues.

Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps.

II. Automation & Observability

Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently.

Observability Stack: Build and maintain observability for GPU fleets using Prometheus, Grafana, DCGM, node exporter, and custom exporters.

Capacity & Rollouts: Contribute to capacity planning, firmware rollout strategy, and cluster bring-up for new sites.

III. Collaboration & Incident Response

Workload Optimization: Partner with customer workload teams on NCCL tuning, job scheduling policy, QoS, and fairshare.

Operational Excellence: Lead post-mortems, write comprehensive runbooks, and improve change management processes across global regions.

On-Call Leadership: Participate in the on-call rotation for US hours and handle escalations from international sites when necessary.

Who You Are

Required Experience & Skills

Deep SRE/HPC Background:5+ years in SRE, systems engineering, or HPC operations.

SLURM Expertise:Extensive production experience with SLURM at scale (accounting/slurmdbd, prolog/epilog scripts, cgroups, GRES, topology awareness).

NVIDIA Stack Mastery:Hands-on experience with NVIDIA datacenter GPUs, driver stacks, CUDA runtime, Fabric Manager, nvidia-smi, DCGM, and GPU Direct RDMA.

Networking:Operational experience with InfiniBand fabrics at 100G or higher (OpenSM/UFM, ibdiagnet, perfquery, and fabric troubleshooting).

Linux Administration:Expert-level Linux admin skills (Ubuntu/RHEL family), including kernel tuning, systemd, networking, and PXE provisioning.

Automation:Solid scripting skills in Python and Bash, plus working knowledge of Ansible or Terraform.

Nice to Have

Distributed Training:Experience with NCCL internals, PyTorch distributed, or Megatron-style training stacks.

Cluster Management:Familiarity with BCM (Base Command Manager), Run:ai, or similar managers.

Kubernetes:Experience running Kubernetes on bare metal with GPU, Network, and MPI Operators.

Storage:Exposure to high-performance storage like Lustre, WEKA, VAST, or BeeGFS.

Infrastructure Context:Prior work in an AI cloud, neocloud, HPC center, or hyperscaler environment.

Why Join Us?

Frontier Infrastructure:You will touch clusters that train world-class models, working with the most advanced hardware available.

Engineering Culture:We maintain a flat structure with direct access to leadership and a culture built around technical craftsmanship and ownership.

Remote-First: Full remote flexibility with occasional travel for team summits and datacenter site visits.

Site Reliability Engineer, AI Infrastructure (Remote)

🇿🇦 Hirezar Summary for South African Applicants

Job Description

Tips for South African Applicants

Timezone Advantage

Salary in Context

Application Tips

Load Shedding Preparedness

About Somewhere