S

Site Reliability Engineer, AI Infrastructure (Remote)

Somewhere

Fully Remote
πŸ“ Remote
πŸ‡ΏπŸ‡¦ SA Friendly: 1.0/1.0

πŸ‡ΏπŸ‡¦ Hirezar Summary for South African Applicants

This fully remote full time position at Somewhere is open to applicants from South Africa. The estimated monthly salary is R92,500 – R203,500 ZAR. This role is suited for senior-level professionals. As a remote position, you can work from anywhere in South Africa β€” whether you're based in Johannesburg, Cape Town, Durban, or a smaller town.

Job Description

Location:Remote (LATAM, South Africa or PH)Contract:Minimum 6-month contract with the potential for an indefinite extension based on performance.Schedule:Full-Time, Monday-Friday, PST or PH timezone.Reports to:Head of Infrastructure / SRE

About the Company

We operate state-of-the-art AI Factories across Europe and the US, running large-scale NVIDIA GPU clusters (H100, H200, B200, B300) on bare metal for frontier AI workloads. We design, build, and operate the full stack: datacenter power and cooling, InfiniBand fabrics, SLURM and Kubernetes orchestration, storage, and the control plane that turns raw iron into reliable compute for our customers.

Role Overview

We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for US hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery, and cleaner operations. This is a hands-on role with significant production impact from week one.

Key Responsibilities

I. Cluster Operations & Hardening

Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs.

Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics.

Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues.

Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps.

II. Automation & Observability

Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently.

Observability Stack: Build and maintain observability for GPU fleets using Prometheus, Grafana, DCGM, node exporter, and custom exporters.

Capacity & Rollouts: Contribute to capacity planning, firmware rollout strategy, and cluster bring-up for new sites.

III. Collaboration & Incident Response

Workload Optimization: Partner with customer workload teams on NCCL tuning, job scheduling policy, QoS, and fairshare.

Operational Excellence: Lead post-mortems, write comprehensive runbooks, and improve change management processes across global regions.

On-Call Leadership: Participate in the on-call rotation for US hours and handle escalations from international sites when necessary.

Who You Are

Required Experience & Skills

Deep SRE/HPC Background:5+ years in SRE, systems engineering, or HPC operations.

SLURM Expertise:Extensive production experience with SLURM at scale (accounting/slurmdbd, prolog/epilog scripts, cgroups, GRES, topology awareness).

NVIDIA Stack Mastery:Hands-on experience with NVIDIA datacenter GPUs, driver stacks, CUDA runtime, Fabric Manager, nvidia-smi, DCGM, and GPU Direct RDMA.

Networking:Operational experience with InfiniBand fabrics at 100G or higher (OpenSM/UFM, ibdiagnet, perfquery, and fabric troubleshooting).

Linux Administration:Expert-level Linux admin skills (Ubuntu/RHEL family), including kernel tuning, systemd, networking, and PXE provisioning.

Automation:Solid scripting skills in Python and Bash, plus working knowledge of Ansible or Terraform.

Nice to Have

Distributed Training:Experience with NCCL internals, PyTorch distributed, or Megatron-style training stacks.

Cluster Management:Familiarity with BCM (Base Command Manager), Run:ai, or similar managers.

Kubernetes:Experience running Kubernetes on bare metal with GPU, Network, and MPI Operators.

Storage:Exposure to high-performance storage like Lustre, WEKA, VAST, or BeeGFS.

Infrastructure Context:Prior work in an AI cloud, neocloud, HPC center, or hyperscaler environment.

Why Join Us?

Frontier Infrastructure:You will touch clusters that train world-class models, working with the most advanced hardware available.

Engineering Culture:We maintain a flat structure with direct access to leadership and a culture built around technical craftsmanship and ownership.

Remote-First: Full remote flexibility with occasional travel for team summits and datacenter site visits.

Tips for South African Applicants

⏰

Timezone Advantage

South Africa (SAST, UTC+2) overlaps well with European business hours and has a few hours of overlap with US East Coast. Mention your timezone flexibility in your application.

πŸ’°

Salary in Context

At R92,500/month, this role pays well above the average South African remote salary. The USD equivalent ($5,000/mo) benefits from the favourable exchange rate.

πŸ“‹

Application Tips

Tailor your CV to international standards β€” use a clean format, highlight remote work experience, and include your English proficiency. Many SA applicants succeed by emphasising their strong work ethic and cultural adaptability.

πŸ”Œ

Load Shedding Preparedness

If you're applying for a remote role, having a backup power solution (UPS, inverter, or generator) and mobile data as a backup internet connection shows employers you're prepared for South Africa's infrastructure challenges.

About Somewhere

Somewhere is a company in the Recruitment & Staffing industry that hires remote workers from South Africa. They currently have 685 open positions on Hirezar. View all Somewhere jobs β†’