hero

The #1 Source for
In-Person NYC Tech Jobs

Build your future in the capital of everything.
Obviously New York.
companies
Jobs

DevOps Engineer: Build Secure, Resilient Infrastructure for Adversarial AI Testing

Ezkl

Ezkl

Software Engineering, Other Engineering, Data Science
New York, NY, USA
Posted on Sep 3, 2025

We're building infrastructure to test and secure AI systems under extreme, adversarial, and failure-prone conditions. From custom CUDA kernels and trusted execution environments (TEEs) to zero-knowledge proof systems, we design for correctness and resilience even when things go catastrophically wrong.

We want to simulate what happens when systems face the worst: thermal stress on GPUs, cosmic-ray-like memory faults, and root-level adversaries targeting cryptographic protocols. If a failure mode could take down an AI system in production, we aim to surface it first under controlled, repeatable conditions.

We're looking for a DevOps Engineer to help us build and operate high-integrity, security-conscious testing infrastructure that directly informs production hardening. You'll work across hardware and software layers to design systems that remain observable, testable, and verifiable while under attack or extreme stress; taking findings from our test set ups directly into our production deployments that secure statistical models and AI systems for our users.

What You'll Do

  • Architect and maintain automated adversarial testing environments where attackers have full host or root access, then translate findings into updates to our products.

  • Build and manage hardware-in-the-loop test setups, including environmental chambers and stress rigs for GPUs, to validate the production resilience of our fault tolerant CUDA kernels.

  • Develop fault injection frameworks simulating everything from bit flips and power loss to protocol-level faults that could occur in production.

  • Implement CI/CD pipelines for our core products, ensuring their integrity with every new code change.

  • Manage the infrastructure that manages TEE attestations, zero-knowledge proof creation and delivery, and fault-tolerant AI inference for our users.

  • Design observability, monitoring, and alerting systems that work in both intentionally unstable test environments and hardened production systems

What We're Looking For

Mindset & Approach

  • A security-first mentality: you think like an attacker to build better defenses,

  • Strong bias toward reproducibility, security, and traceability in complex environments.

  • Comfortable working in ambiguous and high-failure environments where resilience matters.

Core Skills####

  • Proficiency with CI/CD systems (GitLab CI, Jenkins, Buildkite) and Infrastructure as Code tools (Terraform, Ansible, Pulumi).

  • Experience with container orchestration (Docker, Kubernetes) and building reproducible environments

  • Strong Linux systems knowledge, especially around debugging, performance, and kernel behavior

  • Expertise with observability and monitoring tools like Prometheus, Grafana, ELK, or OpenTelemetry in both test and production contexts

*Systems & Hardware
*

  • Experience with GPU-based compute (NVIDIA stack, CUDA, thermals, memory behavior)

  • Comfort working with bare-metal or lab hardware (rack-mounted systems, thermal chambers, environmental sensors)

  • Background in systems-level engineering or reliability, especially at hardware/software fault boundaries

*Security & Adversarial Thinking
*

  • Experience with secure systems design and threat modeling at the infrastructure level

  • Ability to simulate adversarial scenarios

*Fault Injection
*

  • Knowledge of chaos engineering practices, fault injection frameworks (Chaos Mesh, Gremlin), or fuzzing tools

  • Ability to design hostile test conditions that replicate real-world production failures (power fluctuations, silent data corruption)

Bonus Skills

  • Familiarity with AI/ML infrastructure (model serving, distributed training, inference under load)

  • Comfort working with ECC error simulation, or low-level hardware errors

  • Experience securing production AI/ML systems against adversarial attacks

  • Familiarity with zero-knowledge proof systems, cryptographic verification, or TEEs (SGX, SEV, TrustZone).

This role is perfect for someone who loves breaking things to make them unbreakable, and who understands that the best production security comes from testing systems to their absolute limits.