evidence layer for GPU renters

GPU clusters fail.
Your invoice never does.

Free tools for teams renting GPU clusters: a goodput calculator, the SLA claim-window table nobody publishes, and an XID field guide written for renters — not operators.

9 contracts read · defaults from Meta's Llama-3 run · no signup, no tracking
clusterproof-agent · node gpu-07 · chain verified ✓
Contracts read — CoreWeaveLambdaNebiusCrusoeRunPodTogether AIAWSGoogle CloudAzure0 auto-credit. All claim-required. Contracts read — CoreWeaveLambdaNebiusCrusoeRunPodTogether AIAWSGoogle CloudAzure0 auto-credit. All claim-required.
0
provider T&Cs fetched, read, quoted & linked
0 days
shortest credit claim window — miss it, you forfeit
0 of 9
providers with no public uptime SLA at all
0
providers that auto-apply credits. You must claim.
01 — Calculator

What is downtime actually costing you?

Failure-rate defaults come from Meta's Llama 3 report — 419 unexpected interruptions in 54 days on 16,384 H100s (arXiv:2407.21783). Every number is editable. Runs entirely in your browser.

Configure cluster

Meta's run ≈ 0.47. On GB200 NVL72 the whole rack is the failure domain — consider higher.

Your numbers

estimated
goodput
GPU-hours lost / yr
Billed-but-unusable / yr
Cluster stalls / week
Hours lost per stall
02 — Contracts

The claim-window table nobody publishes

We fetched the public SLA and ToS texts of 9 GPU providers (2026-07-03). Three findings: nothing is auto-credited. Several providers make your own logs the mandatory evidence — miss the window, you forfeit. And a degraded-but-reachable GPU generally isn't "downtime" at all.

ProviderPublic uptime SLAClaim windowEvidence required from youDegraded GPU covered?Credit capSource
CoreWeave 99.9% multi-region / 99% single (container-era, 2022) 10 days or forfeit Server log files showing connectivity loss + exact date/time No — connectivity only 50% of month ToS
Nebius 99.5% per VM 14 days after month end, or waived Timestamped notice — your notice starts their downtime clock No — connectivity / boot disk 10–30% tiers SLA
Crusoe Exists; full text not publicly crawlable 30 days (reported) Evidence of missed target + date/time, via "SLA Refund Request" ticket Unclear Credits only Legal
AWS P4/P5 99.99% region / 99.5% instance End of 2nd billing cycle "Your request logs that document the errors and corroborate your claimed outage" No — connectivity only 10/30/100%; Capacity Blocks effectively creditless SLA
Google Cloud A3/A4 99.9% single instance (GA GPUs only) 60 days "Log files showing Downtime Periods" — or forfeit No — A4X/GB200 excluded entirely 10/25/100% tiers SLA
Azure ND 99.9% (premium disk, single VM) 2 months from billing month end Incident description, time & duration, your resolution attempts No — connectivity only 10/25/100% tiers SLA
Lambda None in public ToS — (only a 4-hour Sev-1 support response commitment) ToS
RunPod None public — Community Cloud expressly no uptime warranty — (reserved-cluster SLA is sales-only, terms unpublished) ToS
Together AI None public — "no guarantees… unless otherwise agreed in an order form" — (provider decides what counts as a "verified breach") ToS

Read from public documents on 2026-07-03. Your negotiated MSA or order form may differ substantially — for large clusters it almost certainly does, and that private contract is where node/rack-level terms live. Not legal advice; read your own contract.

03 — Field guide

XID errors, for people who rent

XIDs are the NVIDIA driver's error codes, surfaced in dmesg and nvidia-smi. Operators have runbooks. Renters need two different answers: whose problem is it — and what evidence do you capture before the node gets recycled?

XIDWhat it meansSeverityWhose problemCapture before it's gone
13Graphics engine exceptionTRANSIENTUsually your code; recurring on one GPU → hardwaredmesg line + GPU UUID; note if it follows the GPU or the job
31GPU memory page faultTRANSIENTUsually app bug; pattern on one node → escalatedmesg + app stack trace + GPU UUID
48Double-bit ECC errorNODE-KILLERProvider. Memory hardware fault — demand a node swapnvidia-smi -q ECC section, dmesg, UTC timestamp, serial
63 / 64ECC row-remap recording (63 success / 64 failure)WARN → KILLERProvider. 64, or 63 repeating = dying HBMRow-remap counters (pending/failed), trend over days
74NVLink errorNODE-KILLERProvider. Interconnect fault; kills collective opsdmesg, nvidia-smi nvlink -e, NCCL logs at same timestamp
79GPU fell off the busNODE-KILLERProvider. The classic. PCIe/power/thermal; node needs interventionFull dmesg, exact UTC time, wall-clock gap in job logs — billable-downtime evidence
92High single-bit ECC rateWARNINGProvider. Early HBM degradation signalVolatile vs aggregate ECC counts, daily trend
94 / 95Contained / uncontained ECC errorWARN / KILLERProvider. 95 = node must draindmesg + which processes were killed + timestamp
119 / 120GSP timeout / errorNODE-KILLERProvider (firmware/driver stack on their image)dmesg, driver + GSP firmware versions, timestamp
140Unrecovered ECC (aggregate)NODE-KILLERProvider. RMA-class memory faultFull nvidia-smi -q dump before the node is swapped
silent killer 01

Thermal & power throttling

A GPU running at 70% clocks bills at 100%. Poll nvidia-smi -q -d PERFORMANCE throttle reasons and log clock residency vs. rated. Sustained HW Slowdown on rented hardware is a conversation with your provider.

silent killer 02

NCCL timeouts

Watchdog caught collective operation timeout usually means one rank's GPU or link died — the XID is on some other node. Correlate NCCL timestamps across nodes with each node's dmesg; the one with the XID at that moment is your evidence.

the general rule

Capture at the moment of failure

dmesg, nvidia-smi -q, UTC timestamps, GPU UUIDs/serials, your job logs. Under most SLAs above, your logs are the required proof, the window is days not months, and a recycled node takes its dmesg with it.

04 — Open source

clusterproof-agent
the evidence recorder

A single static binary that watches dmesg, NVML/DCGM, and your NCCL logs, and writes a hash-chained, tamper-evident record of what your rented cluster actually delivered — goodput vs. billed hours, incident by incident. Free, Apache-2.0, no phone-home.

→ repo opening soon · first release: XID + ECC + throttle + heartbeat

Renting ≥64 GPUs and disputing
with your provider?

I'm building the evidence layer for GPU renters — independent goodput records, claim-window tracking, and evidence packets your provider's own T&Cs ask for. If you're in (or heading into) a dispute or a renewal, I'd like to hear what happened. Written correspondence only — I'm an engineer, not a salesperson.