Clusterproof — Know what your GPU cluster actually delivered

01 — Calculator

What is downtime actually costing you?

Failure-rate defaults come from Meta's Llama 3 report — 419 unexpected interruptions in 54 days on 16,384 H100s (arXiv:2407.21783). Every number is editable. Runs entirely in your browser.

Configure cluster

GPUs in cluster

GPU type

$ / GPU-hour

Utilization %

Hardware failures per 1,000 GPU-days

Meta's run ≈ 0.47. On GB200 NVL72 the whole rack is the failure domain — consider higher.

MTTR per failure (min)

Checkpoint interval (min)

Restart overhead (min)

Provider

Your numbers

—

estimated
goodput

GPU-hours lost / yr

—

Billed-but-unusable / yr

—

Cluster stalls / week

—

Hours lost per stall

—

02 — Contracts

The claim-window table nobody publishes

We fetched the public SLA and ToS texts of 9 GPU providers (2026-07-03). Three findings: nothing is auto-credited. Several providers make your own logs the mandatory evidence — miss the window, you forfeit. And a degraded-but-reachable GPU generally isn't "downtime" at all.

Provider	Public uptime SLA	Claim window	Evidence required from you	Degraded GPU covered?	Credit cap	Source
CoreWeave	99.9% multi-region / 99% single (container-era, 2022)	10 days or forfeit	Server log files showing connectivity loss + exact date/time	No — connectivity only	50% of month	ToS
Nebius	99.5% per VM	14 days after month end, or waived	Timestamped notice — your notice starts their downtime clock	No — connectivity / boot disk	10–30% tiers	SLA
Crusoe	Exists; full text not publicly crawlable	30 days (reported)	Evidence of missed target + date/time, via "SLA Refund Request" ticket	Unclear	Credits only	Legal
AWS P4/P5	99.99% region / 99.5% instance	End of 2nd billing cycle	"Your request logs that document the errors and corroborate your claimed outage"	No — connectivity only	10/30/100%; Capacity Blocks effectively creditless	SLA
Google Cloud A3/A4	99.9% single instance (GA GPUs only)	60 days	"Log files showing Downtime Periods" — or forfeit	No — A4X/GB200 excluded entirely	10/25/100% tiers	SLA
Azure ND	99.9% (premium disk, single VM)	2 months from billing month end	Incident description, time & duration, your resolution attempts	No — connectivity only	10/25/100% tiers	SLA
Lambda	None in public ToS	—	— (only a 4-hour Sev-1 support response commitment)	—	—	ToS
RunPod	None public — Community Cloud expressly no uptime warranty	—	— (reserved-cluster SLA is sales-only, terms unpublished)	—	—	ToS
Together AI	None public — "no guarantees… unless otherwise agreed in an order form"	—	— (provider decides what counts as a "verified breach")	—	—	ToS

Read from public documents on 2026-07-03. Your negotiated MSA or order form may differ substantially — for large clusters it almost certainly does, and that private contract is where node/rack-level terms live. Not legal advice; read your own contract.

03 — Field guide

XID errors, for people who rent

XIDs are the NVIDIA driver's error codes, surfaced in dmesg and nvidia-smi. Operators have runbooks. Renters need two different answers: whose problem is it — and what evidence do you capture before the node gets recycled?

XID	What it means	Severity	Whose problem	Capture before it's gone
13	Graphics engine exception	TRANSIENT	Usually your code; recurring on one GPU → hardware	dmesg line + GPU UUID; note if it follows the GPU or the job
31	GPU memory page fault	TRANSIENT	Usually app bug; pattern on one node → escalate	dmesg + app stack trace + GPU UUID
48	Double-bit ECC error	NODE-KILLER	Provider. Memory hardware fault — demand a node swap	nvidia-smi -q ECC section, dmesg, UTC timestamp, serial
63 / 64	ECC row-remap recording (63 success / 64 failure)	WARN → KILLER	Provider. 64, or 63 repeating = dying HBM	Row-remap counters (pending/failed), trend over days
74	NVLink error	NODE-KILLER	Provider. Interconnect fault; kills collective ops	dmesg, nvidia-smi nvlink -e, NCCL logs at same timestamp
79	GPU fell off the bus	NODE-KILLER	Provider. The classic. PCIe/power/thermal; node needs intervention	Full dmesg, exact UTC time, wall-clock gap in job logs — billable-downtime evidence
92	High single-bit ECC rate	WARNING	Provider. Early HBM degradation signal	Volatile vs aggregate ECC counts, daily trend
94 / 95	Contained / uncontained ECC error	WARN / KILLER	Provider. 95 = node must drain	dmesg + which processes were killed + timestamp
119 / 120	GSP timeout / error	NODE-KILLER	Provider (firmware/driver stack on their image)	dmesg, driver + GSP firmware versions, timestamp
140	Unrecovered ECC (aggregate)	NODE-KILLER	Provider. RMA-class memory fault	Full nvidia-smi -q dump before the node is swapped

silent killer 01

Thermal & power throttling

A GPU running at 70% clocks bills at 100%. Poll nvidia-smi -q -d PERFORMANCE throttle reasons and log clock residency vs. rated. Sustained HW Slowdown on rented hardware is a conversation with your provider.

silent killer 02

NCCL timeouts

Watchdog caught collective operation timeout usually means one rank's GPU or link died — the XID is on some other node. Correlate NCCL timestamps across nodes with each node's dmesg; the one with the XID at that moment is your evidence.

the general rule

Capture at the moment of failure

dmesg, nvidia-smi -q, UTC timestamps, GPU UUIDs/serials, your job logs. Under most SLAs above, your logs are the required proof, the window is days not months, and a recycled node takes its dmesg with it.

GPU clusters fail.
Your invoice never does.

What is downtime actually costing you?

Configure cluster

Your numbers

The claim-window table nobody publishes

XID errors, for people who rent

Thermal & power throttling

NCCL timeouts

Capture at the moment of failure

clusterproof-agent
the evidence recorder

Renting ≥64 GPUs and disputing
with your provider?

GPU clusters fail.Your invoice never does.

What is downtime actually costing you?

Configure cluster

Your numbers

The claim-window table nobody publishes

XID errors, for people who rent

Thermal & power throttling

NCCL timeouts

Capture at the moment of failure

clusterproof-agentthe evidence recorder

Renting ≥64 GPUs and disputingwith your provider?

GPU clusters fail.
Your invoice never does.

clusterproof-agent
the evidence recorder

Renting ≥64 GPUs and disputing
with your provider?