Free tools for teams renting GPU clusters: a goodput calculator, the SLA claim-window table nobody publishes, and an XID field guide written for renters — not operators.
Failure-rate defaults come from Meta's Llama 3 report — 419 unexpected interruptions in 54 days on 16,384 H100s (arXiv:2407.21783). Every number is editable. Runs entirely in your browser.
We fetched the public SLA and ToS texts of 9 GPU providers (2026-07-03). Three findings: nothing is auto-credited. Several providers make your own logs the mandatory evidence — miss the window, you forfeit. And a degraded-but-reachable GPU generally isn't "downtime" at all.
| Provider | Public uptime SLA | Claim window | Evidence required from you | Degraded GPU covered? | Credit cap | Source |
|---|---|---|---|---|---|---|
| CoreWeave | 99.9% multi-region / 99% single (container-era, 2022) | 10 days or forfeit | Server log files showing connectivity loss + exact date/time | No — connectivity only | 50% of month | ToS |
| Nebius | 99.5% per VM | 14 days after month end, or waived | Timestamped notice — your notice starts their downtime clock | No — connectivity / boot disk | 10–30% tiers | SLA |
| Crusoe | Exists; full text not publicly crawlable | 30 days (reported) | Evidence of missed target + date/time, via "SLA Refund Request" ticket | Unclear | Credits only | Legal |
| AWS P4/P5 | 99.99% region / 99.5% instance | End of 2nd billing cycle | "Your request logs that document the errors and corroborate your claimed outage" | No — connectivity only | 10/30/100%; Capacity Blocks effectively creditless | SLA |
| Google Cloud A3/A4 | 99.9% single instance (GA GPUs only) | 60 days | "Log files showing Downtime Periods" — or forfeit | No — A4X/GB200 excluded entirely | 10/25/100% tiers | SLA |
| Azure ND | 99.9% (premium disk, single VM) | 2 months from billing month end | Incident description, time & duration, your resolution attempts | No — connectivity only | 10/25/100% tiers | SLA |
| Lambda | None in public ToS | — | — (only a 4-hour Sev-1 support response commitment) | — | — | ToS |
| RunPod | None public — Community Cloud expressly no uptime warranty | — | — (reserved-cluster SLA is sales-only, terms unpublished) | — | — | ToS |
| Together AI | None public — "no guarantees… unless otherwise agreed in an order form" | — | — (provider decides what counts as a "verified breach") | — | — | ToS |
Read from public documents on 2026-07-03. Your negotiated MSA or order form may differ substantially — for large clusters it almost certainly does, and that private contract is where node/rack-level terms live. Not legal advice; read your own contract.
XIDs are the NVIDIA driver's error codes, surfaced in dmesg and nvidia-smi. Operators have runbooks. Renters need two different answers: whose problem is it — and what evidence do you capture before the node gets recycled?
| XID | What it means | Severity | Whose problem | Capture before it's gone |
|---|---|---|---|---|
| 13 | Graphics engine exception | TRANSIENT | Usually your code; recurring on one GPU → hardware | dmesg line + GPU UUID; note if it follows the GPU or the job |
| 31 | GPU memory page fault | TRANSIENT | Usually app bug; pattern on one node → escalate | dmesg + app stack trace + GPU UUID |
| 48 | Double-bit ECC error | NODE-KILLER | Provider. Memory hardware fault — demand a node swap | nvidia-smi -q ECC section, dmesg, UTC timestamp, serial |
| 63 / 64 | ECC row-remap recording (63 success / 64 failure) | WARN → KILLER | Provider. 64, or 63 repeating = dying HBM | Row-remap counters (pending/failed), trend over days |
| 74 | NVLink error | NODE-KILLER | Provider. Interconnect fault; kills collective ops | dmesg, nvidia-smi nvlink -e, NCCL logs at same timestamp |
| 79 | GPU fell off the bus | NODE-KILLER | Provider. The classic. PCIe/power/thermal; node needs intervention | Full dmesg, exact UTC time, wall-clock gap in job logs — billable-downtime evidence |
| 92 | High single-bit ECC rate | WARNING | Provider. Early HBM degradation signal | Volatile vs aggregate ECC counts, daily trend |
| 94 / 95 | Contained / uncontained ECC error | WARN / KILLER | Provider. 95 = node must drain | dmesg + which processes were killed + timestamp |
| 119 / 120 | GSP timeout / error | NODE-KILLER | Provider (firmware/driver stack on their image) | dmesg, driver + GSP firmware versions, timestamp |
| 140 | Unrecovered ECC (aggregate) | NODE-KILLER | Provider. RMA-class memory fault | Full nvidia-smi -q dump before the node is swapped |
A GPU running at 70% clocks bills at 100%. Poll nvidia-smi -q -d PERFORMANCE throttle reasons and log clock residency vs. rated. Sustained HW Slowdown on rented hardware is a conversation with your provider.
Watchdog caught collective operation timeout usually means one rank's GPU or link died — the XID is on some other node. Correlate NCCL timestamps across nodes with each node's dmesg; the one with the XID at that moment is your evidence.
dmesg, nvidia-smi -q, UTC timestamps, GPU UUIDs/serials, your job logs. Under most SLAs above, your logs are the required proof, the window is days not months, and a recycled node takes its dmesg with it.
A single static binary that watches dmesg, NVML/DCGM, and your NCCL logs, and writes a hash-chained, tamper-evident record of what your rented cluster actually delivered — goodput vs. billed hours, incident by incident. Free, Apache-2.0, no phone-home.
→ repo opening soon · first release: XID + ECC + throttle + heartbeat
I'm building the evidence layer for GPU renters — independent goodput records, claim-window tracking, and evidence packets your provider's own T&Cs ask for. If you're in (or heading into) a dispute or a renewal, I'd like to hear what happened. Written correspondence only — I'm an engineer, not a salesperson.