Designing a Production-Grade RPC Failover Layer
Designing a production-grade RPC failover layer is not about adding more endpoints.
It is about engineering for degraded behavior.
In real-world environments, RPC nodes rarely fail catastrophically. They drift behind the network. They respond with valid JSON while operating on stale state. They exhibit elevated p95/p99 latency under load. They partially fail specific methods while remaining “online.”
From the outside, everything appears functional.
From a systems perspective, reliability has already been compromised.
A production-grade failover layer must therefore evaluate quality, consistency, and performance — not just availability.
This article walks through what it actually takes to design such a layer using practical TypeScript examples, and why naive fallback logic collapses under real production traffic.
1. The Naive Approach
Most implementations start like this:
const endpoints = [
"https://rpc1.example.com",
"https://rpc2.example.com"
];
async function rpcCall(method: string, params: unknown[]) {
for (const endpoint of endpoints) {
try {
const response = await fetch(endpoint, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
jsonrpc: "2.0",
id: 1,
method,
params
})
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
return await response.json();
} catch {
continue;
}
}
throw new Error("All RPC endpoints failed");
}
This logic assumes failure is binary:
- The endpoint works.
- Or it throws.
But production failures are rarely binary.
A node can:
- Respond successfully
- Return valid JSON
- Still be unusable
The above code cannot detect degraded performance, stale state, or method-specific instability.
It only detects catastrophic failure.
2. Degradation Is Not Failure
An RPC node can:
- Respond successfully
- Return valid JSON
- Be several blocks behind
- Have high p95/p99 latency under load
- Randomly fail specific methods
- Silently rate-limit heavy calls
From your application's perspective, these are all failures.
But HTTP status codes won't show it.
A real failover layer must evaluate quality, not just availability.
That means:
- Measuring latency
- Tracking error ratios
- Sampling block height
- Monitoring method-specific behavior
Without this, you are routing traffic without visibility.
3. Adding Timeouts
The first production requirement is time control.
Never allow remote calls to hang indefinitely.
function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
return Promise.race([
promise,
new Promise<T>((_, reject) =>
setTimeout(() => reject(new Error("Timeout")), ms)
)
]);
}
Why this matters:
- TCP sockets can stall.
- Providers can throttle without closing connections.
- Network paths can degrade without failing.
Timeouts convert silent hangs into measurable failures.
But timeouts alone do not make a system reliable.
They only surface slow behavior.
4. Exponential Backoff with Jitter
When errors occur, retrying immediately can amplify the problem.
If 1,000 instances retry at the same interval, you create synchronized traffic spikes --- known as the thundering herd problem.
function sleep(ms: number) {
return new Promise(resolve => setTimeout(resolve, ms));
}
async function retryWithBackoff<T>(
fn: () => Promise<T>,
retries = 5
): Promise<T> {
for (let attempt = 0; attempt <= retries; attempt++) {
try {
return await fn();
} catch (err) {
if (attempt === retries) throw err;
const baseDelay = Math.min(1000 * 2 ** attempt, 8000);
const jitter = Math.random() * 300;
await sleep(baseDelay + jitter);
}
}
throw new Error("Unreachable");
}
Backoff reduces pressure on unstable endpoints.
Jitter prevents coordinated retry spikes.
But retries hide symptoms.
They do not solve systemic degradation.
5. Endpoint Health Scoring
Static fallback order is dangerous.
If endpoint A is slightly degraded, you'll still hit it first every time.
A better approach is dynamic scoring.
type EndpointHealth = {
url: string;
successCount: number;
errorCount: number;
totalLatencyMs: number;
lastBlockHeight: number;
lastChecked: number;
};
function calculateScore(health: EndpointHealth): number {
const successRate =
health.successCount /
Math.max(health.successCount + health.errorCount, 1);
const avgLatency =
health.totalLatencyMs /
Math.max(health.successCount, 1);
return successRate * 100 - avgLatency * 0.1;
}
This introduces:
- Success ratio tracking
- Latency aggregation
- Performance-based routing
Now endpoints are ranked by quality.
But there is a deeper issue still unsolved.
6. Detecting Stale Nodes
A node can respond quickly --- and still be behind.
If it lags several blocks:
- Trading systems may execute on stale state.
- Indexers may miss recent events.
- Wallet backends may misreport balances.
To detect staleness, you must compare block heights.
async function getBlockHeight(endpoint: string): Promise<number> {
const res = await fetch(endpoint, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
jsonrpc: "2.0",
id: 1,
method: "eth_blockNumber",
params: []
})
});
const data = await res.json();
return parseInt(data.result, 16);
}
async function isStale(
endpoint: string,
referenceHeight: number,
tolerance = 3
): Promise<boolean> {
const height = await getBlockHeight(endpoint);
return referenceHeight - height > tolerance;
}
But now you need:
- A trusted reference height
- Cross-endpoint comparison
- Background sampling
- Per-chain tolerance logic
Your failover layer is no longer a simple client wrapper. It is becoming a monitoring system.
7. Circuit Breaking
If an endpoint fails repeatedly, you must temporarily remove it from rotation.
Otherwise, you keep sending traffic into a degraded system.
type CircuitState = "closed" | "open" | "half-open";
type Circuit = {
state: CircuitState;
failureCount: number;
lastFailureTime: number;
};
function shouldAllowRequest(circuit: Circuit): boolean {
if (circuit.state === "closed") return true;
if (circuit.state === "open") {
const cooldown = 10_000;
return Date.now() - circuit.lastFailureTime > cooldown;
}
return true;
}
Circuit breaking introduces:
- Failure thresholds
- Cooldown windows
- Recovery validation
- State transitions
At this point, you are building distributed systems infrastructure.
Not simple fallback logic.
What This Actually Means
To operate a reliable RPC layer yourself, you need:
- Dynamic endpoint scoring
- Latency percentile tracking
- Error rate monitoring
- Stale state detection
- Circuit breaking
- Retry control
- Continuous background health probes
- Observability and metrics
And this must run continuously --- not just when requests fail.
You are no longer consuming infrastructure.
You are operating it.
Closing Thought
Adding multiple RPC endpoints is easy.
Maintaining:
- Active routing
- Dynamic health scoring
- Latency percentile tracking
- State consistency validation
- Circuit breaking
- Background health probes
- Method-level error monitoring
- Continuous observability
is not.
At some point, your “simple failover layer” has become:
- A routing engine
- A monitoring system
- A metrics pipeline
- A consistency validator
- An operational burden
And it must run 24/7, under load, across regions.
Reliable RPC is not redundancy.
It is an infrastructure discipline.
If your application depends on accurate on-chain data — trading systems, indexers, analytics pipelines, wallet backends — then reliability must be engineered deliberately.
You can build and maintain this entire layer yourself.
Or you can rely on infrastructure that already implements:
- Active health scoring
- Degradation detection
- Stale node filtering
- Intelligent routing
- Observability-first design
That is precisely the problem RVO is built to solve.
Instead of operating your own failover and monitoring stack, you can integrate a production-grade routing layer in minutes — without building a routing engine, monitoring system, and metrics pipeline yourself.
If you want to get started, see:
https://docs.rvo.network/getting-started/
Reliable RPC should not require you to become an infrastructure operator.
It should just work.
See also
Tracing a Web3 Request End-to-End: Where Latency and Failure Actually Come From
RPC performance issues rarely originate at the node itself. Latency, inconsistency, and failure are introduced across a chain of systems long before a request reaches a validator. This article traces a Web3 request end-to-end to show where delays accumulate, errors are masked, and reliability quietly degrades.
How to Benchmark RPC Providers Correctly
Most RPC benchmarks measure the wrong things. Average latency and request rates often hide degradation, throttling, and stale state that only appear under real load. This article explains how to benchmark RPC providers correctly—focusing on reliability, consistency, and behavior under stress, not just speed.
What Happens When an RPC Node Degrades (And Why It’s Worse Than Failure)
Most RPC outages don’t start with a clean failure. They begin with silent degradation—slower responses, stale data, and hidden latency spikes that traditional monitoring fails to detect. This article explains why degradation is more dangerous than downtime and how to recognize it before users feel the impact.
