← Back to all articles

Designing a Production-Grade RPC Failover Layer

#distributed-systems #failover #infrastructure #rpc #typescript #web3

Designing a production-grade RPC failover layer is not about adding more endpoints.

It is about engineering for degraded behavior.

In real-world environments, RPC nodes rarely fail catastrophically. They drift behind the network. They respond with valid JSON while operating on stale state. They exhibit elevated p95/p99 latency under load. They partially fail specific methods while remaining “online.”

From the outside, everything appears functional.

From a systems perspective, reliability has already been compromised.

A production-grade failover layer must therefore evaluate quality, consistency, and performance — not just availability.

This article walks through what it actually takes to design such a layer using practical TypeScript examples, and why naive fallback logic collapses under real production traffic.


1. The Naive Approach

Most implementations start like this:

</> typescript
const endpoints = [
  "https://rpc1.example.com",
  "https://rpc2.example.com"
];

async function rpcCall(method: string, params: unknown[]) {
  for (const endpoint of endpoints) {
    try {
      const response = await fetch(endpoint, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          jsonrpc: "2.0",
          id: 1,
          method,
          params
        })
      });

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}`);
      }

      return await response.json();
    } catch {
      continue;
    }
  }

  throw new Error("All RPC endpoints failed");
}

This logic assumes failure is binary:

  • The endpoint works.
  • Or it throws.

But production failures are rarely binary.

A node can:

  • Respond successfully
  • Return valid JSON
  • Still be unusable

The above code cannot detect degraded performance, stale state, or method-specific instability.

It only detects catastrophic failure.


2. Degradation Is Not Failure

An RPC node can:

  • Respond successfully
  • Return valid JSON
  • Be several blocks behind
  • Have high p95/p99 latency under load
  • Randomly fail specific methods
  • Silently rate-limit heavy calls

From your application's perspective, these are all failures.

But HTTP status codes won't show it.

A real failover layer must evaluate quality, not just availability.

That means:

  • Measuring latency
  • Tracking error ratios
  • Sampling block height
  • Monitoring method-specific behavior

Without this, you are routing traffic without visibility.


3. Adding Timeouts

The first production requirement is time control.

Never allow remote calls to hang indefinitely.

</> typescript
function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
  return Promise.race([
    promise,
    new Promise<T>((_, reject) =>
      setTimeout(() => reject(new Error("Timeout")), ms)
    )
  ]);
}

Why this matters:

  • TCP sockets can stall.
  • Providers can throttle without closing connections.
  • Network paths can degrade without failing.

Timeouts convert silent hangs into measurable failures.

But timeouts alone do not make a system reliable.

They only surface slow behavior.


4. Exponential Backoff with Jitter

When errors occur, retrying immediately can amplify the problem.

If 1,000 instances retry at the same interval, you create synchronized traffic spikes --- known as the thundering herd problem.

</> typescript
function sleep(ms: number) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  retries = 5
): Promise<T> {
  for (let attempt = 0; attempt <= retries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === retries) throw err;

      const baseDelay = Math.min(1000 * 2 ** attempt, 8000);
      const jitter = Math.random() * 300;
      await sleep(baseDelay + jitter);
    }
  }

  throw new Error("Unreachable");
}

Backoff reduces pressure on unstable endpoints.

Jitter prevents coordinated retry spikes.

But retries hide symptoms.

They do not solve systemic degradation.


5. Endpoint Health Scoring

Static fallback order is dangerous.

If endpoint A is slightly degraded, you'll still hit it first every time.

A better approach is dynamic scoring.

</> typescript
type EndpointHealth = {
  url: string;
  successCount: number;
  errorCount: number;
  totalLatencyMs: number;
  lastBlockHeight: number;
  lastChecked: number;
};

function calculateScore(health: EndpointHealth): number {
  const successRate =
    health.successCount /
    Math.max(health.successCount + health.errorCount, 1);

  const avgLatency =
    health.totalLatencyMs /
    Math.max(health.successCount, 1);

  return successRate * 100 - avgLatency * 0.1;
}

This introduces:

  • Success ratio tracking
  • Latency aggregation
  • Performance-based routing

Now endpoints are ranked by quality.

But there is a deeper issue still unsolved.


6. Detecting Stale Nodes

A node can respond quickly --- and still be behind.

If it lags several blocks:

  • Trading systems may execute on stale state.
  • Indexers may miss recent events.
  • Wallet backends may misreport balances.

To detect staleness, you must compare block heights.

</> typescript
async function getBlockHeight(endpoint: string): Promise<number> {
  const res = await fetch(endpoint, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      jsonrpc: "2.0",
      id: 1,
      method: "eth_blockNumber",
      params: []
    })
  });

  const data = await res.json();
  return parseInt(data.result, 16);
}

async function isStale(
  endpoint: string,
  referenceHeight: number,
  tolerance = 3
): Promise<boolean> {
  const height = await getBlockHeight(endpoint);
  return referenceHeight - height > tolerance;
}

But now you need:

  • A trusted reference height
  • Cross-endpoint comparison
  • Background sampling
  • Per-chain tolerance logic

Your failover layer is no longer a simple client wrapper. It is becoming a monitoring system.


7. Circuit Breaking

If an endpoint fails repeatedly, you must temporarily remove it from rotation.

Otherwise, you keep sending traffic into a degraded system.

</> typescript
type CircuitState = "closed" | "open" | "half-open";

type Circuit = {
  state: CircuitState;
  failureCount: number;
  lastFailureTime: number;
};

function shouldAllowRequest(circuit: Circuit): boolean {
  if (circuit.state === "closed") return true;

  if (circuit.state === "open") {
    const cooldown = 10_000;
    return Date.now() - circuit.lastFailureTime > cooldown;
  }

  return true;
}

Circuit breaking introduces:

  • Failure thresholds
  • Cooldown windows
  • Recovery validation
  • State transitions

At this point, you are building distributed systems infrastructure.

Not simple fallback logic.


What This Actually Means

To operate a reliable RPC layer yourself, you need:

  • Dynamic endpoint scoring
  • Latency percentile tracking
  • Error rate monitoring
  • Stale state detection
  • Circuit breaking
  • Retry control
  • Continuous background health probes
  • Observability and metrics

And this must run continuously --- not just when requests fail.

You are no longer consuming infrastructure.

You are operating it.


Closing Thought

Adding multiple RPC endpoints is easy.

Maintaining:

  • Active routing
  • Dynamic health scoring
  • Latency percentile tracking
  • State consistency validation
  • Circuit breaking
  • Background health probes
  • Method-level error monitoring
  • Continuous observability

is not.

At some point, your “simple failover layer” has become:

  • A routing engine
  • A monitoring system
  • A metrics pipeline
  • A consistency validator
  • An operational burden

And it must run 24/7, under load, across regions.

Reliable RPC is not redundancy.

It is an infrastructure discipline.

If your application depends on accurate on-chain data — trading systems, indexers, analytics pipelines, wallet backends — then reliability must be engineered deliberately.

You can build and maintain this entire layer yourself.

Or you can rely on infrastructure that already implements:

  • Active health scoring
  • Degradation detection
  • Stale node filtering
  • Intelligent routing
  • Observability-first design

That is precisely the problem RVO is built to solve.

Instead of operating your own failover and monitoring stack, you can integrate a production-grade routing layer in minutes — without building a routing engine, monitoring system, and metrics pipeline yourself.

If you want to get started, see:

https://docs.rvo.network/getting-started/

Reliable RPC should not require you to become an infrastructure operator.

It should just work.

See also

Ready when you are

Start using RVO today

Create a free API key and send requests immediately. Limits are explicit, upgrades are instant, and nothing is hidden.