RVO Network - Designing a Production-Grade RPC Failover Layer

Designing a production-grade RPC failover layer is not about adding more endpoints.

It is about engineering for degraded behavior.

In real-world environments, RPC nodes rarely fail catastrophically. They drift behind the network. They respond with valid JSON while operating on stale state. They exhibit elevated p95/p99 latency under load. They partially fail specific methods while remaining “online.”

From the outside, everything appears functional.

From a systems perspective, reliability has already been compromised.

A production-grade failover layer must therefore evaluate quality, consistency, and performance — not just availability.

This article walks through what it actually takes to design such a layer using practical TypeScript examples, and why naive fallback logic collapses under real production traffic.

1. The Naive Approach

Most implementations start like this:

</> typescript

const endpoints = [
  "https://rpc1.example.com",
  "https://rpc2.example.com"
];

async function rpcCall(method: string, params: unknown[]) {
  for (const endpoint of endpoints) {
    try {
      const response = await fetch(endpoint, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          jsonrpc: "2.0",
          id: 1,
          method,
          params
        })
      });

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}`);
      }

      return await response.json();
    } catch {
      continue;
    }
  }

  throw new Error("All RPC endpoints failed");
}

This logic assumes failure is binary:

The endpoint works.
Or it throws.

But production failures are rarely binary.

A node can:

Respond successfully
Return valid JSON
Still be unusable

The above code cannot detect degraded performance, stale state, or method-specific instability.

It only detects catastrophic failure.

2. Degradation Is Not Failure

An RPC node can:

Respond successfully
Return valid JSON
Be several blocks behind
Have high p95/p99 latency under load
Randomly fail specific methods
Silently rate-limit heavy calls

From your application's perspective, these are all failures.

But HTTP status codes won't show it.

A real failover layer must evaluate quality, not just availability.

That means:

Measuring latency
Tracking error ratios
Sampling block height
Monitoring method-specific behavior

Without this, you are routing traffic without visibility.

3. Adding Timeouts

The first production requirement is time control.

Never allow remote calls to hang indefinitely.

</> typescript

function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
  return Promise.race([
    promise,
    new Promise<T>((_, reject) =>
      setTimeout(() => reject(new Error("Timeout")), ms)
    )
  ]);
}

Why this matters:

TCP sockets can stall.
Providers can throttle without closing connections.
Network paths can degrade without failing.

Timeouts convert silent hangs into measurable failures.

But timeouts alone do not make a system reliable.

They only surface slow behavior.

4. Exponential Backoff with Jitter

When errors occur, retrying immediately can amplify the problem.

If 1,000 instances retry at the same interval, you create synchronized traffic spikes --- known as the thundering herd problem.

</> typescript

function sleep(ms: number) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  retries = 5
): Promise<T> {
  for (let attempt = 0; attempt <= retries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === retries) throw err;

      const baseDelay = Math.min(1000 * 2 ** attempt, 8000);
      const jitter = Math.random() * 300;
      await sleep(baseDelay + jitter);
    }
  }

  throw new Error("Unreachable");
}

Backoff reduces pressure on unstable endpoints.

Jitter prevents coordinated retry spikes.

But retries hide symptoms.

They do not solve systemic degradation.

5. Endpoint Health Scoring

Static fallback order is dangerous.

If endpoint A is slightly degraded, you'll still hit it first every time.

A better approach is dynamic scoring.

</> typescript

type EndpointHealth = {
  url: string;
  successCount: number;
  errorCount: number;
  totalLatencyMs: number;
  lastBlockHeight: number;
  lastChecked: number;
};

function calculateScore(health: EndpointHealth): number {
  const successRate =
    health.successCount /
    Math.max(health.successCount + health.errorCount, 1);

  const avgLatency =
    health.totalLatencyMs /
    Math.max(health.successCount, 1);

  return successRate * 100 - avgLatency * 0.1;
}

This introduces:

Success ratio tracking
Latency aggregation
Performance-based routing

Now endpoints are ranked by quality.

But there is a deeper issue still unsolved.

6. Detecting Stale Nodes

A node can respond quickly --- and still be behind.

If it lags several blocks:

Trading systems may execute on stale state.
Indexers may miss recent events.
Wallet backends may misreport balances.

To detect staleness, you must compare block heights.

</> typescript

async function getBlockHeight(endpoint: string): Promise<number> {
  const res = await fetch(endpoint, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      jsonrpc: "2.0",
      id: 1,
      method: "eth_blockNumber",
      params: []
    })
  });

  const data = await res.json();
  return parseInt(data.result, 16);
}

async function isStale(
  endpoint: string,
  referenceHeight: number,
  tolerance = 3
): Promise<boolean> {
  const height = await getBlockHeight(endpoint);
  return referenceHeight - height > tolerance;
}

But now you need:

A trusted reference height
Cross-endpoint comparison
Background sampling
Per-chain tolerance logic

Your failover layer is no longer a simple client wrapper. It is becoming a monitoring system.

7. Circuit Breaking

If an endpoint fails repeatedly, you must temporarily remove it from rotation.

Otherwise, you keep sending traffic into a degraded system.

</> typescript

type CircuitState = "closed" | "open" | "half-open";

type Circuit = {
  state: CircuitState;
  failureCount: number;
  lastFailureTime: number;
};

function shouldAllowRequest(circuit: Circuit): boolean {
  if (circuit.state === "closed") return true;

  if (circuit.state === "open") {
    const cooldown = 10_000;
    return Date.now() - circuit.lastFailureTime > cooldown;
  }

  return true;
}

Circuit breaking introduces:

Failure thresholds
Cooldown windows
Recovery validation
State transitions

At this point, you are building distributed systems infrastructure.

Not simple fallback logic.

What This Actually Means

To operate a reliable RPC layer yourself, you need:

Dynamic endpoint scoring
Latency percentile tracking
Error rate monitoring
Stale state detection
Circuit breaking
Retry control
Continuous background health probes
Observability and metrics

And this must run continuously --- not just when requests fail.

You are no longer consuming infrastructure.

You are operating it.

Closing Thought

Adding multiple RPC endpoints is easy.

Maintaining:

Active routing
Dynamic health scoring
Latency percentile tracking
State consistency validation
Circuit breaking
Background health probes
Method-level error monitoring
Continuous observability

is not.

At some point, your “simple failover layer” has become:

A routing engine
A monitoring system
A metrics pipeline
A consistency validator
An operational burden

And it must run 24/7, under load, across regions.

Reliable RPC is not redundancy.

It is an infrastructure discipline.

If your application depends on accurate on-chain data — trading systems, indexers, analytics pipelines, wallet backends — then reliability must be engineered deliberately.

You can build and maintain this entire layer yourself.

Or you can rely on infrastructure that already implements:

Active health scoring
Degradation detection
Stale node filtering
Intelligent routing
Observability-first design

That is precisely the problem RVO is built to solve.

Instead of operating your own failover and monitoring stack, you can integrate a production-grade routing layer in minutes — without building a routing engine, monitoring system, and metrics pipeline yourself.

If you want to get started, see:

https://docs.rvo.network/getting-started/

Reliable RPC should not require you to become an infrastructure operator.

It should just work.

Designing a Production-Grade RPC Failover Layer

1. The Naive Approach

2. Degradation Is Not Failure

3. Adding Timeouts

4. Exponential Backoff with Jitter

5. Endpoint Health Scoring

6. Detecting Stale Nodes

7. Circuit Breaking

What This Actually Means

Closing Thought

See also

RVO Typed JSON API for Faster Integrations

Reliable Solana RPC Integration in Production

Tracing a Web3 Request End-to-End: Where Latency and Failure Actually Come From

Start using RVO today