Skip to main content

Scaling & Performance in Backend Systems

Introduction

  • Scaling and Performance are core concepts in backend systems and infrastructure.

  • Their meaning varies across domains (frontend vs backend), but this discussion focuses on backend engineering.

  • Goal:

    • Build intuition for:

      • How systems behave under load
      • Where bottlenecks occur
      • How to think during system failures
    • Learn concepts that apply universally across systems.


What is Performance?

  • A system is considered fast based on user experience.

  • Example flow:

    • User clicks a button → Browser sends request → Server processes → Database/API calls → Response → UI renders.
  • The total time taken for this flow is called:

Latency

  • Latency = Time from user action → final response rendering.

  • It is the primary metric of performance.

  • When users say:

    • “App is slow” → They are referring to high latency.

Latency Characteristics

  • Latency is not constant:

    • One request: 50 ms
    • Another: 200 ms
  • Reasons for variation:

    • Cache hits (CDN, Redis)
    • Server load (concurrent requests)
    • Network variability
    • Database query complexity

Why Average Latency is Misleading

  • Example:

    • 99% requests → 50 ms
    • 1% requests → 5 seconds
    • Average ≈ 100 ms → Looks good, but misleading
  • Real impact:

    • At 1M requests/day → 10,000 users experience 5s delay
    • These users have terrible experience, hidden by averages

Percentiles (Better Metric for Latency)

  • Instead of averages, use percentiles:

Key Percentiles

  • P50 (Median):

    • 50% users experience ≤ this latency
  • P90:

    • 90% users experience ≤ this latency
    • 10% users experience worse latency
  • P99:

    • 99% users experience ≤ this latency
    • 1% users experience worst-case latency

Importance of P95 / P99

  • Backend engineers focus heavily on P95 / P99 because:

    • Represent worst-performing requests

    • Often involve:

      • Complex business logic
      • Heavy DB queries
      • External API calls
  • These requests often belong to:

    • High-value users (e.g., payments, purchases)

Throughput

  • Latency → Time per request

  • Throughput → Number of requests handled per unit time

  • Measured as:

    • Requests per second (RPS)
    • Requests per minute

Latency vs Throughput Relationship

  • At low load:

    • Low throughput → Low latency
  • As load increases:

    • Throughput ↑ → Latency ↑
  • After a threshold:

    • Latency increases dramatically (non-linear)

Real-World Questions Answered by Throughput

  • Can system handle:

    • Black Friday traffic?
    • Sudden spikes (email campaigns)?
    • Viral traffic?
  • Helps determine:

    • Maximum concurrent users
    • When scaling is required

Utilization

  • Utilization = % of system capacity being used

  • Examples:

    • 0% → Idle system
    • 100% → Fully saturated (risk of collapse)

Utilization vs Latency (Critical Concept)

  • Expected (wrong intuition):

    • Linear increase in latency
  • Reality:

    • Latency grows exponentially near 100% utilization

Ice Cream Shop Analogy

  • Low utilization:

    • No queue → Instant service → Low latency
  • High utilization:

    • Long queue → Wait time increases → High latency
  • Key insight:

    • Processing speed same, but waiting time increases

Highway Analogy

  • 50% capacity → Smooth traffic

  • 80% → Slower, constrained

  • 90% → Unpredictable

  • 100% → Traffic jam

  • Insight:

    • Small increases near capacity → Huge latency spikes

Key Takeaway on Utilization

  • Never run systems at 100% utilization

  • Ideal range:

    • 60–80% utilization
    • Keep 20% headroom for spikes

Traffic Behavior

  • Traffic is bursty, not uniform:

    • Sudden spikes → Overload system
  • Even if average load is low:

    • Spikes can exceed capacity

Bottlenecks

  • Bottleneck = Specific component causing slowness

  • Common mistake:

    • Jumping to solutions without identifying bottleneck

Wrong Approaches

  • Add caching blindly
  • Upgrade database
  • Add more servers (horizontal scaling)

Example: Misidentified Bottleneck

  • API appears slow → Assumption: Database is slow

  • Action:

    • Added caching (Redis)
  • Result:

    • No improvement

Actual Issue

  • Logging function:

    • Synchronous remote logging
    • Took ~500 ms
  • DB query:

    • Only ~10 ms

Lesson

  • Never guess → Always measure

Measurement & Debugging

Key Principle

  • Always measure each component:

    • DB queries
    • API calls
    • Serialization
    • Network latency

Profiling

  • Profiling = Measuring where application spends time

  • Features:

    • Tracks function execution
    • Measures CPU usage

Flame Graph

  • Visual representation of profiling:

    • Wide blocks → More time-consuming functions
    • Stacked blocks → Call hierarchy

CPU-bound vs IO-bound Tasks

  • CPU-bound:

    • Computation-heavy (good for profilers)
  • IO-bound:

    • DB queries
    • API calls
    • File operations
    • Network latency
  • Backend systems are mostly IO-bound. And Profilers are not good for these.


Distributed Tracing

  • Tracks a request across system components:

    • API → DB → External service → Response
  • Helps identify:

    • Where time is spent
    • Exact bottleneck location
  • Example insight:

    • Business logic: 2 ms
    • DB query: 800 ms → Actual bottleneck

Database as Bottleneck

  • Databases are often bottlenecks because they:

    • Persist data on disk
    • Handle concurrency (reads/writes)
    • Execute complex queries
    • Ensure consistency

N + 1 Query Problem

Definition

  • 1 query to fetch N items
  • N additional queries for each item’s details

Example (Frontend Perspective)

  • Fetch 20 blog posts:

    • 1 API call → get posts
    • 20 API calls → get authors
  • Total = 21 API calls


Problem with N + 1

  • Linear growth:

    • N items → N+1 queries
  • High overhead:

    • Network latency
    • Query parsing & execution
    • Connection setup
  • Example:

    • 1000 queries × 5 ms = 5 seconds latency

Solution to N + 1

  • Batch fetching:

    • Collect all IDs
    • Fetch in a single query
  • Result:

    • 2 queries total:

      • Fetch posts
      • Fetch all authors
  • Complexity:

    • Constant (O(1) queries), not linear

Server-Side Perspective

  • N + 1 occurs at:

    • Backend → Database level
  • Example:

    • Looping over posts and querying author per post

Key Takeaways

  • Latency is the core performance metric

  • Averages are misleading → Use percentiles (P50, P90, P99)

  • Latency increases non-linearly with load

  • Maintain headroom (20–40%)

  • Always identify bottlenecks before optimizing

  • Use:

    • Profiling → CPU issues
    • Distributed tracing → IO issues
  • Avoid:

    • Blind optimizations
    • N + 1 query patterns