Skip to main content

Scaling & Performance in Backend Systems

Introduction

Scaling and Performance are core concepts in backend systems and infrastructure.
Their meaning varies across domains (frontend vs backend), but this discussion focuses on backend engineering.
Goal:
- Build intuition for:
  - How systems behave under load
  - Where bottlenecks occur
  - How to think during system failures
- Learn concepts that apply universally across systems.

What is Performance?

A system is considered fast based on user experience.
Example flow:
- User clicks a button → Browser sends request → Server processes → Database/API calls → Response → UI renders.
The total time taken for this flow is called:

Latency

Latency = Time from user action → final response rendering.
It is the primary metric of performance.
When users say:
- “App is slow” → They are referring to high latency.

Latency Characteristics

Latency is not constant:
- One request: 50 ms
- Another: 200 ms
Reasons for variation:
- Cache hits (CDN, Redis)
- Server load (concurrent requests)
- Network variability
- Database query complexity

Why Average Latency is Misleading

Example:
- 99% requests → 50 ms
- 1% requests → 5 seconds
- Average ≈ 100 ms → Looks good, but misleading
Real impact:
- At 1M requests/day → 10,000 users experience 5s delay
- These users have terrible experience, hidden by averages

Percentiles (Better Metric for Latency)

Instead of averages, use percentiles:

Key Percentiles

P50 (Median):
- 50% users experience ≤ this latency
P90:
- 90% users experience ≤ this latency
- 10% users experience worse latency
P99:
- 99% users experience ≤ this latency
- 1% users experience worst-case latency

Importance of P95 / P99

Backend engineers focus heavily on P95 / P99 because:
- Represent worst-performing requests
- Often involve:
  - Complex business logic
  - Heavy DB queries
  - External API calls
These requests often belong to:
- High-value users (e.g., payments, purchases)

Throughput

Latency → Time per request
Throughput → Number of requests handled per unit time
Measured as:
- Requests per second (RPS)
- Requests per minute

Latency vs Throughput Relationship

At low load:
- Low throughput → Low latency
As load increases:
- Throughput ↑ → Latency ↑
After a threshold:
- Latency increases dramatically (non-linear)

Real-World Questions Answered by Throughput

Can system handle:
- Black Friday traffic?
- Sudden spikes (email campaigns)?
- Viral traffic?
Helps determine:
- Maximum concurrent users
- When scaling is required

Utilization

Utilization = % of system capacity being used
Examples:
- 0% → Idle system
- 100% → Fully saturated (risk of collapse)

Utilization vs Latency (Critical Concept)

Expected (wrong intuition):
- Linear increase in latency
Reality:
- Latency grows exponentially near 100% utilization

Ice Cream Shop Analogy

Low utilization:
- No queue → Instant service → Low latency
High utilization:
- Long queue → Wait time increases → High latency
Key insight:
- Processing speed same, but waiting time increases

Highway Analogy

50% capacity → Smooth traffic
80% → Slower, constrained
90% → Unpredictable
100% → Traffic jam
Insight:
- Small increases near capacity → Huge latency spikes

Key Takeaway on Utilization

Never run systems at 100% utilization
Ideal range:
- 60–80% utilization
- Keep 20% headroom for spikes

Traffic Behavior

Traffic is bursty, not uniform:
- Sudden spikes → Overload system
Even if average load is low:
- Spikes can exceed capacity

Bottlenecks

Bottleneck = Specific component causing slowness
Common mistake:
- Jumping to solutions without identifying bottleneck

Wrong Approaches

Add caching blindly
Upgrade database
Add more servers (horizontal scaling)

Example: Misidentified Bottleneck

API appears slow → Assumption: Database is slow
Action:
- Added caching (Redis)
Result:
- No improvement

Actual Issue

Logging function:
- Synchronous remote logging
- Took ~500 ms
DB query:
- Only ~10 ms

Lesson

Never guess → Always measure

Measurement & Debugging

Key Principle

Always measure each component:
- DB queries
- API calls
- Serialization
- Network latency

Profiling

Profiling = Measuring where application spends time
Features:
- Tracks function execution
- Measures CPU usage

Flame Graph

Visual representation of profiling:
- Wide blocks → More time-consuming functions
- Stacked blocks → Call hierarchy

CPU-bound vs IO-bound Tasks

CPU-bound:
- Computation-heavy (good for profilers)
IO-bound:
- DB queries
- API calls
- File operations
- Network latency
Backend systems are mostly IO-bound. And Profilers are not good for these.

Distributed Tracing

Tracks a request across system components:
- API → DB → External service → Response
Helps identify:
- Where time is spent
- Exact bottleneck location
Example insight:
- Business logic: 2 ms
- DB query: 800 ms → Actual bottleneck

Database as Bottleneck

Databases are often bottlenecks because they:
- Persist data on disk
- Handle concurrency (reads/writes)
- Execute complex queries
- Ensure consistency

N + 1 Query Problem

Definition

1 query to fetch N items
N additional queries for each item’s details

Example (Frontend Perspective)

Fetch 20 blog posts:
- 1 API call → get posts
- 20 API calls → get authors
Total = 21 API calls

Problem with N + 1

Linear growth:
- N items → N+1 queries
High overhead:
- Network latency
- Query parsing & execution
- Connection setup
Example:
- 1000 queries × 5 ms = 5 seconds latency

Solution to N + 1

Batch fetching:
- Collect all IDs
- Fetch in a single query
Result:
- 2 queries total:
  - Fetch posts
  - Fetch all authors
Complexity:
- Constant (O(1) queries), not linear

Server-Side Perspective

N + 1 occurs at:
- Backend → Database level
Example:
- Looping over posts and querying author per post

Key Takeaways

Latency is the core performance metric
Averages are misleading → Use percentiles (P50, P90, P99)
Latency increases non-linearly with load
Maintain headroom (20–40%)
Always identify bottlenecks before optimizing
Use:
- Profiling → CPU issues
- Distributed tracing → IO issues
Avoid:
- Blind optimizations
- N + 1 query patterns