Scaling & Performance in Backend Systems
Introduction
-
Scaling and Performance are core concepts in backend systems and infrastructure.
-
Their meaning varies across domains (frontend vs backend), but this discussion focuses on backend engineering.
-
Goal:
-
Build intuition for:
- How systems behave under load
- Where bottlenecks occur
- How to think during system failures
-
Learn concepts that apply universally across systems.
-
What is Performance?
-
A system is considered fast based on user experience.
-
Example flow:
- User clicks a button → Browser sends request → Server processes → Database/API calls → Response → UI renders.
-
The total time taken for this flow is called:
Latency
-
Latency = Time from user action → final response rendering.
-
It is the primary metric of performance.
-
When users say:
- “App is slow” → They are referring to high latency.
Latency Characteristics
-
Latency is not constant:
- One request: 50 ms
- Another: 200 ms
-
Reasons for variation:
- Cache hits (CDN, Redis)
- Server load (concurrent requests)
- Network variability
- Database query complexity
Why Average Latency is Misleading
-
Example:
- 99% requests → 50 ms
- 1% requests → 5 seconds
- Average ≈ 100 ms → Looks good, but misleading
-
Real impact:
- At 1M requests/day → 10,000 users experience 5s delay
- These users have terrible experience, hidden by averages
Percentiles (Better Metric for Latency)
- Instead of averages, use percentiles:
Key Percentiles
-
P50 (Median):
- 50% users experience ≤ this latency
-
P90:
- 90% users experience ≤ this latency
- 10% users experience worse latency
-
P99:
- 99% users experience ≤ this latency
- 1% users experience worst-case latency
Importance of P95 / P99
-
Backend engineers focus heavily on P95 / P99 because:
-
Represent worst-performing requests
-
Often involve:
- Complex business logic
- Heavy DB queries
- External API calls
-
-
These requests often belong to:
- High-value users (e.g., payments, purchases)
Throughput
-
Latency → Time per request
-
Throughput → Number of requests handled per unit time
-
Measured as:
- Requests per second (RPS)
- Requests per minute
Latency vs Throughput Relationship
-
At low load:
- Low throughput → Low latency
-
As load increases:
- Throughput ↑ → Latency ↑
-
After a threshold:
- Latency increases dramatically (non-linear)
Real-World Questions Answered by Throughput
-
Can system handle:
- Black Friday traffic?
- Sudden spikes (email campaigns)?
- Viral traffic?
-
Helps determine:
- Maximum concurrent users
- When scaling is required
Utilization
-
Utilization = % of system capacity being used
-
Examples:
- 0% → Idle system
- 100% → Fully saturated (risk of collapse)
Utilization vs Latency (Critical Concept)
-
Expected (wrong intuition):
- Linear increase in latency
-
Reality:
- Latency grows exponentially near 100% utilization
Ice Cream Shop Analogy
-
Low utilization:
- No queue → Instant service → Low latency
-
High utilization:
- Long queue → Wait time increases → High latency
-
Key insight:
- Processing speed same, but waiting time increases
Highway Analogy
-
50% capacity → Smooth traffic
-
80% → Slower, constrained
-
90% → Unpredictable
-
100% → Traffic jam
-
Insight:
- Small increases near capacity → Huge latency spikes
Key Takeaway on Utilization
-
Never run systems at 100% utilization
-
Ideal range:
- 60–80% utilization
- Keep 20% headroom for spikes
Traffic Behavior
-
Traffic is bursty, not uniform:
- Sudden spikes → Overload system
-
Even if average load is low:
- Spikes can exceed capacity
Bottlenecks
-
Bottleneck = Specific component causing slowness
-
Common mistake:
- Jumping to solutions without identifying bottleneck
Wrong Approaches
- Add caching blindly
- Upgrade database
- Add more servers (horizontal scaling)
Example: Misidentified Bottleneck
-
API appears slow → Assumption: Database is slow
-
Action:
- Added caching (Redis)
-
Result:
- No improvement
Actual Issue
-
Logging function:
- Synchronous remote logging
- Took ~500 ms
-
DB query:
- Only ~10 ms
Lesson
- Never guess → Always measure
Measurement & Debugging
Key Principle
-
Always measure each component:
- DB queries
- API calls
- Serialization
- Network latency
Profiling
-
Profiling = Measuring where application spends time
-
Features:
- Tracks function execution
- Measures CPU usage
Flame Graph
-
Visual representation of profiling:
- Wide blocks → More time-consuming functions
- Stacked blocks → Call hierarchy
CPU-bound vs IO-bound Tasks
-
CPU-bound:
- Computation-heavy (good for profilers)
-
IO-bound:
- DB queries
- API calls
- File operations
- Network latency
-
Backend systems are mostly IO-bound. And Profilers are not good for these.
Distributed Tracing
-
Tracks a request across system components:
- API → DB → External service → Response
-
Helps identify:
- Where time is spent
- Exact bottleneck location
-
Example insight:
- Business logic: 2 ms
- DB query: 800 ms → Actual bottleneck
Database as Bottleneck
-
Databases are often bottlenecks because they:
- Persist data on disk
- Handle concurrency (reads/writes)
- Execute complex queries
- Ensure consistency
N + 1 Query Problem
Definition
- 1 query to fetch N items
- N additional queries for each item’s details
Example (Frontend Perspective)
-
Fetch 20 blog posts:
- 1 API call → get posts
- 20 API calls → get authors
-
Total = 21 API calls
Problem with N + 1
-
Linear growth:
- N items → N+1 queries
-
High overhead:
- Network latency
- Query parsing & execution
- Connection setup
-
Example:
- 1000 queries × 5 ms = 5 seconds latency
Solution to N + 1
-
Batch fetching:
- Collect all IDs
- Fetch in a single query
-
Result:
-
2 queries total:
- Fetch posts
- Fetch all authors
-
-
Complexity:
- Constant (O(1) queries), not linear
Server-Side Perspective
-
N + 1 occurs at:
- Backend → Database level
-
Example:
- Looping over posts and querying author per post
Key Takeaways
-
Latency is the core performance metric
-
Averages are misleading → Use percentiles (P50, P90, P99)
-
Latency increases non-linearly with load
-
Maintain headroom (20–40%)
-
Always identify bottlenecks before optimizing
-
Use:
- Profiling → CPU issues
- Distributed tracing → IO issues
-
Avoid:
- Blind optimizations
- N + 1 query patterns