Skip to main content

Logging, Monitoring, and Observability

Overview

Logging, monitoring, and observability are practices, not strict rules.
These practices exist on a spectrum → no system is “perfectly implemented.”
Widely used across companies, startups, and production systems.
Closely tied to code implementation and infrastructure setup.
Essential for modern distributed systems running across:
- Multiple servers
- Different regions
- Global users

Why We Need Them

Modern applications are distributed and complex.
Need visibility into:
- Application behavior
- Infrastructure state
- Request lifecycle

Key Goal

Track what is happening in the system across:
- Services
- Servers
- Databases
- Requests

Logging

Definition

Logging = recording events happening in the application.

What to Log

Important events:
- User actions (e.g., login)
- API requests
- Database queries
- Errors and failures
- Security-related events

Metadata in Logs

Logs should include context:
- User ID
- Request ID
- Timestamp
- Function/method triggered
- Latency

Purpose

Acts like a journal/diary of the application.
Helps answer:
- What happened?
- When did it happen?
- Why did it happen?

Monitoring

Definition

Monitoring = tracking system state over time.

What It Tracks

System health and performance:
- CPU usage
- Memory usage
- Requests per second
- Database connections
- Error rates

Characteristics

Provides near real-time data (usually ~10–15 sec delay).
Aggregates data into metrics.

Purpose

Detect issues and patterns:
- Performance degradation
- Increased error rates
- Resource bottlenecks

Observability

Definition

Observability = ability to infer internal system state from external outputs.

Three Pillars of Observability

Logs
- Records of events
Metrics
- Numerical data over time (e.g., error rate, throughput)
Traces
- Tracks request flow across components

Traces (Transactions)

A trace represents a full request journey.
Tracks:
- Where request started (frontend, load balancer, backend)
- Which components it passed through:
  - Handler layer
  - Service layer
  - Validation layer
  - Repository layer
  - Database

Purpose

Helps identify:
- Where failure occurred
- Latency bottlenecks
- Component interactions

Monitoring vs Observability

Traditional Monitoring

Detects that something is wrong
Provides alerts (e.g., high error rate)
Limited debugging capability

Observability

Detects what is wrong and where
Uses logs + metrics + traces
Enables deep debugging

How They Work Together

Workflow Example

Alert Triggered
- Example: Error rate > 80%
- Notification via Slack/webhook
Check Metrics
- Error rate, throughput, response time
Inspect Logs
- View failed requests
- Identify error messages
Analyze Traces
- Track request path
- Identify exact failure point

Key Mapping

Logs → What happened
Metrics → Trends and patterns
Traces → Component interactions

Metrics

Definition

Metrics = quantitative data about system behavior.

Examples

Number of requests processed
Number of failed requests
Error rate
Throughput
Response time
Business metrics (e.g., todos created)

Characteristics

Can be:
- Real-time
- Historical
Configurable based on system needs

Logging Best Practices

Log Levels

Debug
- Detailed logs for development
- Disabled in production
Info
- General operations (e.g., successful actions)
Warn
- Non-critical issues (e.g., wrong password)
Error
- Failures (e.g., DB query failure)
Fatal
- Critical failure → application crash/restart

Structured vs Unstructured Logging

Unstructured Logging (Development)

Human-readable text
Colored, formatted output
Easy to debug locally

Structured Logging (Production)

JSON format
Machine-readable
Easier for tools to parse

Why Structured Logs in Production?

Tools (e.g., ELK, Grafana stack) need structured data
Easier extraction of:
- User ID
- Request ID
- Error details

Instrumentation

Definition

Instrumentation = measuring and collecting data from code.

Purpose

Adds visibility into:
- Function execution
- Request lifecycle
- System behavior

OpenTelemetry

Open standard for observability
Provides:
- SDKs
- APIs
- Tools
Works across languages:
- Node.js, Go, Python, etc.
Enables consistent instrumentation

Example Workflow in Code

Request Lifecycle

Middleware
- Creates a transaction (trace)
- Adds metadata:
  - Service name
  - Environment
  - IP address
  - User ID
  - Request ID
Context Passing
- Transaction stored in context
- Accessible across layers
Service Layer
- Extract transaction
- Add attributes (e.g., user ID, title)
Logging
- Log important events:
  - Operation start
  - Validation
  - DB execution
  - Errors
  - Success
Error Handling
- Log error with:
  - Level = error
  - Metadata
- Attach error to trace
Success Case
- Log success event
- Add attributes to trace

Monitoring Implementation

Middleware instruments each request
Tracks:
- Response time
- Error rate
- Throughput

Observability Tools

Open Source Stack

Grafana → Dashboard
Prometheus → Metrics
Loki / Promtail → Logs
Jaeger → Traces

Proprietary Tools

New Relic
Datadog

Trade-offs

Open Source:
- More control
- More setup complexity
Proprietary:
- Easier integration
- Less maintenance

Dashboard Insights (Example)

Metrics View

Error rate
Average response time
Throughput

Logs View

Error details:
- Status code
- API route
- Timestamp
- Metadata

Traces View

Full request path
Component-level breakdown
Failure point identification

Key Takeaways

Logging, monitoring, and observability are essential for production systems.
They must be implemented both in code and infrastructure.
Observability provides deep system understanding:
- Logs → events
- Metrics → trends
- Traces → flow
Implementation is incremental and evolving, not perfect.

Final Note

Not a single “skill” but a core engineering practice.
Critical for:
- Debugging
- Performance optimization
- Reliability

[Context unclear – verify from video: Ending sentence incomplete]