Observability

pgstream provides comprehensive observability through OpenTelemetry metrics and distributed tracing. This document outlines the available metrics and how to use them for monitoring your data streaming pipeline.

Overview

pgstream instruments all major components of the data streaming pipeline with metrics and traces using OpenTelemetry. The instrumentation covers:

WAL replication and processing
Snapshot generation
Kafka read/write operations
PostgreSQL query operations
Search indexing operations
Data transformations
Go runtime metrics (memory, GC, goroutines, etc.)
Go profiling (CPU, memory, goroutines, etc.)

Quick Start with Local Setup

pgstream includes a local observability setup using SigNoz with a pre-configured dashboard that visualizes all the metrics described below. To get started:

docker-compose -f build/docker/docker-compose-signoz.yml --profile instrumentation up

docker-compose -f build/docker/docker-compose.yml -f build/docker/docker-compose-signoz.yml --profile pg2pg --profile instrumentation up

open http://localhost:8080

The dashboard includes panels for all key metrics, alerting rules, and provides a comprehensive view of your pgstream deployment health and performance.

Metrics

All metrics follow the pgstream.* naming convention and include relevant attributes for filtering and aggregation.

WAL Replication

pgstream.replication.lag

Type: ObservableGauge
Unit: bytes
Description: Replication lag in bytes accrued by the pgstream WAL consumer

Attributes: None Usage: Monitor replication health and detect lag buildup that could indicate performance issues. ⚠️ Important: This metric only tracks pgstream’s consumer lag. It’s strongly recommended to also monitor your source PostgreSQL metrics, particularly the built-in replication lag metrics (pg_stat_replication.flush_lag, pg_stat_replication.replay_lag) to get a complete picture of replication health.

WAL Event Processing

pgstream.target.processing.lag

Type: Histogram
Unit: ns
Description: Time between WAL event creation and processing

pgstream.target.processing.latency

Type: Histogram
Unit: ns
Description: Time taken to process a WAL event

Attributes:

target: Target type (e.g., “postgres”, “kafka”, “search”)

Usage: Monitor end-to-end processing latency for each target and identify bottlenecks in the processing pipeline.

Snapshot Operations

pgstream.snapshot.generator.latency

Type: Histogram
Unit: ms
Description: Time taken to snapshot a source PostgreSQL database

Attributes:

snapshot_schema: List of schemas being snapshotted
snapshot_tables: List of tables being snapshotted

Usage: Monitor snapshot performance and identify slow-running snapshot operations.

Kafka Operations

Writer Metrics

pgstream.kafka.writer.batch.size

Type: Histogram
Unit: messages
Description: Distribution of message batch sizes

pgstream.kafka.writer.batch.bytes

Type: Histogram
Unit: bytes
Description: Distribution of message batch sizes in bytes

pgstream.kafka.writer.latency

Type: Histogram
Unit: ms
Description: Time taken to send messages to Kafka

Reader Metrics

pgstream.kafka.reader.msg.bytes

Type: Histogram
Unit: bytes
Description: Distribution of message sizes read from Kafka

pgstream.kafka.reader.fetch.latency

Type: Histogram
Unit: ms
Description: Time taken to fetch messages from Kafka

pgstream.kafka.reader.commit.latency

Type: Histogram
Unit: ms
Description: Time taken to commit offsets to Kafka

pgstream.kafka.reader.commit.batch.size

Type: Histogram
Unit: offsets
Description: Distribution of offset batch sizes committed

Attributes: None Usage: Monitor Kafka throughput, latency, and batch efficiency to optimize performance.

PostgreSQL Operations

pgstream.postgres.querier.latency

Type: Histogram
Unit: ms
Description: Time taken to perform PostgreSQL queries

Attributes:

query_type: Type of SQL operation (e.g., “SELECT”, “INSERT”, “UPDATE”, “DELETE”, “tx”)
query: The actual SQL query (for non-transaction operations)

Usage: Monitor database performance and identify slow queries.

Search Operations

pgstream.search.store.doc.errors

Type: Counter
Unit: errors
Description: Count of failed document indexing operations

Attributes:

severity: Error severity level (one of “NONE”, “DATALOSS”, “IGNORED”, “RETRIABLE”)

Usage: Monitor search indexing health and error rates.

Data Transformations

pgstream.transformer.latency

Type: Histogram
Unit: ms
Description: Time taken to transform data values

Attributes:

transformer_type: Type of transformer being used

Usage: Monitor transformation performance and identify slow transformers.

Go Runtime Metrics

pgstream automatically collects Go runtime metrics using the OpenTelemetry Go runtime instrumentation. These metrics are essential for monitoring application health and performance:

runtime.go.mem.heap_alloc

Type: Gauge
Unit: bytes
Description: Bytes of allocated heap objects

runtime.go.mem.heap_idle

Type: Gauge
Unit: bytes
Description: Bytes in idle (unused) heap spans

runtime.go.mem.heap_inuse

Type: Gauge
Unit: bytes
Description: Bytes in in-use heap spans

runtime.go.mem.heap_objects

Type: Gauge
Unit: objects
Description: Number of allocated heap objects

runtime.go.mem.heap_released

Type: Gauge
Unit: bytes
Description: Bytes of physical memory returned to the OS

runtime.go.mem.heap_sys

Type: Gauge
Unit: bytes
Description: Bytes of heap memory obtained from the OS

runtime.go.gc.count

Type: Counter
Unit: collections
Description: Number of completed GC cycles

runtime.go.gc.pause_ns

Type: Histogram
Unit: ns
Description: Amount of nanoseconds in GC stop-the-world pauses

runtime.go.goroutines

Type: Gauge
Unit: goroutines
Description: Number of goroutines that currently exist

runtime.go.lookups

Type: Counter
Unit: lookups
Description: Number of pointer lookups performed by the runtime

runtime.go.mem.gc_sys

Type: Gauge
Unit: bytes
Description: Bytes of memory in garbage collection metadata

Usage: Monitor application resource usage, detect memory leaks, track GC performance, and identify goroutine leaks.

Configuration

Enabling Metrics

Configure metrics collection in your pgstream configuration file:

instrumentation:
  metrics:
    endpoint: "0.0.0.0:4317"
    collection_interval: 60 # collection interval for metrics in seconds. Defaults to 60s
  traces:
    endpoint: "0.0.0.0:4317"
    sample_ratio: 0.5 # ratio of traces that will be sampled. Must be between 0.0-1.0, where 0 is no traces sampled, and 1 is all traces sampled.

Or using environment variables:

PGSTREAM_METRICS_ENDPOINT="http://localhost:4317"
PGSTREAM_METRICS_COLLECTION_INTERVAL=60s
PGSTREAM_TRACES_ENDPOINT="http://localhost:4317"
PGSTREAM_TRACES_SAMPLE_RATIO=0.5

Monitoring Dashboards

Pre-built pgstream Dashboard

The included SigNoz dashboard provides:

Overview Panel: System health and key metrics at a glance
Replication Health: Both pgstream and PostgreSQL replication metrics
Processing Performance: Latency and throughput across all components
Error Tracking: Error rates and failure patterns
Resource Utilization: Memory, CPU, and network usage
Go Runtime Health: Memory usage, GC performance, and goroutine tracking

Key Performance Indicators (KPIs)

Application Health
- runtime.go.mem.heap_alloc - Memory usage trends
- runtime.go.goroutines - Goroutine count (detect leaks)
- runtime.go.gc.pause_ns - GC pause times (performance impact)
Replication Health
- pgstream.replication.lag - Should remain low and stable
- pgstream.target.processing.lag - End-to-end processing delay
- PostgreSQL replication lag - Monitor source database metrics
Throughput
- pgstream.kafka.writer.batch.size - Messages per batch
- pgstream.kafka.writer.batch.bytes - Bytes per batch
- rate(pgstream.postgres.querier.latency_count[5m]) - Query rate
Latency
- pgstream.target.processing.latency (p95, p99) - Processing latency percentiles
- pgstream.kafka.writer.latency (p95, p99) - Kafka write latency
- pgstream.postgres.querier.latency (p95, p99) - Database query latency
Error Rates
- rate(pgstream.search.store.doc.errors[5m]) - Search indexing errors
- Error logs from distributed traces

Source PostgreSQL Monitoring

⚠️ Critical: pgstream metrics only track the consumer-side lag. For complete replication monitoring, you must also monitor your source PostgreSQL instance:

Essential PostgreSQL Metrics

-- WAL generation rate
SELECT
    pg_current_wal_lsn(),
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS wal_mb;

-- Replication slot lag for all consumers
SELECT
    slot_name,
    active,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS lag_mb
FROM pg_replication_slots;

Distributed Tracing

pgstream automatically creates traces for all major operations. Traces include:

Complete request flows from WAL event to target system
Database query execution details
Kafka produce/consume operations
Transformation pipeline execution
Error context and stack traces

Use the included SigNoz setup or tools like Jaeger/Zipkin to visualize and analyze trace data for debugging and performance optimization. pgstream_signoz_tracing

Profiling

pgstream includes built-in Go profiling capabilities using Go’s net/http/pprof package for performance analysis and debugging.

Enabling Profiling

Profiling can be enabled using the --profile flag on supported commands:

pgstream snapshot --profile --config config.yaml

pgstream run --profile --config config.yaml

Profiling Modes

1. HTTP Endpoint (All Commands)

When --profile is enabled, pgstream exposes a profiling HTTP server at localhost:6060 with the following endpoints:

http://localhost:6060/debug/pprof/

Description: Profile index page

http://localhost:6060/debug/pprof/profile

Description: CPU profile (30-second sample)

http://localhost:6060/debug/pprof/heap

Description: Memory heap profile

http://localhost:6060/debug/pprof/goroutine

Description: Goroutine profile

http://localhost:6060/debug/pprof/allocs

Description: Memory allocation profile

http://localhost:6060/debug/pprof/block

Description: Block profile

http://localhost:6060/debug/pprof/mutex

Description: Mutex profile

http://localhost:6060/debug/pprof/trace

Description: Execution trace

2. File Output (Snapshot Command Only)

For the snapshot command, profiling also generates profile files:

cpu.prof - CPU profile for the entire snapshot operation
mem.prof - Memory profile taken at the end of the operation

Using Profiling Data

With Go’s pprof Tool

CPU Hotspots: Identify functions consuming the most CPU time

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile

Memory Usage: Find memory allocation patterns and potential leaks
```
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap
```

Goroutine Analysis: Debug goroutine leaks or blocking operations

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/goroutine

Blocked Operations:

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/block

Considerations

Local Only: The profiling server binds to localhost:6060 and is not accessible externally
Disable in Production: Only enable profiling for debugging/optimization sessions
Performance Impact: Profiling has minimal overhead but should be used carefully

Best Practices

Use the Pre-built Dashboard: Start with the included SigNoz dashboard for immediate visibility
Monitor Both Sides: Track both pgstream metrics AND source PostgreSQL replication metrics
Watch Runtime Metrics: Monitor memory usage, GC performance, and goroutine counts for application health
Set Up Comprehensive Alerting: Configure alerts for both application and infrastructure metrics
Regular Health Checks: Use the dashboard to perform regular health assessments
Analyze Traces for Debugging: Leverage distributed tracing for complex issue resolution
Capacity Planning: Use historical metrics data to plan for scaling needs
Enable Profiling Only When Needed: Don’t run profiling continuously in production
Collect Sufficient Profile Data: Let profiling run for adequate time to collect meaningful samples

Troubleshooting

High Memory Usage: Check runtime.go.mem.heap_alloc and look for memory leaks in processing pipelines
Goroutine Leaks: Monitor runtime.go.goroutines and investigate if count keeps growing
GC Pressure: High runtime.go.gc.count rate may indicate memory allocation issues
High Lag: Check both pgstream.replication.lag and PostgreSQL pg_stat_replication for the root cause
Write Failures: Monitor pgstream.kafka.writer.latency and check Kafka connectivity
Slow Snapshots: Use pgstream.snapshot.generator.latency with schema/table attributes to identify problematic tables
Query Performance: Analyze pgstream.postgres.querier.latency by query_type to find slow operations
Missing Data: Check PostgreSQL replication slot status and WAL retention policies

Getting Help

If you encounter issues with observability setup:

Check the SigNoz dashboard for immediate insights
Review both pgstream and PostgreSQL metrics
Examine Go runtime metrics for application health issues
Use profiling to drill down into performance bottlenecks
Examine distributed traces for detailed execution flow
Consult the troubleshooting section above
Check PostgreSQL logs for replication related errors

Tutorials

Onboarding

Release Notes

Observability

Overview

Quick Start with Local Setup

Metrics

WAL Replication

WAL Event Processing

Snapshot Operations

Kafka Operations

Writer Metrics

Reader Metrics

PostgreSQL Operations

Search Operations

Data Transformations

Go Runtime Metrics

Configuration

Enabling Metrics

Monitoring Dashboards

Pre-built pgstream Dashboard

Key Performance Indicators (KPIs)

Source PostgreSQL Monitoring

Essential PostgreSQL Metrics

Distributed Tracing

Profiling

Enabling Profiling

Profiling Modes

1. HTTP Endpoint (All Commands)

2. File Output (Snapshot Command Only)

Using Profiling Data

With Go’s pprof Tool

Considerations

Best Practices

Troubleshooting

Getting Help

Tutorials

Onboarding

Release Notes

​Overview

​Quick Start with Local Setup

​Metrics

​WAL Replication

​WAL Event Processing

​Snapshot Operations

​Kafka Operations

​Writer Metrics

​Reader Metrics

​PostgreSQL Operations

​Search Operations

​Data Transformations

​Go Runtime Metrics

​Configuration

​Enabling Metrics

​Monitoring Dashboards

​Pre-built pgstream Dashboard

​Key Performance Indicators (KPIs)

​Source PostgreSQL Monitoring

​Essential PostgreSQL Metrics

​Distributed Tracing

​Profiling

​Enabling Profiling

​Profiling Modes

​1. HTTP Endpoint (All Commands)

​2. File Output (Snapshot Command Only)

​Using Profiling Data

​With Go’s pprof Tool

​Considerations

​Best Practices

​Troubleshooting

​Getting Help

Overview

Quick Start with Local Setup

Metrics

WAL Replication

WAL Event Processing

Snapshot Operations

Kafka Operations

Writer Metrics

Reader Metrics

PostgreSQL Operations

Search Operations

Data Transformations

Go Runtime Metrics

Configuration

Enabling Metrics

Monitoring Dashboards

Pre-built pgstream Dashboard

Key Performance Indicators (KPIs)

Source PostgreSQL Monitoring

Essential PostgreSQL Metrics

Distributed Tracing

Profiling

Enabling Profiling

Profiling Modes

1. HTTP Endpoint (All Commands)

2. File Output (Snapshot Command Only)

Using Profiling Data

With Go’s pprof Tool

Considerations

Best Practices

Troubleshooting

Getting Help