Overview
pgstream instruments all major components of the data streaming pipeline with metrics and traces using OpenTelemetry. The instrumentation covers:- WAL replication and processing
- Snapshot generation
- Kafka read/write operations
- PostgreSQL query operations
- Search indexing operations
- Data transformations
- Go runtime metrics (memory, GC, goroutines, etc.)
- Go profiling (CPU, memory, goroutines, etc.)
Quick Start with Local Setup
pgstream includes a local observability setup using SigNoz with a pre-configured dashboard that visualizes all the metrics described below. To get started:Metrics
All metrics follow thepgstream.* naming convention and include relevant attributes for filtering and aggregation.
WAL Replication
pgstream.replication.lag
pgstream.replication.lag
Unit: bytes
Description: Replication lag in bytes accrued by the pgstream WAL consumer
pg_stat_replication.flush_lag, pg_stat_replication.replay_lag) to get a complete picture of replication health.
WAL Event Processing
pgstream.target.processing.lag
pgstream.target.processing.lag
Unit: ns
Description: Time between WAL event creation and processing
pgstream.target.processing.latency
pgstream.target.processing.latency
Unit: ns
Description: Time taken to process a WAL event
target: Target type (e.g., “postgres”, “kafka”, “search”)
Snapshot Operations
pgstream.snapshot.generator.latency
pgstream.snapshot.generator.latency
Unit: ms
Description: Time taken to snapshot a source PostgreSQL database
snapshot_schema: List of schemas being snapshottedsnapshot_tables: List of tables being snapshotted
Kafka Operations
Writer Metrics
pgstream.kafka.writer.batch.size
pgstream.kafka.writer.batch.size
Unit: messages
Description: Distribution of message batch sizes
pgstream.kafka.writer.batch.bytes
pgstream.kafka.writer.batch.bytes
Unit: bytes
Description: Distribution of message batch sizes in bytes
pgstream.kafka.writer.latency
pgstream.kafka.writer.latency
Unit: ms
Description: Time taken to send messages to Kafka
Reader Metrics
pgstream.kafka.reader.msg.bytes
pgstream.kafka.reader.msg.bytes
Unit: bytes
Description: Distribution of message sizes read from Kafka
pgstream.kafka.reader.fetch.latency
pgstream.kafka.reader.fetch.latency
Unit: ms
Description: Time taken to fetch messages from Kafka
pgstream.kafka.reader.commit.latency
pgstream.kafka.reader.commit.latency
Unit: ms
Description: Time taken to commit offsets to Kafka
pgstream.kafka.reader.commit.batch.size
pgstream.kafka.reader.commit.batch.size
Unit: offsets
Description: Distribution of offset batch sizes committed
PostgreSQL Operations
pgstream.postgres.querier.latency
pgstream.postgres.querier.latency
Unit: ms
Description: Time taken to perform PostgreSQL queries
query_type: Type of SQL operation (e.g., “SELECT”, “INSERT”, “UPDATE”, “DELETE”, “tx”)query: The actual SQL query (for non-transaction operations)
Search Operations
pgstream.search.store.doc.errors
pgstream.search.store.doc.errors
Unit: errors
Description: Count of failed document indexing operations
severity: Error severity level (one of “NONE”, “DATALOSS”, “IGNORED”, “RETRIABLE”)
Data Transformations
pgstream.transformer.latency
pgstream.transformer.latency
Unit: ms
Description: Time taken to transform data values
transformer_type: Type of transformer being used
Go Runtime Metrics
pgstream automatically collects Go runtime metrics using the OpenTelemetry Go runtime instrumentation. These metrics are essential for monitoring application health and performance:runtime.go.mem.heap_alloc
runtime.go.mem.heap_alloc
Unit: bytes
Description: Bytes of allocated heap objects
runtime.go.mem.heap_idle
runtime.go.mem.heap_idle
Unit: bytes
Description: Bytes in idle (unused) heap spans
runtime.go.mem.heap_inuse
runtime.go.mem.heap_inuse
Unit: bytes
Description: Bytes in in-use heap spans
runtime.go.mem.heap_objects
runtime.go.mem.heap_objects
Unit: objects
Description: Number of allocated heap objects
runtime.go.mem.heap_released
runtime.go.mem.heap_released
Unit: bytes
Description: Bytes of physical memory returned to the OS
runtime.go.mem.heap_sys
runtime.go.mem.heap_sys
Unit: bytes
Description: Bytes of heap memory obtained from the OS
runtime.go.gc.count
runtime.go.gc.count
Unit: collections
Description: Number of completed GC cycles
runtime.go.gc.pause_ns
runtime.go.gc.pause_ns
Unit: ns
Description: Amount of nanoseconds in GC stop-the-world pauses
runtime.go.goroutines
runtime.go.goroutines
Unit: goroutines
Description: Number of goroutines that currently exist
runtime.go.lookups
runtime.go.lookups
Unit: lookups
Description: Number of pointer lookups performed by the runtime
runtime.go.mem.gc_sys
runtime.go.mem.gc_sys
Unit: bytes
Description: Bytes of memory in garbage collection metadata
Configuration
Enabling Metrics
Configure metrics collection in your pgstream configuration file:Monitoring Dashboards
Pre-built pgstream Dashboard
The included SigNoz dashboard provides:- Overview Panel: System health and key metrics at a glance
- Replication Health: Both pgstream and PostgreSQL replication metrics
- Processing Performance: Latency and throughput across all components
- Error Tracking: Error rates and failure patterns
- Resource Utilization: Memory, CPU, and network usage
- Go Runtime Health: Memory usage, GC performance, and goroutine tracking
Key Performance Indicators (KPIs)
-
Application Health
runtime.go.mem.heap_alloc- Memory usage trendsruntime.go.goroutines- Goroutine count (detect leaks)runtime.go.gc.pause_ns- GC pause times (performance impact)
-
Replication Health
pgstream.replication.lag- Should remain low and stablepgstream.target.processing.lag- End-to-end processing delay- PostgreSQL replication lag - Monitor source database metrics
-
Throughput
pgstream.kafka.writer.batch.size- Messages per batchpgstream.kafka.writer.batch.bytes- Bytes per batchrate(pgstream.postgres.querier.latency_count[5m])- Query rate
-
Latency
pgstream.target.processing.latency(p95, p99) - Processing latency percentilespgstream.kafka.writer.latency(p95, p99) - Kafka write latencypgstream.postgres.querier.latency(p95, p99) - Database query latency
-
Error Rates
rate(pgstream.search.store.doc.errors[5m])- Search indexing errors- Error logs from distributed traces
Source PostgreSQL Monitoring
⚠️ Critical: pgstream metrics only track the consumer-side lag. For complete replication monitoring, you must also monitor your source PostgreSQL instance:Essential PostgreSQL Metrics
Distributed Tracing
pgstream automatically creates traces for all major operations. Traces include:- Complete request flows from WAL event to target system
- Database query execution details
- Kafka produce/consume operations
- Transformation pipeline execution
- Error context and stack traces
Profiling
pgstream includes built-in Go profiling capabilities using Go’snet/http/pprof package for performance analysis and debugging.
Enabling Profiling
Profiling can be enabled using the--profile flag on supported commands:
Profiling Modes
1. HTTP Endpoint (All Commands)
When--profile is enabled, pgstream exposes a profiling HTTP server at localhost:6060 with the following endpoints:
http://localhost:6060/debug/pprof/
http://localhost:6060/debug/pprof/
http://localhost:6060/debug/pprof/profile
http://localhost:6060/debug/pprof/profile
http://localhost:6060/debug/pprof/heap
http://localhost:6060/debug/pprof/heap
http://localhost:6060/debug/pprof/goroutine
http://localhost:6060/debug/pprof/goroutine
http://localhost:6060/debug/pprof/allocs
http://localhost:6060/debug/pprof/allocs
http://localhost:6060/debug/pprof/block
http://localhost:6060/debug/pprof/block
http://localhost:6060/debug/pprof/mutex
http://localhost:6060/debug/pprof/mutex
http://localhost:6060/debug/pprof/trace
http://localhost:6060/debug/pprof/trace
2. File Output (Snapshot Command Only)
For thesnapshot command, profiling also generates profile files:
cpu.prof- CPU profile for the entire snapshot operationmem.prof- Memory profile taken at the end of the operation
Using Profiling Data
With Go’s pprof Tool
-
CPU Hotspots: Identify functions consuming the most CPU time
-
Memory Usage: Find memory allocation patterns and potential leaks
-
Goroutine Analysis: Debug goroutine leaks or blocking operations
-
Blocked Operations:
Considerations
- Local Only: The profiling server binds to
localhost:6060and is not accessible externally - Disable in Production: Only enable profiling for debugging/optimization sessions
- Performance Impact: Profiling has minimal overhead but should be used carefully
Best Practices
- Use the Pre-built Dashboard: Start with the included SigNoz dashboard for immediate visibility
- Monitor Both Sides: Track both pgstream metrics AND source PostgreSQL replication metrics
- Watch Runtime Metrics: Monitor memory usage, GC performance, and goroutine counts for application health
- Set Up Comprehensive Alerting: Configure alerts for both application and infrastructure metrics
- Regular Health Checks: Use the dashboard to perform regular health assessments
- Analyze Traces for Debugging: Leverage distributed tracing for complex issue resolution
- Capacity Planning: Use historical metrics data to plan for scaling needs
- Enable Profiling Only When Needed: Don’t run profiling continuously in production
- Collect Sufficient Profile Data: Let profiling run for adequate time to collect meaningful samples
Troubleshooting
- High Memory Usage: Check
runtime.go.mem.heap_allocand look for memory leaks in processing pipelines - Goroutine Leaks: Monitor
runtime.go.goroutinesand investigate if count keeps growing - GC Pressure: High
runtime.go.gc.countrate may indicate memory allocation issues - High Lag: Check both
pgstream.replication.lagand PostgreSQLpg_stat_replicationfor the root cause - Write Failures: Monitor
pgstream.kafka.writer.latencyand check Kafka connectivity - Slow Snapshots: Use
pgstream.snapshot.generator.latencywith schema/table attributes to identify problematic tables - Query Performance: Analyze
pgstream.postgres.querier.latencybyquery_typeto find slow operations - Missing Data: Check PostgreSQL replication slot status and WAL retention policies
Getting Help
If you encounter issues with observability setup:- Check the SigNoz dashboard for immediate insights
- Review both pgstream and PostgreSQL metrics
- Examine Go runtime metrics for application health issues
- Use profiling to drill down into performance bottlenecks
- Examine distributed traces for detailed execution flow
- Consult the troubleshooting section above
- Check PostgreSQL logs for replication related errors