pgstream provides comprehensive observability through OpenTelemetry metrics and distributed tracing. This document outlines the available metrics and how to use them for monitoring your data streaming pipeline.Documentation Index
Fetch the complete documentation index at: https://xata.io/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
pgstream instruments all major components of the data streaming pipeline with metrics and traces using OpenTelemetry. The instrumentation covers:- WAL replication and processing
- Snapshot generation
- Kafka read/write operations
- PostgreSQL query operations
- Search indexing operations
- Data transformations
- Go runtime metrics (memory, GC, goroutines, etc.)
- Go profiling (CPU, memory, goroutines, etc.)
Quick Start with Local Setup
pgstream includes a local observability setup using SigNoz with a pre-configured dashboard that visualizes all the metrics described below. To get started:Metrics
All metrics follow thepgstream.* naming convention and include relevant attributes for filtering and aggregation.
WAL Replication
pgstream.replication.lag
pgstream.replication.lag
Unit: bytes
Description: Replication lag in bytes accrued by the pgstream WAL consumer
pg_stat_replication.flush_lag, pg_stat_replication.replay_lag) to get a complete picture of replication health.
WAL Event Processing
pgstream.target.processing.lag
pgstream.target.processing.lag
Unit: ns
Description: Time between WAL event creation and processing
pgstream.target.processing.latency
pgstream.target.processing.latency
Unit: ns
Description: Time taken to process a WAL event
target: Target type (e.g., “postgres”, “kafka”, “search”)
Snapshot Operations
pgstream.snapshot.generator.latency
pgstream.snapshot.generator.latency
Unit: ms
Description: Time taken to snapshot a source PostgreSQL database
snapshot_schema: List of schemas being snapshottedsnapshot_tables: List of tables being snapshotted
Kafka Operations
Writer Metrics
pgstream.kafka.writer.batch.size
pgstream.kafka.writer.batch.size
Unit: messages
Description: Distribution of message batch sizes
pgstream.kafka.writer.batch.bytes
pgstream.kafka.writer.batch.bytes
Unit: bytes
Description: Distribution of message batch sizes in bytes
pgstream.kafka.writer.latency
pgstream.kafka.writer.latency
Unit: ms
Description: Time taken to send messages to Kafka
Reader Metrics
pgstream.kafka.reader.msg.bytes
pgstream.kafka.reader.msg.bytes
Unit: bytes
Description: Distribution of message sizes read from Kafka
pgstream.kafka.reader.fetch.latency
pgstream.kafka.reader.fetch.latency
Unit: ms
Description: Time taken to fetch messages from Kafka
pgstream.kafka.reader.commit.latency
pgstream.kafka.reader.commit.latency
Unit: ms
Description: Time taken to commit offsets to Kafka
pgstream.kafka.reader.commit.batch.size
pgstream.kafka.reader.commit.batch.size
Unit: offsets
Description: Distribution of offset batch sizes committed
PostgreSQL Operations
pgstream.postgres.querier.latency
pgstream.postgres.querier.latency
Unit: ms
Description: Time taken to perform PostgreSQL queries
query_type: Type of SQL operation (e.g., “SELECT”, “INSERT”, “UPDATE”, “DELETE”, “tx”)query: The actual SQL query (for non-transaction operations)
Search Operations
pgstream.search.store.doc.errors
pgstream.search.store.doc.errors
Unit: errors
Description: Count of failed document indexing operations
severity: Error severity level (one of “NONE”, “DATALOSS”, “IGNORED”, “RETRIABLE”)
Data Transformations
pgstream.transformer.latency
pgstream.transformer.latency
Unit: ms
Description: Time taken to transform data values
transformer_type: Type of transformer being used
Go Runtime Metrics
pgstream automatically collects Go runtime metrics using the OpenTelemetry Go runtime instrumentation. These metrics are essential for monitoring application health and performance:runtime.go.mem.heap_alloc
runtime.go.mem.heap_alloc
Unit: bytes
Description: Bytes of allocated heap objects
runtime.go.mem.heap_idle
runtime.go.mem.heap_idle
Unit: bytes
Description: Bytes in idle (unused) heap spans
runtime.go.mem.heap_inuse
runtime.go.mem.heap_inuse
Unit: bytes
Description: Bytes in in-use heap spans
runtime.go.mem.heap_objects
runtime.go.mem.heap_objects
Unit: objects
Description: Number of allocated heap objects
runtime.go.mem.heap_released
runtime.go.mem.heap_released
Unit: bytes
Description: Bytes of physical memory returned to the OS
runtime.go.mem.heap_sys
runtime.go.mem.heap_sys
Unit: bytes
Description: Bytes of heap memory obtained from the OS
runtime.go.gc.count
runtime.go.gc.count
Unit: collections
Description: Number of completed GC cycles
runtime.go.gc.pause_ns
runtime.go.gc.pause_ns
Unit: ns
Description: Amount of nanoseconds in GC stop-the-world pauses
runtime.go.goroutines
runtime.go.goroutines
Unit: goroutines
Description: Number of goroutines that currently exist
runtime.go.lookups
runtime.go.lookups
Unit: lookups
Description: Number of pointer lookups performed by the runtime
runtime.go.mem.gc_sys
runtime.go.mem.gc_sys
Unit: bytes
Description: Bytes of memory in garbage collection metadata
Configuration
Enabling Metrics
Configure metrics collection in your pgstream configuration file:Monitoring Dashboards
Pre-built pgstream Dashboard
The included SigNoz dashboard provides:- Overview Panel: System health and key metrics at a glance
- Replication Health: Both pgstream and PostgreSQL replication metrics
- Processing Performance: Latency and throughput across all components
- Error Tracking: Error rates and failure patterns
- Resource Utilization: Memory, CPU, and network usage
- Go Runtime Health: Memory usage, GC performance, and goroutine tracking
Key Performance Indicators (KPIs)
-
Application Health
runtime.go.mem.heap_alloc- Memory usage trendsruntime.go.goroutines- Goroutine count (detect leaks)runtime.go.gc.pause_ns- GC pause times (performance impact)
-
Replication Health
pgstream.replication.lag- Should remain low and stablepgstream.target.processing.lag- End-to-end processing delay- PostgreSQL replication lag - Monitor source database metrics
-
Throughput
pgstream.kafka.writer.batch.size- Messages per batchpgstream.kafka.writer.batch.bytes- Bytes per batchrate(pgstream.postgres.querier.latency_count[5m])- Query rate
-
Latency
pgstream.target.processing.latency(p95, p99) - Processing latency percentilespgstream.kafka.writer.latency(p95, p99) - Kafka write latencypgstream.postgres.querier.latency(p95, p99) - Database query latency
-
Error Rates
rate(pgstream.search.store.doc.errors[5m])- Search indexing errors- Error logs from distributed traces
Source PostgreSQL Monitoring
⚠️ Critical: pgstream metrics only track the consumer-side lag. For complete replication monitoring, you must also monitor your source PostgreSQL instance:Essential PostgreSQL Metrics
Distributed Tracing
pgstream automatically creates traces for all major operations. Traces include:- Complete request flows from WAL event to target system
- Database query execution details
- Kafka produce/consume operations
- Transformation pipeline execution
- Error context and stack traces
Profiling
pgstream includes built-in Go profiling capabilities using Go’snet/http/pprof package for performance analysis and debugging.
Enabling Profiling
Profiling can be enabled using the--profile flag on supported commands:
Profiling Modes
1. HTTP Endpoint (All Commands)
When--profile is enabled, pgstream exposes a profiling HTTP server at localhost:6060 with the following endpoints:
http://localhost:6060/debug/pprof/
http://localhost:6060/debug/pprof/
http://localhost:6060/debug/pprof/profile
http://localhost:6060/debug/pprof/profile
http://localhost:6060/debug/pprof/heap
http://localhost:6060/debug/pprof/heap
http://localhost:6060/debug/pprof/goroutine
http://localhost:6060/debug/pprof/goroutine
http://localhost:6060/debug/pprof/allocs
http://localhost:6060/debug/pprof/allocs
http://localhost:6060/debug/pprof/block
http://localhost:6060/debug/pprof/block
http://localhost:6060/debug/pprof/mutex
http://localhost:6060/debug/pprof/mutex
http://localhost:6060/debug/pprof/trace
http://localhost:6060/debug/pprof/trace
2. File Output (Snapshot Command Only)
For thesnapshot command, profiling also generates profile files:
cpu.prof- CPU profile for the entire snapshot operationmem.prof- Memory profile taken at the end of the operation
Using Profiling Data
With Go’s pprof Tool
-
CPU Hotspots: Identify functions consuming the most CPU time
-
Memory Usage: Find memory allocation patterns and potential leaks
-
Goroutine Analysis: Debug goroutine leaks or blocking operations
-
Blocked Operations:
Considerations
- Local Only: The profiling server binds to
localhost:6060and is not accessible externally - Disable in Production: Only enable profiling for debugging/optimization sessions
- Performance Impact: Profiling has minimal overhead but should be used carefully
Best Practices
- Use the Pre-built Dashboard: Start with the included SigNoz dashboard for immediate visibility
- Monitor Both Sides: Track both pgstream metrics AND source PostgreSQL replication metrics
- Watch Runtime Metrics: Monitor memory usage, GC performance, and goroutine counts for application health
- Set Up Comprehensive Alerting: Configure alerts for both application and infrastructure metrics
- Regular Health Checks: Use the dashboard to perform regular health assessments
- Analyze Traces for Debugging: Leverage distributed tracing for complex issue resolution
- Capacity Planning: Use historical metrics data to plan for scaling needs
- Enable Profiling Only When Needed: Don’t run profiling continuously in production
- Collect Sufficient Profile Data: Let profiling run for adequate time to collect meaningful samples
Troubleshooting
- High Memory Usage: Check
runtime.go.mem.heap_allocand look for memory leaks in processing pipelines - Goroutine Leaks: Monitor
runtime.go.goroutinesand investigate if count keeps growing - GC Pressure: High
runtime.go.gc.countrate may indicate memory allocation issues - High Lag: Check both
pgstream.replication.lagand PostgreSQLpg_stat_replicationfor the root cause - Write Failures: Monitor
pgstream.kafka.writer.latencyand check Kafka connectivity - Slow Snapshots: Use
pgstream.snapshot.generator.latencywith schema/table attributes to identify problematic tables - Query Performance: Analyze
pgstream.postgres.querier.latencybyquery_typeto find slow operations - Missing Data: Check PostgreSQL replication slot status and WAL retention policies
Getting Help
If you encounter issues with observability setup:- Check the SigNoz dashboard for immediate insights
- Review both pgstream and PostgreSQL metrics
- Examine Go runtime metrics for application health issues
- Use profiling to drill down into performance bottlenecks
- Examine distributed traces for detailed execution flow
- Consult the troubleshooting section above
- Check PostgreSQL logs for replication related errors