Telegraf and TimescaleDB Metrics Collection Guide#
This guide demonstrates how to set up a metrics collection pipeline using Telegraf to collect StatsD metrics and store them in PostgreSQL with TimescaleDB using Docker Compose.
Overview#
This setup creates a metrics pipeline that:
- Collects StatsD metrics via Telegraf on UDP port 8125
- Creates individual PostgreSQL tables for each metric type
- Stores metric tags as JSONB for flexible querying
- Automatically creates hypertables for time-series optimization
- Provides a complete Docker Compose configuration for easy deployment
Important Note on Table Structure#
The Telegraf PostgreSQL output plugin with the configuration in this guide creates individual tables for each metric name. For example:
quickdid.http.request.countbecomes table"quickdid.http.request.count"quickdid.resolver.rate_limit.available_permitsbecomes table"quickdid.resolver.rate_limit.available_permits"
Each table has the following structure:
time(timestamptz) - The timestamp of the metrictags(jsonb) - All tags stored as a JSON object- Metric-specific columns for values (e.g.,
value,mean,p99, etc.)
Prerequisites#
- Docker and Docker Compose installed
- Basic understanding of StatsD metrics format
- Familiarity with PostgreSQL/TimescaleDB concepts
Project Structure#
Create the following directory structure:
metrics-stack/
├── docker-compose.yml
├── telegraf/
│ └── telegraf.conf
├── test-scripts/
│ ├── send-metrics.sh
│ └── verify-queries.sql
└── .env
Configuration Files#
1. Environment Variables (.env)#
Create a .env file to store sensitive configuration:
# PostgreSQL/TimescaleDB Configuration
POSTGRES_DB=metrics
POSTGRES_USER=postgres
POSTGRES_PASSWORD=secretpassword
# Telegraf Database User
TELEGRAF_DB_USER=postgres
TELEGRAF_DB_PASSWORD=secretpassword
# TimescaleDB Settings
TIMESCALE_TELEMETRY=off
2. Telegraf Configuration (telegraf/telegraf.conf)#
Create the Telegraf configuration file:
# Global Telegraf Agent Configuration
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = false
hostname = "telegraf-agent"
omit_hostname = false
# StatsD Input Plugin
[[inputs.statsd]]
service_address = ":8125" # Listen on UDP port 8125 for StatsD metrics
protocol = "udp"
delete_gauges = true
delete_counters = true
delete_sets = true
delete_timings = true
percentiles = [50, 90, 95, 99]
metric_separator = "."
allowed_pending_messages = 10000
datadog_extensions = true
datadog_distributions = true
# PostgreSQL (TimescaleDB) Output Plugin
[[outputs.postgresql]]
connection = "host=timescaledb user=${TELEGRAF_DB_USER} password=${TELEGRAF_DB_PASSWORD} dbname=${POSTGRES_DB} sslmode=disable"
schema = "public"
# Create individual tables for each metric with hypertable support
create_templates = [
'''CREATE TABLE IF NOT EXISTS {{.table}} ({{.columns}})''',
'''SELECT create_hypertable({{.table|quoteLiteral}}, 'time', if_not_exists => TRUE)''',
]
# Store all tags as JSONB for flexible querying
tags_as_jsonb = true
# Keep fields as separate columns for better performance on aggregations
fields_as_jsonb = false
3. Docker Compose Configuration (docker-compose.yml)#
Create the Docker Compose file:
version: '3.8'
services:
timescaledb:
image: timescale/timescaledb:latest-pg17
container_name: timescaledb
restart: unless-stopped
environment:
POSTGRES_DB: ${POSTGRES_DB}
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
TIMESCALE_TELEMETRY: ${TIMESCALE_TELEMETRY}
ports:
- "5442:5432"
volumes:
- timescale_data:/home/postgres/pgdata/data
- ./init-scripts:/docker-entrypoint-initdb.d:ro
command:
- postgres
- -c
- shared_buffers=1GB
- -c
- effective_cache_size=3GB
- -c
- maintenance_work_mem=512MB
- -c
- work_mem=32MB
- -c
- timescaledb.max_background_workers=8
- -c
- max_parallel_workers_per_gather=2
- -c
- max_parallel_workers=8
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
interval: 10s
timeout: 5s
retries: 5
networks:
- metrics_network
telegraf:
image: telegraf:1.35
container_name: telegraf
restart: unless-stopped
environment:
TELEGRAF_DB_USER: ${TELEGRAF_DB_USER}
TELEGRAF_DB_PASSWORD: ${TELEGRAF_DB_PASSWORD}
POSTGRES_DB: ${POSTGRES_DB}
ports:
- "8125:8125/udp" # StatsD UDP port
volumes:
- ./telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
depends_on:
timescaledb:
condition: service_healthy
networks:
- metrics_network
command: ["telegraf", "--config", "/etc/telegraf/telegraf.conf"]
redis:
image: redis:7-alpine
container_name: redis
restart: unless-stopped
ports:
- "6379:6379"
volumes:
- redis_data:/data
command: redis-server --appendonly yes --appendfsync everysec
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
networks:
- metrics_network
networks:
metrics_network:
driver: bridge
volumes:
timescale_data:
redis_data:
4. Database Initialization Script (optional)#
Create init-scripts/01-init.sql to set up the TimescaleDB extension:
-- Enable TimescaleDB extension
CREATE EXTENSION IF NOT EXISTS timescaledb;
-- Enable additional useful extensions
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
Test Scripts#
1. Send Test Metrics Script (test-scripts/send-metrics.sh)#
Create a script to send various types of metrics:
#!/bin/bash
# Send test metrics to StatsD/Telegraf
echo "Sending test metrics to StatsD on localhost:8125..."
# Counter metrics
for i in {1..10}; do
echo "quickdid.http.request.count:1|c|#method:GET,path:/resolve,status:200" | nc -u -w0 localhost 8125
echo "quickdid.http.request.count:1|c|#method:POST,path:/api,status:201" | nc -u -w0 localhost 8125
echo "quickdid.http.request.count:1|c|#method:GET,path:/resolve,status:404" | nc -u -w0 localhost 8125
done
# Gauge metrics
echo "quickdid.resolver.rate_limit.available_permits:10|g" | nc -u -w0 localhost 8125
echo "quickdid.resolver.rate_limit.available_permits:8|g" | nc -u -w0 localhost 8125
echo "quickdid.resolver.rate_limit.available_permits:5|g" | nc -u -w0 localhost 8125
# Timing metrics (in milliseconds)
for i in {1..20}; do
duration=$((RANDOM % 100 + 10))
echo "quickdid.http.request.duration_ms:${duration}|ms|#method:GET,path:/resolve,status:200" | nc -u -w0 localhost 8125
done
for i in {1..10}; do
duration=$((RANDOM % 200 + 50))
echo "quickdid.http.request.duration_ms:${duration}|ms|#method:POST,path:/api,status:201" | nc -u -w0 localhost 8125
done
# Histogram metrics
for i in {1..15}; do
resolution_time=$((RANDOM % 500 + 50))
echo "quickdid.resolver.resolution_time:${resolution_time}|h|#resolver:redis" | nc -u -w0 localhost 8125
echo "quickdid.resolver.resolution_time:$((resolution_time * 2))|h|#resolver:base" | nc -u -w0 localhost 8125
done
# Cache metrics
echo "quickdid.cache.hit.count:45|c|#cache_type:redis" | nc -u -w0 localhost 8125
echo "quickdid.cache.miss.count:5|c|#cache_type:redis" | nc -u -w0 localhost 8125
echo "quickdid.cache.size:1024|g|#cache_type:memory" | nc -u -w0 localhost 8125
echo "Metrics sent! Wait 15 seconds for Telegraf to flush..."
sleep 15
echo "Done!"
2. Verify Queries Script (test-scripts/verify-queries.sql)#
Create a SQL script to verify all queries work correctly:
-- Test script to verify all metrics queries work correctly
-- Run this after sending test metrics with send-metrics.sh
\echo '===== CHECKING AVAILABLE TABLES ====='
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public'
AND table_name LIKE 'quickdid%'
ORDER BY table_name;
\echo ''
\echo '===== CHECKING TABLE STRUCTURES ====='
\echo 'Structure of quickdid.http.request.count table:'
\d "quickdid.http.request.count"
\echo ''
\echo 'Structure of quickdid.http.request.duration_ms table:'
\d "quickdid.http.request.duration_ms"
\echo ''
\echo '===== QUERY 1: Recent HTTP Request Counts ====='
SELECT
time,
tags,
tags->>'method' as method,
tags->>'path' as path,
tags->>'status' as status,
value
FROM "quickdid.http.request.count"
WHERE time > NOW() - INTERVAL '1 hour'
ORDER BY time DESC
LIMIT 10;
\echo ''
\echo '===== QUERY 2: HTTP Request Duration Statistics by Endpoint ====='
SELECT
time_bucket('1 minute', time) AS minute,
tags->>'method' as method,
tags->>'path' as path,
tags->>'status' as status,
COUNT(*) as request_count,
AVG(mean) as avg_duration_ms,
MAX(p99) as p99_duration_ms,
MIN(mean) as min_duration_ms
FROM "quickdid.http.request.duration_ms"
WHERE time > NOW() - INTERVAL '1 hour'
AND tags IS NOT NULL
GROUP BY minute, tags->>'method', tags->>'path', tags->>'status'
ORDER BY minute DESC
LIMIT 10;
\echo ''
\echo '===== QUERY 3: Rate Limiter Status Over Time ====='
SELECT
time,
value as available_permits
FROM "quickdid.resolver.rate_limit.available_permits"
WHERE time > NOW() - INTERVAL '1 hour'
ORDER BY time DESC
LIMIT 10;
\echo ''
\echo '===== QUERY 4: Resolver Performance Comparison ====='
SELECT
tags->>'resolver' as resolver_type,
COUNT(*) as sample_count,
AVG(mean) as avg_resolution_time_ms,
MAX(p99) as p99_resolution_time_ms,
MIN(mean) as min_resolution_time_ms
FROM "quickdid.resolver.resolution_time"
WHERE time > NOW() - INTERVAL '1 hour'
AND tags->>'resolver' IS NOT NULL
GROUP BY tags->>'resolver'
ORDER BY avg_resolution_time_ms;
\echo ''
\echo '===== QUERY 5: Cache Hit Rate Analysis ====='
WITH cache_stats AS (
SELECT
'hits' as metric_type,
SUM(value) as total_count
FROM "quickdid.cache.hit.count"
WHERE time > NOW() - INTERVAL '1 hour'
UNION ALL
SELECT
'misses' as metric_type,
SUM(value) as total_count
FROM "quickdid.cache.miss.count"
WHERE time > NOW() - INTERVAL '1 hour'
)
SELECT
SUM(CASE WHEN metric_type = 'hits' THEN total_count ELSE 0 END) as total_hits,
SUM(CASE WHEN metric_type = 'misses' THEN total_count ELSE 0 END) as total_misses,
CASE
WHEN SUM(total_count) > 0 THEN
ROUND(100.0 * SUM(CASE WHEN metric_type = 'hits' THEN total_count ELSE 0 END) / SUM(total_count), 2)
ELSE 0
END as hit_rate_percentage
FROM cache_stats;
\echo ''
\echo '===== QUERY 6: Hypertable Information ====='
SELECT
hypertable_schema,
hypertable_name,
owner,
num_dimensions,
num_chunks,
compression_enabled
FROM timescaledb_information.hypertables
WHERE hypertable_name LIKE 'quickdid%'
ORDER BY hypertable_name;
\echo ''
\echo '===== QUERY 7: HTTP Error Rate by Endpoint ====='
WITH status_counts AS (
SELECT
time_bucket('5 minutes', time) as period,
tags->>'path' as path,
CASE
WHEN (tags->>'status')::int >= 400 THEN 'error'
ELSE 'success'
END as status_category,
SUM(value) as request_count
FROM "quickdid.http.request.count"
WHERE time > NOW() - INTERVAL '1 hour'
GROUP BY period, path, status_category
)
SELECT
period,
path,
SUM(CASE WHEN status_category = 'error' THEN request_count ELSE 0 END) as error_count,
SUM(CASE WHEN status_category = 'success' THEN request_count ELSE 0 END) as success_count,
CASE
WHEN SUM(request_count) > 0 THEN
ROUND(100.0 * SUM(CASE WHEN status_category = 'error' THEN request_count ELSE 0 END) / SUM(request_count), 2)
ELSE 0
END as error_rate_percentage
FROM status_counts
GROUP BY period, path
HAVING SUM(request_count) > 0
ORDER BY period DESC, error_rate_percentage DESC;
\echo ''
\echo '===== TEST COMPLETED ====='
Usage#
Starting the Stack#
- Navigate to your project directory:
cd metrics-stack
- Make the test scripts executable:
chmod +x test-scripts/send-metrics.sh
- Start the services:
docker-compose up -d
- Check the logs to ensure everything is running:
docker-compose logs -f
- Wait for services to be fully ready (about 30 seconds)
Running the Test Suite#
- Send test metrics:
./test-scripts/send-metrics.sh
- Verify all queries work:
docker exec -i timescaledb psql -U postgres -d metrics < test-scripts/verify-queries.sql
Manual Querying#
Connect to TimescaleDB to run queries manually:
# Connect to the database
docker exec -it timescaledb psql -U postgres -d metrics
# List all metric tables
\dt "quickdid*"
# Describe a specific table structure
\d "quickdid.http.request.duration_ms"
# Query with JSONB tag filtering
SELECT
time,
tags->>'method' as method,
mean as avg_ms,
'99_percentile' as p99_ms
FROM "quickdid.http.request.duration_ms"
WHERE tags @> '{"method": "GET"}'::jsonb
AND time > NOW() - INTERVAL '1 hour'
ORDER BY time DESC
LIMIT 10;
Advanced Configuration#
Continuous Aggregates for Performance#
Create continuous aggregates for frequently queried data:
-- Create hourly aggregates for HTTP metrics
CREATE MATERIALIZED VIEW http_metrics_hourly
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', time) AS hour,
tags->>'method' as method,
tags->>'path' as path,
tags->>'status' as status,
COUNT(*) as request_count,
AVG(mean) as avg_duration_ms,
MAX('99_percentile') as p99_duration_ms,
MIN(mean) as min_duration_ms
FROM "quickdid.http.request.duration_ms"
WHERE tags IS NOT NULL
GROUP BY hour, method, path, status
WITH NO DATA;
-- Add refresh policy
SELECT add_continuous_aggregate_policy('http_metrics_hourly',
start_offset => INTERVAL '3 hours',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
-- Manually refresh to populate initial data
CALL refresh_continuous_aggregate('http_metrics_hourly', NULL, NULL);
-- Query the aggregate
SELECT * FROM http_metrics_hourly
ORDER BY hour DESC, request_count DESC
LIMIT 20;
Data Retention Policies#
Set up automatic data retention:
-- Add retention policy to drop data older than 30 days
SELECT add_retention_policy('"quickdid.http.request.count"', INTERVAL '30 days');
SELECT add_retention_policy('"quickdid.http.request.duration_ms"', INTERVAL '30 days');
-- View retention policies
SELECT js.* FROM timescaledb_information.job_stats js
JOIN timescaledb_information.jobs j ON js.job_id = j.job_id
WHERE j.proc_name LIKE '%retention%';
Compression for Storage Optimization#
Enable compression for older data:
-- Enable compression on a hypertable
ALTER TABLE "quickdid.http.request.duration_ms" SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'tags'
);
-- Add compression policy (compress chunks older than 7 days)
SELECT add_compression_policy('"quickdid.http.request.duration_ms"', INTERVAL '7 days');
-- Manually compress old chunks
SELECT compress_chunk(format('%I.%I', c.chunk_schema, c.chunk_name)::regclass)
FROM timescaledb_information.chunks c
WHERE c.hypertable_name = 'quickdid.http.request.duration_ms'
AND c.range_end < NOW() - INTERVAL '7 days'
AND NOT c.is_compressed;
-- Check compression status
SELECT
hypertable_name,
uncompressed_total_bytes,
compressed_total_bytes,
compression_ratio
FROM timescaledb_information.hypertable_compression_stats
WHERE hypertable_name LIKE 'quickdid%';
Monitoring and Maintenance#
Health Checks#
-- Check chunk distribution
SELECT
hypertable_name,
chunk_name,
range_start,
range_end,
is_compressed,
pg_size_pretty(total_bytes) as size
FROM timescaledb_information.chunks
WHERE hypertable_name LIKE 'quickdid%'
ORDER BY hypertable_name, range_start DESC
LIMIT 20;
-- Check background jobs
SELECT
job_id,
application_name,
job_type,
schedule_interval,
last_run_started_at,
last_successful_finish,
next_scheduled_run
FROM timescaledb_information.job_stats
ORDER BY job_id;
-- Check table sizes
SELECT
hypertable_name,
chunks_total_size,
chunks_compressed_size,
chunks_uncompressed_size
FROM timescaledb_information.hypertables
WHERE hypertable_name LIKE 'quickdid%';
Troubleshooting#
-
Tables not being created:
- Check Telegraf logs:
docker-compose logs telegraf | grep -i error - Verify PostgreSQL connectivity:
docker exec telegraf telegraf --test - Ensure metrics are being received:
docker-compose logs telegraf | grep statsd
- Check Telegraf logs:
-
Queries returning no data:
- Verify tables exist:
\dt "quickdid*"in psql - Check table contents:
SELECT COUNT(*) FROM "quickdid.http.request.count"; - Verify time ranges in WHERE clauses
- Verify tables exist:
-
Performance issues:
- Check if hypertables are created: Query
timescaledb_information.hypertables - Verify compression is working if enabled
- Consider creating appropriate indexes on JSONB paths:
CREATE INDEX idx_http_method ON "quickdid.http.request.duration_ms" ((tags->>'method')); CREATE INDEX idx_http_path ON "quickdid.http.request.duration_ms" ((tags->>'path')); - Check if hypertables are created: Query
Integration with QuickDID#
To integrate with QuickDID, configure it to send metrics to the Telegraf StatsD endpoint:
# Set environment variables for QuickDID
export METRICS_ADAPTER=statsd
export METRICS_STATSD_HOST=localhost:8125
export METRICS_PREFIX=quickdid.
export METRICS_TAGS=env:production,service:quickdid
# Start QuickDID
cargo run
QuickDID will automatically send metrics to Telegraf, which will store them in TimescaleDB for analysis.
Key Differences from Generic Metrics Table Approach#
This configuration creates individual tables per metric instead of a single generic metrics table. Benefits include:
- Better performance: Each metric has its own optimized schema
- Clearer data model: Tables directly represent metrics
- Easier querying: No need to filter by metric name
- Type safety: Each metric's fields have appropriate types
- Efficient compression: Per-metric compression strategies
Trade-offs:
- More tables to manage (mitigated by TimescaleDB automation)
- Need to know metric names upfront for queries
- Schema changes require table alterations
Security Considerations#
- Use strong passwords: Update the default passwords in
.env - Enable SSL: Configure
sslmode=requirein production - Network isolation: Use Docker networks to isolate services
- Access control: Create separate database users with minimal permissions:
CREATE USER metrics_reader WITH PASSWORD 'readonly_password'; GRANT CONNECT ON DATABASE metrics TO metrics_reader; GRANT USAGE ON SCHEMA public TO metrics_reader; GRANT SELECT ON ALL TABLES IN SCHEMA public TO metrics_reader; - Regular updates: Keep Docker images updated for security patches
Performance Tuning#
PostgreSQL/TimescaleDB Settings#
The docker-compose.yml includes optimized settings. Adjust based on your hardware:
shared_buffers: 25% of system RAMeffective_cache_size: 75% of system RAMmaintenance_work_mem: 5% of system RAMwork_mem: RAM / max_connections / 2
Telegraf Buffer Settings#
For high-volume metrics, adjust in telegraf.conf:
[agent]
metric_batch_size = 5000 # Increase for high volume
metric_buffer_limit = 100000 # Increase buffer size
flush_interval = "5s" # Decrease for more frequent writes
Conclusion#
This setup provides a robust metrics collection and storage solution with:
- Individual metric tables for optimal performance and clarity
- JSONB tag storage for flexible querying
- TimescaleDB hypertables for efficient time-series storage
- Comprehensive test suite to verify functionality
- Production-ready configuration with compression and retention policies
The system correctly handles StatsD metrics from QuickDID and provides powerful querying capabilities through PostgreSQL's JSONB support and TimescaleDB's time-series functions.