Telegraf and TimescaleDB Metrics Collection Guide#

This guide demonstrates how to set up a metrics collection pipeline using Telegraf to collect StatsD metrics and store them in PostgreSQL with TimescaleDB using Docker Compose.

Overview#

This setup creates a metrics pipeline that:

Collects StatsD metrics via Telegraf on UDP port 8125
Creates individual PostgreSQL tables for each metric type
Stores metric tags as JSONB for flexible querying
Automatically creates hypertables for time-series optimization
Provides a complete Docker Compose configuration for easy deployment

Important Note on Table Structure#

The Telegraf PostgreSQL output plugin with the configuration in this guide creates individual tables for each metric name. For example:

quickdid.http.request.count becomes table "quickdid.http.request.count"
quickdid.resolver.rate_limit.available_permits becomes table "quickdid.resolver.rate_limit.available_permits"

Each table has the following structure:

time (timestamptz) - The timestamp of the metric
tags (jsonb) - All tags stored as a JSON object
Metric-specific columns for values (e.g., value, mean, p99, etc.)

Prerequisites#

Docker and Docker Compose installed
Basic understanding of StatsD metrics format
Familiarity with PostgreSQL/TimescaleDB concepts

Project Structure#

Create the following directory structure:

metrics-stack/
├── docker-compose.yml
├── telegraf/
│   └── telegraf.conf
├── test-scripts/
│   ├── send-metrics.sh
│   └── verify-queries.sql
└── .env

Configuration Files#

1. Environment Variables (.env)#

Create a .env file to store sensitive configuration:

# PostgreSQL/TimescaleDB Configuration
POSTGRES_DB=metrics
POSTGRES_USER=postgres
POSTGRES_PASSWORD=secretpassword

# Telegraf Database User
TELEGRAF_DB_USER=postgres
TELEGRAF_DB_PASSWORD=secretpassword

# TimescaleDB Settings
TIMESCALE_TELEMETRY=off

2. Telegraf Configuration (telegraf/telegraf.conf)#

Create the Telegraf configuration file:

# Global Telegraf Agent Configuration
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  hostname = "telegraf-agent"
  omit_hostname = false

# StatsD Input Plugin
[[inputs.statsd]]
  service_address = ":8125"  # Listen on UDP port 8125 for StatsD metrics
  protocol = "udp"
  delete_gauges = true
  delete_counters = true
  delete_sets = true
  delete_timings = true
  percentiles = [50, 90, 95, 99]
  metric_separator = "."
  allowed_pending_messages = 10000
  datadog_extensions = true
  datadog_distributions = true

# PostgreSQL (TimescaleDB) Output Plugin
[[outputs.postgresql]]
  connection = "host=timescaledb user=${TELEGRAF_DB_USER} password=${TELEGRAF_DB_PASSWORD} dbname=${POSTGRES_DB} sslmode=disable"
  schema = "public"
  
  # Create individual tables for each metric with hypertable support
  create_templates = [
    '''CREATE TABLE IF NOT EXISTS {{.table}} ({{.columns}})''',
    '''SELECT create_hypertable({{.table|quoteLiteral}}, 'time', if_not_exists => TRUE)''',
  ]
  
  # Store all tags as JSONB for flexible querying
  tags_as_jsonb = true
  
  # Keep fields as separate columns for better performance on aggregations
  fields_as_jsonb = false

3. Docker Compose Configuration (docker-compose.yml)#

Create the Docker Compose file:

version: '3.8'

services:
  timescaledb:
    image: timescale/timescaledb:latest-pg17
    container_name: timescaledb
    restart: unless-stopped
    environment:
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      TIMESCALE_TELEMETRY: ${TIMESCALE_TELEMETRY}
    ports:
      - "5442:5432"
    volumes:
      - timescale_data:/home/postgres/pgdata/data
      - ./init-scripts:/docker-entrypoint-initdb.d:ro
    command:
      - postgres
      - -c
      - shared_buffers=1GB
      - -c
      - effective_cache_size=3GB
      - -c
      - maintenance_work_mem=512MB
      - -c
      - work_mem=32MB
      - -c
      - timescaledb.max_background_workers=8
      - -c
      - max_parallel_workers_per_gather=2
      - -c
      - max_parallel_workers=8
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - metrics_network

  telegraf:
    image: telegraf:1.35
    container_name: telegraf
    restart: unless-stopped
    environment:
      TELEGRAF_DB_USER: ${TELEGRAF_DB_USER}
      TELEGRAF_DB_PASSWORD: ${TELEGRAF_DB_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    ports:
      - "8125:8125/udp"  # StatsD UDP port
    volumes:
      - ./telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
    depends_on:
      timescaledb:
        condition: service_healthy
    networks:
      - metrics_network
    command: ["telegraf", "--config", "/etc/telegraf/telegraf.conf"]

  redis:
    image: redis:7-alpine
    container_name: redis
    restart: unless-stopped
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes --appendfsync everysec
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - metrics_network

networks:
  metrics_network:
    driver: bridge

volumes:
  timescale_data:
  redis_data:

4. Database Initialization Script (optional)#

Create init-scripts/01-init.sql to set up the TimescaleDB extension:

-- Enable TimescaleDB extension
CREATE EXTENSION IF NOT EXISTS timescaledb;

-- Enable additional useful extensions
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Test Scripts#

1. Send Test Metrics Script (test-scripts/send-metrics.sh)#

Create a script to send various types of metrics:

#!/bin/bash

# Send test metrics to StatsD/Telegraf

echo "Sending test metrics to StatsD on localhost:8125..."

# Counter metrics
for i in {1..10}; do
  echo "quickdid.http.request.count:1|c|#method:GET,path:/resolve,status:200" | nc -u -w0 localhost 8125
  echo "quickdid.http.request.count:1|c|#method:POST,path:/api,status:201" | nc -u -w0 localhost 8125
  echo "quickdid.http.request.count:1|c|#method:GET,path:/resolve,status:404" | nc -u -w0 localhost 8125
done

# Gauge metrics
echo "quickdid.resolver.rate_limit.available_permits:10|g" | nc -u -w0 localhost 8125
echo "quickdid.resolver.rate_limit.available_permits:8|g" | nc -u -w0 localhost 8125
echo "quickdid.resolver.rate_limit.available_permits:5|g" | nc -u -w0 localhost 8125

# Timing metrics (in milliseconds)
for i in {1..20}; do
  duration=$((RANDOM % 100 + 10))
  echo "quickdid.http.request.duration_ms:${duration}|ms|#method:GET,path:/resolve,status:200" | nc -u -w0 localhost 8125
done

for i in {1..10}; do
  duration=$((RANDOM % 200 + 50))
  echo "quickdid.http.request.duration_ms:${duration}|ms|#method:POST,path:/api,status:201" | nc -u -w0 localhost 8125
done

# Histogram metrics
for i in {1..15}; do
  resolution_time=$((RANDOM % 500 + 50))
  echo "quickdid.resolver.resolution_time:${resolution_time}|h|#resolver:redis" | nc -u -w0 localhost 8125
  echo "quickdid.resolver.resolution_time:$((resolution_time * 2))|h|#resolver:base" | nc -u -w0 localhost 8125
done

# Cache metrics
echo "quickdid.cache.hit.count:45|c|#cache_type:redis" | nc -u -w0 localhost 8125
echo "quickdid.cache.miss.count:5|c|#cache_type:redis" | nc -u -w0 localhost 8125
echo "quickdid.cache.size:1024|g|#cache_type:memory" | nc -u -w0 localhost 8125

echo "Metrics sent! Wait 15 seconds for Telegraf to flush..."
sleep 15
echo "Done!"

2. Verify Queries Script (test-scripts/verify-queries.sql)#

Create a SQL script to verify all queries work correctly:

-- Test script to verify all metrics queries work correctly
-- Run this after sending test metrics with send-metrics.sh

\echo '===== CHECKING AVAILABLE TABLES ====='
SELECT table_name 
FROM information_schema.tables 
WHERE table_schema = 'public' 
  AND table_name LIKE 'quickdid%'
ORDER BY table_name;

\echo ''
\echo '===== CHECKING TABLE STRUCTURES ====='
\echo 'Structure of quickdid.http.request.count table:'
\d "quickdid.http.request.count"

\echo ''
\echo 'Structure of quickdid.http.request.duration_ms table:'
\d "quickdid.http.request.duration_ms"

\echo ''
\echo '===== QUERY 1: Recent HTTP Request Counts ====='
SELECT 
    time,
    tags,
    tags->>'method' as method,
    tags->>'path' as path,
    tags->>'status' as status,
    value
FROM "quickdid.http.request.count"
WHERE time > NOW() - INTERVAL '1 hour'
ORDER BY time DESC
LIMIT 10;

\echo ''
\echo '===== QUERY 2: HTTP Request Duration Statistics by Endpoint ====='
SELECT 
    time_bucket('1 minute', time) AS minute,
    tags->>'method' as method,
    tags->>'path' as path,
    tags->>'status' as status,
    COUNT(*) as request_count,
    AVG(mean) as avg_duration_ms,
    MAX(p99) as p99_duration_ms,
    MIN(mean) as min_duration_ms
FROM "quickdid.http.request.duration_ms"
WHERE time > NOW() - INTERVAL '1 hour'
  AND tags IS NOT NULL
GROUP BY minute, tags->>'method', tags->>'path', tags->>'status'
ORDER BY minute DESC
LIMIT 10;

\echo ''
\echo '===== QUERY 3: Rate Limiter Status Over Time ====='
SELECT 
    time,
    value as available_permits
FROM "quickdid.resolver.rate_limit.available_permits"
WHERE time > NOW() - INTERVAL '1 hour'
ORDER BY time DESC
LIMIT 10;

\echo ''
\echo '===== QUERY 4: Resolver Performance Comparison ====='
SELECT 
    tags->>'resolver' as resolver_type,
    COUNT(*) as sample_count,
    AVG(mean) as avg_resolution_time_ms,
    MAX(p99) as p99_resolution_time_ms,
    MIN(mean) as min_resolution_time_ms
FROM "quickdid.resolver.resolution_time"
WHERE time > NOW() - INTERVAL '1 hour'
  AND tags->>'resolver' IS NOT NULL
GROUP BY tags->>'resolver'
ORDER BY avg_resolution_time_ms;

\echo ''
\echo '===== QUERY 5: Cache Hit Rate Analysis ====='
WITH cache_stats AS (
    SELECT 
        'hits' as metric_type,
        SUM(value) as total_count
    FROM "quickdid.cache.hit.count"
    WHERE time > NOW() - INTERVAL '1 hour'
    UNION ALL
    SELECT 
        'misses' as metric_type,
        SUM(value) as total_count
    FROM "quickdid.cache.miss.count"
    WHERE time > NOW() - INTERVAL '1 hour'
)
SELECT 
    SUM(CASE WHEN metric_type = 'hits' THEN total_count ELSE 0 END) as total_hits,
    SUM(CASE WHEN metric_type = 'misses' THEN total_count ELSE 0 END) as total_misses,
    CASE 
        WHEN SUM(total_count) > 0 THEN 
            ROUND(100.0 * SUM(CASE WHEN metric_type = 'hits' THEN total_count ELSE 0 END) / SUM(total_count), 2)
        ELSE 0 
    END as hit_rate_percentage
FROM cache_stats;

\echo ''
\echo '===== QUERY 6: Hypertable Information ====='
SELECT 
    hypertable_schema,
    hypertable_name,
    owner,
    num_dimensions,
    num_chunks,
    compression_enabled
FROM timescaledb_information.hypertables
WHERE hypertable_name LIKE 'quickdid%'
ORDER BY hypertable_name;

\echo ''
\echo '===== QUERY 7: HTTP Error Rate by Endpoint ====='
WITH status_counts AS (
    SELECT 
        time_bucket('5 minutes', time) as period,
        tags->>'path' as path,
        CASE 
            WHEN (tags->>'status')::int >= 400 THEN 'error'
            ELSE 'success'
        END as status_category,
        SUM(value) as request_count
    FROM "quickdid.http.request.count"
    WHERE time > NOW() - INTERVAL '1 hour'
    GROUP BY period, path, status_category
)
SELECT 
    period,
    path,
    SUM(CASE WHEN status_category = 'error' THEN request_count ELSE 0 END) as error_count,
    SUM(CASE WHEN status_category = 'success' THEN request_count ELSE 0 END) as success_count,
    CASE 
        WHEN SUM(request_count) > 0 THEN 
            ROUND(100.0 * SUM(CASE WHEN status_category = 'error' THEN request_count ELSE 0 END) / SUM(request_count), 2)
        ELSE 0 
    END as error_rate_percentage
FROM status_counts
GROUP BY period, path
HAVING SUM(request_count) > 0
ORDER BY period DESC, error_rate_percentage DESC;

\echo ''
\echo '===== TEST COMPLETED ====='

Usage#

Starting the Stack#

Navigate to your project directory:

cd metrics-stack

Make the test scripts executable:

chmod +x test-scripts/send-metrics.sh

Start the services:

docker-compose up -d

Check the logs to ensure everything is running:

docker-compose logs -f

Wait for services to be fully ready (about 30 seconds)

Running the Test Suite#

Send test metrics:

./test-scripts/send-metrics.sh

Verify all queries work:

docker exec -i timescaledb psql -U postgres -d metrics < test-scripts/verify-queries.sql

Manual Querying#

Connect to TimescaleDB to run queries manually:

# Connect to the database
docker exec -it timescaledb psql -U postgres -d metrics

# List all metric tables
\dt "quickdid*"

# Describe a specific table structure
\d "quickdid.http.request.duration_ms"

# Query with JSONB tag filtering
SELECT 
    time,
    tags->>'method' as method,
    mean as avg_ms,
    '99_percentile' as p99_ms
FROM "quickdid.http.request.duration_ms"
WHERE tags @> '{"method": "GET"}'::jsonb
  AND time > NOW() - INTERVAL '1 hour'
ORDER BY time DESC
LIMIT 10;

Advanced Configuration#

Continuous Aggregates for Performance#

Create continuous aggregates for frequently queried data:

-- Create hourly aggregates for HTTP metrics
CREATE MATERIALIZED VIEW http_metrics_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', time) AS hour,
    tags->>'method' as method,
    tags->>'path' as path,
    tags->>'status' as status,
    COUNT(*) as request_count,
    AVG(mean) as avg_duration_ms,
    MAX('99_percentile') as p99_duration_ms,
    MIN(mean) as min_duration_ms
FROM "quickdid.http.request.duration_ms"
WHERE tags IS NOT NULL
GROUP BY hour, method, path, status
WITH NO DATA;

-- Add refresh policy
SELECT add_continuous_aggregate_policy('http_metrics_hourly',
    start_offset => INTERVAL '3 hours',
    end_offset => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour');

-- Manually refresh to populate initial data
CALL refresh_continuous_aggregate('http_metrics_hourly', NULL, NULL);

-- Query the aggregate
SELECT * FROM http_metrics_hourly
ORDER BY hour DESC, request_count DESC
LIMIT 20;

Data Retention Policies#

Set up automatic data retention:

-- Add retention policy to drop data older than 30 days
SELECT add_retention_policy('"quickdid.http.request.count"', INTERVAL '30 days');
SELECT add_retention_policy('"quickdid.http.request.duration_ms"', INTERVAL '30 days');

-- View retention policies
SELECT js.* FROM timescaledb_information.job_stats js
JOIN timescaledb_information.jobs j ON js.job_id = j.job_id
WHERE j.proc_name LIKE '%retention%';

Compression for Storage Optimization#

Enable compression for older data:

-- Enable compression on a hypertable
ALTER TABLE "quickdid.http.request.duration_ms" SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'tags'
);

-- Add compression policy (compress chunks older than 7 days)
SELECT add_compression_policy('"quickdid.http.request.duration_ms"', INTERVAL '7 days');

-- Manually compress old chunks
SELECT compress_chunk(format('%I.%I', c.chunk_schema, c.chunk_name)::regclass)
FROM timescaledb_information.chunks c
WHERE c.hypertable_name = 'quickdid.http.request.duration_ms'
AND c.range_end < NOW() - INTERVAL '7 days'
AND NOT c.is_compressed;

-- Check compression status
SELECT 
    hypertable_name,
    uncompressed_total_bytes,
    compressed_total_bytes,
    compression_ratio
FROM timescaledb_information.hypertable_compression_stats
WHERE hypertable_name LIKE 'quickdid%';

Monitoring and Maintenance#

Health Checks#

-- Check chunk distribution
SELECT 
    hypertable_name,
    chunk_name,
    range_start,
    range_end,
    is_compressed,
    pg_size_pretty(total_bytes) as size
FROM timescaledb_information.chunks
WHERE hypertable_name LIKE 'quickdid%'
ORDER BY hypertable_name, range_start DESC
LIMIT 20;

-- Check background jobs
SELECT 
    job_id,
    application_name,
    job_type,
    schedule_interval,
    last_run_started_at,
    last_successful_finish,
    next_scheduled_run
FROM timescaledb_information.job_stats
ORDER BY job_id;

-- Check table sizes
SELECT 
    hypertable_name,
    chunks_total_size,
    chunks_compressed_size,
    chunks_uncompressed_size
FROM timescaledb_information.hypertables
WHERE hypertable_name LIKE 'quickdid%';

Troubleshooting#

Tables not being created:
- Check Telegraf logs: docker-compose logs telegraf | grep -i error
- Verify PostgreSQL connectivity: docker exec telegraf telegraf --test
- Ensure metrics are being received: docker-compose logs telegraf | grep statsd
Queries returning no data:
- Verify tables exist: \dt "quickdid*" in psql
- Check table contents: SELECT COUNT(*) FROM "quickdid.http.request.count";
- Verify time ranges in WHERE clauses

Performance issues:

Check if hypertables are created: Query timescaledb_information.hypertables
Verify compression is working if enabled
Consider creating appropriate indexes on JSONB paths:

CREATE INDEX idx_http_method ON "quickdid.http.request.duration_ms" ((tags->>'method'));
CREATE INDEX idx_http_path ON "quickdid.http.request.duration_ms" ((tags->>'path'));

Integration with QuickDID#

To integrate with QuickDID, configure it to send metrics to the Telegraf StatsD endpoint:

# Set environment variables for QuickDID
export METRICS_ADAPTER=statsd
export METRICS_STATSD_HOST=localhost:8125
export METRICS_PREFIX=quickdid.
export METRICS_TAGS=env:production,service:quickdid

# Start QuickDID
cargo run

QuickDID will automatically send metrics to Telegraf, which will store them in TimescaleDB for analysis.

Key Differences from Generic Metrics Table Approach#

This configuration creates individual tables per metric instead of a single generic metrics table. Benefits include:

Better performance: Each metric has its own optimized schema
Clearer data model: Tables directly represent metrics
Easier querying: No need to filter by metric name
Type safety: Each metric's fields have appropriate types
Efficient compression: Per-metric compression strategies

Trade-offs:

More tables to manage (mitigated by TimescaleDB automation)
Need to know metric names upfront for queries
Schema changes require table alterations

Security Considerations#

Use strong passwords: Update the default passwords in .env
Enable SSL: Configure sslmode=require in production
Network isolation: Use Docker networks to isolate services

Access control: Create separate database users with minimal permissions:

CREATE USER metrics_reader WITH PASSWORD 'readonly_password';
GRANT CONNECT ON DATABASE metrics TO metrics_reader;
GRANT USAGE ON SCHEMA public TO metrics_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO metrics_reader;

Regular updates: Keep Docker images updated for security patches

Performance Tuning#

PostgreSQL/TimescaleDB Settings#

The docker-compose.yml includes optimized settings. Adjust based on your hardware:

shared_buffers: 25% of system RAM
effective_cache_size: 75% of system RAM
maintenance_work_mem: 5% of system RAM
work_mem: RAM / max_connections / 2

Telegraf Buffer Settings#

For high-volume metrics, adjust in telegraf.conf:

[agent]
  metric_batch_size = 5000      # Increase for high volume
  metric_buffer_limit = 100000  # Increase buffer size
  flush_interval = "5s"          # Decrease for more frequent writes

Conclusion#

This setup provides a robust metrics collection and storage solution with:

Individual metric tables for optimal performance and clarity
JSONB tag storage for flexible querying
TimescaleDB hypertables for efficient time-series storage
Comprehensive test suite to verify functionality
Production-ready configuration with compression and retention policies

The system correctly handles StatsD metrics from QuickDID and provides powerful querying capabilities through PostgreSQL's JSONB support and TimescaleDB's time-series functions.