+3
.gitignore
+3
.gitignore
+426
docs/telegraf-timescaledb-metrics-guide.md
+426
docs/telegraf-timescaledb-metrics-guide.md
···
1
+
# Telegraf and TimescaleDB Metrics Collection Guide
2
+
3
+
This guide demonstrates how to set up a metrics collection pipeline using Telegraf to collect StatsD metrics and store them in PostgreSQL with TimescaleDB using Docker Compose.
4
+
5
+
## Overview
6
+
7
+
This setup creates a metrics pipeline that:
8
+
- Collects StatsD metrics via Telegraf on UDP port 8125
9
+
- Stores metrics in PostgreSQL with TimescaleDB extensions
10
+
- Automatically creates hypertables for time-series optimization
11
+
- Provides a complete Docker Compose configuration for easy deployment
12
+
13
+
## Prerequisites
14
+
15
+
- Docker and Docker Compose installed
16
+
- Basic understanding of StatsD metrics format
17
+
- Familiarity with PostgreSQL/TimescaleDB concepts
18
+
19
+
## Project Structure
20
+
21
+
Create the following directory structure:
22
+
23
+
```
24
+
metrics-stack/
25
+
├── docker-compose.yml
26
+
├── telegraf/
27
+
│ └── telegraf.conf
28
+
└── .env
29
+
```
30
+
31
+
## Configuration Files
32
+
33
+
### 1. Environment Variables (.env)
34
+
35
+
Create a `.env` file to store sensitive configuration:
36
+
37
+
```env
38
+
# PostgreSQL/TimescaleDB Configuration
39
+
POSTGRES_DB=metrics
40
+
POSTGRES_USER=postgres
41
+
POSTGRES_PASSWORD=secretpassword
42
+
43
+
# Telegraf Database User
44
+
TELEGRAF_DB_USER=postgres
45
+
TELEGRAF_DB_PASSWORD=secretpassword
46
+
47
+
# TimescaleDB Settings
48
+
TIMESCALE_TELEMETRY=off
49
+
```
50
+
51
+
### 2. Telegraf Configuration (telegraf/telegraf.conf)
52
+
53
+
Create the Telegraf configuration file:
54
+
55
+
```toml
56
+
# Global Telegraf Agent Configuration
57
+
[agent]
58
+
interval = "10s"
59
+
round_interval = true
60
+
metric_batch_size = 1000
61
+
metric_buffer_limit = 10000
62
+
collection_jitter = "0s"
63
+
flush_interval = "10s"
64
+
flush_jitter = "0s"
65
+
precision = ""
66
+
debug = false
67
+
quiet = false
68
+
hostname = "telegraf-agent"
69
+
omit_hostname = false
70
+
71
+
# StatsD Input Plugin
72
+
[[inputs.statsd]]
73
+
service_address = ":8125" # Listen on UDP port 8125 for StatsD metrics
74
+
protocol = "udp"
75
+
delete_gauges = true
76
+
delete_counters = true
77
+
delete_sets = true
78
+
delete_timings = true
79
+
percentiles = [50, 90, 95, 99]
80
+
metric_separator = "."
81
+
allowed_pending_messages = 10000
82
+
datadog_extensions = true
83
+
datadog_distributions = true
84
+
85
+
# PostgreSQL (TimescaleDB) Output Plugin
86
+
[[outputs.postgresql]]
87
+
connection = "host=timescaledb user=${TELEGRAF_DB_USER} password=${TELEGRAF_DB_PASSWORD} dbname=${POSTGRES_DB} sslmode=disable"
88
+
schema = "public"
89
+
create_templates = [
90
+
'''CREATE TABLE IF NOT EXISTS {{.table}} ({{.columns}})''',
91
+
'''SELECT create_hypertable({{.table|quoteLiteral}}, 'time', if_not_exists => TRUE)''',
92
+
]
93
+
tags_as_jsonb = true
94
+
95
+
```
96
+
97
+
### 3. Docker Compose Configuration (docker-compose.yml)
98
+
99
+
Create the Docker Compose file:
100
+
101
+
```yaml
102
+
version: '3.8'
103
+
104
+
services:
105
+
timescaledb:
106
+
image: timescale/timescaledb-ha:pg17
107
+
container_name: timescaledb
108
+
restart: unless-stopped
109
+
environment:
110
+
POSTGRES_DB: ${POSTGRES_DB}
111
+
POSTGRES_USER: ${POSTGRES_USER}
112
+
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
113
+
TIMESCALE_TELEMETRY: ${TIMESCALE_TELEMETRY}
114
+
ports:
115
+
- "5442:5432"
116
+
volumes:
117
+
- timescale_data:/home/postgres/pgdata/data
118
+
- ./init-scripts:/docker-entrypoint-initdb.d:ro
119
+
healthcheck:
120
+
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
121
+
interval: 10s
122
+
timeout: 5s
123
+
retries: 5
124
+
networks:
125
+
- metrics_network
126
+
127
+
telegraf:
128
+
image: telegraf
129
+
container_name: telegraf
130
+
restart: unless-stopped
131
+
environment:
132
+
TELEGRAF_DB_USER: ${TELEGRAF_DB_USER}
133
+
TELEGRAF_DB_PASSWORD: ${TELEGRAF_DB_PASSWORD}
134
+
POSTGRES_DB: ${POSTGRES_DB}
135
+
ports:
136
+
- "8125:8125/udp" # StatsD UDP port
137
+
volumes:
138
+
- ./telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
139
+
depends_on:
140
+
timescaledb:
141
+
condition: service_healthy
142
+
networks:
143
+
- metrics_network
144
+
command: ["telegraf", "--config", "/etc/telegraf/telegraf.conf"]
145
+
146
+
redis:
147
+
image: redis:7-alpine
148
+
container_name: redis
149
+
restart: unless-stopped
150
+
ports:
151
+
- "6379:6379"
152
+
volumes:
153
+
- redis_data:/data
154
+
command: redis-server --appendonly yes --appendfsync everysec
155
+
healthcheck:
156
+
test: ["CMD", "redis-cli", "ping"]
157
+
interval: 10s
158
+
timeout: 5s
159
+
retries: 5
160
+
networks:
161
+
- metrics_network
162
+
163
+
networks:
164
+
metrics_network:
165
+
driver: bridge
166
+
167
+
volumes:
168
+
timescale_data:
169
+
redis_data:
170
+
```
171
+
172
+
### 4. Database Initialization Script (optional)
173
+
174
+
Create `init-scripts/01-init.sql` to set up the Telegraf user and schema:
175
+
176
+
```sql
177
+
-- Enable TimescaleDB extension
178
+
CREATE EXTENSION IF NOT EXISTS timescaledb;
179
+
```
180
+
181
+
## Usage
182
+
183
+
### Starting the Stack
184
+
185
+
1. Navigate to your project directory:
186
+
```bash
187
+
cd metrics-stack
188
+
```
189
+
190
+
2. Start the services:
191
+
```bash
192
+
docker-compose up -d
193
+
```
194
+
195
+
3. Check the logs to ensure everything is running:
196
+
```bash
197
+
docker-compose logs -f
198
+
```
199
+
200
+
### Sending Metrics
201
+
202
+
Send test metrics to Telegraf using StatsD protocol:
203
+
204
+
```bash
205
+
# Send a counter metric
206
+
echo "quickdid.http.request.count:1|c|#method:GET,path:/resolve,status:200" | nc -u -w0 localhost 8125
207
+
208
+
# Send a gauge metric
209
+
echo "quickdid.resolver.rate_limit.available_permits:10|g" | nc -u -w0 localhost 8125
210
+
211
+
# Send a timing metric
212
+
echo "quickdid.http.request.duration_ms:42|ms|#method:POST,path:/api,status:201" | nc -u -w0 localhost 8125
213
+
214
+
# Send a histogram
215
+
echo "quickdid.resolver.resolution_time:100|h|#resolver:redis" | nc -u -w0 localhost 8125
216
+
```
217
+
218
+
### Querying Metrics
219
+
220
+
Connect to TimescaleDB to query your metrics:
221
+
222
+
```bash
223
+
# Connect to the database
224
+
docker exec -it timescaledb psql -U postgres -d metrics
225
+
226
+
# Query recent metrics
227
+
SELECT
228
+
time,
229
+
tags,
230
+
value
231
+
FROM "quickdid.http.request.count"
232
+
WHERE time > NOW() - INTERVAL '1 hour'
233
+
ORDER BY time DESC
234
+
LIMIT 10;
235
+
236
+
# Query specific metric with tags
237
+
SELECT
238
+
time_bucket('1 minute', time) AS minute,
239
+
tags->>'method' as method,
240
+
tags->>'path' as path,
241
+
tags->>'status' as status,
242
+
AVG(mean::numeric) as avg_duration
243
+
FROM "quickdid.http.request.duration_ms"
244
+
WHERE
245
+
time > NOW() - INTERVAL '1 hour'
246
+
GROUP BY minute, method, path, status
247
+
ORDER BY minute DESC;
248
+
```
249
+
250
+
## Advanced Configuration
251
+
252
+
### Continuous Aggregates
253
+
254
+
Create continuous aggregates for better query performance:
255
+
256
+
```sql
257
+
-- Create a continuous aggregate for HTTP metrics
258
+
CREATE MATERIALIZED VIEW http_metrics_hourly
259
+
WITH (timescaledb.continuous) AS
260
+
SELECT
261
+
time_bucket('1 hour', time) AS hour,
262
+
name,
263
+
tags->>'method' as method,
264
+
tags->>'path' as path,
265
+
tags->>'status' as status,
266
+
COUNT(*) as request_count,
267
+
AVG((fields->>'value')::numeric) as avg_value,
268
+
MAX((fields->>'value')::numeric) as max_value,
269
+
MIN((fields->>'value')::numeric) as min_value
270
+
FROM telegraf.metrics
271
+
WHERE name LIKE 'quickdid_http_%'
272
+
GROUP BY hour, name, method, path, status
273
+
WITH NO DATA;
274
+
275
+
-- Add refresh policy
276
+
SELECT add_continuous_aggregate_policy('http_metrics_hourly',
277
+
start_offset => INTERVAL '3 hours',
278
+
end_offset => INTERVAL '1 hour',
279
+
schedule_interval => INTERVAL '1 hour');
280
+
```
281
+
282
+
## Monitoring and Maintenance
283
+
284
+
### Health Checks
285
+
286
+
Check the health of your TimescaleDB hypertables:
287
+
288
+
```sql
289
+
-- View hypertable information
290
+
SELECT * FROM timescaledb_information.hypertables;
291
+
292
+
-- Check chunk information
293
+
SELECT
294
+
hypertable_name,
295
+
chunk_name,
296
+
range_start,
297
+
range_end,
298
+
is_compressed
299
+
FROM timescaledb_information.chunks
300
+
WHERE hypertable_name = 'quickdid.http.request.duration_ms'
301
+
ORDER BY range_start DESC
302
+
LIMIT 10;
303
+
304
+
-- Check compression status
305
+
SELECT
306
+
hypertable_name,
307
+
uncompressed_total_bytes,
308
+
compressed_total_bytes,
309
+
compression_ratio
310
+
FROM timescaledb_information.hypertable_compression_stats;
311
+
```
312
+
313
+
### Troubleshooting
314
+
315
+
1. **Telegraf can't connect to TimescaleDB:**
316
+
- Check that TimescaleDB is healthy: `docker-compose ps`
317
+
- Verify credentials in `.env` file
318
+
- Check network connectivity between containers
319
+
320
+
2. **Metrics not appearing in database:**
321
+
- Check Telegraf logs: `docker-compose logs telegraf`
322
+
- Verify StatsD metrics are being received: `docker-compose logs telegraf | grep statsd`
323
+
- Check table creation: Connect to database and run `\dt telegraf.*`
324
+
325
+
3. **Performance issues:**
326
+
- Check chunk sizes and adjust chunk_time_interval if needed
327
+
- Verify compression is working
328
+
- Consider creating appropriate indexes on frequently queried columns
329
+
330
+
## Integration with QuickDID
331
+
332
+
To integrate with QuickDID, configure it to send metrics to the Telegraf StatsD endpoint:
333
+
334
+
```bash
335
+
# Set environment variables for QuickDID
336
+
export METRICS_ADAPTER=statsd
337
+
export METRICS_STATSD_HOST=localhost:8125
338
+
export METRICS_PREFIX=quickdid.
339
+
export METRICS_TAGS=env:production,service:quickdid
340
+
341
+
# Start QuickDID
342
+
cargo run
343
+
```
344
+
345
+
QuickDID will automatically send metrics to Telegraf, which will store them in TimescaleDB for analysis.
346
+
347
+
## Backup and Restore
348
+
349
+
### Backup
350
+
351
+
Create a backup of your metrics data:
352
+
353
+
```bash
354
+
# Backup entire database
355
+
docker exec timescaledb pg_dump -U postgres metrics_db > metrics_backup.sql
356
+
357
+
# Backup specific time range
358
+
docker exec timescaledb pg_dump -U postgres \
359
+
--table='telegraf.metrics' \
360
+
--data-only \
361
+
--where="time >= '2024-01-01' AND time < '2024-02-01'" \
362
+
metrics_db > metrics_january.sql
363
+
```
364
+
365
+
### Restore
366
+
367
+
Restore from backup:
368
+
369
+
```bash
370
+
# Restore database
371
+
docker exec -i timescaledb psql -U postgres metrics_db < metrics_backup.sql
372
+
```
373
+
374
+
## Security Considerations
375
+
376
+
1. **Use strong passwords:** Update the default passwords in `.env`
377
+
2. **Enable SSL:** Configure `sslmode=require` in production
378
+
3. **Network isolation:** Use Docker networks to isolate services
379
+
4. **Access control:** Limit database user permissions to minimum required
380
+
5. **Regular updates:** Keep Docker images updated for security patches
381
+
382
+
## Performance Tuning
383
+
384
+
### TimescaleDB Configuration
385
+
386
+
Add these settings to optimize performance:
387
+
388
+
```yaml
389
+
# In docker-compose.yml under timescaledb service
390
+
command:
391
+
- postgres
392
+
- -c
393
+
- shared_buffers=1GB
394
+
- -c
395
+
- effective_cache_size=3GB
396
+
- -c
397
+
- maintenance_work_mem=512MB
398
+
- -c
399
+
- work_mem=32MB
400
+
- -c
401
+
- timescaledb.max_background_workers=8
402
+
- -c
403
+
- max_parallel_workers_per_gather=2
404
+
- -c
405
+
- max_parallel_workers=8
406
+
```
407
+
408
+
### Telegraf Buffer Settings
409
+
410
+
Adjust buffer settings based on your metric volume:
411
+
412
+
```toml
413
+
[agent]
414
+
metric_batch_size = 5000 # Increase for high volume
415
+
metric_buffer_limit = 100000 # Increase buffer size
416
+
flush_interval = "5s" # Decrease for more frequent writes
417
+
```
418
+
419
+
## Conclusion
420
+
421
+
This setup provides a robust metrics collection and storage solution using:
422
+
- **Telegraf** for flexible metric collection via StatsD
423
+
- **TimescaleDB** for efficient time-series data storage
424
+
- **Docker Compose** for easy deployment and management
425
+
426
+
The system is production-ready with support for compression, retention policies, and continuous aggregates for optimal performance at scale.