Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

netfilter: conntrack: refine gc worker heuristics, redux

This further refines the changes made to conntrack gc_worker in
commit e0df8cae6c16 ("netfilter: conntrack: refine gc worker heuristics").

The main idea of that change was to reduce the scan interval when evictions
take place.

However, on the reporters' setup, there are 1-2 million conntrack entries
in total and roughly 8k new (and closing) connections per second.

In this case we'll always evict at least one entry per gc cycle and scan
interval is always at 1 jiffy because of this test:

} else if (expired_count) {
gc_work->next_gc_run /= 2U;
next_run = msecs_to_jiffies(1);

being true almost all the time.

Given we scan ~10k entries per run its clearly wrong to reduce interval
based on nonzero eviction count, it will only waste cpu cycles since a vast
majorities of conntracks are not timed out.

Thus only look at the ratio (scanned entries vs. evicted entries) to make
a decision on whether to reduce or not.

Because evictor is supposed to only kick in when system turns idle after
a busy period, pick a high ratio -- this makes it 50%. We thus keep
the idea of increasing scan rate when its likely that table contains many
expired entries.

In order to not let timed-out entries hang around for too long
(important when using event logging, in which case we want to timely
destroy events), we now scan the full table within at most
GC_MAX_SCAN_JIFFIES (16 seconds) even in worst-case scenario where all
timed-out entries sit in same slot.

I tested this with a vm under synflood (with
sysctl net.netfilter.nf_conntrack_tcp_timeout_syn_recv=3).

While flood is ongoing, interval now stays at its max rate
(GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV -> 125ms).

With feedback from Nicolas Dichtel.

Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Fixes: b87a2f9199ea82eaadc ("netfilter: conntrack: add gc worker to remove timed-out entries")
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

authored by

Florian Westphal and committed by
Pablo Neira Ayuso
e5072053 524b698d

+20 -19
+20 -19
net/netfilter/nf_conntrack_core.c
··· 85 85 static __read_mostly bool nf_conntrack_locks_all; 86 86 87 87 /* every gc cycle scans at most 1/GC_MAX_BUCKETS_DIV part of table */ 88 - #define GC_MAX_BUCKETS_DIV 64u 89 - /* upper bound of scan intervals */ 90 - #define GC_INTERVAL_MAX (2 * HZ) 88 + #define GC_MAX_BUCKETS_DIV 128u 89 + /* upper bound of full table scan */ 90 + #define GC_MAX_SCAN_JIFFIES (16u * HZ) 91 + /* desired ratio of entries found to be expired */ 92 + #define GC_EVICT_RATIO 50u 91 93 92 94 static struct conntrack_gc_work conntrack_gc_work; 93 95 ··· 938 936 939 937 static void gc_worker(struct work_struct *work) 940 938 { 939 + unsigned int min_interval = max(HZ / GC_MAX_BUCKETS_DIV, 1u); 941 940 unsigned int i, goal, buckets = 0, expired_count = 0; 942 941 struct conntrack_gc_work *gc_work; 943 942 unsigned int ratio, scanned = 0; ··· 997 994 * 1. Minimize time until we notice a stale entry 998 995 * 2. Maximize scan intervals to not waste cycles 999 996 * 1000 - * Normally, expired_count will be 0, this increases the next_run time 1001 - * to priorize 2) above. 997 + * Normally, expire ratio will be close to 0. 1002 998 * 1003 - * As soon as a timed-out entry is found, move towards 1) and increase 1004 - * the scan frequency. 1005 - * In case we have lots of evictions next scan is done immediately. 999 + * As soon as a sizeable fraction of the entries have expired 1000 + * increase scan frequency. 1006 1001 */ 1007 1002 ratio = scanned ? expired_count * 100 / scanned : 0; 1008 - if (ratio >= 90) { 1009 - gc_work->next_gc_run = 0; 1010 - next_run = 0; 1011 - } else if (expired_count) { 1012 - gc_work->next_gc_run /= 2U; 1013 - next_run = msecs_to_jiffies(1); 1003 + if (ratio > GC_EVICT_RATIO) { 1004 + gc_work->next_gc_run = min_interval; 1014 1005 } else { 1015 - if (gc_work->next_gc_run < GC_INTERVAL_MAX) 1016 - gc_work->next_gc_run += msecs_to_jiffies(1); 1006 + unsigned int max = GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV; 1017 1007 1018 - next_run = gc_work->next_gc_run; 1008 + BUILD_BUG_ON((GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV) == 0); 1009 + 1010 + gc_work->next_gc_run += min_interval; 1011 + if (gc_work->next_gc_run > max) 1012 + gc_work->next_gc_run = max; 1019 1013 } 1020 1014 1015 + next_run = gc_work->next_gc_run; 1021 1016 gc_work->last_bucket = i; 1022 1017 queue_delayed_work(system_long_wq, &gc_work->dwork, next_run); 1023 1018 } ··· 1023 1022 static void conntrack_gc_work_init(struct conntrack_gc_work *gc_work) 1024 1023 { 1025 1024 INIT_DELAYED_WORK(&gc_work->dwork, gc_worker); 1026 - gc_work->next_gc_run = GC_INTERVAL_MAX; 1025 + gc_work->next_gc_run = HZ; 1027 1026 gc_work->exiting = false; 1028 1027 } 1029 1028 ··· 1915 1914 nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED); 1916 1915 1917 1916 conntrack_gc_work_init(&conntrack_gc_work); 1918 - queue_delayed_work(system_long_wq, &conntrack_gc_work.dwork, GC_INTERVAL_MAX); 1917 + queue_delayed_work(system_long_wq, &conntrack_gc_work.dwork, HZ); 1919 1918 1920 1919 return 0; 1921 1920