Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter.

2) Add DATA_SENT state to SCTP connection tracking helper, from
Sriram Yagnaraman.

3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal.

4) Add bitmask support for ipset, from Vishwanath Pai.

5) Handle icmpv6 redirects as RELATED, from Florian Westphal.

6) Add WARN_ON_ONCE() to impossible case in flowtable datapath,
from Li Qiong.

7) A large batch of IPVS updates to replace timer-based estimators by
kthreads to scale up wrt. CPUs and workload (millions of estimators).

Julian Anastasov says:

This patchset implements stats estimation in kthread context.
It replaces the code that runs on single CPU in timer context every 2
seconds and causing latency splats as shown in reports [1], [2], [3].
The solution targets setups with thousands of IPVS services,
destinations and multi-CPU boxes.

Spread the estimation on multiple (configured) CPUs and multiple
time slots (timer ticks) by using multiple chains organized under RCU
rules. When stats are not needed, it is recommended to use
run_estimation=0 as already implemented before this change.

RCU Locking:

- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.

Kthread data:

- every kthread works over its own data structure and all
such structures are attached to array. For now we limit
kthreads depending on the number of CPUs.

- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is set to 0 to disable the estimation.

- the allocated kthread context may grow from 1 to 50
allocated structures for timer ticks which saves memory for
setups with small number of estimators

- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty

- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the max number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.

- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.

- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains and as result,
reducing the initial delay below 2 seconds.

Many thanks to Jiri Wiesner for his valuable comments
and for spending a lot of time reviewing and testing
the changes on different platforms with 48-256 CPUs and
1-8 NUMA nodes under different cpufreq governors.

The new IPVS estimators do not use workqueue infrastructure
because:

- The estimation can take long time when using multiple IPVS rules (eg.
millions estimator structures) and especially when box has multiple
CPUs due to the for_each_possible_cpu usage that expects packets from
any CPU. With est_nice sysctl we have more control how to prioritize the
estimation kthreads compared to other processes/kthreads that have
latency requirements (such as servers). As a benefit, we can see these
kthreads in top and decide if we will need some further control to limit
their CPU usage (max number of structure to estimate per kthread).

- with kthreads we run code that is read-mostly, no write/lock
operations to process the estimators in 2-second intervals.

- work items are one-shot: as estimators are processed every
2 seconds, they need to be re-added every time. This again
loads the timers (add_timer) if we use delayed works, as there are
no kthreads to do the timings.

[1] Report from Yunhong Jiang:
https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
[2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
[3] Report from Dust:
https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
ipvs: run_estimation should control the kthread tasks
ipvs: add est_cpulist and est_nice sysctl vars
ipvs: use kthreads for stats estimation
ipvs: use u64_stats_t for the per-cpu counters
ipvs: use common functions for stats allocation
ipvs: add rcu protection to stats
netfilter: flowtable: add a 'default' case to flowtable datapath
netfilter: conntrack: set icmpv6 redirects as RELATED
netfilter: ipset: Add support for new bitmask parameter
netfilter: conntrack: merge ipv4+ipv6 confirm functions
netfilter: conntrack: add sctp DATA_SENT state
netfilter: nft_inner: fix IS_ERR() vs NULL check
====================

Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Jakub Kicinski 3 years ago 95d1815f 15eb1621

+1738 -367

22 changed files

expand all

Documentation

networking

ipvs-sysctl.rst

include

linux

netfilter

ipset

ip_set.h

net

ip_vs.h

netfilter

nf_conntrack_core.h

uapi

linux

netfilter

ipset

ip_set.h

nf_conntrack_sctp.h

nfnetlink_cttimeout.h

net

bridge

netfilter

nf_conntrack_bridge.c

netfilter

ipset

ip_set_hash_gen.h

ip_set_hash_ip.c

ip_set_hash_ipport.c

ip_set_hash_netnet.c

ipvs

ip_vs_core.c

ip_vs_ctl.c

ip_vs_est.c

nf_conntrack_proto.c

nf_conntrack_proto_icmpv6.c

nf_conntrack_proto_sctp.c

nf_conntrack_standalone.c

nf_flow_table_ip.c

nf_tables_api.c

tools

testing

selftests

netfilter

conntrack_icmp_related.sh

+22 -2

Documentation/networking/ipvs-sysctl.rst

··· 129 129 threshold. When the mode 3 is set, the always mode drop rate 130 130 is controlled by the /proc/sys/net/ipv4/vs/am_droprate. 131 131 132 + est_cpulist - CPULIST 133 + Allowed CPUs for estimation kthreads 134 + 135 + Syntax: standard cpulist format 136 + empty list - stop kthread tasks and estimation 137 + default - the system's housekeeping CPUs for kthreads 138 + 139 + Example: 140 + "all": all possible CPUs 141 + "0-N": all possible CPUs, N denotes last CPU number 142 + "0,1-N:1/2": first and all CPUs with odd number 143 + "": empty list 144 + 145 + est_nice - INTEGER 146 + default 0 147 + Valid range: -20 (more favorable) .. 19 (less favorable) 148 + 149 + Niceness value to use for the estimation kthreads (scheduling 150 + priority) 151 + 132 152 expire_nodest_conn - BOOLEAN 133 153 - 0 - disabled (default) 134 154 - not 0 - enabled ··· 324 304 0 - disabled 325 305 not 0 - enabled (default) 326 306 327 - If disabled, the estimation will be stop, and you can't see 328 - any update on speed estimation data. 307 + If disabled, the estimation will be suspended and kthread tasks 308 + stopped. 329 309 330 310 You can always re-enable estimation by setting this value to 1. 331 311 But be careful, the first estimation after re-enable is not

+10

include/linux/netfilter/ipset/ip_set.h

··· 515 515 *skbinfo = ext->skbinfo; 516 516 } 517 517 518 + static inline void 519 + nf_inet_addr_mask_inplace(union nf_inet_addr *a1, 520 + const union nf_inet_addr *mask) 521 + { 522 + a1->all[0] &= mask->all[0]; 523 + a1->all[1] &= mask->all[1]; 524 + a1->all[2] &= mask->all[2]; 525 + a1->all[3] &= mask->all[3]; 526 + } 527 + 518 528 #define IP_SET_INIT_KEXT(skb, opt, set) \ 519 529 { .bytes = (skb)->len, .packets = 1, .target = true,\ 520 530 .timeout = ip_set_adt_opt_timeout(opt, set) }

+160 -11

include/net/ip_vs.h

··· 29 29 #include <net/netfilter/nf_conntrack.h> 30 30 #endif 31 31 #include <net/net_namespace.h> /* Netw namespace */ 32 + #include <linux/sched/isolation.h> 32 33 33 34 #define IP_VS_HDR_INVERSE 1 34 35 #define IP_VS_HDR_ICMP 2 ··· 42 41 43 42 /* Connections' size value needed by ip_vs_ctl.c */ 44 43 extern int ip_vs_conn_tab_size; 44 + 45 + extern struct mutex __ip_vs_mutex; 45 46 46 47 struct ip_vs_iphdr { 47 48 int hdr_flags; /* ipvs flags */ ··· 354 351 355 352 /* counters per cpu */ 356 353 struct ip_vs_counters { 357 - __u64 conns; /* connections scheduled */ 358 - __u64 inpkts; /* incoming packets */ 359 - __u64 outpkts; /* outgoing packets */ 360 - __u64 inbytes; /* incoming bytes */ 361 - __u64 outbytes; /* outgoing bytes */ 354 + u64_stats_t conns; /* connections scheduled */ 355 + u64_stats_t inpkts; /* incoming packets */ 356 + u64_stats_t outpkts; /* outgoing packets */ 357 + u64_stats_t inbytes; /* incoming bytes */ 358 + u64_stats_t outbytes; /* outgoing bytes */ 362 359 }; 363 360 /* Stats per cpu */ 364 361 struct ip_vs_cpu_stats { ··· 366 363 struct u64_stats_sync syncp; 367 364 }; 368 365 366 + /* Default nice for estimator kthreads */ 367 + #define IPVS_EST_NICE 0 368 + 369 369 /* IPVS statistics objects */ 370 370 struct ip_vs_estimator { 371 - struct list_head list; 371 + struct hlist_node list; 372 372 373 373 u64 last_inbytes; 374 374 u64 last_outbytes; ··· 384 378 u64 outpps; 385 379 u64 inbps; 386 380 u64 outbps; 381 + 382 + s32 ktid:16, /* kthread ID, -1=temp list */ 383 + ktrow:8, /* row/tick ID for kthread */ 384 + ktcid:8; /* chain ID for kthread tick */ 387 385 }; 388 386 389 387 /* ··· 413 403 struct ip_vs_cpu_stats __percpu *cpustats; /* per cpu counters */ 414 404 spinlock_t lock; /* spin lock */ 415 405 struct ip_vs_kstats kstats0; /* reset values */ 406 + }; 407 + 408 + struct ip_vs_stats_rcu { 409 + struct ip_vs_stats s; 410 + struct rcu_head rcu_head; 411 + }; 412 + 413 + int ip_vs_stats_init_alloc(struct ip_vs_stats *s); 414 + struct ip_vs_stats *ip_vs_stats_alloc(void); 415 + void ip_vs_stats_release(struct ip_vs_stats *stats); 416 + void ip_vs_stats_free(struct ip_vs_stats *stats); 417 + 418 + /* Process estimators in multiple timer ticks (20/50/100, see ktrow) */ 419 + #define IPVS_EST_NTICKS 50 420 + /* Estimation uses a 2-second period containing ticks (in jiffies) */ 421 + #define IPVS_EST_TICK ((2 * HZ) / IPVS_EST_NTICKS) 422 + 423 + /* Limit of CPU load per kthread (8 for 12.5%), ratio of CPU capacity (1/C). 424 + * Value of 4 and above ensures kthreads will take work without exceeding 425 + * the CPU capacity under different circumstances. 426 + */ 427 + #define IPVS_EST_LOAD_DIVISOR 8 428 + 429 + /* Kthreads should not have work that exceeds the CPU load above 50% */ 430 + #define IPVS_EST_CPU_KTHREADS (IPVS_EST_LOAD_DIVISOR / 2) 431 + 432 + /* Desired number of chains per timer tick (chain load factor in 100us units), 433 + * 48=4.8ms of 40ms tick (12% CPU usage): 434 + * 2 sec * 1000 ms in sec * 10 (100us in ms) / 8 (12.5%) / 50 435 + */ 436 + #define IPVS_EST_CHAIN_FACTOR \ 437 + ALIGN_DOWN(2 * 1000 * 10 / IPVS_EST_LOAD_DIVISOR / IPVS_EST_NTICKS, 8) 438 + 439 + /* Compiled number of chains per tick 440 + * The defines should match cond_resched_rcu 441 + */ 442 + #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU) 443 + #define IPVS_EST_TICK_CHAINS IPVS_EST_CHAIN_FACTOR 444 + #else 445 + #define IPVS_EST_TICK_CHAINS 1 446 + #endif 447 + 448 + #if IPVS_EST_NTICKS > 127 449 + #error Too many timer ticks for ktrow 450 + #endif 451 + 452 + /* Multiple chains processed in same tick */ 453 + struct ip_vs_est_tick_data { 454 + struct hlist_head chains[IPVS_EST_TICK_CHAINS]; 455 + DECLARE_BITMAP(present, IPVS_EST_TICK_CHAINS); 456 + DECLARE_BITMAP(full, IPVS_EST_TICK_CHAINS); 457 + int chain_len[IPVS_EST_TICK_CHAINS]; 458 + }; 459 + 460 + /* Context for estimation kthread */ 461 + struct ip_vs_est_kt_data { 462 + struct netns_ipvs *ipvs; 463 + struct task_struct *task; /* task if running */ 464 + struct ip_vs_est_tick_data __rcu *ticks[IPVS_EST_NTICKS]; 465 + DECLARE_BITMAP(avail, IPVS_EST_NTICKS); /* tick has space for ests */ 466 + unsigned long est_timer; /* estimation timer (jiffies) */ 467 + struct ip_vs_stats *calc_stats; /* Used for calculation */ 468 + int tick_len[IPVS_EST_NTICKS]; /* est count */ 469 + int id; /* ktid per netns */ 470 + int chain_max; /* max ests per tick chain */ 471 + int tick_max; /* max ests per tick */ 472 + int est_count; /* attached ests to kthread */ 473 + int est_max_count; /* max ests per kthread */ 474 + int add_row; /* row for new ests */ 475 + int est_row; /* estimated row */ 416 476 }; 417 477 418 478 struct dst_entry; ··· 768 688 union nf_inet_addr vaddr; /* virtual IP address */ 769 689 __u32 vfwmark; /* firewall mark of service */ 770 690 691 + struct rcu_head rcu_head; 771 692 struct list_head t_list; /* in dest_trash */ 772 693 unsigned int in_rs_table:1; /* we are in rs_table */ 773 694 }; ··· 950 869 atomic_t conn_count; /* connection counter */ 951 870 952 871 /* ip_vs_ctl */ 953 - struct ip_vs_stats tot_stats; /* Statistics & est. */ 872 + struct ip_vs_stats_rcu *tot_stats; /* Statistics & est. */ 954 873 955 874 int num_services; /* no of virtual services */ 956 875 int num_services6; /* IPv6 virtual services */ ··· 1013 932 int sysctl_schedule_icmp; 1014 933 int sysctl_ignore_tunneled; 1015 934 int sysctl_run_estimation; 935 + #ifdef CONFIG_SYSCTL 936 + cpumask_var_t sysctl_est_cpulist; /* kthread cpumask */ 937 + int est_cpulist_valid; /* cpulist set */ 938 + int sysctl_est_nice; /* kthread nice */ 939 + int est_stopped; /* stop tasks */ 940 + #endif 1016 941 1017 942 /* ip_vs_lblc */ 1018 943 int sysctl_lblc_expiration; ··· 1029 942 struct ctl_table_header *lblcr_ctl_header; 1030 943 struct ctl_table *lblcr_ctl_table; 1031 944 /* ip_vs_est */ 1032 - struct list_head est_list; /* estimator list */ 1033 - spinlock_t est_lock; 1034 - struct timer_list est_timer; /* Estimation timer */ 945 + struct delayed_work est_reload_work;/* Reload kthread tasks */ 946 + struct mutex est_mutex; /* protect kthread tasks */ 947 + struct hlist_head est_temp_list; /* Ests during calc phase */ 948 + struct ip_vs_est_kt_data **est_kt_arr; /* Array of kthread data ptrs */ 949 + unsigned long est_max_threads;/* Hard limit of kthreads */ 950 + int est_calc_phase; /* Calculation phase */ 951 + int est_chain_max; /* Calculated chain_max */ 952 + int est_kt_count; /* Allocated ptrs */ 953 + int est_add_ktid; /* ktid where to add ests */ 954 + atomic_t est_genid; /* kthreads reload genid */ 955 + atomic_t est_genid_done; /* applied genid */ 1035 956 /* ip_vs_sync */ 1036 957 spinlock_t sync_lock; 1037 958 struct ipvs_master_sync_state *ms; ··· 1172 1077 return ipvs->sysctl_run_estimation; 1173 1078 } 1174 1079 1080 + static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs) 1081 + { 1082 + if (ipvs->est_cpulist_valid) 1083 + return ipvs->sysctl_est_cpulist; 1084 + else 1085 + return housekeeping_cpumask(HK_TYPE_KTHREAD); 1086 + } 1087 + 1088 + static inline int sysctl_est_nice(struct netns_ipvs *ipvs) 1089 + { 1090 + return ipvs->sysctl_est_nice; 1091 + } 1092 + 1175 1093 #else 1176 1094 1177 1095 static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs) ··· 1280 1172 static inline int sysctl_run_estimation(struct netns_ipvs *ipvs) 1281 1173 { 1282 1174 return 1; 1175 + } 1176 + 1177 + static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs) 1178 + { 1179 + return housekeeping_cpumask(HK_TYPE_KTHREAD); 1180 + } 1181 + 1182 + static inline int sysctl_est_nice(struct netns_ipvs *ipvs) 1183 + { 1184 + return IPVS_EST_NICE; 1283 1185 } 1284 1186 1285 1187 #endif ··· 1593 1475 void ip_vs_sync_conn(struct netns_ipvs *ipvs, struct ip_vs_conn *cp, int pkts); 1594 1476 1595 1477 /* IPVS rate estimator prototypes (from ip_vs_est.c) */ 1596 - void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats); 1478 + int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats); 1597 1479 void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats); 1598 1480 void ip_vs_zero_estimator(struct ip_vs_stats *stats); 1599 1481 void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats); 1482 + void ip_vs_est_reload_start(struct netns_ipvs *ipvs); 1483 + int ip_vs_est_kthread_start(struct netns_ipvs *ipvs, 1484 + struct ip_vs_est_kt_data *kd); 1485 + void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd); 1486 + 1487 + static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs) 1488 + { 1489 + #ifdef CONFIG_SYSCTL 1490 + /* Stop tasks while cpulist is empty or if disabled with flag */ 1491 + ipvs->est_stopped = !sysctl_run_estimation(ipvs) || 1492 + (ipvs->est_cpulist_valid && 1493 + cpumask_empty(sysctl_est_cpulist(ipvs))); 1494 + #endif 1495 + } 1496 + 1497 + static inline bool ip_vs_est_stopped(struct netns_ipvs *ipvs) 1498 + { 1499 + #ifdef CONFIG_SYSCTL 1500 + return ipvs->est_stopped; 1501 + #else 1502 + return false; 1503 + #endif 1504 + } 1505 + 1506 + static inline int ip_vs_est_max_threads(struct netns_ipvs *ipvs) 1507 + { 1508 + unsigned int limit = IPVS_EST_CPU_KTHREADS * 1509 + cpumask_weight(sysctl_est_cpulist(ipvs)); 1510 + 1511 + return max(1U, limit); 1512 + } 1600 1513 1601 1514 /* Various IPVS packet transmitters (from ip_vs_xmit.c) */ 1602 1515 int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,

+1 -2

include/net/netfilter/nf_conntrack_core.h

··· 71 71 return ret; 72 72 } 73 73 74 - unsigned int nf_confirm(struct sk_buff *skb, unsigned int protoff, 75 - struct nf_conn *ct, enum ip_conntrack_info ctinfo); 74 + unsigned int nf_confirm(void *priv, struct sk_buff *skb, const struct nf_hook_state *state); 76 75 77 76 void print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple, 78 77 const struct nf_conntrack_l4proto *proto);

include/uapi/linux/netfilter/ipset/ip_set.h

··· 85 85 IPSET_ATTR_CADT_LINENO = IPSET_ATTR_LINENO, /* 9 */ 86 86 IPSET_ATTR_MARK, /* 10 */ 87 87 IPSET_ATTR_MARKMASK, /* 11 */ 88 + IPSET_ATTR_BITMASK, /* 12 */ 88 89 /* Reserve empty slots */ 89 90 IPSET_ATTR_CADT_MAX = 16, 90 91 /* Create-only specific attributes */ ··· 154 153 IPSET_ERR_COMMENT, 155 154 IPSET_ERR_INVALID_MARKMASK, 156 155 IPSET_ERR_SKBINFO, 156 + IPSET_ERR_BITMASK_NETMASK_EXCL, 157 157 158 158 /* Type specific error codes */ 159 159 IPSET_ERR_TYPE_SPECIFIC = 4352,

include/uapi/linux/netfilter/nf_conntrack_sctp.h

··· 16 16 SCTP_CONNTRACK_SHUTDOWN_ACK_SENT, 17 17 SCTP_CONNTRACK_HEARTBEAT_SENT, 18 18 SCTP_CONNTRACK_HEARTBEAT_ACKED, 19 + SCTP_CONNTRACK_DATA_SENT, 19 20 SCTP_CONNTRACK_MAX 20 21 }; 21 22

include/uapi/linux/netfilter/nfnetlink_cttimeout.h

··· 95 95 CTA_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT, 96 96 CTA_TIMEOUT_SCTP_HEARTBEAT_SENT, 97 97 CTA_TIMEOUT_SCTP_HEARTBEAT_ACKED, 98 + CTA_TIMEOUT_SCTP_DATA_SENT, 98 99 __CTA_TIMEOUT_SCTP_MAX 99 100 }; 100 101 #define CTA_TIMEOUT_SCTP_MAX (__CTA_TIMEOUT_SCTP_MAX - 1)

+1 -31

net/bridge/netfilter/nf_conntrack_bridge.c

··· 366 366 return br_dev_queue_push_xmit(net, sk, skb); 367 367 } 368 368 369 - static unsigned int nf_ct_bridge_confirm(struct sk_buff *skb) 370 - { 371 - enum ip_conntrack_info ctinfo; 372 - struct nf_conn *ct; 373 - int protoff; 374 - 375 - ct = nf_ct_get(skb, &ctinfo); 376 - if (!ct || ctinfo == IP_CT_RELATED_REPLY) 377 - return nf_conntrack_confirm(skb); 378 - 379 - switch (skb->protocol) { 380 - case htons(ETH_P_IP): 381 - protoff = skb_network_offset(skb) + ip_hdrlen(skb); 382 - break; 383 - case htons(ETH_P_IPV6): { 384 - unsigned char pnum = ipv6_hdr(skb)->nexthdr; 385 - __be16 frag_off; 386 - 387 - protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &pnum, 388 - &frag_off); 389 - if (protoff < 0 || (frag_off & htons(~0x7)) != 0) 390 - return nf_conntrack_confirm(skb); 391 - } 392 - break; 393 - default: 394 - return NF_ACCEPT; 395 - } 396 - return nf_confirm(skb, protoff, ct, ctinfo); 397 - } 398 - 399 369 static unsigned int nf_ct_bridge_post(void *priv, struct sk_buff *skb, 400 370 const struct nf_hook_state *state) 401 371 { 402 372 int ret; 403 373 404 - ret = nf_ct_bridge_confirm(skb); 374 + ret = nf_confirm(priv, skb, state); 405 375 if (ret != NF_ACCEPT) 406 376 return ret; 407 377

+62 -9

net/netfilter/ipset/ip_set_hash_gen.h

··· 159 159 (SET_WITH_TIMEOUT(set) && \ 160 160 ip_set_timeout_expired(ext_timeout(d, set))) 161 161 162 + #if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK) 163 + static const union nf_inet_addr onesmask = { 164 + .all[0] = 0xffffffff, 165 + .all[1] = 0xffffffff, 166 + .all[2] = 0xffffffff, 167 + .all[3] = 0xffffffff 168 + }; 169 + 170 + static const union nf_inet_addr zeromask = {}; 171 + #endif 172 + 162 173 #endif /* _IP_SET_HASH_GEN_H */ 163 174 164 175 #ifndef MTYPE ··· 294 283 u32 markmask; /* markmask value for mark mask to store */ 295 284 #endif 296 285 u8 bucketsize; /* max elements in an array block */ 297 - #ifdef IP_SET_HASH_WITH_NETMASK 286 + #if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK) 298 287 u8 netmask; /* netmask value for subnets to store */ 288 + union nf_inet_addr bitmask; /* stores bitmask */ 299 289 #endif 300 290 struct list_head ad; /* Resize add|del backlist */ 301 291 struct mtype_elem next; /* temporary storage for uadd */ ··· 471 459 /* Resizing changes htable_bits, so we ignore it */ 472 460 return x->maxelem == y->maxelem && 473 461 a->timeout == b->timeout && 474 - #ifdef IP_SET_HASH_WITH_NETMASK 475 - x->netmask == y->netmask && 462 + #if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK) 463 + nf_inet_addr_cmp(&x->bitmask, &y->bitmask) && 476 464 #endif 477 465 #ifdef IP_SET_HASH_WITH_MARKMASK 478 466 x->markmask == y->markmask && ··· 1276 1264 htonl(jhash_size(htable_bits))) || 1277 1265 nla_put_net32(skb, IPSET_ATTR_MAXELEM, htonl(h->maxelem))) 1278 1266 goto nla_put_failure; 1267 + #ifdef IP_SET_HASH_WITH_BITMASK 1268 + /* if netmask is set to anything other than HOST_MASK we know that the user supplied netmask 1269 + * and not bitmask. These two are mutually exclusive. */ 1270 + if (h->netmask == HOST_MASK && !nf_inet_addr_cmp(&onesmask, &h->bitmask)) { 1271 + if (set->family == NFPROTO_IPV4) { 1272 + if (nla_put_ipaddr4(skb, IPSET_ATTR_BITMASK, h->bitmask.ip)) 1273 + goto nla_put_failure; 1274 + } else if (set->family == NFPROTO_IPV6) { 1275 + if (nla_put_ipaddr6(skb, IPSET_ATTR_BITMASK, &h->bitmask.in6)) 1276 + goto nla_put_failure; 1277 + } 1278 + } 1279 + #endif 1279 1280 #ifdef IP_SET_HASH_WITH_NETMASK 1280 - if (h->netmask != HOST_MASK && 1281 - nla_put_u8(skb, IPSET_ATTR_NETMASK, h->netmask)) 1281 + if (h->netmask != HOST_MASK && nla_put_u8(skb, IPSET_ATTR_NETMASK, h->netmask)) 1282 1282 goto nla_put_failure; 1283 1283 #endif 1284 1284 #ifdef IP_SET_HASH_WITH_MARKMASK ··· 1453 1429 u32 markmask; 1454 1430 #endif 1455 1431 u8 hbits; 1456 - #ifdef IP_SET_HASH_WITH_NETMASK 1457 - u8 netmask; 1432 + #if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK) 1433 + int ret __attribute__((unused)) = 0; 1434 + u8 netmask = set->family == NFPROTO_IPV4 ? 32 : 128; 1435 + union nf_inet_addr bitmask = onesmask; 1458 1436 #endif 1459 1437 size_t hsize; 1460 1438 struct htype *h; ··· 1494 1468 #endif 1495 1469 1496 1470 #ifdef IP_SET_HASH_WITH_NETMASK 1497 - netmask = set->family == NFPROTO_IPV4 ? 32 : 128; 1498 1471 if (tb[IPSET_ATTR_NETMASK]) { 1499 1472 netmask = nla_get_u8(tb[IPSET_ATTR_NETMASK]); 1500 1473 1501 1474 if ((set->family == NFPROTO_IPV4 && netmask > 32) || 1502 1475 (set->family == NFPROTO_IPV6 && netmask > 128) || 1503 1476 netmask == 0) 1477 + return -IPSET_ERR_INVALID_NETMASK; 1478 + 1479 + /* we convert netmask to bitmask and store it */ 1480 + if (set->family == NFPROTO_IPV4) 1481 + bitmask.ip = ip_set_netmask(netmask); 1482 + else 1483 + ip6_netmask(&bitmask, netmask); 1484 + } 1485 + #endif 1486 + 1487 + #ifdef IP_SET_HASH_WITH_BITMASK 1488 + if (tb[IPSET_ATTR_BITMASK]) { 1489 + /* bitmask and netmask do the same thing, allow only one of these options */ 1490 + if (tb[IPSET_ATTR_NETMASK]) 1491 + return -IPSET_ERR_BITMASK_NETMASK_EXCL; 1492 + 1493 + if (set->family == NFPROTO_IPV4) { 1494 + ret = ip_set_get_ipaddr4(tb[IPSET_ATTR_BITMASK], &bitmask.ip); 1495 + if (ret || !bitmask.ip) 1496 + return -IPSET_ERR_INVALID_NETMASK; 1497 + } else if (set->family == NFPROTO_IPV6) { 1498 + ret = ip_set_get_ipaddr6(tb[IPSET_ATTR_BITMASK], &bitmask); 1499 + if (ret || ipv6_addr_any(&bitmask.in6)) 1500 + return -IPSET_ERR_INVALID_NETMASK; 1501 + } 1502 + 1503 + if (nf_inet_addr_cmp(&bitmask, &zeromask)) 1504 1504 return -IPSET_ERR_INVALID_NETMASK; 1505 1505 } 1506 1506 #endif ··· 1570 1518 for (i = 0; i < ahash_numof_locks(hbits); i++) 1571 1519 spin_lock_init(&t->hregion[i].lock); 1572 1520 h->maxelem = maxelem; 1573 - #ifdef IP_SET_HASH_WITH_NETMASK 1521 + #if defined(IP_SET_HASH_WITH_NETMASK) || defined(IP_SET_HASH_WITH_BITMASK) 1522 + h->bitmask = bitmask; 1574 1523 h->netmask = netmask; 1575 1524 #endif 1576 1525 #ifdef IP_SET_HASH_WITH_MARKMASK

+8 -11

net/netfilter/ipset/ip_set_hash_ip.c

··· 24 24 /* 2 Comments support */ 25 25 /* 3 Forceadd support */ 26 26 /* 4 skbinfo support */ 27 - #define IPSET_TYPE_REV_MAX 5 /* bucketsize, initval support */ 27 + /* 5 bucketsize, initval support */ 28 + #define IPSET_TYPE_REV_MAX 6 /* bitmask support */ 28 29 29 30 MODULE_LICENSE("GPL"); 30 31 MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@netfilter.org>"); ··· 35 34 /* Type specific function prefix */ 36 35 #define HTYPE hash_ip 37 36 #define IP_SET_HASH_WITH_NETMASK 37 + #define IP_SET_HASH_WITH_BITMASK 38 38 39 39 /* IPv4 variant */ 40 40 ··· 88 86 __be32 ip; 89 87 90 88 ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &ip); 91 - ip &= ip_set_netmask(h->netmask); 89 + ip &= h->bitmask.ip; 92 90 if (ip == 0) 93 91 return -EINVAL; 94 92 ··· 121 119 if (ret) 122 120 return ret; 123 121 124 - ip &= ip_set_hostmask(h->netmask); 122 + ip &= ntohl(h->bitmask.ip); 125 123 e.ip = htonl(ip); 126 124 if (e.ip == 0) 127 125 return -IPSET_ERR_HASH_ELEM; ··· 187 185 return ipv6_addr_equal(&ip1->ip.in6, &ip2->ip.in6); 188 186 } 189 187 190 - static void 191 - hash_ip6_netmask(union nf_inet_addr *ip, u8 prefix) 192 - { 193 - ip6_netmask(ip, prefix); 194 - } 195 - 196 188 static bool 197 189 hash_ip6_data_list(struct sk_buff *skb, const struct hash_ip6_elem *e) 198 190 { ··· 223 227 struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set); 224 228 225 229 ip6addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip.in6); 226 - hash_ip6_netmask(&e.ip, h->netmask); 230 + nf_inet_addr_mask_inplace(&e.ip, &h->bitmask); 227 231 if (ipv6_addr_any(&e.ip.in6)) 228 232 return -EINVAL; 229 233 ··· 262 266 if (ret) 263 267 return ret; 264 268 265 - hash_ip6_netmask(&e.ip, h->netmask); 269 + nf_inet_addr_mask_inplace(&e.ip, &h->bitmask); 266 270 if (ipv6_addr_any(&e.ip.in6)) 267 271 return -IPSET_ERR_HASH_ELEM; 268 272 ··· 289 293 [IPSET_ATTR_RESIZE] = { .type = NLA_U8 }, 290 294 [IPSET_ATTR_TIMEOUT] = { .type = NLA_U32 }, 291 295 [IPSET_ATTR_NETMASK] = { .type = NLA_U8 }, 296 + [IPSET_ATTR_BITMASK] = { .type = NLA_NESTED }, 292 297 [IPSET_ATTR_CADT_FLAGS] = { .type = NLA_U32 }, 293 298 }, 294 299 .adt_policy = {

+23 -1

net/netfilter/ipset/ip_set_hash_ipport.c

··· 26 26 /* 3 Comments support added */ 27 27 /* 4 Forceadd support added */ 28 28 /* 5 skbinfo support added */ 29 - #define IPSET_TYPE_REV_MAX 6 /* bucketsize, initval support added */ 29 + /* 6 bucketsize, initval support added */ 30 + #define IPSET_TYPE_REV_MAX 7 /* bitmask support added */ 30 31 31 32 MODULE_LICENSE("GPL"); 32 33 MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@netfilter.org>"); ··· 36 35 37 36 /* Type specific function prefix */ 38 37 #define HTYPE hash_ipport 38 + #define IP_SET_HASH_WITH_NETMASK 39 + #define IP_SET_HASH_WITH_BITMASK 39 40 40 41 /* IPv4 variant */ 41 42 ··· 95 92 ipset_adtfn adtfn = set->variant->adt[adt]; 96 93 struct hash_ipport4_elem e = { .ip = 0 }; 97 94 struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set); 95 + const struct MTYPE *h = set->data; 98 96 99 97 if (!ip_set_get_ip4_port(skb, opt->flags & IPSET_DIM_TWO_SRC, 100 98 &e.port, &e.proto)) 101 99 return -EINVAL; 102 100 103 101 ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip); 102 + e.ip &= h->bitmask.ip; 103 + if (e.ip == 0) 104 + return -EINVAL; 104 105 return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags); 105 106 } 106 107 ··· 135 128 ret = ip_set_get_extensions(set, tb, &ext); 136 129 if (ret) 137 130 return ret; 131 + 132 + e.ip &= h->bitmask.ip; 133 + if (e.ip == 0) 134 + return -EINVAL; 138 135 139 136 e.port = nla_get_be16(tb[IPSET_ATTR_PORT]); 140 137 ··· 264 253 ipset_adtfn adtfn = set->variant->adt[adt]; 265 254 struct hash_ipport6_elem e = { .ip = { .all = { 0 } } }; 266 255 struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set); 256 + const struct MTYPE *h = set->data; 267 257 268 258 if (!ip_set_get_ip6_port(skb, opt->flags & IPSET_DIM_TWO_SRC, 269 259 &e.port, &e.proto)) 270 260 return -EINVAL; 271 261 272 262 ip6addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip.in6); 263 + nf_inet_addr_mask_inplace(&e.ip, &h->bitmask); 264 + if (ipv6_addr_any(&e.ip.in6)) 265 + return -EINVAL; 266 + 273 267 return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags); 274 268 } 275 269 ··· 313 297 ret = ip_set_get_extensions(set, tb, &ext); 314 298 if (ret) 315 299 return ret; 300 + 301 + nf_inet_addr_mask_inplace(&e.ip, &h->bitmask); 302 + if (ipv6_addr_any(&e.ip.in6)) 303 + return -EINVAL; 316 304 317 305 e.port = nla_get_be16(tb[IPSET_ATTR_PORT]); 318 306 ··· 376 356 [IPSET_ATTR_PROTO] = { .type = NLA_U8 }, 377 357 [IPSET_ATTR_TIMEOUT] = { .type = NLA_U32 }, 378 358 [IPSET_ATTR_CADT_FLAGS] = { .type = NLA_U32 }, 359 + [IPSET_ATTR_NETMASK] = { .type = NLA_U8 }, 360 + [IPSET_ATTR_BITMASK] = { .type = NLA_NESTED }, 379 361 }, 380 362 .adt_policy = { 381 363 [IPSET_ATTR_IP] = { .type = NLA_NESTED },

+21 -5

net/netfilter/ipset/ip_set_hash_netnet.c

··· 23 23 #define IPSET_TYPE_REV_MIN 0 24 24 /* 1 Forceadd support added */ 25 25 /* 2 skbinfo support added */ 26 - #define IPSET_TYPE_REV_MAX 3 /* bucketsize, initval support added */ 26 + /* 3 bucketsize, initval support added */ 27 + #define IPSET_TYPE_REV_MAX 4 /* bitmask support added */ 27 28 28 29 MODULE_LICENSE("GPL"); 29 30 MODULE_AUTHOR("Oliver Smith <oliver@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>"); ··· 34 33 /* Type specific function prefix */ 35 34 #define HTYPE hash_netnet 36 35 #define IP_SET_HASH_WITH_NETS 36 + #define IP_SET_HASH_WITH_NETMASK 37 + #define IP_SET_HASH_WITH_BITMASK 37 38 #define IPSET_NET_COUNT 2 38 39 39 40 /* IPv4 variants */ ··· 156 153 157 154 ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip[0]); 158 155 ip4addrptr(skb, opt->flags & IPSET_DIM_TWO_SRC, &e.ip[1]); 159 - e.ip[0] &= ip_set_netmask(e.cidr[0]); 160 - e.ip[1] &= ip_set_netmask(e.cidr[1]); 156 + e.ip[0] &= (ip_set_netmask(e.cidr[0]) & h->bitmask.ip); 157 + e.ip[1] &= (ip_set_netmask(e.cidr[1]) & h->bitmask.ip); 161 158 162 159 return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags); 163 160 } ··· 216 213 217 214 if (adt == IPSET_TEST || !(tb[IPSET_ATTR_IP_TO] || 218 215 tb[IPSET_ATTR_IP2_TO])) { 219 - e.ip[0] = htonl(ip & ip_set_hostmask(e.cidr[0])); 220 - e.ip[1] = htonl(ip2_from & ip_set_hostmask(e.cidr[1])); 216 + e.ip[0] = htonl(ip & ntohl(h->bitmask.ip) & ip_set_hostmask(e.cidr[0])); 217 + e.ip[1] = htonl(ip2_from & ntohl(h->bitmask.ip) & ip_set_hostmask(e.cidr[1])); 221 218 ret = adtfn(set, &e, &ext, &ext, flags); 222 219 return ip_set_enomatch(ret, flags, adt, set) ? -ret : 223 220 ip_set_eexist(ret, flags) ? 0 : ret; ··· 407 404 ip6_netmask(&e.ip[0], e.cidr[0]); 408 405 ip6_netmask(&e.ip[1], e.cidr[1]); 409 406 407 + nf_inet_addr_mask_inplace(&e.ip[0], &h->bitmask); 408 + nf_inet_addr_mask_inplace(&e.ip[1], &h->bitmask); 409 + if (e.cidr[0] == HOST_MASK && ipv6_addr_any(&e.ip[0].in6)) 410 + return -EINVAL; 411 + 410 412 return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags); 411 413 } 412 414 ··· 422 414 ipset_adtfn adtfn = set->variant->adt[adt]; 423 415 struct hash_netnet6_elem e = { }; 424 416 struct ip_set_ext ext = IP_SET_INIT_UEXT(set); 417 + const struct hash_netnet6 *h = set->data; 425 418 int ret; 426 419 427 420 if (tb[IPSET_ATTR_LINENO]) ··· 462 453 ip6_netmask(&e.ip[0], e.cidr[0]); 463 454 ip6_netmask(&e.ip[1], e.cidr[1]); 464 455 456 + nf_inet_addr_mask_inplace(&e.ip[0], &h->bitmask); 457 + nf_inet_addr_mask_inplace(&e.ip[1], &h->bitmask); 458 + if (e.cidr[0] == HOST_MASK && ipv6_addr_any(&e.ip[0].in6)) 459 + return -IPSET_ERR_HASH_ELEM; 460 + 465 461 if (tb[IPSET_ATTR_CADT_FLAGS]) { 466 462 u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]); 467 463 ··· 498 484 [IPSET_ATTR_RESIZE] = { .type = NLA_U8 }, 499 485 [IPSET_ATTR_TIMEOUT] = { .type = NLA_U32 }, 500 486 [IPSET_ATTR_CADT_FLAGS] = { .type = NLA_U32 }, 487 + [IPSET_ATTR_NETMASK] = { .type = NLA_U8 }, 488 + [IPSET_ATTR_BITMASK] = { .type = NLA_NESTED }, 501 489 }, 502 490 .adt_policy = { 503 491 [IPSET_ATTR_IP] = { .type = NLA_NESTED },

+22 -18

net/netfilter/ipvs/ip_vs_core.c

··· 132 132 133 133 s = this_cpu_ptr(dest->stats.cpustats); 134 134 u64_stats_update_begin(&s->syncp); 135 - s->cnt.inpkts++; 136 - s->cnt.inbytes += skb->len; 135 + u64_stats_inc(&s->cnt.inpkts); 136 + u64_stats_add(&s->cnt.inbytes, skb->len); 137 137 u64_stats_update_end(&s->syncp); 138 138 139 139 svc = rcu_dereference(dest->svc); 140 140 s = this_cpu_ptr(svc->stats.cpustats); 141 141 u64_stats_update_begin(&s->syncp); 142 - s->cnt.inpkts++; 143 - s->cnt.inbytes += skb->len; 142 + u64_stats_inc(&s->cnt.inpkts); 143 + u64_stats_add(&s->cnt.inbytes, skb->len); 144 144 u64_stats_update_end(&s->syncp); 145 145 146 - s = this_cpu_ptr(ipvs->tot_stats.cpustats); 146 + s = this_cpu_ptr(ipvs->tot_stats->s.cpustats); 147 147 u64_stats_update_begin(&s->syncp); 148 - s->cnt.inpkts++; 149 - s->cnt.inbytes += skb->len; 148 + u64_stats_inc(&s->cnt.inpkts); 149 + u64_stats_add(&s->cnt.inbytes, skb->len); 150 150 u64_stats_update_end(&s->syncp); 151 151 152 152 local_bh_enable(); ··· 168 168 169 169 s = this_cpu_ptr(dest->stats.cpustats); 170 170 u64_stats_update_begin(&s->syncp); 171 - s->cnt.outpkts++; 172 - s->cnt.outbytes += skb->len; 171 + u64_stats_inc(&s->cnt.outpkts); 172 + u64_stats_add(&s->cnt.outbytes, skb->len); 173 173 u64_stats_update_end(&s->syncp); 174 174 175 175 svc = rcu_dereference(dest->svc); 176 176 s = this_cpu_ptr(svc->stats.cpustats); 177 177 u64_stats_update_begin(&s->syncp); 178 - s->cnt.outpkts++; 179 - s->cnt.outbytes += skb->len; 178 + u64_stats_inc(&s->cnt.outpkts); 179 + u64_stats_add(&s->cnt.outbytes, skb->len); 180 180 u64_stats_update_end(&s->syncp); 181 181 182 - s = this_cpu_ptr(ipvs->tot_stats.cpustats); 182 + s = this_cpu_ptr(ipvs->tot_stats->s.cpustats); 183 183 u64_stats_update_begin(&s->syncp); 184 - s->cnt.outpkts++; 185 - s->cnt.outbytes += skb->len; 184 + u64_stats_inc(&s->cnt.outpkts); 185 + u64_stats_add(&s->cnt.outbytes, skb->len); 186 186 u64_stats_update_end(&s->syncp); 187 187 188 188 local_bh_enable(); ··· 200 200 201 201 s = this_cpu_ptr(cp->dest->stats.cpustats); 202 202 u64_stats_update_begin(&s->syncp); 203 - s->cnt.conns++; 203 + u64_stats_inc(&s->cnt.conns); 204 204 u64_stats_update_end(&s->syncp); 205 205 206 206 s = this_cpu_ptr(svc->stats.cpustats); 207 207 u64_stats_update_begin(&s->syncp); 208 - s->cnt.conns++; 208 + u64_stats_inc(&s->cnt.conns); 209 209 u64_stats_update_end(&s->syncp); 210 210 211 - s = this_cpu_ptr(ipvs->tot_stats.cpustats); 211 + s = this_cpu_ptr(ipvs->tot_stats->s.cpustats); 212 212 u64_stats_update_begin(&s->syncp); 213 - s->cnt.conns++; 213 + u64_stats_inc(&s->cnt.conns); 214 214 u64_stats_update_end(&s->syncp); 215 215 216 216 local_bh_enable(); ··· 2448 2448 ip_vs_conn_cleanup(); 2449 2449 ip_vs_protocol_cleanup(); 2450 2450 ip_vs_control_cleanup(); 2451 + /* common rcu_barrier() used by: 2452 + * - ip_vs_control_cleanup() 2453 + */ 2454 + rcu_barrier(); 2451 2455 pr_info("ipvs unloaded.\n"); 2452 2456 } 2453 2457

+362 -86

net/netfilter/ipvs/ip_vs_ctl.c

··· 49 49 50 50 MODULE_ALIAS_GENL_FAMILY(IPVS_GENL_NAME); 51 51 52 - /* semaphore for IPVS sockopts. And, [gs]etsockopt may sleep. */ 53 - static DEFINE_MUTEX(__ip_vs_mutex); 52 + DEFINE_MUTEX(__ip_vs_mutex); /* Serialize configuration with sockopt/netlink */ 54 53 55 54 /* sysctl variables */ 56 55 ··· 239 240 DEFENSE_TIMER_PERIOD); 240 241 } 241 242 #endif 243 + 244 + static void est_reload_work_handler(struct work_struct *work) 245 + { 246 + struct netns_ipvs *ipvs = 247 + container_of(work, struct netns_ipvs, est_reload_work.work); 248 + int genid_done = atomic_read(&ipvs->est_genid_done); 249 + unsigned long delay = HZ / 10; /* repeat startups after failure */ 250 + bool repeat = false; 251 + int genid; 252 + int id; 253 + 254 + mutex_lock(&ipvs->est_mutex); 255 + genid = atomic_read(&ipvs->est_genid); 256 + for (id = 0; id < ipvs->est_kt_count; id++) { 257 + struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[id]; 258 + 259 + /* netns clean up started, abort delayed work */ 260 + if (!ipvs->enable) 261 + goto unlock; 262 + if (!kd) 263 + continue; 264 + /* New config ? Stop kthread tasks */ 265 + if (genid != genid_done) 266 + ip_vs_est_kthread_stop(kd); 267 + if (!kd->task && !ip_vs_est_stopped(ipvs)) { 268 + /* Do not start kthreads above 0 in calc phase */ 269 + if ((!id || !ipvs->est_calc_phase) && 270 + ip_vs_est_kthread_start(ipvs, kd) < 0) 271 + repeat = true; 272 + } 273 + } 274 + 275 + atomic_set(&ipvs->est_genid_done, genid); 276 + 277 + if (repeat) 278 + queue_delayed_work(system_long_wq, &ipvs->est_reload_work, 279 + delay); 280 + 281 + unlock: 282 + mutex_unlock(&ipvs->est_mutex); 283 + } 242 284 243 285 int 244 286 ip_vs_use_count_inc(void) ··· 511 471 512 472 static void ip_vs_service_free(struct ip_vs_service *svc) 513 473 { 514 - free_percpu(svc->stats.cpustats); 474 + ip_vs_stats_release(&svc->stats); 515 475 kfree(svc); 516 476 } 517 477 ··· 523 483 ip_vs_service_free(svc); 524 484 } 525 485 526 - static void __ip_vs_svc_put(struct ip_vs_service *svc, bool do_delay) 486 + static void __ip_vs_svc_put(struct ip_vs_service *svc) 527 487 { 528 488 if (atomic_dec_and_test(&svc->refcnt)) { 529 489 IP_VS_DBG_BUF(3, "Removing service %u/%s:%u\n", 530 490 svc->fwmark, 531 491 IP_VS_DBG_ADDR(svc->af, &svc->addr), 532 492 ntohs(svc->port)); 533 - if (do_delay) 534 - call_rcu(&svc->rcu_head, ip_vs_service_rcu_free); 535 - else 536 - ip_vs_service_free(svc); 493 + call_rcu(&svc->rcu_head, ip_vs_service_rcu_free); 537 494 } 538 495 } 539 496 ··· 817 780 return dest; 818 781 } 819 782 783 + static void ip_vs_dest_rcu_free(struct rcu_head *head) 784 + { 785 + struct ip_vs_dest *dest; 786 + 787 + dest = container_of(head, struct ip_vs_dest, rcu_head); 788 + ip_vs_stats_release(&dest->stats); 789 + ip_vs_dest_put_and_free(dest); 790 + } 791 + 820 792 static void ip_vs_dest_free(struct ip_vs_dest *dest) 821 793 { 822 794 struct ip_vs_service *svc = rcu_dereference_protected(dest->svc, 1); 823 795 824 796 __ip_vs_dst_cache_reset(dest); 825 - __ip_vs_svc_put(svc, false); 826 - free_percpu(dest->stats.cpustats); 827 - ip_vs_dest_put_and_free(dest); 797 + __ip_vs_svc_put(svc); 798 + call_rcu(&dest->rcu_head, ip_vs_dest_rcu_free); 828 799 } 829 800 830 801 /* ··· 856 811 } 857 812 } 858 813 814 + static void ip_vs_stats_rcu_free(struct rcu_head *head) 815 + { 816 + struct ip_vs_stats_rcu *rs = container_of(head, 817 + struct ip_vs_stats_rcu, 818 + rcu_head); 819 + 820 + ip_vs_stats_release(&rs->s); 821 + kfree(rs); 822 + } 823 + 859 824 static void 860 825 ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src) 861 826 { 862 827 #define IP_VS_SHOW_STATS_COUNTER(c) dst->c = src->kstats.c - src->kstats0.c 863 828 864 - spin_lock_bh(&src->lock); 829 + spin_lock(&src->lock); 865 830 866 831 IP_VS_SHOW_STATS_COUNTER(conns); 867 832 IP_VS_SHOW_STATS_COUNTER(inpkts); ··· 881 826 882 827 ip_vs_read_estimator(dst, src); 883 828 884 - spin_unlock_bh(&src->lock); 829 + spin_unlock(&src->lock); 885 830 } 886 831 887 832 static void ··· 902 847 static void 903 848 ip_vs_zero_stats(struct ip_vs_stats *stats) 904 849 { 905 - spin_lock_bh(&stats->lock); 850 + spin_lock(&stats->lock); 906 851 907 852 /* get current counters as zero point, rates are zeroed */ 908 853 ··· 916 861 917 862 ip_vs_zero_estimator(stats); 918 863 919 - spin_unlock_bh(&stats->lock); 864 + spin_unlock(&stats->lock); 865 + } 866 + 867 + /* Allocate fields after kzalloc */ 868 + int ip_vs_stats_init_alloc(struct ip_vs_stats *s) 869 + { 870 + int i; 871 + 872 + spin_lock_init(&s->lock); 873 + s->cpustats = alloc_percpu(struct ip_vs_cpu_stats); 874 + if (!s->cpustats) 875 + return -ENOMEM; 876 + 877 + for_each_possible_cpu(i) { 878 + struct ip_vs_cpu_stats *cs = per_cpu_ptr(s->cpustats, i); 879 + 880 + u64_stats_init(&cs->syncp); 881 + } 882 + return 0; 883 + } 884 + 885 + struct ip_vs_stats *ip_vs_stats_alloc(void) 886 + { 887 + struct ip_vs_stats *s = kzalloc(sizeof(*s), GFP_KERNEL); 888 + 889 + if (s && ip_vs_stats_init_alloc(s) >= 0) 890 + return s; 891 + kfree(s); 892 + return NULL; 893 + } 894 + 895 + void ip_vs_stats_release(struct ip_vs_stats *stats) 896 + { 897 + free_percpu(stats->cpustats); 898 + } 899 + 900 + void ip_vs_stats_free(struct ip_vs_stats *stats) 901 + { 902 + if (stats) { 903 + ip_vs_stats_release(stats); 904 + kfree(stats); 905 + } 920 906 } 921 907 922 908 /* ··· 1019 923 if (old_svc != svc) { 1020 924 ip_vs_zero_stats(&dest->stats); 1021 925 __ip_vs_bind_svc(dest, svc); 1022 - __ip_vs_svc_put(old_svc, true); 926 + __ip_vs_svc_put(old_svc); 1023 927 } 1024 928 } 1025 929 ··· 1038 942 spin_unlock_bh(&dest->dst_lock); 1039 943 1040 944 if (add) { 1041 - ip_vs_start_estimator(svc->ipvs, &dest->stats); 1042 945 list_add_rcu(&dest->n_list, &svc->destinations); 1043 946 svc->num_dests++; 1044 947 sched = rcu_dereference_protected(svc->scheduler, 1); ··· 1058 963 ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest) 1059 964 { 1060 965 struct ip_vs_dest *dest; 1061 - unsigned int atype, i; 966 + unsigned int atype; 967 + int ret; 1062 968 1063 969 EnterFunction(2); 1064 970 1065 971 #ifdef CONFIG_IP_VS_IPV6 1066 972 if (udest->af == AF_INET6) { 1067 - int ret; 1068 - 1069 973 atype = ipv6_addr_type(&udest->addr.in6); 1070 974 if ((!(atype & IPV6_ADDR_UNICAST) || 1071 975 atype & IPV6_ADDR_LINKLOCAL) && ··· 1086 992 if (dest == NULL) 1087 993 return -ENOMEM; 1088 994 1089 - dest->stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats); 1090 - if (!dest->stats.cpustats) 995 + ret = ip_vs_stats_init_alloc(&dest->stats); 996 + if (ret < 0) 1091 997 goto err_alloc; 1092 998 1093 - for_each_possible_cpu(i) { 1094 - struct ip_vs_cpu_stats *ip_vs_dest_stats; 1095 - ip_vs_dest_stats = per_cpu_ptr(dest->stats.cpustats, i); 1096 - u64_stats_init(&ip_vs_dest_stats->syncp); 1097 - } 999 + ret = ip_vs_start_estimator(svc->ipvs, &dest->stats); 1000 + if (ret < 0) 1001 + goto err_stats; 1098 1002 1099 1003 dest->af = udest->af; 1100 1004 dest->protocol = svc->protocol; ··· 1109 1017 1110 1018 INIT_HLIST_NODE(&dest->d_list); 1111 1019 spin_lock_init(&dest->dst_lock); 1112 - spin_lock_init(&dest->stats.lock); 1113 1020 __ip_vs_update_dest(svc, dest, udest, 1); 1114 1021 1115 1022 LeaveFunction(2); 1116 1023 return 0; 1117 1024 1025 + err_stats: 1026 + ip_vs_stats_release(&dest->stats); 1027 + 1118 1028 err_alloc: 1119 1029 kfree(dest); 1120 - return -ENOMEM; 1030 + return ret; 1121 1031 } 1122 1032 1123 1033 ··· 1181 1087 IP_VS_DBG_ADDR(svc->af, &dest->vaddr), 1182 1088 ntohs(dest->vport)); 1183 1089 1090 + ret = ip_vs_start_estimator(svc->ipvs, &dest->stats); 1091 + if (ret < 0) 1092 + goto err; 1184 1093 __ip_vs_update_dest(svc, dest, udest, 1); 1185 - ret = 0; 1186 1094 } else { 1187 1095 /* 1188 1096 * Allocate and initialize the dest structure 1189 1097 */ 1190 1098 ret = ip_vs_new_dest(svc, udest); 1191 1099 } 1100 + 1101 + err: 1192 1102 LeaveFunction(2); 1193 1103 1194 1104 return ret; ··· 1382 1284 ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u, 1383 1285 struct ip_vs_service **svc_p) 1384 1286 { 1385 - int ret = 0, i; 1287 + int ret = 0; 1386 1288 struct ip_vs_scheduler *sched = NULL; 1387 1289 struct ip_vs_pe *pe = NULL; 1388 1290 struct ip_vs_service *svc = NULL; ··· 1442 1344 ret = -ENOMEM; 1443 1345 goto out_err; 1444 1346 } 1445 - svc->stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats); 1446 - if (!svc->stats.cpustats) { 1447 - ret = -ENOMEM; 1347 + ret = ip_vs_stats_init_alloc(&svc->stats); 1348 + if (ret < 0) 1448 1349 goto out_err; 1449 - } 1450 - 1451 - for_each_possible_cpu(i) { 1452 - struct ip_vs_cpu_stats *ip_vs_stats; 1453 - ip_vs_stats = per_cpu_ptr(svc->stats.cpustats, i); 1454 - u64_stats_init(&ip_vs_stats->syncp); 1455 - } 1456 - 1457 1350 1458 1351 /* I'm the first user of the service */ 1459 1352 atomic_set(&svc->refcnt, 0); ··· 1461 1372 1462 1373 INIT_LIST_HEAD(&svc->destinations); 1463 1374 spin_lock_init(&svc->sched_lock); 1464 - spin_lock_init(&svc->stats.lock); 1465 1375 1466 1376 /* Bind the scheduler */ 1467 1377 if (sched) { ··· 1469 1381 goto out_err; 1470 1382 sched = NULL; 1471 1383 } 1384 + 1385 + ret = ip_vs_start_estimator(ipvs, &svc->stats); 1386 + if (ret < 0) 1387 + goto out_err; 1472 1388 1473 1389 /* Bind the ct retriever */ 1474 1390 RCU_INIT_POINTER(svc->pe, pe); ··· 1486 1394 if (svc->pe && svc->pe->conn_out) 1487 1395 atomic_inc(&ipvs->conn_out_counter); 1488 1396 1489 - ip_vs_start_estimator(ipvs, &svc->stats); 1490 - 1491 1397 /* Count only IPv4 services for old get/setsockopt interface */ 1492 1398 if (svc->af == AF_INET) 1493 1399 ipvs->num_services++; ··· 1496 1406 ip_vs_svc_hash(svc); 1497 1407 1498 1408 *svc_p = svc; 1499 - /* Now there is a service - full throttle */ 1500 - ipvs->enable = 1; 1409 + 1410 + if (!ipvs->enable) { 1411 + /* Now there is a service - full throttle */ 1412 + ipvs->enable = 1; 1413 + 1414 + /* Start estimation for first time */ 1415 + ip_vs_est_reload_start(ipvs); 1416 + } 1417 + 1501 1418 return 0; 1502 1419 1503 1420 ··· 1668 1571 /* 1669 1572 * Free the service if nobody refers to it 1670 1573 */ 1671 - __ip_vs_svc_put(svc, true); 1574 + __ip_vs_svc_put(svc); 1672 1575 1673 1576 /* decrease the module use count */ 1674 1577 ip_vs_use_count_dec(); ··· 1858 1761 } 1859 1762 } 1860 1763 1861 - ip_vs_zero_stats(&ipvs->tot_stats); 1764 + ip_vs_zero_stats(&ipvs->tot_stats->s); 1862 1765 return 0; 1863 1766 } 1864 1767 ··· 1938 1841 *valp = val; 1939 1842 } 1940 1843 return rc; 1844 + } 1845 + 1846 + static int ipvs_proc_est_cpumask_set(struct ctl_table *table, void *buffer) 1847 + { 1848 + struct netns_ipvs *ipvs = table->extra2; 1849 + cpumask_var_t *valp = table->data; 1850 + cpumask_var_t newmask; 1851 + int ret; 1852 + 1853 + if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) 1854 + return -ENOMEM; 1855 + 1856 + ret = cpulist_parse(buffer, newmask); 1857 + if (ret) 1858 + goto out; 1859 + 1860 + mutex_lock(&ipvs->est_mutex); 1861 + 1862 + if (!ipvs->est_cpulist_valid) { 1863 + if (!zalloc_cpumask_var(valp, GFP_KERNEL)) { 1864 + ret = -ENOMEM; 1865 + goto unlock; 1866 + } 1867 + ipvs->est_cpulist_valid = 1; 1868 + } 1869 + cpumask_and(newmask, newmask, &current->cpus_mask); 1870 + cpumask_copy(*valp, newmask); 1871 + /* est_max_threads may depend on cpulist size */ 1872 + ipvs->est_max_threads = ip_vs_est_max_threads(ipvs); 1873 + ipvs->est_calc_phase = 1; 1874 + ip_vs_est_reload_start(ipvs); 1875 + 1876 + unlock: 1877 + mutex_unlock(&ipvs->est_mutex); 1878 + 1879 + out: 1880 + free_cpumask_var(newmask); 1881 + return ret; 1882 + } 1883 + 1884 + static int ipvs_proc_est_cpumask_get(struct ctl_table *table, void *buffer, 1885 + size_t size) 1886 + { 1887 + struct netns_ipvs *ipvs = table->extra2; 1888 + cpumask_var_t *valp = table->data; 1889 + struct cpumask *mask; 1890 + int ret; 1891 + 1892 + mutex_lock(&ipvs->est_mutex); 1893 + 1894 + if (ipvs->est_cpulist_valid) 1895 + mask = *valp; 1896 + else 1897 + mask = (struct cpumask *)housekeeping_cpumask(HK_TYPE_KTHREAD); 1898 + ret = scnprintf(buffer, size, "%*pbl\n", cpumask_pr_args(mask)); 1899 + 1900 + mutex_unlock(&ipvs->est_mutex); 1901 + 1902 + return ret; 1903 + } 1904 + 1905 + static int ipvs_proc_est_cpulist(struct ctl_table *table, int write, 1906 + void *buffer, size_t *lenp, loff_t *ppos) 1907 + { 1908 + int ret; 1909 + 1910 + /* Ignore both read and write(append) if *ppos not 0 */ 1911 + if (*ppos || !*lenp) { 1912 + *lenp = 0; 1913 + return 0; 1914 + } 1915 + if (write) { 1916 + /* proc_sys_call_handler() appends terminator */ 1917 + ret = ipvs_proc_est_cpumask_set(table, buffer); 1918 + if (ret >= 0) 1919 + *ppos += *lenp; 1920 + } else { 1921 + /* proc_sys_call_handler() allocates 1 byte for terminator */ 1922 + ret = ipvs_proc_est_cpumask_get(table, buffer, *lenp + 1); 1923 + if (ret >= 0) { 1924 + *lenp = ret; 1925 + *ppos += *lenp; 1926 + ret = 0; 1927 + } 1928 + } 1929 + return ret; 1930 + } 1931 + 1932 + static int ipvs_proc_est_nice(struct ctl_table *table, int write, 1933 + void *buffer, size_t *lenp, loff_t *ppos) 1934 + { 1935 + struct netns_ipvs *ipvs = table->extra2; 1936 + int *valp = table->data; 1937 + int val = *valp; 1938 + int ret; 1939 + 1940 + struct ctl_table tmp_table = { 1941 + .data = &val, 1942 + .maxlen = sizeof(int), 1943 + .mode = table->mode, 1944 + }; 1945 + 1946 + ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos); 1947 + if (write && ret >= 0) { 1948 + if (val < MIN_NICE || val > MAX_NICE) { 1949 + ret = -EINVAL; 1950 + } else { 1951 + mutex_lock(&ipvs->est_mutex); 1952 + if (*valp != val) { 1953 + *valp = val; 1954 + ip_vs_est_reload_start(ipvs); 1955 + } 1956 + mutex_unlock(&ipvs->est_mutex); 1957 + } 1958 + } 1959 + return ret; 1960 + } 1961 + 1962 + static int ipvs_proc_run_estimation(struct ctl_table *table, int write, 1963 + void *buffer, size_t *lenp, loff_t *ppos) 1964 + { 1965 + struct netns_ipvs *ipvs = table->extra2; 1966 + int *valp = table->data; 1967 + int val = *valp; 1968 + int ret; 1969 + 1970 + struct ctl_table tmp_table = { 1971 + .data = &val, 1972 + .maxlen = sizeof(int), 1973 + .mode = table->mode, 1974 + }; 1975 + 1976 + ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos); 1977 + if (write && ret >= 0) { 1978 + mutex_lock(&ipvs->est_mutex); 1979 + if (*valp != val) { 1980 + *valp = val; 1981 + ip_vs_est_reload_start(ipvs); 1982 + } 1983 + mutex_unlock(&ipvs->est_mutex); 1984 + } 1985 + return ret; 1941 1986 } 1942 1987 1943 1988 /* ··· 2256 2017 .procname = "run_estimation", 2257 2018 .maxlen = sizeof(int), 2258 2019 .mode = 0644, 2259 - .proc_handler = proc_dointvec, 2020 + .proc_handler = ipvs_proc_run_estimation, 2021 + }, 2022 + { 2023 + .procname = "est_cpulist", 2024 + .maxlen = NR_CPUS, /* unused */ 2025 + .mode = 0644, 2026 + .proc_handler = ipvs_proc_est_cpulist, 2027 + }, 2028 + { 2029 + .procname = "est_nice", 2030 + .maxlen = sizeof(int), 2031 + .mode = 0644, 2032 + .proc_handler = ipvs_proc_est_nice, 2260 2033 }, 2261 2034 #ifdef CONFIG_IP_VS_DEBUG 2262 2035 { ··· 2506 2255 seq_puts(seq, 2507 2256 " Conns Packets Packets Bytes Bytes\n"); 2508 2257 2509 - ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats); 2258 + ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats->s); 2510 2259 seq_printf(seq, "%8LX %8LX %8LX %16LX %16LX\n\n", 2511 2260 (unsigned long long)show.conns, 2512 2261 (unsigned long long)show.inpkts, ··· 2530 2279 static int ip_vs_stats_percpu_show(struct seq_file *seq, void *v) 2531 2280 { 2532 2281 struct net *net = seq_file_single_net(seq); 2533 - struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats; 2282 + struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats->s; 2534 2283 struct ip_vs_cpu_stats __percpu *cpustats = tot_stats->cpustats; 2535 2284 struct ip_vs_kstats kstats; 2536 2285 int i; ··· 2548 2297 2549 2298 do { 2550 2299 start = u64_stats_fetch_begin(&u->syncp); 2551 - conns = u->cnt.conns; 2552 - inpkts = u->cnt.inpkts; 2553 - outpkts = u->cnt.outpkts; 2554 - inbytes = u->cnt.inbytes; 2555 - outbytes = u->cnt.outbytes; 2300 + conns = u64_stats_read(&u->cnt.conns); 2301 + inpkts = u64_stats_read(&u->cnt.inpkts); 2302 + outpkts = u64_stats_read(&u->cnt.outpkts); 2303 + inbytes = u64_stats_read(&u->cnt.inbytes); 2304 + outbytes = u64_stats_read(&u->cnt.outbytes); 2556 2305 } while (u64_stats_fetch_retry(&u->syncp, start)); 2557 2306 2558 2307 seq_printf(seq, "%3X %8LX %8LX %8LX %16LX %16LX\n", ··· 4278 4027 static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs) 4279 4028 { 4280 4029 struct net *net = ipvs->net; 4281 - int idx; 4282 4030 struct ctl_table *tbl; 4031 + int idx, ret; 4283 4032 4284 4033 atomic_set(&ipvs->dropentry, 0); 4285 4034 spin_lock_init(&ipvs->dropentry_lock); 4286 4035 spin_lock_init(&ipvs->droppacket_lock); 4287 4036 spin_lock_init(&ipvs->securetcp_lock); 4037 + INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler); 4038 + INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work, 4039 + expire_nodest_conn_handler); 4040 + ipvs->est_stopped = 0; 4288 4041 4289 4042 if (!net_eq(net, &init_net)) { 4290 4043 tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL); ··· 4349 4094 tbl[idx++].data = &ipvs->sysctl_schedule_icmp; 4350 4095 tbl[idx++].data = &ipvs->sysctl_ignore_tunneled; 4351 4096 ipvs->sysctl_run_estimation = 1; 4097 + tbl[idx].extra2 = ipvs; 4352 4098 tbl[idx++].data = &ipvs->sysctl_run_estimation; 4099 + 4100 + ipvs->est_cpulist_valid = 0; 4101 + tbl[idx].extra2 = ipvs; 4102 + tbl[idx++].data = &ipvs->sysctl_est_cpulist; 4103 + 4104 + ipvs->sysctl_est_nice = IPVS_EST_NICE; 4105 + tbl[idx].extra2 = ipvs; 4106 + tbl[idx++].data = &ipvs->sysctl_est_nice; 4107 + 4353 4108 #ifdef CONFIG_IP_VS_DEBUG 4354 4109 /* Global sysctls must be ro in non-init netns */ 4355 4110 if (!net_eq(net, &init_net)) 4356 4111 tbl[idx++].mode = 0444; 4357 4112 #endif 4358 4113 4114 + ret = -ENOMEM; 4359 4115 ipvs->sysctl_hdr = register_net_sysctl(net, "net/ipv4/vs", tbl); 4360 - if (ipvs->sysctl_hdr == NULL) { 4361 - if (!net_eq(net, &init_net)) 4362 - kfree(tbl); 4363 - return -ENOMEM; 4364 - } 4365 - ip_vs_start_estimator(ipvs, &ipvs->tot_stats); 4116 + if (!ipvs->sysctl_hdr) 4117 + goto err; 4366 4118 ipvs->sysctl_tbl = tbl; 4119 + 4120 + ret = ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s); 4121 + if (ret < 0) 4122 + goto err; 4123 + 4367 4124 /* Schedule defense work */ 4368 - INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler); 4369 4125 queue_delayed_work(system_long_wq, &ipvs->defense_work, 4370 4126 DEFENSE_TIMER_PERIOD); 4371 4127 4372 - /* Init delayed work for expiring no dest conn */ 4373 - INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work, 4374 - expire_nodest_conn_handler); 4375 - 4376 4128 return 0; 4129 + 4130 + err: 4131 + unregister_net_sysctl_table(ipvs->sysctl_hdr); 4132 + if (!net_eq(net, &init_net)) 4133 + kfree(tbl); 4134 + return ret; 4377 4135 } 4378 4136 4379 4137 static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs) ··· 4397 4129 cancel_delayed_work_sync(&ipvs->defense_work); 4398 4130 cancel_work_sync(&ipvs->defense_work.work); 4399 4131 unregister_net_sysctl_table(ipvs->sysctl_hdr); 4400 - ip_vs_stop_estimator(ipvs, &ipvs->tot_stats); 4132 + ip_vs_stop_estimator(ipvs, &ipvs->tot_stats->s); 4133 + 4134 + if (ipvs->est_cpulist_valid) 4135 + free_cpumask_var(ipvs->sysctl_est_cpulist); 4401 4136 4402 4137 if (!net_eq(net, &init_net)) 4403 4138 kfree(ipvs->sysctl_tbl); ··· 4422 4151 4423 4152 int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs) 4424 4153 { 4425 - int i, idx; 4154 + int ret = -ENOMEM; 4155 + int idx; 4426 4156 4427 4157 /* Initialize rs_table */ 4428 4158 for (idx = 0; idx < IP_VS_RTAB_SIZE; idx++) ··· 4436 4164 atomic_set(&ipvs->nullsvc_counter, 0); 4437 4165 atomic_set(&ipvs->conn_out_counter, 0); 4438 4166 4167 + INIT_DELAYED_WORK(&ipvs->est_reload_work, est_reload_work_handler); 4168 + 4439 4169 /* procfs stats */ 4440 - ipvs->tot_stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats); 4441 - if (!ipvs->tot_stats.cpustats) 4442 - return -ENOMEM; 4443 - 4444 - for_each_possible_cpu(i) { 4445 - struct ip_vs_cpu_stats *ipvs_tot_stats; 4446 - ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats.cpustats, i); 4447 - u64_stats_init(&ipvs_tot_stats->syncp); 4448 - } 4449 - 4450 - spin_lock_init(&ipvs->tot_stats.lock); 4170 + ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL); 4171 + if (!ipvs->tot_stats) 4172 + goto out; 4173 + if (ip_vs_stats_init_alloc(&ipvs->tot_stats->s) < 0) 4174 + goto err_tot_stats; 4451 4175 4452 4176 #ifdef CONFIG_PROC_FS 4453 4177 if (!proc_create_net("ip_vs", 0, ipvs->net->proc_net, ··· 4458 4190 goto err_percpu; 4459 4191 #endif 4460 4192 4461 - if (ip_vs_control_net_init_sysctl(ipvs)) 4193 + ret = ip_vs_control_net_init_sysctl(ipvs); 4194 + if (ret < 0) 4462 4195 goto err; 4463 4196 4464 4197 return 0; ··· 4476 4207 4477 4208 err_vs: 4478 4209 #endif 4479 - free_percpu(ipvs->tot_stats.cpustats); 4480 - return -ENOMEM; 4210 + ip_vs_stats_release(&ipvs->tot_stats->s); 4211 + 4212 + err_tot_stats: 4213 + kfree(ipvs->tot_stats); 4214 + 4215 + out: 4216 + return ret; 4481 4217 } 4482 4218 4483 4219 void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs) 4484 4220 { 4485 4221 ip_vs_trash_cleanup(ipvs); 4486 4222 ip_vs_control_net_cleanup_sysctl(ipvs); 4223 + cancel_delayed_work_sync(&ipvs->est_reload_work); 4487 4224 #ifdef CONFIG_PROC_FS 4488 4225 remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net); 4489 4226 remove_proc_entry("ip_vs_stats", ipvs->net->proc_net); 4490 4227 remove_proc_entry("ip_vs", ipvs->net->proc_net); 4491 4228 #endif 4492 - free_percpu(ipvs->tot_stats.cpustats); 4229 + call_rcu(&ipvs->tot_stats->rcu_head, ip_vs_stats_rcu_free); 4493 4230 } 4494 4231 4495 4232 int __init ip_vs_register_nl_ioctl(void) ··· 4555 4280 { 4556 4281 EnterFunction(2); 4557 4282 unregister_netdevice_notifier(&ip_vs_dst_notifier); 4283 + /* relying on common rcu_barrier() in ip_vs_cleanup() */ 4558 4284 LeaveFunction(2); 4559 4285 }

+815 -69

net/netfilter/ipvs/ip_vs_est.c

··· 30 30 long interval, it is easy to implement a user level daemon which 31 31 periodically reads those statistical counters and measure rate. 32 32 33 - Currently, the measurement is activated by slow timer handler. Hope 34 - this measurement will not introduce too much load. 35 - 36 33 We measure rate during the last 8 seconds every 2 seconds: 37 34 38 35 avgrate = avgrate*(1-W) + rate*W ··· 44 47 to 32-bit values for conns, packets, bps, cps and pps. 45 48 46 49 * A lot of code is taken from net/core/gen_estimator.c 50 + 51 + KEY POINTS: 52 + - cpustats counters are updated per-cpu in SoftIRQ context with BH disabled 53 + - kthreads read the cpustats to update the estimators (svcs, dests, total) 54 + - the states of estimators can be read (get stats) or modified (zero stats) 55 + from processes 56 + 57 + KTHREADS: 58 + - estimators are added initially to est_temp_list and later kthread 0 59 + distributes them to one or many kthreads for estimation 60 + - kthread contexts are created and attached to array 61 + - the kthread tasks are started when first service is added, before that 62 + the total stats are not estimated 63 + - when configuration (cpulist/nice) is changed, the tasks are restarted 64 + by work (est_reload_work) 65 + - kthread tasks are stopped while the cpulist is empty 66 + - the kthread context holds lists with estimators (chains) which are 67 + processed every 2 seconds 68 + - as estimators can be added dynamically and in bursts, we try to spread 69 + them to multiple chains which are estimated at different time 70 + - on start, kthread 0 enters calculation phase to determine the chain limits 71 + and the limit of estimators per kthread 72 + - est_add_ktid: ktid where to add new ests, can point to empty slot where 73 + we should add kt data 47 74 */ 48 75 76 + static struct lock_class_key __ipvs_est_key; 49 77 50 - /* 51 - * Make a summary from each cpu 52 - */ 53 - static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum, 54 - struct ip_vs_cpu_stats __percpu *stats) 55 - { 56 - int i; 57 - bool add = false; 78 + static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs); 79 + static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs); 58 80 59 - for_each_possible_cpu(i) { 60 - struct ip_vs_cpu_stats *s = per_cpu_ptr(stats, i); 61 - unsigned int start; 62 - u64 conns, inpkts, outpkts, inbytes, outbytes; 63 - 64 - if (add) { 65 - do { 66 - start = u64_stats_fetch_begin(&s->syncp); 67 - conns = s->cnt.conns; 68 - inpkts = s->cnt.inpkts; 69 - outpkts = s->cnt.outpkts; 70 - inbytes = s->cnt.inbytes; 71 - outbytes = s->cnt.outbytes; 72 - } while (u64_stats_fetch_retry(&s->syncp, start)); 73 - sum->conns += conns; 74 - sum->inpkts += inpkts; 75 - sum->outpkts += outpkts; 76 - sum->inbytes += inbytes; 77 - sum->outbytes += outbytes; 78 - } else { 79 - add = true; 80 - do { 81 - start = u64_stats_fetch_begin(&s->syncp); 82 - sum->conns = s->cnt.conns; 83 - sum->inpkts = s->cnt.inpkts; 84 - sum->outpkts = s->cnt.outpkts; 85 - sum->inbytes = s->cnt.inbytes; 86 - sum->outbytes = s->cnt.outbytes; 87 - } while (u64_stats_fetch_retry(&s->syncp, start)); 88 - } 89 - } 90 - } 91 - 92 - 93 - static void estimation_timer(struct timer_list *t) 81 + static void ip_vs_chain_estimation(struct hlist_head *chain) 94 82 { 95 83 struct ip_vs_estimator *e; 84 + struct ip_vs_cpu_stats *c; 96 85 struct ip_vs_stats *s; 97 86 u64 rate; 98 - struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer); 99 87 100 - if (!sysctl_run_estimation(ipvs)) 101 - goto skip; 88 + hlist_for_each_entry_rcu(e, chain, list) { 89 + u64 conns, inpkts, outpkts, inbytes, outbytes; 90 + u64 kconns = 0, kinpkts = 0, koutpkts = 0; 91 + u64 kinbytes = 0, koutbytes = 0; 92 + unsigned int start; 93 + int i; 102 94 103 - spin_lock(&ipvs->est_lock); 104 - list_for_each_entry(e, &ipvs->est_list, list) { 95 + if (kthread_should_stop()) 96 + break; 97 + 105 98 s = container_of(e, struct ip_vs_stats, est); 99 + for_each_possible_cpu(i) { 100 + c = per_cpu_ptr(s->cpustats, i); 101 + do { 102 + start = u64_stats_fetch_begin(&c->syncp); 103 + conns = u64_stats_read(&c->cnt.conns); 104 + inpkts = u64_stats_read(&c->cnt.inpkts); 105 + outpkts = u64_stats_read(&c->cnt.outpkts); 106 + inbytes = u64_stats_read(&c->cnt.inbytes); 107 + outbytes = u64_stats_read(&c->cnt.outbytes); 108 + } while (u64_stats_fetch_retry(&c->syncp, start)); 109 + kconns += conns; 110 + kinpkts += inpkts; 111 + koutpkts += outpkts; 112 + kinbytes += inbytes; 113 + koutbytes += outbytes; 114 + } 106 115 107 116 spin_lock(&s->lock); 108 - ip_vs_read_cpu_stats(&s->kstats, s->cpustats); 117 + 118 + s->kstats.conns = kconns; 119 + s->kstats.inpkts = kinpkts; 120 + s->kstats.outpkts = koutpkts; 121 + s->kstats.inbytes = kinbytes; 122 + s->kstats.outbytes = koutbytes; 109 123 110 124 /* scaled by 2^10, but divided 2 seconds */ 111 125 rate = (s->kstats.conns - e->last_conns) << 9; ··· 141 133 e->outbps += ((s64)rate - (s64)e->outbps) >> 2; 142 134 spin_unlock(&s->lock); 143 135 } 144 - spin_unlock(&ipvs->est_lock); 145 - 146 - skip: 147 - mod_timer(&ipvs->est_timer, jiffies + 2*HZ); 148 136 } 149 137 150 - void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats) 138 + static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row) 139 + { 140 + struct ip_vs_est_tick_data *td; 141 + int cid; 142 + 143 + rcu_read_lock(); 144 + td = rcu_dereference(kd->ticks[row]); 145 + if (!td) 146 + goto out; 147 + for_each_set_bit(cid, td->present, IPVS_EST_TICK_CHAINS) { 148 + if (kthread_should_stop()) 149 + break; 150 + ip_vs_chain_estimation(&td->chains[cid]); 151 + cond_resched_rcu(); 152 + td = rcu_dereference(kd->ticks[row]); 153 + if (!td) 154 + break; 155 + } 156 + 157 + out: 158 + rcu_read_unlock(); 159 + } 160 + 161 + static int ip_vs_estimation_kthread(void *data) 162 + { 163 + struct ip_vs_est_kt_data *kd = data; 164 + struct netns_ipvs *ipvs = kd->ipvs; 165 + int row = kd->est_row; 166 + unsigned long now; 167 + int id = kd->id; 168 + long gap; 169 + 170 + if (id > 0) { 171 + if (!ipvs->est_chain_max) 172 + return 0; 173 + } else { 174 + if (!ipvs->est_chain_max) { 175 + ipvs->est_calc_phase = 1; 176 + /* commit est_calc_phase before reading est_genid */ 177 + smp_mb(); 178 + } 179 + 180 + /* kthread 0 will handle the calc phase */ 181 + if (ipvs->est_calc_phase) 182 + ip_vs_est_calc_phase(ipvs); 183 + } 184 + 185 + while (1) { 186 + if (!id && !hlist_empty(&ipvs->est_temp_list)) 187 + ip_vs_est_drain_temp_list(ipvs); 188 + set_current_state(TASK_IDLE); 189 + if (kthread_should_stop()) 190 + break; 191 + 192 + /* before estimation, check if we should sleep */ 193 + now = jiffies; 194 + gap = kd->est_timer - now; 195 + if (gap > 0) { 196 + if (gap > IPVS_EST_TICK) { 197 + kd->est_timer = now - IPVS_EST_TICK; 198 + gap = IPVS_EST_TICK; 199 + } 200 + schedule_timeout(gap); 201 + } else { 202 + __set_current_state(TASK_RUNNING); 203 + if (gap < -8 * IPVS_EST_TICK) 204 + kd->est_timer = now; 205 + } 206 + 207 + if (kd->tick_len[row]) 208 + ip_vs_tick_estimation(kd, row); 209 + 210 + row++; 211 + if (row >= IPVS_EST_NTICKS) 212 + row = 0; 213 + WRITE_ONCE(kd->est_row, row); 214 + kd->est_timer += IPVS_EST_TICK; 215 + } 216 + __set_current_state(TASK_RUNNING); 217 + 218 + return 0; 219 + } 220 + 221 + /* Schedule stop/start for kthread tasks */ 222 + void ip_vs_est_reload_start(struct netns_ipvs *ipvs) 223 + { 224 + /* Ignore reloads before first service is added */ 225 + if (!ipvs->enable) 226 + return; 227 + ip_vs_est_stopped_recalc(ipvs); 228 + /* Bump the kthread configuration genid */ 229 + atomic_inc(&ipvs->est_genid); 230 + queue_delayed_work(system_long_wq, &ipvs->est_reload_work, 0); 231 + } 232 + 233 + /* Start kthread task with current configuration */ 234 + int ip_vs_est_kthread_start(struct netns_ipvs *ipvs, 235 + struct ip_vs_est_kt_data *kd) 236 + { 237 + unsigned long now; 238 + int ret = 0; 239 + long gap; 240 + 241 + lockdep_assert_held(&ipvs->est_mutex); 242 + 243 + if (kd->task) 244 + goto out; 245 + now = jiffies; 246 + gap = kd->est_timer - now; 247 + /* Sync est_timer if task is starting later */ 248 + if (abs(gap) > 4 * IPVS_EST_TICK) 249 + kd->est_timer = now; 250 + kd->task = kthread_create(ip_vs_estimation_kthread, kd, "ipvs-e:%d:%d", 251 + ipvs->gen, kd->id); 252 + if (IS_ERR(kd->task)) { 253 + ret = PTR_ERR(kd->task); 254 + kd->task = NULL; 255 + goto out; 256 + } 257 + 258 + set_user_nice(kd->task, sysctl_est_nice(ipvs)); 259 + set_cpus_allowed_ptr(kd->task, sysctl_est_cpulist(ipvs)); 260 + 261 + pr_info("starting estimator thread %d...\n", kd->id); 262 + wake_up_process(kd->task); 263 + 264 + out: 265 + return ret; 266 + } 267 + 268 + void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd) 269 + { 270 + if (kd->task) { 271 + pr_info("stopping estimator thread %d...\n", kd->id); 272 + kthread_stop(kd->task); 273 + kd->task = NULL; 274 + } 275 + } 276 + 277 + /* Apply parameters to kthread */ 278 + static void ip_vs_est_set_params(struct netns_ipvs *ipvs, 279 + struct ip_vs_est_kt_data *kd) 280 + { 281 + kd->chain_max = ipvs->est_chain_max; 282 + /* We are using single chain on RCU preemption */ 283 + if (IPVS_EST_TICK_CHAINS == 1) 284 + kd->chain_max *= IPVS_EST_CHAIN_FACTOR; 285 + kd->tick_max = IPVS_EST_TICK_CHAINS * kd->chain_max; 286 + kd->est_max_count = IPVS_EST_NTICKS * kd->tick_max; 287 + } 288 + 289 + /* Create and start estimation kthread in a free or new array slot */ 290 + static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs) 291 + { 292 + struct ip_vs_est_kt_data *kd = NULL; 293 + int id = ipvs->est_kt_count; 294 + int ret = -ENOMEM; 295 + void *arr = NULL; 296 + int i; 297 + 298 + if ((unsigned long)ipvs->est_kt_count >= ipvs->est_max_threads && 299 + ipvs->enable && ipvs->est_max_threads) 300 + return -EINVAL; 301 + 302 + mutex_lock(&ipvs->est_mutex); 303 + 304 + for (i = 0; i < id; i++) { 305 + if (!ipvs->est_kt_arr[i]) 306 + break; 307 + } 308 + if (i >= id) { 309 + arr = krealloc_array(ipvs->est_kt_arr, id + 1, 310 + sizeof(struct ip_vs_est_kt_data *), 311 + GFP_KERNEL); 312 + if (!arr) 313 + goto out; 314 + ipvs->est_kt_arr = arr; 315 + } else { 316 + id = i; 317 + } 318 + 319 + kd = kzalloc(sizeof(*kd), GFP_KERNEL); 320 + if (!kd) 321 + goto out; 322 + kd->ipvs = ipvs; 323 + bitmap_fill(kd->avail, IPVS_EST_NTICKS); 324 + kd->est_timer = jiffies; 325 + kd->id = id; 326 + ip_vs_est_set_params(ipvs, kd); 327 + 328 + /* Pre-allocate stats used in calc phase */ 329 + if (!id && !kd->calc_stats) { 330 + kd->calc_stats = ip_vs_stats_alloc(); 331 + if (!kd->calc_stats) 332 + goto out; 333 + } 334 + 335 + /* Start kthread tasks only when services are present */ 336 + if (ipvs->enable && !ip_vs_est_stopped(ipvs)) { 337 + ret = ip_vs_est_kthread_start(ipvs, kd); 338 + if (ret < 0) 339 + goto out; 340 + } 341 + 342 + if (arr) 343 + ipvs->est_kt_count++; 344 + ipvs->est_kt_arr[id] = kd; 345 + kd = NULL; 346 + /* Use most recent kthread for new ests */ 347 + ipvs->est_add_ktid = id; 348 + ret = 0; 349 + 350 + out: 351 + mutex_unlock(&ipvs->est_mutex); 352 + if (kd) { 353 + ip_vs_stats_free(kd->calc_stats); 354 + kfree(kd); 355 + } 356 + 357 + return ret; 358 + } 359 + 360 + /* Select ktid where to add new ests: available, unused or new slot */ 361 + static void ip_vs_est_update_ktid(struct netns_ipvs *ipvs) 362 + { 363 + int ktid, best = ipvs->est_kt_count; 364 + struct ip_vs_est_kt_data *kd; 365 + 366 + for (ktid = 0; ktid < ipvs->est_kt_count; ktid++) { 367 + kd = ipvs->est_kt_arr[ktid]; 368 + if (kd) { 369 + if (kd->est_count < kd->est_max_count) { 370 + best = ktid; 371 + break; 372 + } 373 + } else if (ktid < best) { 374 + best = ktid; 375 + } 376 + } 377 + ipvs->est_add_ktid = best; 378 + } 379 + 380 + /* Add estimator to current kthread (est_add_ktid) */ 381 + static int ip_vs_enqueue_estimator(struct netns_ipvs *ipvs, 382 + struct ip_vs_estimator *est) 383 + { 384 + struct ip_vs_est_kt_data *kd = NULL; 385 + struct ip_vs_est_tick_data *td; 386 + int ktid, row, crow, cid, ret; 387 + int delay = est->ktrow; 388 + 389 + BUILD_BUG_ON_MSG(IPVS_EST_TICK_CHAINS > 127, 390 + "Too many chains for ktcid"); 391 + 392 + if (ipvs->est_add_ktid < ipvs->est_kt_count) { 393 + kd = ipvs->est_kt_arr[ipvs->est_add_ktid]; 394 + if (kd) 395 + goto add_est; 396 + } 397 + 398 + ret = ip_vs_est_add_kthread(ipvs); 399 + if (ret < 0) 400 + goto out; 401 + kd = ipvs->est_kt_arr[ipvs->est_add_ktid]; 402 + 403 + add_est: 404 + ktid = kd->id; 405 + /* For small number of estimators prefer to use few ticks, 406 + * otherwise try to add into the last estimated row. 407 + * est_row and add_row point after the row we should use 408 + */ 409 + if (kd->est_count >= 2 * kd->tick_max || delay < IPVS_EST_NTICKS - 1) 410 + crow = READ_ONCE(kd->est_row); 411 + else 412 + crow = kd->add_row; 413 + crow += delay; 414 + if (crow >= IPVS_EST_NTICKS) 415 + crow -= IPVS_EST_NTICKS; 416 + /* Assume initial delay ? */ 417 + if (delay >= IPVS_EST_NTICKS - 1) { 418 + /* Preserve initial delay or decrease it if no space in tick */ 419 + row = crow; 420 + if (crow < IPVS_EST_NTICKS - 1) { 421 + crow++; 422 + row = find_last_bit(kd->avail, crow); 423 + } 424 + if (row >= crow) 425 + row = find_last_bit(kd->avail, IPVS_EST_NTICKS); 426 + } else { 427 + /* Preserve delay or increase it if no space in tick */ 428 + row = IPVS_EST_NTICKS; 429 + if (crow > 0) 430 + row = find_next_bit(kd->avail, IPVS_EST_NTICKS, crow); 431 + if (row >= IPVS_EST_NTICKS) 432 + row = find_first_bit(kd->avail, IPVS_EST_NTICKS); 433 + } 434 + 435 + td = rcu_dereference_protected(kd->ticks[row], 1); 436 + if (!td) { 437 + td = kzalloc(sizeof(*td), GFP_KERNEL); 438 + if (!td) { 439 + ret = -ENOMEM; 440 + goto out; 441 + } 442 + rcu_assign_pointer(kd->ticks[row], td); 443 + } 444 + 445 + cid = find_first_zero_bit(td->full, IPVS_EST_TICK_CHAINS); 446 + 447 + kd->est_count++; 448 + kd->tick_len[row]++; 449 + if (!td->chain_len[cid]) 450 + __set_bit(cid, td->present); 451 + td->chain_len[cid]++; 452 + est->ktid = ktid; 453 + est->ktrow = row; 454 + est->ktcid = cid; 455 + hlist_add_head_rcu(&est->list, &td->chains[cid]); 456 + 457 + if (td->chain_len[cid] >= kd->chain_max) { 458 + __set_bit(cid, td->full); 459 + if (kd->tick_len[row] >= kd->tick_max) 460 + __clear_bit(row, kd->avail); 461 + } 462 + 463 + /* Update est_add_ktid to point to first available/empty kt slot */ 464 + if (kd->est_count == kd->est_max_count) 465 + ip_vs_est_update_ktid(ipvs); 466 + 467 + ret = 0; 468 + 469 + out: 470 + return ret; 471 + } 472 + 473 + /* Start estimation for stats */ 474 + int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats) 151 475 { 152 476 struct ip_vs_estimator *est = &stats->est; 477 + int ret; 153 478 154 - INIT_LIST_HEAD(&est->list); 479 + if (!ipvs->est_max_threads && ipvs->enable) 480 + ipvs->est_max_threads = ip_vs_est_max_threads(ipvs); 155 481 156 - spin_lock_bh(&ipvs->est_lock); 157 - list_add(&est->list, &ipvs->est_list); 158 - spin_unlock_bh(&ipvs->est_lock); 482 + est->ktid = -1; 483 + est->ktrow = IPVS_EST_NTICKS - 1; /* Initial delay */ 484 + 485 + /* We prefer this code to be short, kthread 0 will requeue the 486 + * estimator to available chain. If tasks are disabled, we 487 + * will not allocate much memory, just for kt 0. 488 + */ 489 + ret = 0; 490 + if (!ipvs->est_kt_count || !ipvs->est_kt_arr[0]) 491 + ret = ip_vs_est_add_kthread(ipvs); 492 + if (ret >= 0) 493 + hlist_add_head(&est->list, &ipvs->est_temp_list); 494 + else 495 + INIT_HLIST_NODE(&est->list); 496 + return ret; 159 497 } 160 498 499 + static void ip_vs_est_kthread_destroy(struct ip_vs_est_kt_data *kd) 500 + { 501 + if (kd) { 502 + if (kd->task) { 503 + pr_info("stop unused estimator thread %d...\n", kd->id); 504 + kthread_stop(kd->task); 505 + } 506 + ip_vs_stats_free(kd->calc_stats); 507 + kfree(kd); 508 + } 509 + } 510 + 511 + /* Unlink estimator from chain */ 161 512 void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats) 162 513 { 163 514 struct ip_vs_estimator *est = &stats->est; 515 + struct ip_vs_est_tick_data *td; 516 + struct ip_vs_est_kt_data *kd; 517 + int ktid = est->ktid; 518 + int row = est->ktrow; 519 + int cid = est->ktcid; 164 520 165 - spin_lock_bh(&ipvs->est_lock); 166 - list_del(&est->list); 167 - spin_unlock_bh(&ipvs->est_lock); 521 + /* Failed to add to chain ? */ 522 + if (hlist_unhashed(&est->list)) 523 + return; 524 + 525 + /* On return, estimator can be freed, dequeue it now */ 526 + 527 + /* In est_temp_list ? */ 528 + if (ktid < 0) { 529 + hlist_del(&est->list); 530 + goto end_kt0; 531 + } 532 + 533 + hlist_del_rcu(&est->list); 534 + kd = ipvs->est_kt_arr[ktid]; 535 + td = rcu_dereference_protected(kd->ticks[row], 1); 536 + __clear_bit(cid, td->full); 537 + td->chain_len[cid]--; 538 + if (!td->chain_len[cid]) 539 + __clear_bit(cid, td->present); 540 + kd->tick_len[row]--; 541 + __set_bit(row, kd->avail); 542 + if (!kd->tick_len[row]) { 543 + RCU_INIT_POINTER(kd->ticks[row], NULL); 544 + kfree_rcu(td); 545 + } 546 + kd->est_count--; 547 + if (kd->est_count) { 548 + /* This kt slot can become available just now, prefer it */ 549 + if (ktid < ipvs->est_add_ktid) 550 + ipvs->est_add_ktid = ktid; 551 + return; 552 + } 553 + 554 + if (ktid > 0) { 555 + mutex_lock(&ipvs->est_mutex); 556 + ip_vs_est_kthread_destroy(kd); 557 + ipvs->est_kt_arr[ktid] = NULL; 558 + if (ktid == ipvs->est_kt_count - 1) { 559 + ipvs->est_kt_count--; 560 + while (ipvs->est_kt_count > 1 && 561 + !ipvs->est_kt_arr[ipvs->est_kt_count - 1]) 562 + ipvs->est_kt_count--; 563 + } 564 + mutex_unlock(&ipvs->est_mutex); 565 + 566 + /* This slot is now empty, prefer another available kt slot */ 567 + if (ktid == ipvs->est_add_ktid) 568 + ip_vs_est_update_ktid(ipvs); 569 + } 570 + 571 + end_kt0: 572 + /* kt 0 is freed after all other kthreads and chains are empty */ 573 + if (ipvs->est_kt_count == 1 && hlist_empty(&ipvs->est_temp_list)) { 574 + kd = ipvs->est_kt_arr[0]; 575 + if (!kd || !kd->est_count) { 576 + mutex_lock(&ipvs->est_mutex); 577 + if (kd) { 578 + ip_vs_est_kthread_destroy(kd); 579 + ipvs->est_kt_arr[0] = NULL; 580 + } 581 + ipvs->est_kt_count--; 582 + mutex_unlock(&ipvs->est_mutex); 583 + ipvs->est_add_ktid = 0; 584 + } 585 + } 586 + } 587 + 588 + /* Register all ests from est_temp_list to kthreads */ 589 + static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs) 590 + { 591 + struct ip_vs_estimator *est; 592 + 593 + while (1) { 594 + int max = 16; 595 + 596 + mutex_lock(&__ip_vs_mutex); 597 + 598 + while (max-- > 0) { 599 + est = hlist_entry_safe(ipvs->est_temp_list.first, 600 + struct ip_vs_estimator, list); 601 + if (est) { 602 + if (kthread_should_stop()) 603 + goto unlock; 604 + hlist_del_init(&est->list); 605 + if (ip_vs_enqueue_estimator(ipvs, est) >= 0) 606 + continue; 607 + est->ktid = -1; 608 + hlist_add_head(&est->list, 609 + &ipvs->est_temp_list); 610 + /* Abort, some entries will not be estimated 611 + * until next attempt 612 + */ 613 + } 614 + goto unlock; 615 + } 616 + mutex_unlock(&__ip_vs_mutex); 617 + cond_resched(); 618 + } 619 + 620 + unlock: 621 + mutex_unlock(&__ip_vs_mutex); 622 + } 623 + 624 + /* Calculate limits for all kthreads */ 625 + static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max) 626 + { 627 + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); 628 + struct ip_vs_est_kt_data *kd; 629 + struct hlist_head chain; 630 + struct ip_vs_stats *s; 631 + int cache_factor = 4; 632 + int i, loops, ntest; 633 + s32 min_est = 0; 634 + ktime_t t1, t2; 635 + s64 diff, val; 636 + int max = 8; 637 + int ret = 1; 638 + 639 + INIT_HLIST_HEAD(&chain); 640 + mutex_lock(&__ip_vs_mutex); 641 + kd = ipvs->est_kt_arr[0]; 642 + mutex_unlock(&__ip_vs_mutex); 643 + s = kd ? kd->calc_stats : NULL; 644 + if (!s) 645 + goto out; 646 + hlist_add_head(&s->est.list, &chain); 647 + 648 + loops = 1; 649 + /* Get best result from many tests */ 650 + for (ntest = 0; ntest < 12; ntest++) { 651 + if (!(ntest & 3)) { 652 + /* Wait for cpufreq frequency transition */ 653 + wait_event_idle_timeout(wq, kthread_should_stop(), 654 + HZ / 50); 655 + if (!ipvs->enable || kthread_should_stop()) 656 + goto stop; 657 + } 658 + 659 + local_bh_disable(); 660 + rcu_read_lock(); 661 + 662 + /* Put stats in cache */ 663 + ip_vs_chain_estimation(&chain); 664 + 665 + t1 = ktime_get(); 666 + for (i = loops * cache_factor; i > 0; i--) 667 + ip_vs_chain_estimation(&chain); 668 + t2 = ktime_get(); 669 + 670 + rcu_read_unlock(); 671 + local_bh_enable(); 672 + 673 + if (!ipvs->enable || kthread_should_stop()) 674 + goto stop; 675 + cond_resched(); 676 + 677 + diff = ktime_to_ns(ktime_sub(t2, t1)); 678 + if (diff <= 1 * NSEC_PER_USEC) { 679 + /* Do more loops on low time resolution */ 680 + loops *= 2; 681 + continue; 682 + } 683 + if (diff >= NSEC_PER_SEC) 684 + continue; 685 + val = diff; 686 + do_div(val, loops); 687 + if (!min_est || val < min_est) { 688 + min_est = val; 689 + /* goal: 95usec per chain */ 690 + val = 95 * NSEC_PER_USEC; 691 + if (val >= min_est) { 692 + do_div(val, min_est); 693 + max = (int)val; 694 + } else { 695 + max = 1; 696 + } 697 + } 698 + } 699 + 700 + out: 701 + if (s) 702 + hlist_del_init(&s->est.list); 703 + *chain_max = max; 704 + return ret; 705 + 706 + stop: 707 + ret = 0; 708 + goto out; 709 + } 710 + 711 + /* Calculate the parameters and apply them in context of kt #0 712 + * ECP: est_calc_phase 713 + * ECM: est_chain_max 714 + * ECP ECM Insert Chain enable Description 715 + * --------------------------------------------------------------------------- 716 + * 0 0 est_temp_list 0 create kt #0 context 717 + * 0 0 est_temp_list 0->1 service added, start kthread #0 task 718 + * 0->1 0 est_temp_list 1 kt task #0 started, enters calc phase 719 + * 1 0 est_temp_list 1 kt #0: determine est_chain_max, 720 + * stop tasks, move ests to est_temp_list 721 + * and free kd for kthreads 1..last 722 + * 1->0 0->N kt chains 1 ests can go to kthreads 723 + * 0 N kt chains 1 drain est_temp_list, create new kthread 724 + * contexts, start tasks, estimate 725 + */ 726 + static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs) 727 + { 728 + int genid = atomic_read(&ipvs->est_genid); 729 + struct ip_vs_est_tick_data *td; 730 + struct ip_vs_est_kt_data *kd; 731 + struct ip_vs_estimator *est; 732 + struct ip_vs_stats *stats; 733 + int id, row, cid, delay; 734 + bool last, last_td; 735 + int chain_max; 736 + int step; 737 + 738 + if (!ip_vs_est_calc_limits(ipvs, &chain_max)) 739 + return; 740 + 741 + mutex_lock(&__ip_vs_mutex); 742 + 743 + /* Stop all other tasks, so that we can immediately move the 744 + * estimators to est_temp_list without RCU grace period 745 + */ 746 + mutex_lock(&ipvs->est_mutex); 747 + for (id = 1; id < ipvs->est_kt_count; id++) { 748 + /* netns clean up started, abort */ 749 + if (!ipvs->enable) 750 + goto unlock2; 751 + kd = ipvs->est_kt_arr[id]; 752 + if (!kd) 753 + continue; 754 + ip_vs_est_kthread_stop(kd); 755 + } 756 + mutex_unlock(&ipvs->est_mutex); 757 + 758 + /* Move all estimators to est_temp_list but carefully, 759 + * all estimators and kthread data can be released while 760 + * we reschedule. Even for kthread 0. 761 + */ 762 + step = 0; 763 + 764 + /* Order entries in est_temp_list in ascending delay, so now 765 + * walk delay(desc), id(desc), cid(asc) 766 + */ 767 + delay = IPVS_EST_NTICKS; 768 + 769 + next_delay: 770 + delay--; 771 + if (delay < 0) 772 + goto end_dequeue; 773 + 774 + last_kt: 775 + /* Destroy contexts backwards */ 776 + id = ipvs->est_kt_count; 777 + 778 + next_kt: 779 + if (!ipvs->enable || kthread_should_stop()) 780 + goto unlock; 781 + id--; 782 + if (id < 0) 783 + goto next_delay; 784 + kd = ipvs->est_kt_arr[id]; 785 + if (!kd) 786 + goto next_kt; 787 + /* kt 0 can exist with empty chains */ 788 + if (!id && kd->est_count <= 1) 789 + goto next_delay; 790 + 791 + row = kd->est_row + delay; 792 + if (row >= IPVS_EST_NTICKS) 793 + row -= IPVS_EST_NTICKS; 794 + td = rcu_dereference_protected(kd->ticks[row], 1); 795 + if (!td) 796 + goto next_kt; 797 + 798 + cid = 0; 799 + 800 + walk_chain: 801 + if (kthread_should_stop()) 802 + goto unlock; 803 + step++; 804 + if (!(step & 63)) { 805 + /* Give chance estimators to be added (to est_temp_list) 806 + * and deleted (releasing kthread contexts) 807 + */ 808 + mutex_unlock(&__ip_vs_mutex); 809 + cond_resched(); 810 + mutex_lock(&__ip_vs_mutex); 811 + 812 + /* Current kt released ? */ 813 + if (id >= ipvs->est_kt_count) 814 + goto last_kt; 815 + if (kd != ipvs->est_kt_arr[id]) 816 + goto next_kt; 817 + /* Current td released ? */ 818 + if (td != rcu_dereference_protected(kd->ticks[row], 1)) 819 + goto next_kt; 820 + /* No fatal changes on the current kd and td */ 821 + } 822 + est = hlist_entry_safe(td->chains[cid].first, struct ip_vs_estimator, 823 + list); 824 + if (!est) { 825 + cid++; 826 + if (cid >= IPVS_EST_TICK_CHAINS) 827 + goto next_kt; 828 + goto walk_chain; 829 + } 830 + /* We can cheat and increase est_count to protect kt 0 context 831 + * from release but we prefer to keep the last estimator 832 + */ 833 + last = kd->est_count <= 1; 834 + /* Do not free kt #0 data */ 835 + if (!id && last) 836 + goto next_delay; 837 + last_td = kd->tick_len[row] <= 1; 838 + stats = container_of(est, struct ip_vs_stats, est); 839 + ip_vs_stop_estimator(ipvs, stats); 840 + /* Tasks are stopped, move without RCU grace period */ 841 + est->ktid = -1; 842 + est->ktrow = row - kd->est_row; 843 + if (est->ktrow < 0) 844 + est->ktrow += IPVS_EST_NTICKS; 845 + hlist_add_head(&est->list, &ipvs->est_temp_list); 846 + /* kd freed ? */ 847 + if (last) 848 + goto next_kt; 849 + /* td freed ? */ 850 + if (last_td) 851 + goto next_kt; 852 + goto walk_chain; 853 + 854 + end_dequeue: 855 + /* All estimators removed while calculating ? */ 856 + if (!ipvs->est_kt_count) 857 + goto unlock; 858 + kd = ipvs->est_kt_arr[0]; 859 + if (!kd) 860 + goto unlock; 861 + kd->add_row = kd->est_row; 862 + ipvs->est_chain_max = chain_max; 863 + ip_vs_est_set_params(ipvs, kd); 864 + 865 + pr_info("using max %d ests per chain, %d per kthread\n", 866 + kd->chain_max, kd->est_max_count); 867 + 868 + /* Try to keep tot_stats in kt0, enqueue it early */ 869 + if (ipvs->tot_stats && !hlist_unhashed(&ipvs->tot_stats->s.est.list) && 870 + ipvs->tot_stats->s.est.ktid == -1) { 871 + hlist_del(&ipvs->tot_stats->s.est.list); 872 + hlist_add_head(&ipvs->tot_stats->s.est.list, 873 + &ipvs->est_temp_list); 874 + } 875 + 876 + mutex_lock(&ipvs->est_mutex); 877 + 878 + /* We completed the calc phase, new calc phase not requested */ 879 + if (genid == atomic_read(&ipvs->est_genid)) 880 + ipvs->est_calc_phase = 0; 881 + 882 + unlock2: 883 + mutex_unlock(&ipvs->est_mutex); 884 + 885 + unlock: 886 + mutex_unlock(&__ip_vs_mutex); 168 887 } 169 888 170 889 void ip_vs_zero_estimator(struct ip_vs_stats *stats) ··· 926 191 927 192 int __net_init ip_vs_estimator_net_init(struct netns_ipvs *ipvs) 928 193 { 929 - INIT_LIST_HEAD(&ipvs->est_list); 930 - spin_lock_init(&ipvs->est_lock); 931 - timer_setup(&ipvs->est_timer, estimation_timer, 0); 932 - mod_timer(&ipvs->est_timer, jiffies + 2 * HZ); 194 + INIT_HLIST_HEAD(&ipvs->est_temp_list); 195 + ipvs->est_kt_arr = NULL; 196 + ipvs->est_max_threads = 0; 197 + ipvs->est_calc_phase = 0; 198 + ipvs->est_chain_max = 0; 199 + ipvs->est_kt_count = 0; 200 + ipvs->est_add_ktid = 0; 201 + atomic_set(&ipvs->est_genid, 0); 202 + atomic_set(&ipvs->est_genid_done, 0); 203 + __mutex_init(&ipvs->est_mutex, "ipvs->est_mutex", &__ipvs_est_key); 933 204 return 0; 934 205 } 935 206 936 207 void __net_exit ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs) 937 208 { 938 - del_timer_sync(&ipvs->est_timer); 209 + int i; 210 + 211 + for (i = 0; i < ipvs->est_kt_count; i++) 212 + ip_vs_est_kthread_destroy(ipvs->est_kt_arr[i]); 213 + kfree(ipvs->est_kt_arr); 214 + mutex_destroy(&ipvs->est_mutex); 939 215 }

+61 -75

net/netfilter/nf_conntrack_proto.c

··· 121 121 }; 122 122 EXPORT_SYMBOL_GPL(nf_ct_l4proto_find); 123 123 124 - unsigned int nf_confirm(struct sk_buff *skb, unsigned int protoff, 125 - struct nf_conn *ct, enum ip_conntrack_info ctinfo) 126 - { 127 - const struct nf_conn_help *help; 128 - 129 - help = nfct_help(ct); 130 - if (help) { 131 - const struct nf_conntrack_helper *helper; 132 - int ret; 133 - 134 - /* rcu_read_lock()ed by nf_hook_thresh */ 135 - helper = rcu_dereference(help->helper); 136 - if (helper) { 137 - ret = helper->help(skb, 138 - protoff, 139 - ct, ctinfo); 140 - if (ret != NF_ACCEPT) 141 - return ret; 142 - } 143 - } 144 - 145 - if (test_bit(IPS_SEQ_ADJUST_BIT, &ct->status) && 146 - !nf_is_loopback_packet(skb)) { 147 - if (!nf_ct_seq_adjust(skb, ct, ctinfo, protoff)) { 148 - NF_CT_STAT_INC_ATOMIC(nf_ct_net(ct), drop); 149 - return NF_DROP; 150 - } 151 - } 152 - 153 - /* We've seen it coming out the other side: confirm it */ 154 - return nf_conntrack_confirm(skb); 155 - } 156 - EXPORT_SYMBOL_GPL(nf_confirm); 157 - 158 124 static bool in_vrf_postrouting(const struct nf_hook_state *state) 159 125 { 160 126 #if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV) ··· 131 165 return false; 132 166 } 133 167 134 - static unsigned int ipv4_confirm(void *priv, 135 - struct sk_buff *skb, 136 - const struct nf_hook_state *state) 168 + unsigned int nf_confirm(void *priv, 169 + struct sk_buff *skb, 170 + const struct nf_hook_state *state) 137 171 { 172 + const struct nf_conn_help *help; 138 173 enum ip_conntrack_info ctinfo; 174 + unsigned int protoff; 139 175 struct nf_conn *ct; 176 + bool seqadj_needed; 177 + __be16 frag_off; 178 + u8 pnum; 140 179 141 180 ct = nf_ct_get(skb, &ctinfo); 142 - if (!ct || ctinfo == IP_CT_RELATED_REPLY) 143 - return nf_conntrack_confirm(skb); 144 - 145 - if (in_vrf_postrouting(state)) 181 + if (!ct || in_vrf_postrouting(state)) 146 182 return NF_ACCEPT; 147 183 148 - return nf_confirm(skb, 149 - skb_network_offset(skb) + ip_hdrlen(skb), 150 - ct, ctinfo); 184 + help = nfct_help(ct); 185 + 186 + seqadj_needed = test_bit(IPS_SEQ_ADJUST_BIT, &ct->status) && !nf_is_loopback_packet(skb); 187 + if (!help && !seqadj_needed) 188 + return nf_conntrack_confirm(skb); 189 + 190 + /* helper->help() do not expect ICMP packets */ 191 + if (ctinfo == IP_CT_RELATED_REPLY) 192 + return nf_conntrack_confirm(skb); 193 + 194 + switch (nf_ct_l3num(ct)) { 195 + case NFPROTO_IPV4: 196 + protoff = skb_network_offset(skb) + ip_hdrlen(skb); 197 + break; 198 + case NFPROTO_IPV6: 199 + pnum = ipv6_hdr(skb)->nexthdr; 200 + protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &pnum, &frag_off); 201 + if (protoff < 0 || (frag_off & htons(~0x7)) != 0) 202 + return nf_conntrack_confirm(skb); 203 + break; 204 + default: 205 + return nf_conntrack_confirm(skb); 206 + } 207 + 208 + if (help) { 209 + const struct nf_conntrack_helper *helper; 210 + int ret; 211 + 212 + /* rcu_read_lock()ed by nf_hook */ 213 + helper = rcu_dereference(help->helper); 214 + if (helper) { 215 + ret = helper->help(skb, 216 + protoff, 217 + ct, ctinfo); 218 + if (ret != NF_ACCEPT) 219 + return ret; 220 + } 221 + } 222 + 223 + if (seqadj_needed && 224 + !nf_ct_seq_adjust(skb, ct, ctinfo, protoff)) { 225 + NF_CT_STAT_INC_ATOMIC(nf_ct_net(ct), drop); 226 + return NF_DROP; 227 + } 228 + 229 + /* We've seen it coming out the other side: confirm it */ 230 + return nf_conntrack_confirm(skb); 151 231 } 232 + EXPORT_SYMBOL_GPL(nf_confirm); 152 233 153 234 static unsigned int ipv4_conntrack_in(void *priv, 154 235 struct sk_buff *skb, ··· 243 230 .priority = NF_IP_PRI_CONNTRACK, 244 231 }, 245 232 { 246 - .hook = ipv4_confirm, 233 + .hook = nf_confirm, 247 234 .pf = NFPROTO_IPV4, 248 235 .hooknum = NF_INET_POST_ROUTING, 249 236 .priority = NF_IP_PRI_CONNTRACK_CONFIRM, 250 237 }, 251 238 { 252 - .hook = ipv4_confirm, 239 + .hook = nf_confirm, 253 240 .pf = NFPROTO_IPV4, 254 241 .hooknum = NF_INET_LOCAL_IN, 255 242 .priority = NF_IP_PRI_CONNTRACK_CONFIRM, ··· 386 373 .owner = THIS_MODULE, 387 374 }; 388 375 389 - static unsigned int ipv6_confirm(void *priv, 390 - struct sk_buff *skb, 391 - const struct nf_hook_state *state) 392 - { 393 - struct nf_conn *ct; 394 - enum ip_conntrack_info ctinfo; 395 - unsigned char pnum = ipv6_hdr(skb)->nexthdr; 396 - __be16 frag_off; 397 - int protoff; 398 - 399 - ct = nf_ct_get(skb, &ctinfo); 400 - if (!ct || ctinfo == IP_CT_RELATED_REPLY) 401 - return nf_conntrack_confirm(skb); 402 - 403 - if (in_vrf_postrouting(state)) 404 - return NF_ACCEPT; 405 - 406 - protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &pnum, 407 - &frag_off); 408 - if (protoff < 0 || (frag_off & htons(~0x7)) != 0) { 409 - pr_debug("proto header not found\n"); 410 - return nf_conntrack_confirm(skb); 411 - } 412 - 413 - return nf_confirm(skb, protoff, ct, ctinfo); 414 - } 415 - 416 376 static unsigned int ipv6_conntrack_in(void *priv, 417 377 struct sk_buff *skb, 418 378 const struct nf_hook_state *state) ··· 414 428 .priority = NF_IP6_PRI_CONNTRACK, 415 429 }, 416 430 { 417 - .hook = ipv6_confirm, 431 + .hook = nf_confirm, 418 432 .pf = NFPROTO_IPV6, 419 433 .hooknum = NF_INET_POST_ROUTING, 420 434 .priority = NF_IP6_PRI_LAST, 421 435 }, 422 436 { 423 - .hook = ipv6_confirm, 437 + .hook = nf_confirm, 424 438 .pf = NFPROTO_IPV6, 425 439 .hooknum = NF_INET_LOCAL_IN, 426 440 .priority = NF_IP6_PRI_LAST - 1,

+53

net/netfilter/nf_conntrack_proto_icmpv6.c

··· 129 129 nf_l4proto_log_invalid(skb, state, IPPROTO_ICMPV6, "%s", msg); 130 130 } 131 131 132 + static noinline_for_stack int 133 + nf_conntrack_icmpv6_redirect(struct nf_conn *tmpl, struct sk_buff *skb, 134 + unsigned int dataoff, 135 + const struct nf_hook_state *state) 136 + { 137 + u8 hl = ipv6_hdr(skb)->hop_limit; 138 + union nf_inet_addr outer_daddr; 139 + union { 140 + struct nd_opt_hdr nd_opt; 141 + struct rd_msg rd_msg; 142 + } tmp; 143 + const struct nd_opt_hdr *nd_opt; 144 + const struct rd_msg *rd_msg; 145 + 146 + rd_msg = skb_header_pointer(skb, dataoff, sizeof(*rd_msg), &tmp.rd_msg); 147 + if (!rd_msg) { 148 + icmpv6_error_log(skb, state, "short redirect"); 149 + return -NF_ACCEPT; 150 + } 151 + 152 + if (rd_msg->icmph.icmp6_code != 0) 153 + return NF_ACCEPT; 154 + 155 + if (hl != 255 || !(ipv6_addr_type(&ipv6_hdr(skb)->saddr) & IPV6_ADDR_LINKLOCAL)) { 156 + icmpv6_error_log(skb, state, "invalid saddr or hoplimit for redirect"); 157 + return -NF_ACCEPT; 158 + } 159 + 160 + dataoff += sizeof(*rd_msg); 161 + 162 + /* warning: rd_msg no longer usable after this call */ 163 + nd_opt = skb_header_pointer(skb, dataoff, sizeof(*nd_opt), &tmp.nd_opt); 164 + if (!nd_opt || nd_opt->nd_opt_len == 0) { 165 + icmpv6_error_log(skb, state, "redirect without options"); 166 + return -NF_ACCEPT; 167 + } 168 + 169 + /* We could call ndisc_parse_options(), but it would need 170 + * skb_linearize() and a bit more work. 171 + */ 172 + if (nd_opt->nd_opt_type != ND_OPT_REDIRECT_HDR) 173 + return NF_ACCEPT; 174 + 175 + memcpy(&outer_daddr.ip6, &ipv6_hdr(skb)->daddr, 176 + sizeof(outer_daddr.ip6)); 177 + dataoff += 8; 178 + return nf_conntrack_inet_error(tmpl, skb, dataoff, state, 179 + IPPROTO_ICMPV6, &outer_daddr); 180 + } 181 + 132 182 int nf_conntrack_icmpv6_error(struct nf_conn *tmpl, 133 183 struct sk_buff *skb, 134 184 unsigned int dataoff, ··· 208 158 nf_ct_set(skb, NULL, IP_CT_UNTRACKED); 209 159 return NF_ACCEPT; 210 160 } 161 + 162 + if (icmp6h->icmp6_type == NDISC_REDIRECT) 163 + return nf_conntrack_icmpv6_redirect(tmpl, skb, dataoff, state); 211 164 212 165 /* is not error message ? */ 213 166 if (icmp6h->icmp6_type >= 128)

+61 -43

net/netfilter/nf_conntrack_proto_sctp.c

··· 60 60 [SCTP_CONNTRACK_SHUTDOWN_ACK_SENT] = 3 SECS, 61 61 [SCTP_CONNTRACK_HEARTBEAT_SENT] = 30 SECS, 62 62 [SCTP_CONNTRACK_HEARTBEAT_ACKED] = 210 SECS, 63 + [SCTP_CONNTRACK_DATA_SENT] = 30 SECS, 63 64 }; 64 65 65 66 #define SCTP_FLAG_HEARTBEAT_VTAG_FAILED 1 ··· 75 74 #define sSA SCTP_CONNTRACK_SHUTDOWN_ACK_SENT 76 75 #define sHS SCTP_CONNTRACK_HEARTBEAT_SENT 77 76 #define sHA SCTP_CONNTRACK_HEARTBEAT_ACKED 77 + #define sDS SCTP_CONNTRACK_DATA_SENT 78 78 #define sIV SCTP_CONNTRACK_MAX 79 79 80 80 /* ··· 92 90 COOKIE ECHOED - We have seen a COOKIE_ECHO chunk in the original direction. 93 91 ESTABLISHED - We have seen a COOKIE_ACK in the reply direction. 94 92 SHUTDOWN_SENT - We have seen a SHUTDOWN chunk in the original direction. 95 - SHUTDOWN_RECD - We have seen a SHUTDOWN chunk in the reply directoin. 93 + SHUTDOWN_RECD - We have seen a SHUTDOWN chunk in the reply direction. 96 94 SHUTDOWN_ACK_SENT - We have seen a SHUTDOWN_ACK chunk in the direction opposite 97 95 to that of the SHUTDOWN chunk. 98 96 CLOSED - We have seen a SHUTDOWN_COMPLETE chunk in the direction of 99 97 the SHUTDOWN chunk. Connection is closed. 100 98 HEARTBEAT_SENT - We have seen a HEARTBEAT in a new flow. 101 - HEARTBEAT_ACKED - We have seen a HEARTBEAT-ACK in the direction opposite to 102 - that of the HEARTBEAT chunk. Secondary connection is 103 - established. 99 + HEARTBEAT_ACKED - We have seen a HEARTBEAT-ACK/DATA/SACK in the direction 100 + opposite to that of the HEARTBEAT/DATA chunk. Secondary connection 101 + is established. 102 + DATA_SENT - We have seen a DATA/SACK in a new flow. 104 103 */ 105 104 106 105 /* TODO ··· 115 112 */ 116 113 117 114 /* SCTP conntrack state transitions */ 118 - static const u8 sctp_conntracks[2][11][SCTP_CONNTRACK_MAX] = { 115 + static const u8 sctp_conntracks[2][12][SCTP_CONNTRACK_MAX] = { 119 116 { 120 117 /* ORIGINAL */ 121 - /* sNO, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA */ 122 - /* init */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCW, sHA}, 123 - /* init_ack */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA}, 124 - /* abort */ {sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL}, 125 - /* shutdown */ {sCL, sCL, sCW, sCE, sSS, sSS, sSR, sSA, sCL, sSS}, 126 - /* shutdown_ack */ {sSA, sCL, sCW, sCE, sES, sSA, sSA, sSA, sSA, sHA}, 127 - /* error */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA},/* Can't have Stale cookie*/ 128 - /* cookie_echo */ {sCL, sCL, sCE, sCE, sES, sSS, sSR, sSA, sCL, sHA},/* 5.2.4 - Big TODO */ 129 - /* cookie_ack */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA},/* Can't come in orig dir */ 130 - /* shutdown_comp*/ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sCL, sCL, sHA}, 131 - /* heartbeat */ {sHS, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA}, 132 - /* heartbeat_ack*/ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA} 118 + /* sNO, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS */ 119 + /* init */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCW, sHA, sCW}, 120 + /* init_ack */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA, sCL}, 121 + /* abort */ {sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL}, 122 + /* shutdown */ {sCL, sCL, sCW, sCE, sSS, sSS, sSR, sSA, sCL, sSS, sCL}, 123 + /* shutdown_ack */ {sSA, sCL, sCW, sCE, sES, sSA, sSA, sSA, sSA, sHA, sSA}, 124 + /* error */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA, sCL},/* Can't have Stale cookie*/ 125 + /* cookie_echo */ {sCL, sCL, sCE, sCE, sES, sSS, sSR, sSA, sCL, sHA, sCL},/* 5.2.4 - Big TODO */ 126 + /* cookie_ack */ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sCL, sHA, sCL},/* Can't come in orig dir */ 127 + /* shutdown_comp*/ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sCL, sCL, sHA, sCL}, 128 + /* heartbeat */ {sHS, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS}, 129 + /* heartbeat_ack*/ {sCL, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS}, 130 + /* data/sack */ {sDS, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS} 133 131 }, 134 132 { 135 133 /* REPLY */ 136 - /* sNO, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA */ 137 - /* init */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA},/* INIT in sCL Big TODO */ 138 - /* init_ack */ {sIV, sCW, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA}, 139 - /* abort */ {sIV, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sIV, sCL}, 140 - /* shutdown */ {sIV, sCL, sCW, sCE, sSR, sSS, sSR, sSA, sIV, sSR}, 141 - /* shutdown_ack */ {sIV, sCL, sCW, sCE, sES, sSA, sSA, sSA, sIV, sHA}, 142 - /* error */ {sIV, sCL, sCW, sCL, sES, sSS, sSR, sSA, sIV, sHA}, 143 - /* cookie_echo */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA},/* Can't come in reply dir */ 144 - /* cookie_ack */ {sIV, sCL, sCW, sES, sES, sSS, sSR, sSA, sIV, sHA}, 145 - /* shutdown_comp*/ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sCL, sIV, sHA}, 146 - /* heartbeat */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA}, 147 - /* heartbeat_ack*/ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHA, sHA} 134 + /* sNO, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sDS */ 135 + /* init */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA, sIV},/* INIT in sCL Big TODO */ 136 + /* init_ack */ {sIV, sCW, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA, sIV}, 137 + /* abort */ {sIV, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sIV, sCL, sIV}, 138 + /* shutdown */ {sIV, sCL, sCW, sCE, sSR, sSS, sSR, sSA, sIV, sSR, sIV}, 139 + /* shutdown_ack */ {sIV, sCL, sCW, sCE, sES, sSA, sSA, sSA, sIV, sHA, sIV}, 140 + /* error */ {sIV, sCL, sCW, sCL, sES, sSS, sSR, sSA, sIV, sHA, sIV}, 141 + /* cookie_echo */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sIV, sHA, sIV},/* Can't come in reply dir */ 142 + /* cookie_ack */ {sIV, sCL, sCW, sES, sES, sSS, sSR, sSA, sIV, sHA, sIV}, 143 + /* shutdown_comp*/ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sCL, sIV, sHA, sIV}, 144 + /* heartbeat */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHS, sHA, sHA}, 145 + /* heartbeat_ack*/ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHA, sHA, sHA}, 146 + /* data/sack */ {sIV, sCL, sCW, sCE, sES, sSS, sSR, sSA, sHA, sHA, sHA}, 148 147 } 149 148 }; 150 149 ··· 258 253 pr_debug("SCTP_CID_HEARTBEAT_ACK"); 259 254 i = 10; 260 255 break; 256 + case SCTP_CID_DATA: 257 + case SCTP_CID_SACK: 258 + pr_debug("SCTP_CID_DATA/SACK"); 259 + i = 11; 260 + break; 261 261 default: 262 262 /* Other chunks like DATA or SACK do not change the state */ 263 263 pr_debug("Unknown chunk type, Will stay in %s\n", ··· 316 306 ih->init_tag); 317 307 318 308 ct->proto.sctp.vtag[IP_CT_DIR_REPLY] = ih->init_tag; 319 - } else if (sch->type == SCTP_CID_HEARTBEAT) { 309 + } else if (sch->type == SCTP_CID_HEARTBEAT || 310 + sch->type == SCTP_CID_DATA || 311 + sch->type == SCTP_CID_SACK) { 320 312 pr_debug("Setting vtag %x for secondary conntrack\n", 321 313 sh->vtag); 322 314 ct->proto.sctp.vtag[IP_CT_DIR_ORIGINAL] = sh->vtag; ··· 404 392 405 393 if (!sctp_new(ct, skb, sh, dataoff)) 406 394 return -NF_ACCEPT; 407 - } 408 - 409 - /* Check the verification tag (Sec 8.5) */ 410 - if (!test_bit(SCTP_CID_INIT, map) && 411 - !test_bit(SCTP_CID_SHUTDOWN_COMPLETE, map) && 412 - !test_bit(SCTP_CID_COOKIE_ECHO, map) && 413 - !test_bit(SCTP_CID_ABORT, map) && 414 - !test_bit(SCTP_CID_SHUTDOWN_ACK, map) && 415 - !test_bit(SCTP_CID_HEARTBEAT, map) && 416 - !test_bit(SCTP_CID_HEARTBEAT_ACK, map) && 417 - sh->vtag != ct->proto.sctp.vtag[dir]) { 418 - pr_debug("Verification tag check failed\n"); 419 - goto out; 395 + } else { 396 + /* Check the verification tag (Sec 8.5) */ 397 + if (!test_bit(SCTP_CID_INIT, map) && 398 + !test_bit(SCTP_CID_SHUTDOWN_COMPLETE, map) && 399 + !test_bit(SCTP_CID_COOKIE_ECHO, map) && 400 + !test_bit(SCTP_CID_ABORT, map) && 401 + !test_bit(SCTP_CID_SHUTDOWN_ACK, map) && 402 + !test_bit(SCTP_CID_HEARTBEAT, map) && 403 + !test_bit(SCTP_CID_HEARTBEAT_ACK, map) && 404 + sh->vtag != ct->proto.sctp.vtag[dir]) { 405 + pr_debug("Verification tag check failed\n"); 406 + goto out; 407 + } 420 408 } 421 409 422 410 old_state = new_state = SCTP_CONNTRACK_NONE; ··· 475 463 ct->proto.sctp.vtag[!dir] = 0; 476 464 } else if (ct->proto.sctp.flags & SCTP_FLAG_HEARTBEAT_VTAG_FAILED) { 477 465 ct->proto.sctp.flags &= ~SCTP_FLAG_HEARTBEAT_VTAG_FAILED; 466 + } 467 + } else if (sch->type == SCTP_CID_DATA || sch->type == SCTP_CID_SACK) { 468 + if (ct->proto.sctp.vtag[dir] == 0) { 469 + pr_debug("Setting vtag %x for dir %d\n", sh->vtag, dir); 470 + ct->proto.sctp.vtag[dir] = sh->vtag; 478 471 } 479 472 } 480 473 ··· 701 684 [CTA_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT] = { .type = NLA_U32 }, 702 685 [CTA_TIMEOUT_SCTP_HEARTBEAT_SENT] = { .type = NLA_U32 }, 703 686 [CTA_TIMEOUT_SCTP_HEARTBEAT_ACKED] = { .type = NLA_U32 }, 687 + [CTA_TIMEOUT_SCTP_DATA_SENT] = { .type = NLA_U32 }, 704 688 }; 705 689 #endif /* CONFIG_NF_CONNTRACK_TIMEOUT */ 706 690

net/netfilter/nf_conntrack_standalone.c

··· 602 602 NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT, 603 603 NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_HEARTBEAT_SENT, 604 604 NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_HEARTBEAT_ACKED, 605 + NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_DATA_SENT, 605 606 #endif 606 607 #ifdef CONFIG_NF_CT_PROTO_DCCP 607 608 NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_REQUEST, ··· 893 892 .mode = 0644, 894 893 .proc_handler = proc_dointvec_jiffies, 895 894 }, 895 + [NF_SYSCTL_CT_PROTO_TIMEOUT_SCTP_DATA_SENT] = { 896 + .procname = "nf_conntrack_sctp_timeout_data_sent", 897 + .maxlen = sizeof(unsigned int), 898 + .mode = 0644, 899 + .proc_handler = proc_dointvec_jiffies, 900 + }, 896 901 #endif 897 902 #ifdef CONFIG_NF_CT_PROTO_DCCP 898 903 [NF_SYSCTL_CT_PROTO_TIMEOUT_DCCP_REQUEST] = { ··· 1043 1036 XASSIGN(SHUTDOWN_ACK_SENT, sn); 1044 1037 XASSIGN(HEARTBEAT_SENT, sn); 1045 1038 XASSIGN(HEARTBEAT_ACKED, sn); 1039 + XASSIGN(DATA_SENT, sn); 1046 1040 #undef XASSIGN 1047 1041 #endif 1048 1042 }

net/netfilter/nf_flow_table_ip.c

··· 421 421 if (ret == NF_DROP) 422 422 flow_offload_teardown(flow); 423 423 break; 424 + default: 425 + WARN_ON_ONCE(1); 426 + ret = NF_DROP; 427 + break; 424 428 } 425 429 426 430 return ret; ··· 685 681 ret = nf_flow_queue_xmit(state->net, skb, tuplehash, ETH_P_IPV6); 686 682 if (ret == NF_DROP) 687 683 flow_offload_teardown(flow); 684 + break; 685 + default: 686 + WARN_ON_ONCE(1); 687 + ret = NF_DROP; 688 688 break; 689 689 } 690 690

+2 -2

net/netfilter/nf_tables_api.c

··· 2873 2873 return -EINVAL; 2874 2874 2875 2875 type = __nft_expr_type_get(ctx->family, tb[NFTA_EXPR_NAME]); 2876 - if (IS_ERR(type)) 2877 - return PTR_ERR(type); 2876 + if (!type) 2877 + return -ENOENT; 2878 2878 2879 2879 if (!type->inner_ops) 2880 2880 return -EOPNOTSUPP;

+34 -2

tools/testing/selftests/netfilter/conntrack_icmp_related.sh

··· 35 35 for i in 1 2;do ip netns del nsrouter$i;done 36 36 } 37 37 38 + trap cleanup EXIT 39 + 38 40 ipv4() { 39 41 echo -n 192.168.$1.2 40 42 } ··· 148 146 table inet filter { 149 147 counter unknown { } 150 148 counter related { } 149 + counter redir4 { } 150 + counter redir6 { } 151 151 chain input { 152 152 type filter hook input priority 0; policy accept; 153 - meta l4proto { icmp, icmpv6 } ct state established,untracked accept 154 153 154 + icmp type "redirect" ct state "related" counter name "redir4" accept 155 + icmpv6 type "nd-redirect" ct state "related" counter name "redir6" accept 156 + 157 + meta l4proto { icmp, icmpv6 } ct state established,untracked accept 155 158 meta l4proto { icmp, icmpv6 } ct state "related" counter name "related" accept 159 + 156 160 counter name "unknown" drop 157 161 } 158 162 } ··· 287 279 echo "ERROR: icmp error RELATED state test has failed" 288 280 fi 289 281 290 - cleanup 282 + # add 'bad' route, expect icmp REDIRECT to be generated 283 + ip netns exec nsclient1 ip route add 192.168.1.42 via 192.168.1.1 284 + ip netns exec nsclient1 ip route add dead:1::42 via dead:1::1 285 + 286 + ip netns exec "nsclient1" ping -q -c 2 192.168.1.42 > /dev/null 287 + 288 + expect="packets 1 bytes 112" 289 + check_counter nsclient1 "redir4" "$expect" 290 + if [ $? -ne 0 ];then 291 + ret=1 292 + fi 293 + 294 + ip netns exec "nsclient1" ping -c 1 dead:1::42 > /dev/null 295 + expect="packets 1 bytes 192" 296 + check_counter nsclient1 "redir6" "$expect" 297 + if [ $? -ne 0 ];then 298 + ret=1 299 + fi 300 + 301 + if [ $ret -eq 0 ];then 302 + echo "PASS: icmp redirects had RELATED state" 303 + else 304 + echo "ERROR: icmp redirect RELATED state test has failed" 305 + fi 306 + 291 307 exit $ret