Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS/OVS updates for net-next

The following patchset contains Netfilter/IPVS fixes and OVS NAT
support, more specifically this batch is composed of:

1) Fix a crash in ipset when performing a parallel flush/dump with
set:list type, from Jozsef Kadlecsik.

2) Make sure NFACCT_FILTER_* netlink attributes are in place before
accessing them, from Phil Turnbull.

3) Check return error code from ip_vs_fill_iph_skb_off() in IPVS SIP
helper, from Arnd Bergmann.

4) Add workaround to IPVS to reschedule existing connections to new
destination server by dropping the packet and wait for retransmission
of TCP syn packet, from Julian Anastasov.

5) Allow connection rescheduling in IPVS when in CLOSE state, also
from Julian.

6) Fix wrong offset of SIP Call-ID in IPVS helper, from Marco Angaroni.

7) Validate IPSET_ATTR_ETHER netlink attribute length, from Jozsef.

8) Check match/targetinfo netlink attribute size in nft_compat,
patch from Florian Westphal.

9) Check for integer overflow on 32-bit systems in x_tables, from
Florian Westphal.

Several patches from Jarno Rajahalme to prepare the introduction of
NAT support to OVS based on the Netfilter infrastructure:

10) Schedule IP_CT_NEW_REPLY definition for removal in
nf_conntrack_common.h.

11) Simplify checksumming recalculation in nf_nat.

12) Add comments to the openvswitch conntrack code, from Jarno.

13) Update the CT state key only after successful nf_conntrack_in()
invocation.

14) Find existing conntrack entry after upcall.

15) Handle NF_REPEAT case due to templates in nf_conntrack_in().

16) Call the conntrack helper functions once the conntrack has been
confirmed.

17) And finally, add the NAT interface to OVS.

The batch closes with:

18) Cleanup to use spin_unlock_wait() instead of
spin_lock()/spin_unlock(), from Nicholas Mc Guire.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

+796 -135
+17
include/net/ip_vs.h
··· 1588 1588 } 1589 1589 #endif /* CONFIG_IP_VS_NFCT */ 1590 1590 1591 + /* Really using conntrack? */ 1592 + static inline bool ip_vs_conn_uses_conntrack(struct ip_vs_conn *cp, 1593 + struct sk_buff *skb) 1594 + { 1595 + #ifdef CONFIG_IP_VS_NFCT 1596 + enum ip_conntrack_info ctinfo; 1597 + struct nf_conn *ct; 1598 + 1599 + if (!(cp->flags & IP_VS_CONN_F_NFCT)) 1600 + return false; 1601 + ct = nf_ct_get(skb, &ctinfo); 1602 + if (ct && !nf_ct_is_untracked(ct)) 1603 + return true; 1604 + #endif 1605 + return false; 1606 + } 1607 + 1591 1608 static inline int 1592 1609 ip_vs_dest_conn_overhead(struct ip_vs_dest *dest) 1593 1610 {
+9 -3
include/uapi/linux/netfilter/nf_conntrack_common.h
··· 20 20 21 21 IP_CT_ESTABLISHED_REPLY = IP_CT_ESTABLISHED + IP_CT_IS_REPLY, 22 22 IP_CT_RELATED_REPLY = IP_CT_RELATED + IP_CT_IS_REPLY, 23 - IP_CT_NEW_REPLY = IP_CT_NEW + IP_CT_IS_REPLY, 24 - /* Number of distinct IP_CT types (no NEW in reply dirn). */ 25 - IP_CT_NUMBER = IP_CT_IS_REPLY * 2 - 1 23 + /* No NEW in reply direction. */ 24 + 25 + /* Number of distinct IP_CT types. */ 26 + IP_CT_NUMBER, 27 + 28 + /* only for userspace compatibility */ 29 + #ifndef __KERNEL__ 30 + IP_CT_NEW_REPLY = IP_CT_NUMBER, 31 + #endif 26 32 }; 27 33 28 34 #define NF_CT_STATE_INVALID_BIT (1 << 0)
+49
include/uapi/linux/openvswitch.h
··· 454 454 #define OVS_CS_F_REPLY_DIR 0x08 /* Flow is in the reply direction. */ 455 455 #define OVS_CS_F_INVALID 0x10 /* Could not track connection. */ 456 456 #define OVS_CS_F_TRACKED 0x20 /* Conntrack has occurred. */ 457 + #define OVS_CS_F_SRC_NAT 0x40 /* Packet's source address/port was 458 + * mangled by NAT. 459 + */ 460 + #define OVS_CS_F_DST_NAT 0x80 /* Packet's destination address/port 461 + * was mangled by NAT. 462 + */ 463 + 464 + #define OVS_CS_F_NAT_MASK (OVS_CS_F_SRC_NAT | OVS_CS_F_DST_NAT) 457 465 458 466 /** 459 467 * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands. ··· 640 632 * mask. For each bit set in the mask, the corresponding bit in the value is 641 633 * copied to the connection tracking label field in the connection. 642 634 * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG. 635 + * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address 636 + * translation (NAT) on the packet. 643 637 */ 644 638 enum ovs_ct_attr { 645 639 OVS_CT_ATTR_UNSPEC, ··· 651 641 OVS_CT_ATTR_LABELS, /* labels to associate with this connection. */ 652 642 OVS_CT_ATTR_HELPER, /* netlink helper to assist detection of 653 643 related connections. */ 644 + OVS_CT_ATTR_NAT, /* Nested OVS_NAT_ATTR_* */ 654 645 __OVS_CT_ATTR_MAX 655 646 }; 656 647 657 648 #define OVS_CT_ATTR_MAX (__OVS_CT_ATTR_MAX - 1) 649 + 650 + /** 651 + * enum ovs_nat_attr - Attributes for %OVS_CT_ATTR_NAT. 652 + * 653 + * @OVS_NAT_ATTR_SRC: Flag for Source NAT (mangle source address/port). 654 + * @OVS_NAT_ATTR_DST: Flag for Destination NAT (mangle destination 655 + * address/port). Only one of (@OVS_NAT_ATTR_SRC, @OVS_NAT_ATTR_DST) may be 656 + * specified. Effective only for packets for ct_state NEW connections. 657 + * Packets of committed connections are mangled by the NAT action according to 658 + * the committed NAT type regardless of the flags specified. As a corollary, a 659 + * NAT action without a NAT type flag will only mangle packets of committed 660 + * connections. The following NAT attributes only apply for NEW 661 + * (non-committed) connections, and they may be included only when the CT 662 + * action has the @OVS_CT_ATTR_COMMIT flag and either @OVS_NAT_ATTR_SRC or 663 + * @OVS_NAT_ATTR_DST is also included. 664 + * @OVS_NAT_ATTR_IP_MIN: struct in_addr or struct in6_addr 665 + * @OVS_NAT_ATTR_IP_MAX: struct in_addr or struct in6_addr 666 + * @OVS_NAT_ATTR_PROTO_MIN: u16 L4 protocol specific lower boundary (port) 667 + * @OVS_NAT_ATTR_PROTO_MAX: u16 L4 protocol specific upper boundary (port) 668 + * @OVS_NAT_ATTR_PERSISTENT: Flag for persistent IP mapping across reboots 669 + * @OVS_NAT_ATTR_PROTO_HASH: Flag for pseudo random L4 port mapping (MD5) 670 + * @OVS_NAT_ATTR_PROTO_RANDOM: Flag for fully randomized L4 port mapping 671 + */ 672 + enum ovs_nat_attr { 673 + OVS_NAT_ATTR_UNSPEC, 674 + OVS_NAT_ATTR_SRC, 675 + OVS_NAT_ATTR_DST, 676 + OVS_NAT_ATTR_IP_MIN, 677 + OVS_NAT_ATTR_IP_MAX, 678 + OVS_NAT_ATTR_PROTO_MIN, 679 + OVS_NAT_ATTR_PROTO_MAX, 680 + OVS_NAT_ATTR_PERSISTENT, 681 + OVS_NAT_ATTR_PROTO_HASH, 682 + OVS_NAT_ATTR_PROTO_RANDOM, 683 + __OVS_NAT_ATTR_MAX, 684 + }; 685 + 686 + #define OVS_NAT_ATTR_MAX (__OVS_NAT_ATTR_MAX - 1) 658 687 659 688 /** 660 689 * enum ovs_action_attr - Action types.
+8 -22
net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
··· 127 127 u8 proto, void *data, __sum16 *check, 128 128 int datalen, int oldlen) 129 129 { 130 - const struct iphdr *iph = ip_hdr(skb); 131 - struct rtable *rt = skb_rtable(skb); 132 - 133 130 if (skb->ip_summed != CHECKSUM_PARTIAL) { 134 - if (!(rt->rt_flags & RTCF_LOCAL) && 135 - (!skb->dev || skb->dev->features & 136 - (NETIF_F_IP_CSUM | NETIF_F_HW_CSUM))) { 137 - skb->ip_summed = CHECKSUM_PARTIAL; 138 - skb->csum_start = skb_headroom(skb) + 139 - skb_network_offset(skb) + 140 - ip_hdrlen(skb); 141 - skb->csum_offset = (void *)check - data; 142 - *check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, 143 - datalen, proto, 0); 144 - } else { 145 - *check = 0; 146 - *check = csum_tcpudp_magic(iph->saddr, iph->daddr, 147 - datalen, proto, 148 - csum_partial(data, datalen, 149 - 0)); 150 - if (proto == IPPROTO_UDP && !*check) 151 - *check = CSUM_MANGLED_0; 152 - } 131 + const struct iphdr *iph = ip_hdr(skb); 132 + 133 + skb->ip_summed = CHECKSUM_PARTIAL; 134 + skb->csum_start = skb_headroom(skb) + skb_network_offset(skb) + 135 + ip_hdrlen(skb); 136 + skb->csum_offset = (void *)check - data; 137 + *check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen, 138 + proto, 0); 153 139 } else 154 140 inet_proto_csum_replace2(check, skb, 155 141 htons(oldlen), htons(datalen), true);
+8 -22
net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
··· 131 131 u8 proto, void *data, __sum16 *check, 132 132 int datalen, int oldlen) 133 133 { 134 - const struct ipv6hdr *ipv6h = ipv6_hdr(skb); 135 - struct rt6_info *rt = (struct rt6_info *)skb_dst(skb); 136 - 137 134 if (skb->ip_summed != CHECKSUM_PARTIAL) { 138 - if (!(rt->rt6i_flags & RTF_LOCAL) && 139 - (!skb->dev || skb->dev->features & 140 - (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))) { 141 - skb->ip_summed = CHECKSUM_PARTIAL; 142 - skb->csum_start = skb_headroom(skb) + 143 - skb_network_offset(skb) + 144 - (data - (void *)skb->data); 145 - skb->csum_offset = (void *)check - data; 146 - *check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr, 147 - datalen, proto, 0); 148 - } else { 149 - *check = 0; 150 - *check = csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr, 151 - datalen, proto, 152 - csum_partial(data, datalen, 153 - 0)); 154 - if (proto == IPPROTO_UDP && !*check) 155 - *check = CSUM_MANGLED_0; 156 - } 135 + const struct ipv6hdr *ipv6h = ipv6_hdr(skb); 136 + 137 + skb->ip_summed = CHECKSUM_PARTIAL; 138 + skb->csum_start = skb_headroom(skb) + skb_network_offset(skb) + 139 + (data - (void *)skb->data); 140 + skb->csum_offset = (void *)check - data; 141 + *check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr, 142 + datalen, proto, 0); 157 143 } else 158 144 inet_proto_csum_replace2(check, skb, 159 145 htons(oldlen), htons(datalen), true);
+2
net/netfilter/ipset/ip_set_bitmap_ipmac.c
··· 267 267 268 268 e.id = ip_to_id(map, ip); 269 269 if (tb[IPSET_ATTR_ETHER]) { 270 + if (nla_len(tb[IPSET_ATTR_ETHER]) != ETH_ALEN) 271 + return -IPSET_ERR_PROTOCOL; 270 272 memcpy(e.ether, nla_data(tb[IPSET_ATTR_ETHER]), ETH_ALEN); 271 273 e.add_mac = 1; 272 274 }
+3
net/netfilter/ipset/ip_set_core.c
··· 985 985 if (unlikely(protocol_failed(attr))) 986 986 return -IPSET_ERR_PROTOCOL; 987 987 988 + /* Must wait for flush to be really finished in list:set */ 989 + rcu_barrier(); 990 + 988 991 /* Commands are serialized and references are 989 992 * protected by the ip_set_ref_lock. 990 993 * External systems (i.e. xt_set) must call
+2 -1
net/netfilter/ipset/ip_set_hash_mac.c
··· 110 110 if (tb[IPSET_ATTR_LINENO]) 111 111 *lineno = nla_get_u32(tb[IPSET_ATTR_LINENO]); 112 112 113 - if (unlikely(!tb[IPSET_ATTR_ETHER])) 113 + if (unlikely(!tb[IPSET_ATTR_ETHER] || 114 + nla_len(tb[IPSET_ATTR_ETHER]) != ETH_ALEN)) 114 115 return -IPSET_ERR_PROTOCOL; 115 116 116 117 ret = ip_set_get_extensions(set, tb, &ext);
+26 -31
net/netfilter/ipset/ip_set_list_set.c
··· 30 30 struct set_elem { 31 31 struct rcu_head rcu; 32 32 struct list_head list; 33 + struct ip_set *set; /* Sigh, in order to cleanup reference */ 33 34 ip_set_id_t id; 34 35 } __aligned(__alignof__(u64)); 35 36 ··· 152 151 /* Userspace interfaces: we are protected by the nfnl mutex */ 153 152 154 153 static void 155 - __list_set_del(struct ip_set *set, struct set_elem *e) 154 + __list_set_del_rcu(struct rcu_head * rcu) 156 155 { 156 + struct set_elem *e = container_of(rcu, struct set_elem, rcu); 157 + struct ip_set *set = e->set; 157 158 struct list_set *map = set->data; 158 159 159 160 ip_set_put_byindex(map->net, e->id); 160 - /* We may call it, because we don't have a to be destroyed 161 - * extension which is used by the kernel. 162 - */ 163 161 ip_set_ext_destroy(set, e); 164 - kfree_rcu(e, rcu); 162 + kfree(e); 165 163 } 166 164 167 165 static inline void 168 166 list_set_del(struct ip_set *set, struct set_elem *e) 169 167 { 170 168 list_del_rcu(&e->list); 171 - __list_set_del(set, e); 169 + call_rcu(&e->rcu, __list_set_del_rcu); 172 170 } 173 171 174 172 static inline void 175 - list_set_replace(struct ip_set *set, struct set_elem *e, struct set_elem *old) 173 + list_set_replace(struct set_elem *e, struct set_elem *old) 176 174 { 177 175 list_replace_rcu(&old->list, &e->list); 178 - __list_set_del(set, old); 176 + call_rcu(&old->rcu, __list_set_del_rcu); 179 177 } 180 178 181 179 static void ··· 244 244 struct set_elem *e, *n, *prev, *next; 245 245 bool flag_exist = flags & IPSET_FLAG_EXIST; 246 246 247 - if (SET_WITH_TIMEOUT(set)) 248 - set_cleanup_entries(set); 249 - 250 247 /* Find where to add the new entry */ 251 248 n = prev = next = NULL; 252 249 list_for_each_entry(e, &map->members, list) { ··· 298 301 if (!e) 299 302 return -ENOMEM; 300 303 e->id = d->id; 304 + e->set = set; 301 305 INIT_LIST_HEAD(&e->list); 302 306 list_set_init_extensions(set, ext, e); 303 307 if (n) 304 - list_set_replace(set, e, n); 308 + list_set_replace(e, n); 305 309 else if (next) 306 310 list_add_tail_rcu(&e->list, &next->list); 307 311 else if (prev) ··· 429 431 430 432 if (SET_WITH_TIMEOUT(set)) 431 433 del_timer_sync(&map->gc); 434 + 432 435 list_for_each_entry_safe(e, n, &map->members, list) { 433 436 list_del(&e->list); 434 437 ip_set_put_byindex(map->net, e->id); ··· 449 450 struct set_elem *e; 450 451 u32 n = 0; 451 452 452 - list_for_each_entry(e, &map->members, list) 453 + rcu_read_lock(); 454 + list_for_each_entry_rcu(e, &map->members, list) 453 455 n++; 456 + rcu_read_unlock(); 454 457 455 458 nested = ipset_nest_start(skb, IPSET_ATTR_DATA); 456 459 if (!nested) ··· 484 483 atd = ipset_nest_start(skb, IPSET_ATTR_ADT); 485 484 if (!atd) 486 485 return -EMSGSIZE; 487 - list_for_each_entry(e, &map->members, list) { 488 - if (i == first) 489 - break; 490 - i++; 491 - } 492 486 493 487 rcu_read_lock(); 494 - list_for_each_entry_from(e, &map->members, list) { 495 - i++; 496 - if (SET_WITH_TIMEOUT(set) && 497 - ip_set_timeout_expired(ext_timeout(e, set))) 488 + list_for_each_entry_rcu(e, &map->members, list) { 489 + if (i < first || 490 + (SET_WITH_TIMEOUT(set) && 491 + ip_set_timeout_expired(ext_timeout(e, set)))) { 492 + i++; 498 493 continue; 499 - nested = ipset_nest_start(skb, IPSET_ATTR_DATA); 500 - if (!nested) { 501 - if (i == first) { 502 - nla_nest_cancel(skb, atd); 503 - ret = -EMSGSIZE; 504 - goto out; 505 - } 506 - goto nla_put_failure; 507 494 } 495 + nested = ipset_nest_start(skb, IPSET_ATTR_DATA); 496 + if (!nested) 497 + goto nla_put_failure; 508 498 if (nla_put_string(skb, IPSET_ATTR_NAME, 509 499 ip_set_name_byindex(map->net, e->id))) 510 500 goto nla_put_failure; 511 501 if (ip_set_put_extensions(skb, set, e, true)) 512 502 goto nla_put_failure; 513 503 ipset_nest_end(skb, nested); 504 + i++; 514 505 } 515 506 516 507 ipset_nest_end(skb, atd); ··· 513 520 nla_put_failure: 514 521 nla_nest_cancel(skb, nested); 515 522 if (unlikely(i == first)) { 523 + nla_nest_cancel(skb, atd); 516 524 cb->args[IPSET_CB_ARG0] = 0; 517 525 ret = -EMSGSIZE; 526 + } else { 527 + cb->args[IPSET_CB_ARG0] = i; 518 528 } 519 - cb->args[IPSET_CB_ARG0] = i - 1; 520 529 ipset_nest_end(skb, atd); 521 530 out: 522 531 rcu_read_unlock();
+29 -9
net/netfilter/ipvs/ip_vs_core.c
··· 1089 1089 switch (cp->protocol) { 1090 1090 case IPPROTO_TCP: 1091 1091 return (cp->state == IP_VS_TCP_S_TIME_WAIT) || 1092 + (cp->state == IP_VS_TCP_S_CLOSE) || 1092 1093 ((conn_reuse_mode & 2) && 1093 1094 (cp->state == IP_VS_TCP_S_FIN_WAIT) && 1094 1095 (cp->flags & IP_VS_CONN_F_NOOUTPUT)); ··· 1758 1757 cp = pp->conn_in_get(ipvs, af, skb, &iph); 1759 1758 1760 1759 conn_reuse_mode = sysctl_conn_reuse_mode(ipvs); 1761 - if (conn_reuse_mode && !iph.fragoffs && 1762 - is_new_conn(skb, &iph) && cp && 1763 - ((unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest && 1764 - unlikely(!atomic_read(&cp->dest->weight))) || 1765 - unlikely(is_new_conn_expected(cp, conn_reuse_mode)))) { 1766 - if (!atomic_read(&cp->n_control)) 1767 - ip_vs_conn_expire_now(cp); 1768 - __ip_vs_conn_put(cp); 1769 - cp = NULL; 1760 + if (conn_reuse_mode && !iph.fragoffs && is_new_conn(skb, &iph) && cp) { 1761 + bool uses_ct = false, resched = false; 1762 + 1763 + if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest && 1764 + unlikely(!atomic_read(&cp->dest->weight))) { 1765 + resched = true; 1766 + uses_ct = ip_vs_conn_uses_conntrack(cp, skb); 1767 + } else if (is_new_conn_expected(cp, conn_reuse_mode)) { 1768 + uses_ct = ip_vs_conn_uses_conntrack(cp, skb); 1769 + if (!atomic_read(&cp->n_control)) { 1770 + resched = true; 1771 + } else { 1772 + /* Do not reschedule controlling connection 1773 + * that uses conntrack while it is still 1774 + * referenced by controlled connection(s). 1775 + */ 1776 + resched = !uses_ct; 1777 + } 1778 + } 1779 + 1780 + if (resched) { 1781 + if (!atomic_read(&cp->n_control)) 1782 + ip_vs_conn_expire_now(cp); 1783 + __ip_vs_conn_put(cp); 1784 + if (uses_ct) 1785 + return NF_DROP; 1786 + cp = NULL; 1787 + } 1770 1788 } 1771 1789 1772 1790 if (unlikely(!cp)) {
+3 -3
net/netfilter/ipvs/ip_vs_pe_sip.c
··· 70 70 const char *dptr; 71 71 int retc; 72 72 73 - ip_vs_fill_iph_skb(p->af, skb, false, &iph); 73 + retc = ip_vs_fill_iph_skb(p->af, skb, false, &iph); 74 74 75 75 /* Only useful with UDP */ 76 - if (iph.protocol != IPPROTO_UDP) 76 + if (!retc || iph.protocol != IPPROTO_UDP) 77 77 return -EINVAL; 78 78 /* todo: IPv6 fragments: 79 79 * I think this only should be done for the first fragment. /HS ··· 88 88 dptr = skb->data + dataoff; 89 89 datalen = skb->len - dataoff; 90 90 91 - if (get_callid(dptr, dataoff, datalen, &matchoff, &matchlen)) 91 + if (get_callid(dptr, 0, datalen, &matchoff, &matchlen)) 92 92 return -EINVAL; 93 93 94 94 /* N.B: pe_data is only set on success,
+2 -4
net/netfilter/nf_conntrack_core.c
··· 74 74 spin_lock(lock); 75 75 while (unlikely(nf_conntrack_locks_all)) { 76 76 spin_unlock(lock); 77 - spin_lock(&nf_conntrack_locks_all_lock); 78 - spin_unlock(&nf_conntrack_locks_all_lock); 77 + spin_unlock_wait(&nf_conntrack_locks_all_lock); 79 78 spin_lock(lock); 80 79 } 81 80 } ··· 120 121 nf_conntrack_locks_all = true; 121 122 122 123 for (i = 0; i < CONNTRACK_LOCKS; i++) { 123 - spin_lock(&nf_conntrack_locks[i]); 124 - spin_unlock(&nf_conntrack_locks[i]); 124 + spin_unlock_wait(&nf_conntrack_locks[i]); 125 125 } 126 126 } 127 127
+6
net/netfilter/nft_compat.c
··· 660 660 if (IS_ERR(match)) 661 661 return ERR_PTR(-ENOENT); 662 662 663 + if (match->matchsize > nla_len(tb[NFTA_MATCH_INFO])) 664 + return ERR_PTR(-EINVAL); 665 + 663 666 /* This is the first time we use this match, allocate operations */ 664 667 nft_match = kzalloc(sizeof(struct nft_xt), GFP_KERNEL); 665 668 if (nft_match == NULL) ··· 742 739 target = xt_request_find_target(family, tg_name, rev); 743 740 if (IS_ERR(target)) 744 741 return ERR_PTR(-ENOENT); 742 + 743 + if (target->targetsize > nla_len(tb[NFTA_TARGET_INFO])) 744 + return ERR_PTR(-EINVAL); 745 745 746 746 /* This is the first time we use this target, allocate operations */ 747 747 nft_target = kzalloc(sizeof(struct nft_xt), GFP_KERNEL);
+3
net/netfilter/x_tables.c
··· 659 659 struct xt_table_info *info = NULL; 660 660 size_t sz = sizeof(*info) + size; 661 661 662 + if (sz < sizeof(*info)) 663 + return NULL; 664 + 662 665 /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */ 663 666 if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages) 664 667 return NULL;
+2 -1
net/openvswitch/Kconfig
··· 6 6 tristate "Open vSwitch" 7 7 depends on INET 8 8 depends on !NF_CONNTRACK || \ 9 - (NF_CONNTRACK && (!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6)) 9 + (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \ 10 + (!NF_NAT || NF_NAT))) 10 11 select LIBCRC32C 11 12 select MPLS 12 13 select NET_MPLS_GSO
+622 -38
net/openvswitch/conntrack.c
··· 13 13 14 14 #include <linux/module.h> 15 15 #include <linux/openvswitch.h> 16 + #include <linux/tcp.h> 17 + #include <linux/udp.h> 18 + #include <linux/sctp.h> 16 19 #include <net/ip.h> 17 20 #include <net/netfilter/nf_conntrack_core.h> 18 21 #include <net/netfilter/nf_conntrack_helper.h> 19 22 #include <net/netfilter/nf_conntrack_labels.h> 23 + #include <net/netfilter/nf_conntrack_seqadj.h> 20 24 #include <net/netfilter/nf_conntrack_zones.h> 21 25 #include <net/netfilter/ipv6/nf_defrag_ipv6.h> 26 + 27 + #ifdef CONFIG_NF_NAT_NEEDED 28 + #include <linux/netfilter/nf_nat.h> 29 + #include <net/netfilter/nf_nat_core.h> 30 + #include <net/netfilter/nf_nat_l3proto.h> 31 + #endif 22 32 23 33 #include "datapath.h" 24 34 #include "conntrack.h" ··· 36 26 #include "flow_netlink.h" 37 27 38 28 struct ovs_ct_len_tbl { 39 - size_t maxlen; 40 - size_t minlen; 29 + int maxlen; 30 + int minlen; 41 31 }; 42 32 43 33 /* Metadata mark for masked write to conntrack mark */ ··· 52 42 struct ovs_key_ct_labels mask; 53 43 }; 54 44 45 + enum ovs_ct_nat { 46 + OVS_CT_NAT = 1 << 0, /* NAT for committed connections only. */ 47 + OVS_CT_SRC_NAT = 1 << 1, /* Source NAT for NEW connections. */ 48 + OVS_CT_DST_NAT = 1 << 2, /* Destination NAT for NEW connections. */ 49 + }; 50 + 55 51 /* Conntrack action context for execution. */ 56 52 struct ovs_conntrack_info { 57 53 struct nf_conntrack_helper *helper; 58 54 struct nf_conntrack_zone zone; 59 55 struct nf_conn *ct; 60 56 u8 commit : 1; 57 + u8 nat : 3; /* enum ovs_ct_nat */ 61 58 u16 family; 62 59 struct md_mark mark; 63 60 struct md_labels labels; 61 + #ifdef CONFIG_NF_NAT_NEEDED 62 + struct nf_nat_range range; /* Only present for SRC NAT and DST NAT. */ 63 + #endif 64 64 }; 65 65 66 66 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info); ··· 95 75 switch (ctinfo) { 96 76 case IP_CT_ESTABLISHED_REPLY: 97 77 case IP_CT_RELATED_REPLY: 98 - case IP_CT_NEW_REPLY: 99 78 ct_state |= OVS_CS_F_REPLY_DIR; 100 79 break; 101 80 default: ··· 111 92 ct_state |= OVS_CS_F_RELATED; 112 93 break; 113 94 case IP_CT_NEW: 114 - case IP_CT_NEW_REPLY: 115 95 ct_state |= OVS_CS_F_NEW; 116 96 break; 117 97 default: ··· 157 139 ovs_ct_get_labels(ct, &key->ct.labels); 158 140 } 159 141 160 - /* Update 'key' based on skb->nfct. If 'post_ct' is true, then OVS has 161 - * previously sent the packet to conntrack via the ct action. 142 + /* Update 'key' based on skb->nfct. If 'post_ct' is true, then OVS has 143 + * previously sent the packet to conntrack via the ct action. If 144 + * 'keep_nat_flags' is true, the existing NAT flags retained, else they are 145 + * initialized from the connection status. 162 146 */ 163 147 static void ovs_ct_update_key(const struct sk_buff *skb, 164 148 const struct ovs_conntrack_info *info, 165 - struct sw_flow_key *key, bool post_ct) 149 + struct sw_flow_key *key, bool post_ct, 150 + bool keep_nat_flags) 166 151 { 167 152 const struct nf_conntrack_zone *zone = &nf_ct_zone_dflt; 168 153 enum ip_conntrack_info ctinfo; ··· 175 154 ct = nf_ct_get(skb, &ctinfo); 176 155 if (ct) { 177 156 state = ovs_ct_get_state(ctinfo); 157 + /* All unconfirmed entries are NEW connections. */ 178 158 if (!nf_ct_is_confirmed(ct)) 179 159 state |= OVS_CS_F_NEW; 160 + /* OVS persists the related flag for the duration of the 161 + * connection. 162 + */ 180 163 if (ct->master) 181 164 state |= OVS_CS_F_RELATED; 165 + if (keep_nat_flags) { 166 + state |= key->ct.state & OVS_CS_F_NAT_MASK; 167 + } else { 168 + if (ct->status & IPS_SRC_NAT) 169 + state |= OVS_CS_F_SRC_NAT; 170 + if (ct->status & IPS_DST_NAT) 171 + state |= OVS_CS_F_DST_NAT; 172 + } 182 173 zone = nf_ct_zone(ct); 183 174 } else if (post_ct) { 184 175 state = OVS_CS_F_TRACKED | OVS_CS_F_INVALID; ··· 200 167 __ovs_ct_update_key(key, state, zone, ct); 201 168 } 202 169 170 + /* This is called to initialize CT key fields possibly coming in from the local 171 + * stack. 172 + */ 203 173 void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key) 204 174 { 205 - ovs_ct_update_key(skb, NULL, key, false); 175 + ovs_ct_update_key(skb, NULL, key, false, false); 206 176 } 207 177 208 178 int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb) ··· 236 200 enum ip_conntrack_info ctinfo; 237 201 struct nf_conn *ct; 238 202 u32 new_mark; 239 - 240 203 241 204 /* The connection could be invalid, in which case set_mark is no-op. */ 242 205 ct = nf_ct_get(skb, &ctinfo); ··· 294 259 enum ip_conntrack_info ctinfo; 295 260 unsigned int protoff; 296 261 struct nf_conn *ct; 262 + int err; 297 263 298 264 ct = nf_ct_get(skb, &ctinfo); 299 265 if (!ct || ctinfo == IP_CT_RELATED_REPLY) ··· 331 295 return NF_DROP; 332 296 } 333 297 334 - return helper->help(skb, protoff, ct, ctinfo); 298 + err = helper->help(skb, protoff, ct, ctinfo); 299 + if (err != NF_ACCEPT) 300 + return err; 301 + 302 + /* Adjust seqs after helper. This is needed due to some helpers (e.g., 303 + * FTP with NAT) adusting the TCP payload size when mangling IP 304 + * addresses and/or port numbers in the text-based control connection. 305 + */ 306 + if (test_bit(IPS_SEQ_ADJUST_BIT, &ct->status) && 307 + !nf_ct_seq_adjust(skb, ct, ctinfo, protoff)) 308 + return NF_DROP; 309 + return NF_ACCEPT; 335 310 } 336 311 337 312 /* Returns 0 on success, -EINPROGRESS if 'skb' is stolen, or other nonzero ··· 399 352 return __nf_ct_expect_find(net, zone, &tuple); 400 353 } 401 354 355 + /* This replicates logic from nf_conntrack_core.c that is not exported. */ 356 + static enum ip_conntrack_info 357 + ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h) 358 + { 359 + const struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h); 360 + 361 + if (NF_CT_DIRECTION(h) == IP_CT_DIR_REPLY) 362 + return IP_CT_ESTABLISHED_REPLY; 363 + /* Once we've had two way comms, always ESTABLISHED. */ 364 + if (test_bit(IPS_SEEN_REPLY_BIT, &ct->status)) 365 + return IP_CT_ESTABLISHED; 366 + if (test_bit(IPS_EXPECTED_BIT, &ct->status)) 367 + return IP_CT_RELATED; 368 + return IP_CT_NEW; 369 + } 370 + 371 + /* Find an existing connection which this packet belongs to without 372 + * re-attributing statistics or modifying the connection state. This allows an 373 + * skb->nfct lost due to an upcall to be recovered during actions execution. 374 + * 375 + * Must be called with rcu_read_lock. 376 + * 377 + * On success, populates skb->nfct and skb->nfctinfo, and returns the 378 + * connection. Returns NULL if there is no existing entry. 379 + */ 380 + static struct nf_conn * 381 + ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone, 382 + u8 l3num, struct sk_buff *skb) 383 + { 384 + struct nf_conntrack_l3proto *l3proto; 385 + struct nf_conntrack_l4proto *l4proto; 386 + struct nf_conntrack_tuple tuple; 387 + struct nf_conntrack_tuple_hash *h; 388 + enum ip_conntrack_info ctinfo; 389 + struct nf_conn *ct; 390 + unsigned int dataoff; 391 + u8 protonum; 392 + 393 + l3proto = __nf_ct_l3proto_find(l3num); 394 + if (!l3proto) { 395 + pr_debug("ovs_ct_find_existing: Can't get l3proto\n"); 396 + return NULL; 397 + } 398 + if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, 399 + &protonum) <= 0) { 400 + pr_debug("ovs_ct_find_existing: Can't get protonum\n"); 401 + return NULL; 402 + } 403 + l4proto = __nf_ct_l4proto_find(l3num, protonum); 404 + if (!l4proto) { 405 + pr_debug("ovs_ct_find_existing: Can't get l4proto\n"); 406 + return NULL; 407 + } 408 + if (!nf_ct_get_tuple(skb, skb_network_offset(skb), dataoff, l3num, 409 + protonum, net, &tuple, l3proto, l4proto)) { 410 + pr_debug("ovs_ct_find_existing: Can't get tuple\n"); 411 + return NULL; 412 + } 413 + 414 + /* look for tuple match */ 415 + h = nf_conntrack_find_get(net, zone, &tuple); 416 + if (!h) 417 + return NULL; /* Not found. */ 418 + 419 + ct = nf_ct_tuplehash_to_ctrack(h); 420 + 421 + ctinfo = ovs_ct_get_info(h); 422 + if (ctinfo == IP_CT_NEW) { 423 + /* This should not happen. */ 424 + WARN_ONCE(1, "ovs_ct_find_existing: new packet for %p\n", ct); 425 + } 426 + skb->nfct = &ct->ct_general; 427 + skb->nfctinfo = ctinfo; 428 + return ct; 429 + } 430 + 402 431 /* Determine whether skb->nfct is equal to the result of conntrack lookup. */ 403 - static bool skb_nfct_cached(const struct net *net, const struct sk_buff *skb, 404 - const struct ovs_conntrack_info *info) 432 + static bool skb_nfct_cached(struct net *net, 433 + const struct sw_flow_key *key, 434 + const struct ovs_conntrack_info *info, 435 + struct sk_buff *skb) 405 436 { 406 437 enum ip_conntrack_info ctinfo; 407 438 struct nf_conn *ct; 408 439 409 440 ct = nf_ct_get(skb, &ctinfo); 441 + /* If no ct, check if we have evidence that an existing conntrack entry 442 + * might be found for this skb. This happens when we lose a skb->nfct 443 + * due to an upcall. If the connection was not confirmed, it is not 444 + * cached and needs to be run through conntrack again. 445 + */ 446 + if (!ct && key->ct.state & OVS_CS_F_TRACKED && 447 + !(key->ct.state & OVS_CS_F_INVALID) && 448 + key->ct.zone == info->zone.id) 449 + ct = ovs_ct_find_existing(net, &info->zone, info->family, skb); 410 450 if (!ct) 411 451 return false; 412 452 if (!net_eq(net, read_pnet(&ct->ct_net))) ··· 511 377 return true; 512 378 } 513 379 380 + #ifdef CONFIG_NF_NAT_NEEDED 381 + /* Modelled after nf_nat_ipv[46]_fn(). 382 + * range is only used for new, uninitialized NAT state. 383 + * Returns either NF_ACCEPT or NF_DROP. 384 + */ 385 + static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct, 386 + enum ip_conntrack_info ctinfo, 387 + const struct nf_nat_range *range, 388 + enum nf_nat_manip_type maniptype) 389 + { 390 + int hooknum, nh_off, err = NF_ACCEPT; 391 + 392 + nh_off = skb_network_offset(skb); 393 + skb_pull(skb, nh_off); 394 + 395 + /* See HOOK2MANIP(). */ 396 + if (maniptype == NF_NAT_MANIP_SRC) 397 + hooknum = NF_INET_LOCAL_IN; /* Source NAT */ 398 + else 399 + hooknum = NF_INET_LOCAL_OUT; /* Destination NAT */ 400 + 401 + switch (ctinfo) { 402 + case IP_CT_RELATED: 403 + case IP_CT_RELATED_REPLY: 404 + if (skb->protocol == htons(ETH_P_IP) && 405 + ip_hdr(skb)->protocol == IPPROTO_ICMP) { 406 + if (!nf_nat_icmp_reply_translation(skb, ct, ctinfo, 407 + hooknum)) 408 + err = NF_DROP; 409 + goto push; 410 + #if IS_ENABLED(CONFIG_NF_NAT_IPV6) 411 + } else if (skb->protocol == htons(ETH_P_IPV6)) { 412 + __be16 frag_off; 413 + u8 nexthdr = ipv6_hdr(skb)->nexthdr; 414 + int hdrlen = ipv6_skip_exthdr(skb, 415 + sizeof(struct ipv6hdr), 416 + &nexthdr, &frag_off); 417 + 418 + if (hdrlen >= 0 && nexthdr == IPPROTO_ICMPV6) { 419 + if (!nf_nat_icmpv6_reply_translation(skb, ct, 420 + ctinfo, 421 + hooknum, 422 + hdrlen)) 423 + err = NF_DROP; 424 + goto push; 425 + } 426 + #endif 427 + } 428 + /* Non-ICMP, fall thru to initialize if needed. */ 429 + case IP_CT_NEW: 430 + /* Seen it before? This can happen for loopback, retrans, 431 + * or local packets. 432 + */ 433 + if (!nf_nat_initialized(ct, maniptype)) { 434 + /* Initialize according to the NAT action. */ 435 + err = (range && range->flags & NF_NAT_RANGE_MAP_IPS) 436 + /* Action is set up to establish a new 437 + * mapping. 438 + */ 439 + ? nf_nat_setup_info(ct, range, maniptype) 440 + : nf_nat_alloc_null_binding(ct, hooknum); 441 + if (err != NF_ACCEPT) 442 + goto push; 443 + } 444 + break; 445 + 446 + case IP_CT_ESTABLISHED: 447 + case IP_CT_ESTABLISHED_REPLY: 448 + break; 449 + 450 + default: 451 + err = NF_DROP; 452 + goto push; 453 + } 454 + 455 + err = nf_nat_packet(ct, ctinfo, hooknum, skb); 456 + push: 457 + skb_push(skb, nh_off); 458 + 459 + return err; 460 + } 461 + 462 + static void ovs_nat_update_key(struct sw_flow_key *key, 463 + const struct sk_buff *skb, 464 + enum nf_nat_manip_type maniptype) 465 + { 466 + if (maniptype == NF_NAT_MANIP_SRC) { 467 + __be16 src; 468 + 469 + key->ct.state |= OVS_CS_F_SRC_NAT; 470 + if (key->eth.type == htons(ETH_P_IP)) 471 + key->ipv4.addr.src = ip_hdr(skb)->saddr; 472 + else if (key->eth.type == htons(ETH_P_IPV6)) 473 + memcpy(&key->ipv6.addr.src, &ipv6_hdr(skb)->saddr, 474 + sizeof(key->ipv6.addr.src)); 475 + else 476 + return; 477 + 478 + if (key->ip.proto == IPPROTO_UDP) 479 + src = udp_hdr(skb)->source; 480 + else if (key->ip.proto == IPPROTO_TCP) 481 + src = tcp_hdr(skb)->source; 482 + else if (key->ip.proto == IPPROTO_SCTP) 483 + src = sctp_hdr(skb)->source; 484 + else 485 + return; 486 + 487 + key->tp.src = src; 488 + } else { 489 + __be16 dst; 490 + 491 + key->ct.state |= OVS_CS_F_DST_NAT; 492 + if (key->eth.type == htons(ETH_P_IP)) 493 + key->ipv4.addr.dst = ip_hdr(skb)->daddr; 494 + else if (key->eth.type == htons(ETH_P_IPV6)) 495 + memcpy(&key->ipv6.addr.dst, &ipv6_hdr(skb)->daddr, 496 + sizeof(key->ipv6.addr.dst)); 497 + else 498 + return; 499 + 500 + if (key->ip.proto == IPPROTO_UDP) 501 + dst = udp_hdr(skb)->dest; 502 + else if (key->ip.proto == IPPROTO_TCP) 503 + dst = tcp_hdr(skb)->dest; 504 + else if (key->ip.proto == IPPROTO_SCTP) 505 + dst = sctp_hdr(skb)->dest; 506 + else 507 + return; 508 + 509 + key->tp.dst = dst; 510 + } 511 + } 512 + 513 + /* Returns NF_DROP if the packet should be dropped, NF_ACCEPT otherwise. */ 514 + static int ovs_ct_nat(struct net *net, struct sw_flow_key *key, 515 + const struct ovs_conntrack_info *info, 516 + struct sk_buff *skb, struct nf_conn *ct, 517 + enum ip_conntrack_info ctinfo) 518 + { 519 + enum nf_nat_manip_type maniptype; 520 + int err; 521 + 522 + if (nf_ct_is_untracked(ct)) { 523 + /* A NAT action may only be performed on tracked packets. */ 524 + return NF_ACCEPT; 525 + } 526 + 527 + /* Add NAT extension if not confirmed yet. */ 528 + if (!nf_ct_is_confirmed(ct) && !nf_ct_nat_ext_add(ct)) 529 + return NF_ACCEPT; /* Can't NAT. */ 530 + 531 + /* Determine NAT type. 532 + * Check if the NAT type can be deduced from the tracked connection. 533 + * Make sure expected traffic is NATted only when committing. 534 + */ 535 + if (info->nat & OVS_CT_NAT && ctinfo != IP_CT_NEW && 536 + ct->status & IPS_NAT_MASK && 537 + (!(ct->status & IPS_EXPECTED_BIT) || info->commit)) { 538 + /* NAT an established or related connection like before. */ 539 + if (CTINFO2DIR(ctinfo) == IP_CT_DIR_REPLY) 540 + /* This is the REPLY direction for a connection 541 + * for which NAT was applied in the forward 542 + * direction. Do the reverse NAT. 543 + */ 544 + maniptype = ct->status & IPS_SRC_NAT 545 + ? NF_NAT_MANIP_DST : NF_NAT_MANIP_SRC; 546 + else 547 + maniptype = ct->status & IPS_SRC_NAT 548 + ? NF_NAT_MANIP_SRC : NF_NAT_MANIP_DST; 549 + } else if (info->nat & OVS_CT_SRC_NAT) { 550 + maniptype = NF_NAT_MANIP_SRC; 551 + } else if (info->nat & OVS_CT_DST_NAT) { 552 + maniptype = NF_NAT_MANIP_DST; 553 + } else { 554 + return NF_ACCEPT; /* Connection is not NATed. */ 555 + } 556 + err = ovs_ct_nat_execute(skb, ct, ctinfo, &info->range, maniptype); 557 + 558 + /* Mark NAT done if successful and update the flow key. */ 559 + if (err == NF_ACCEPT) 560 + ovs_nat_update_key(key, skb, maniptype); 561 + 562 + return err; 563 + } 564 + #else /* !CONFIG_NF_NAT_NEEDED */ 565 + static int ovs_ct_nat(struct net *net, struct sw_flow_key *key, 566 + const struct ovs_conntrack_info *info, 567 + struct sk_buff *skb, struct nf_conn *ct, 568 + enum ip_conntrack_info ctinfo) 569 + { 570 + return NF_ACCEPT; 571 + } 572 + #endif 573 + 574 + /* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if 575 + * not done already. Update key with new CT state after passing the packet 576 + * through conntrack. 577 + * Note that if the packet is deemed invalid by conntrack, skb->nfct will be 578 + * set to NULL and 0 will be returned. 579 + */ 514 580 static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key, 515 581 const struct ovs_conntrack_info *info, 516 582 struct sk_buff *skb) ··· 720 386 * actually run the packet through conntrack twice unless it's for a 721 387 * different zone. 722 388 */ 723 - if (!skb_nfct_cached(net, skb, info)) { 389 + bool cached = skb_nfct_cached(net, key, info, skb); 390 + enum ip_conntrack_info ctinfo; 391 + struct nf_conn *ct; 392 + 393 + if (!cached) { 724 394 struct nf_conn *tmpl = info->ct; 395 + int err; 725 396 726 397 /* Associate skb with specified zone. */ 727 398 if (tmpl) { ··· 737 398 skb->nfctinfo = IP_CT_NEW; 738 399 } 739 400 740 - if (nf_conntrack_in(net, info->family, NF_INET_PRE_ROUTING, 741 - skb) != NF_ACCEPT) 401 + /* Repeat if requested, see nf_iterate(). */ 402 + do { 403 + err = nf_conntrack_in(net, info->family, 404 + NF_INET_PRE_ROUTING, skb); 405 + } while (err == NF_REPEAT); 406 + 407 + if (err != NF_ACCEPT) 742 408 return -ENOENT; 743 409 744 - if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) { 745 - WARN_ONCE(1, "helper rejected packet"); 410 + /* Clear CT state NAT flags to mark that we have not yet done 411 + * NAT after the nf_conntrack_in() call. We can actually clear 412 + * the whole state, as it will be re-initialized below. 413 + */ 414 + key->ct.state = 0; 415 + 416 + /* Update the key, but keep the NAT flags. */ 417 + ovs_ct_update_key(skb, info, key, true, true); 418 + } 419 + 420 + ct = nf_ct_get(skb, &ctinfo); 421 + if (ct) { 422 + /* Packets starting a new connection must be NATted before the 423 + * helper, so that the helper knows about the NAT. We enforce 424 + * this by delaying both NAT and helper calls for unconfirmed 425 + * connections until the committing CT action. For later 426 + * packets NAT and Helper may be called in either order. 427 + * 428 + * NAT will be done only if the CT action has NAT, and only 429 + * once per packet (per zone), as guarded by the NAT bits in 430 + * the key->ct.state. 431 + */ 432 + if (info->nat && !(key->ct.state & OVS_CS_F_NAT_MASK) && 433 + (nf_ct_is_confirmed(ct) || info->commit) && 434 + ovs_ct_nat(net, key, info, skb, ct, ctinfo) != NF_ACCEPT) { 435 + return -EINVAL; 436 + } 437 + 438 + /* Call the helper only if: 439 + * - nf_conntrack_in() was executed above ("!cached") for a 440 + * confirmed connection, or 441 + * - When committing an unconfirmed connection. 442 + */ 443 + if ((nf_ct_is_confirmed(ct) ? !cached : info->commit) && 444 + ovs_ct_helper(skb, info->family) != NF_ACCEPT) { 746 445 return -EINVAL; 747 446 } 748 447 } 749 - 750 - ovs_ct_update_key(skb, info, key, true); 751 448 752 449 return 0; 753 450 } ··· 795 420 { 796 421 struct nf_conntrack_expect *exp; 797 422 423 + /* If we pass an expected packet through nf_conntrack_in() the 424 + * expectation is typically removed, but the packet could still be 425 + * lost in upcall processing. To prevent this from happening we 426 + * perform an explicit expectation lookup. Expected connections are 427 + * always new, and will be passed through conntrack only when they are 428 + * committed, as it is OK to remove the expectation at that time. 429 + */ 798 430 exp = ovs_ct_expect_find(net, &info->zone, info->family, skb); 799 431 if (exp) { 800 432 u8 state; 801 433 434 + /* NOTE: New connections are NATted and Helped only when 435 + * committed, so we are not calling into NAT here. 436 + */ 802 437 state = OVS_CS_F_TRACKED | OVS_CS_F_NEW | OVS_CS_F_RELATED; 803 438 __ovs_ct_update_key(key, state, &info->zone, exp->master); 804 - } else { 805 - int err; 806 - 807 - err = __ovs_ct_lookup(net, key, info, skb); 808 - if (err) 809 - return err; 810 - } 439 + } else 440 + return __ovs_ct_lookup(net, key, info, skb); 811 441 812 442 return 0; 813 443 } ··· 822 442 const struct ovs_conntrack_info *info, 823 443 struct sk_buff *skb) 824 444 { 825 - u8 state; 826 445 int err; 827 - 828 - state = key->ct.state; 829 - if (key->ct.zone == info->zone.id && 830 - ((state & OVS_CS_F_TRACKED) && !(state & OVS_CS_F_NEW))) { 831 - /* Previous lookup has shown that this connection is already 832 - * tracked and committed. Skip committing. 833 - */ 834 - return 0; 835 - } 836 446 837 447 err = __ovs_ct_lookup(net, key, info, skb); 838 448 if (err) 839 449 return err; 450 + /* This is a no-op if the connection has already been confirmed. */ 840 451 if (nf_conntrack_confirm(skb) != NF_ACCEPT) 841 452 return -EINVAL; 842 453 ··· 912 541 return 0; 913 542 } 914 543 544 + #ifdef CONFIG_NF_NAT_NEEDED 545 + static int parse_nat(const struct nlattr *attr, 546 + struct ovs_conntrack_info *info, bool log) 547 + { 548 + struct nlattr *a; 549 + int rem; 550 + bool have_ip_max = false; 551 + bool have_proto_max = false; 552 + bool ip_vers = (info->family == NFPROTO_IPV6); 553 + 554 + nla_for_each_nested(a, attr, rem) { 555 + static const int ovs_nat_attr_lens[OVS_NAT_ATTR_MAX + 1][2] = { 556 + [OVS_NAT_ATTR_SRC] = {0, 0}, 557 + [OVS_NAT_ATTR_DST] = {0, 0}, 558 + [OVS_NAT_ATTR_IP_MIN] = {sizeof(struct in_addr), 559 + sizeof(struct in6_addr)}, 560 + [OVS_NAT_ATTR_IP_MAX] = {sizeof(struct in_addr), 561 + sizeof(struct in6_addr)}, 562 + [OVS_NAT_ATTR_PROTO_MIN] = {sizeof(u16), sizeof(u16)}, 563 + [OVS_NAT_ATTR_PROTO_MAX] = {sizeof(u16), sizeof(u16)}, 564 + [OVS_NAT_ATTR_PERSISTENT] = {0, 0}, 565 + [OVS_NAT_ATTR_PROTO_HASH] = {0, 0}, 566 + [OVS_NAT_ATTR_PROTO_RANDOM] = {0, 0}, 567 + }; 568 + int type = nla_type(a); 569 + 570 + if (type > OVS_NAT_ATTR_MAX) { 571 + OVS_NLERR(log, 572 + "Unknown NAT attribute (type=%d, max=%d).\n", 573 + type, OVS_NAT_ATTR_MAX); 574 + return -EINVAL; 575 + } 576 + 577 + if (nla_len(a) != ovs_nat_attr_lens[type][ip_vers]) { 578 + OVS_NLERR(log, 579 + "NAT attribute type %d has unexpected length (%d != %d).\n", 580 + type, nla_len(a), 581 + ovs_nat_attr_lens[type][ip_vers]); 582 + return -EINVAL; 583 + } 584 + 585 + switch (type) { 586 + case OVS_NAT_ATTR_SRC: 587 + case OVS_NAT_ATTR_DST: 588 + if (info->nat) { 589 + OVS_NLERR(log, 590 + "Only one type of NAT may be specified.\n" 591 + ); 592 + return -ERANGE; 593 + } 594 + info->nat |= OVS_CT_NAT; 595 + info->nat |= ((type == OVS_NAT_ATTR_SRC) 596 + ? OVS_CT_SRC_NAT : OVS_CT_DST_NAT); 597 + break; 598 + 599 + case OVS_NAT_ATTR_IP_MIN: 600 + nla_memcpy(&info->range.min_addr, a, nla_len(a)); 601 + info->range.flags |= NF_NAT_RANGE_MAP_IPS; 602 + break; 603 + 604 + case OVS_NAT_ATTR_IP_MAX: 605 + have_ip_max = true; 606 + nla_memcpy(&info->range.max_addr, a, 607 + sizeof(info->range.max_addr)); 608 + info->range.flags |= NF_NAT_RANGE_MAP_IPS; 609 + break; 610 + 611 + case OVS_NAT_ATTR_PROTO_MIN: 612 + info->range.min_proto.all = htons(nla_get_u16(a)); 613 + info->range.flags |= NF_NAT_RANGE_PROTO_SPECIFIED; 614 + break; 615 + 616 + case OVS_NAT_ATTR_PROTO_MAX: 617 + have_proto_max = true; 618 + info->range.max_proto.all = htons(nla_get_u16(a)); 619 + info->range.flags |= NF_NAT_RANGE_PROTO_SPECIFIED; 620 + break; 621 + 622 + case OVS_NAT_ATTR_PERSISTENT: 623 + info->range.flags |= NF_NAT_RANGE_PERSISTENT; 624 + break; 625 + 626 + case OVS_NAT_ATTR_PROTO_HASH: 627 + info->range.flags |= NF_NAT_RANGE_PROTO_RANDOM; 628 + break; 629 + 630 + case OVS_NAT_ATTR_PROTO_RANDOM: 631 + info->range.flags |= NF_NAT_RANGE_PROTO_RANDOM_FULLY; 632 + break; 633 + 634 + default: 635 + OVS_NLERR(log, "Unknown nat attribute (%d).\n", type); 636 + return -EINVAL; 637 + } 638 + } 639 + 640 + if (rem > 0) { 641 + OVS_NLERR(log, "NAT attribute has %d unknown bytes.\n", rem); 642 + return -EINVAL; 643 + } 644 + if (!info->nat) { 645 + /* Do not allow flags if no type is given. */ 646 + if (info->range.flags) { 647 + OVS_NLERR(log, 648 + "NAT flags may be given only when NAT range (SRC or DST) is also specified.\n" 649 + ); 650 + return -EINVAL; 651 + } 652 + info->nat = OVS_CT_NAT; /* NAT existing connections. */ 653 + } else if (!info->commit) { 654 + OVS_NLERR(log, 655 + "NAT attributes may be specified only when CT COMMIT flag is also specified.\n" 656 + ); 657 + return -EINVAL; 658 + } 659 + /* Allow missing IP_MAX. */ 660 + if (info->range.flags & NF_NAT_RANGE_MAP_IPS && !have_ip_max) { 661 + memcpy(&info->range.max_addr, &info->range.min_addr, 662 + sizeof(info->range.max_addr)); 663 + } 664 + /* Allow missing PROTO_MAX. */ 665 + if (info->range.flags & NF_NAT_RANGE_PROTO_SPECIFIED && 666 + !have_proto_max) { 667 + info->range.max_proto.all = info->range.min_proto.all; 668 + } 669 + return 0; 670 + } 671 + #endif 672 + 915 673 static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = { 916 674 [OVS_CT_ATTR_COMMIT] = { .minlen = 0, .maxlen = 0 }, 917 675 [OVS_CT_ATTR_ZONE] = { .minlen = sizeof(u16), ··· 1050 550 [OVS_CT_ATTR_LABELS] = { .minlen = sizeof(struct md_labels), 1051 551 .maxlen = sizeof(struct md_labels) }, 1052 552 [OVS_CT_ATTR_HELPER] = { .minlen = 1, 1053 - .maxlen = NF_CT_HELPER_NAME_LEN } 553 + .maxlen = NF_CT_HELPER_NAME_LEN }, 554 + #ifdef CONFIG_NF_NAT_NEEDED 555 + /* NAT length is checked when parsing the nested attributes. */ 556 + [OVS_CT_ATTR_NAT] = { .minlen = 0, .maxlen = INT_MAX }, 557 + #endif 1054 558 }; 1055 559 1056 560 static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, ··· 1121 617 return -EINVAL; 1122 618 } 1123 619 break; 620 + #ifdef CONFIG_NF_NAT_NEEDED 621 + case OVS_CT_ATTR_NAT: { 622 + int err = parse_nat(a, info, log); 623 + 624 + if (err) 625 + return err; 626 + break; 627 + } 628 + #endif 1124 629 default: 1125 630 OVS_NLERR(log, "Unknown conntrack attr (%d)", 1126 631 type); ··· 1217 704 return err; 1218 705 } 1219 706 707 + #ifdef CONFIG_NF_NAT_NEEDED 708 + static bool ovs_ct_nat_to_attr(const struct ovs_conntrack_info *info, 709 + struct sk_buff *skb) 710 + { 711 + struct nlattr *start; 712 + 713 + start = nla_nest_start(skb, OVS_CT_ATTR_NAT); 714 + if (!start) 715 + return false; 716 + 717 + if (info->nat & OVS_CT_SRC_NAT) { 718 + if (nla_put_flag(skb, OVS_NAT_ATTR_SRC)) 719 + return false; 720 + } else if (info->nat & OVS_CT_DST_NAT) { 721 + if (nla_put_flag(skb, OVS_NAT_ATTR_DST)) 722 + return false; 723 + } else { 724 + goto out; 725 + } 726 + 727 + if (info->range.flags & NF_NAT_RANGE_MAP_IPS) { 728 + if (info->family == NFPROTO_IPV4) { 729 + if (nla_put_in_addr(skb, OVS_NAT_ATTR_IP_MIN, 730 + info->range.min_addr.ip) || 731 + (info->range.max_addr.ip 732 + != info->range.min_addr.ip && 733 + (nla_put_in_addr(skb, OVS_NAT_ATTR_IP_MAX, 734 + info->range.max_addr.ip)))) 735 + return false; 736 + #if IS_ENABLED(CONFIG_NF_NAT_IPV6) 737 + } else if (info->family == NFPROTO_IPV6) { 738 + if (nla_put_in6_addr(skb, OVS_NAT_ATTR_IP_MIN, 739 + &info->range.min_addr.in6) || 740 + (memcmp(&info->range.max_addr.in6, 741 + &info->range.min_addr.in6, 742 + sizeof(info->range.max_addr.in6)) && 743 + (nla_put_in6_addr(skb, OVS_NAT_ATTR_IP_MAX, 744 + &info->range.max_addr.in6)))) 745 + return false; 746 + #endif 747 + } else { 748 + return false; 749 + } 750 + } 751 + if (info->range.flags & NF_NAT_RANGE_PROTO_SPECIFIED && 752 + (nla_put_u16(skb, OVS_NAT_ATTR_PROTO_MIN, 753 + ntohs(info->range.min_proto.all)) || 754 + (info->range.max_proto.all != info->range.min_proto.all && 755 + nla_put_u16(skb, OVS_NAT_ATTR_PROTO_MAX, 756 + ntohs(info->range.max_proto.all))))) 757 + return false; 758 + 759 + if (info->range.flags & NF_NAT_RANGE_PERSISTENT && 760 + nla_put_flag(skb, OVS_NAT_ATTR_PERSISTENT)) 761 + return false; 762 + if (info->range.flags & NF_NAT_RANGE_PROTO_RANDOM && 763 + nla_put_flag(skb, OVS_NAT_ATTR_PROTO_HASH)) 764 + return false; 765 + if (info->range.flags & NF_NAT_RANGE_PROTO_RANDOM_FULLY && 766 + nla_put_flag(skb, OVS_NAT_ATTR_PROTO_RANDOM)) 767 + return false; 768 + out: 769 + nla_nest_end(skb, start); 770 + 771 + return true; 772 + } 773 + #endif 774 + 1220 775 int ovs_ct_action_to_attr(const struct ovs_conntrack_info *ct_info, 1221 776 struct sk_buff *skb) 1222 777 { ··· 1313 732 ct_info->helper->name)) 1314 733 return -EMSGSIZE; 1315 734 } 1316 - 735 + #ifdef CONFIG_NF_NAT_NEEDED 736 + if (ct_info->nat && !ovs_ct_nat_to_attr(ct_info, skb)) 737 + return -EMSGSIZE; 738 + #endif 1317 739 nla_nest_end(skb, start); 1318 740 1319 741 return 0;
+2 -1
net/openvswitch/conntrack.h
··· 37 37 38 38 #define CT_SUPPORTED_MASK (OVS_CS_F_NEW | OVS_CS_F_ESTABLISHED | \ 39 39 OVS_CS_F_RELATED | OVS_CS_F_REPLY_DIR | \ 40 - OVS_CS_F_INVALID | OVS_CS_F_TRACKED) 40 + OVS_CS_F_INVALID | OVS_CS_F_TRACKED | \ 41 + OVS_CS_F_SRC_NAT | OVS_CS_F_DST_NAT) 41 42 #else 42 43 #include <linux/errno.h> 43 44