Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'Handle-multiple-received-packets-at-each-stage'

Edward Cree says:

====================
Handle multiple received packets at each stage

This patch series adds the capability for the network stack to receive a
list of packets and process them as a unit, rather than handling each
packet singly in sequence. This is done by factoring out the existing
datapath code at each layer and wrapping it in list handling code.

The motivation for this change is twofold:
* Instruction cache locality. Currently, running the entire network
stack receive path on a packet involves more code than will fit in the
lowest-level icache, meaning that when the next packet is handled, the
code has to be reloaded from more distant caches. By handling packets
in "row-major order", we ensure that the code at each layer is hot for
most of the list. (There is a corresponding downside in _data_ cache
locality, since we are now touching every packet at every layer, but in
practice there is easily enough room in dcache to hold one cacheline of
each of the 64 packets in a NAPI poll.)
* Reduction of indirect calls. Owing to Spectre mitigations, indirect
function calls are now more expensive than ever; they are also heavily
used in the network stack's architecture (see [1]). By replacing 64
indirect calls to the next-layer per-packet function with a single
indirect call to the next-layer list function, we can save CPU cycles.

Drivers pass an SKB list to the stack at the end of the NAPI poll; this
gives a natural batch size (the NAPI poll weight) and avoids waiting at
the software level for further packets to make a larger batch (which
would add latency). It also means that the batch size is automatically
tuned by the existing interrupt moderation mechanism.
The stack then runs each layer of processing over all the packets in the
list before proceeding to the next layer. Where the 'next layer' (or
the context in which it must run) differs among the packets, the stack
splits the list; this 'late demux' means that packets which differ only
in later headers (e.g. same L2/L3 but different L4) can traverse the
early part of the stack together.
Also, where the next layer is not (yet) list-aware, the stack can revert
to calling the rest of the stack in a loop; this allows gradual/creeping
listification, with no 'flag day' patch needed to listify everything.

Patches 1-2 simply place received packets on a list during the event
processing loop on the sfc EF10 architecture, then call the normal stack
for each packet singly at the end of the NAPI poll. (Analogues of patch
#2 for other NIC drivers should be fairly straightforward.)
Patches 3-9 extend the list processing as far as the IP receive handler.

Patches 1-2 alone give about a 10% improvement in packet rate in the
baseline test; adding patches 3-9 raises this to around 25%.

Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
packets and a single core to handle interrupts on the RX side; this was
in order to measure as simply as possible the packet rate handled by a
single core. Figures are in Mbit/s; divide by 8 to obtain Mpps. The
setup was tuned for maximum reproducibility, rather than raw performance.
Full details and more results (both with and without retpolines) from a
previous version of the patch series are presented in [2].

The baseline test uses four streams, and multiple RXQs all bound to a
single CPU (the netperf binary is bound to a neighbouring CPU). These
tests were run with retpolines.
net-next: 6.91 Mb/s (datum)
after 9: 8.46 Mb/s (+22.5%)
Note however that these results are not robust; changes in the parameters
of the test sometimes shrink the gain to single-digit percentages. For
instance, when using only a single RXQ, only a 4% gain was seen.

One test variation was the use of software filtering/firewall rules.
Adding a single iptables rule (UDP port drop on a port range not matching
the test traffic), thus making the netfilter hook have work to do,
reduced baseline performance but showed a similar gain from the patches:
net-next: 5.02 Mb/s (datum)
after 9: 6.78 Mb/s (+35.1%)

Similarly, testing with a set of TC flower filters (kindly supplied by
Cong Wang) gave the following:
net-next: 6.83 Mb/s (datum)
after 9: 8.86 Mb/s (+29.7%)

These data suggest that the batching approach remains effective in the
presence of software switching rules, and perhaps even improves the
performance of those rules by allowing them and their codepaths to stay
in cache between packets.

Changes from v3:
* Fixed build error when CONFIG_NETFILTER=n (thanks kbuild).

Changes from v2:
* Used standard list handling (and skb->list) instead of the skb-queue
functions (that use skb->next, skb->prev).
- As part of this, changed from a "dequeue, process, enqueue" model to
using list_for_each_safe, list_del, and (new) list_cut_before.
* Altered __netif_receive_skb_core() changes in patch 6 as per Willem de
Bruijn's suggestions (separate **ppt_prev from *pt_prev; renaming).
* Removed patches to Generic XDP, since they were producing no benefit.
I may revisit them later.
* Removed RFC tags.

Changes from v1:
* Rebased across 2 years' net-next movement (surprisingly straightforward).
- Added Generic XDP handling to netif_receive_skb_list_internal()
- Dealt with changes to PFMEMALLOC setting APIs
* General cleanup of code and comments.
* Skipped function calls for empty lists at various points in the stack
(patch #9).
* Added listified Generic XDP handling (patches 10-12), though it doesn't
seem to help (see above).
* Extended testing to cover software firewalls / netfilter etc.

[1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
[2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

+360 -16
+12
drivers/net/ethernet/sfc/efx.c
··· 264 264 static int efx_process_channel(struct efx_channel *channel, int budget) 265 265 { 266 266 struct efx_tx_queue *tx_queue; 267 + struct list_head rx_list; 267 268 int spent; 268 269 269 270 if (unlikely(!channel->enabled)) 270 271 return 0; 272 + 273 + /* Prepare the batch receive list */ 274 + EFX_WARN_ON_PARANOID(channel->rx_list != NULL); 275 + INIT_LIST_HEAD(&rx_list); 276 + channel->rx_list = &rx_list; 271 277 272 278 efx_for_each_channel_tx_queue(tx_queue, channel) { 273 279 tx_queue->pkts_compl = 0; ··· 296 290 tx_queue->pkts_compl, tx_queue->bytes_compl); 297 291 } 298 292 } 293 + 294 + /* Receive any packets we queued up */ 295 + netif_receive_skb_list(channel->rx_list); 296 + channel->rx_list = NULL; 299 297 300 298 return spent; 301 299 } ··· 564 554 if (rc) 565 555 goto fail; 566 556 } 557 + 558 + channel->rx_list = NULL; 567 559 568 560 return 0; 569 561
+3
drivers/net/ethernet/sfc/net_driver.h
··· 448 448 * __efx_rx_packet(), or zero if there is none 449 449 * @rx_pkt_index: Ring index of first buffer for next packet to be delivered 450 450 * by __efx_rx_packet(), if @rx_pkt_n_frags != 0 451 + * @rx_list: list of SKBs from current RX, awaiting processing 451 452 * @rx_queue: RX queue for this channel 452 453 * @tx_queue: TX queues for this channel 453 454 * @sync_events_state: Current state of sync events on this channel ··· 500 499 501 500 unsigned int rx_pkt_n_frags; 502 501 unsigned int rx_pkt_index; 502 + 503 + struct list_head *rx_list; 503 504 504 505 struct efx_rx_queue rx_queue; 505 506 struct efx_tx_queue tx_queue[EFX_TXQ_TYPES];
+6 -1
drivers/net/ethernet/sfc/rx.c
··· 634 634 return; 635 635 636 636 /* Pass the packet up */ 637 - netif_receive_skb(skb); 637 + if (channel->rx_list != NULL) 638 + /* Add to list, will pass up later */ 639 + list_add_tail(&skb->list, channel->rx_list); 640 + else 641 + /* No list, so pass it up now */ 642 + netif_receive_skb(skb); 638 643 } 639 644 640 645 /* Handle a received packet. Second half: Touches packet payload. */
+30
include/linux/list.h
··· 285 285 __list_cut_position(list, head, entry); 286 286 } 287 287 288 + /** 289 + * list_cut_before - cut a list into two, before given entry 290 + * @list: a new list to add all removed entries 291 + * @head: a list with entries 292 + * @entry: an entry within head, could be the head itself 293 + * 294 + * This helper moves the initial part of @head, up to but 295 + * excluding @entry, from @head to @list. You should pass 296 + * in @entry an element you know is on @head. @list should 297 + * be an empty list or a list you do not care about losing 298 + * its data. 299 + * If @entry == @head, all entries on @head are moved to 300 + * @list. 301 + */ 302 + static inline void list_cut_before(struct list_head *list, 303 + struct list_head *head, 304 + struct list_head *entry) 305 + { 306 + if (head->next == entry) { 307 + INIT_LIST_HEAD(list); 308 + return; 309 + } 310 + list->next = head->next; 311 + list->next->prev = list; 312 + list->prev = entry->prev; 313 + list->prev->next = list; 314 + head->next = entry; 315 + entry->prev = head; 316 + } 317 + 288 318 static inline void __list_splice(const struct list_head *list, 289 319 struct list_head *prev, 290 320 struct list_head *next)
+4
include/linux/netdevice.h
··· 2297 2297 struct net_device *, 2298 2298 struct packet_type *, 2299 2299 struct net_device *); 2300 + void (*list_func) (struct list_head *, 2301 + struct packet_type *, 2302 + struct net_device *); 2300 2303 bool (*id_match)(struct packet_type *ptype, 2301 2304 struct sock *sk); 2302 2305 void *af_packet_priv; ··· 3480 3477 int netif_rx_ni(struct sk_buff *skb); 3481 3478 int netif_receive_skb(struct sk_buff *skb); 3482 3479 int netif_receive_skb_core(struct sk_buff *skb); 3480 + void netif_receive_skb_list(struct list_head *head); 3483 3481 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb); 3484 3482 void napi_gro_flush(struct napi_struct *napi, bool flush_old); 3485 3483 struct sk_buff *napi_get_frags(struct napi_struct *napi);
+22
include/linux/netfilter.h
··· 288 288 return ret; 289 289 } 290 290 291 + static inline void 292 + NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, 293 + struct list_head *head, struct net_device *in, struct net_device *out, 294 + int (*okfn)(struct net *, struct sock *, struct sk_buff *)) 295 + { 296 + struct sk_buff *skb, *next; 297 + 298 + list_for_each_entry_safe(skb, next, head, list) { 299 + int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn); 300 + if (ret != 1) 301 + list_del(&skb->list); 302 + } 303 + } 304 + 291 305 /* Call setsockopt() */ 292 306 int nf_setsockopt(struct sock *sk, u_int8_t pf, int optval, char __user *opt, 293 307 unsigned int len); ··· 381 367 int (*okfn)(struct net *, struct sock *, struct sk_buff *)) 382 368 { 383 369 return okfn(net, sk, skb); 370 + } 371 + 372 + static inline void 373 + NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, 374 + struct list_head *head, struct net_device *in, struct net_device *out, 375 + int (*okfn)(struct net *, struct sock *, struct sk_buff *)) 376 + { 377 + /* nothing to do */ 384 378 } 385 379 386 380 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
+2
include/net/ip.h
··· 138 138 struct ip_options_rcu *opt); 139 139 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, 140 140 struct net_device *orig_dev); 141 + void ip_list_rcv(struct list_head *head, struct packet_type *pt, 142 + struct net_device *orig_dev); 141 143 int ip_local_deliver(struct sk_buff *skb); 142 144 int ip_mr_input(struct sk_buff *skb); 143 145 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);
+7
include/trace/events/net.h
··· 223 223 TP_ARGS(skb) 224 224 ); 225 225 226 + DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_list_entry, 227 + 228 + TP_PROTO(const struct sk_buff *skb), 229 + 230 + TP_ARGS(skb) 231 + ); 232 + 226 233 DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_entry, 227 234 228 235 TP_PROTO(const struct sk_buff *skb),
+168 -6
net/core/dev.c
··· 4608 4608 return 0; 4609 4609 } 4610 4610 4611 - static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc) 4611 + static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc, 4612 + struct packet_type **ppt_prev) 4612 4613 { 4613 4614 struct packet_type *ptype, *pt_prev; 4614 4615 rx_handler_func_t *rx_handler; ··· 4739 4738 if (pt_prev) { 4740 4739 if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC))) 4741 4740 goto drop; 4742 - else 4743 - ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); 4741 + *ppt_prev = pt_prev; 4744 4742 } else { 4745 4743 drop: 4746 4744 if (!deliver_exact) ··· 4754 4754 } 4755 4755 4756 4756 out: 4757 + return ret; 4758 + } 4759 + 4760 + static int __netif_receive_skb_one_core(struct sk_buff *skb, bool pfmemalloc) 4761 + { 4762 + struct net_device *orig_dev = skb->dev; 4763 + struct packet_type *pt_prev = NULL; 4764 + int ret; 4765 + 4766 + ret = __netif_receive_skb_core(skb, pfmemalloc, &pt_prev); 4767 + if (pt_prev) 4768 + ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); 4757 4769 return ret; 4758 4770 } 4759 4771 ··· 4789 4777 int ret; 4790 4778 4791 4779 rcu_read_lock(); 4792 - ret = __netif_receive_skb_core(skb, false); 4780 + ret = __netif_receive_skb_one_core(skb, false); 4793 4781 rcu_read_unlock(); 4794 4782 4795 4783 return ret; 4796 4784 } 4797 4785 EXPORT_SYMBOL(netif_receive_skb_core); 4786 + 4787 + static inline void __netif_receive_skb_list_ptype(struct list_head *head, 4788 + struct packet_type *pt_prev, 4789 + struct net_device *orig_dev) 4790 + { 4791 + struct sk_buff *skb, *next; 4792 + 4793 + if (!pt_prev) 4794 + return; 4795 + if (list_empty(head)) 4796 + return; 4797 + if (pt_prev->list_func != NULL) 4798 + pt_prev->list_func(head, pt_prev, orig_dev); 4799 + else 4800 + list_for_each_entry_safe(skb, next, head, list) 4801 + pt_prev->func(skb, skb->dev, pt_prev, orig_dev); 4802 + } 4803 + 4804 + static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc) 4805 + { 4806 + /* Fast-path assumptions: 4807 + * - There is no RX handler. 4808 + * - Only one packet_type matches. 4809 + * If either of these fails, we will end up doing some per-packet 4810 + * processing in-line, then handling the 'last ptype' for the whole 4811 + * sublist. This can't cause out-of-order delivery to any single ptype, 4812 + * because the 'last ptype' must be constant across the sublist, and all 4813 + * other ptypes are handled per-packet. 4814 + */ 4815 + /* Current (common) ptype of sublist */ 4816 + struct packet_type *pt_curr = NULL; 4817 + /* Current (common) orig_dev of sublist */ 4818 + struct net_device *od_curr = NULL; 4819 + struct list_head sublist; 4820 + struct sk_buff *skb, *next; 4821 + 4822 + list_for_each_entry_safe(skb, next, head, list) { 4823 + struct net_device *orig_dev = skb->dev; 4824 + struct packet_type *pt_prev = NULL; 4825 + 4826 + __netif_receive_skb_core(skb, pfmemalloc, &pt_prev); 4827 + if (pt_curr != pt_prev || od_curr != orig_dev) { 4828 + /* dispatch old sublist */ 4829 + list_cut_before(&sublist, head, &skb->list); 4830 + __netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr); 4831 + /* start new sublist */ 4832 + pt_curr = pt_prev; 4833 + od_curr = orig_dev; 4834 + } 4835 + } 4836 + 4837 + /* dispatch final sublist */ 4838 + __netif_receive_skb_list_ptype(head, pt_curr, od_curr); 4839 + } 4798 4840 4799 4841 static int __netif_receive_skb(struct sk_buff *skb) 4800 4842 { ··· 4867 4801 * context down to all allocation sites. 4868 4802 */ 4869 4803 noreclaim_flag = memalloc_noreclaim_save(); 4870 - ret = __netif_receive_skb_core(skb, true); 4804 + ret = __netif_receive_skb_one_core(skb, true); 4871 4805 memalloc_noreclaim_restore(noreclaim_flag); 4872 4806 } else 4873 - ret = __netif_receive_skb_core(skb, false); 4807 + ret = __netif_receive_skb_one_core(skb, false); 4874 4808 4875 4809 return ret; 4810 + } 4811 + 4812 + static void __netif_receive_skb_list(struct list_head *head) 4813 + { 4814 + unsigned long noreclaim_flag = 0; 4815 + struct sk_buff *skb, *next; 4816 + bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */ 4817 + 4818 + list_for_each_entry_safe(skb, next, head, list) { 4819 + if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != pfmemalloc) { 4820 + struct list_head sublist; 4821 + 4822 + /* Handle the previous sublist */ 4823 + list_cut_before(&sublist, head, &skb->list); 4824 + if (!list_empty(&sublist)) 4825 + __netif_receive_skb_list_core(&sublist, pfmemalloc); 4826 + pfmemalloc = !pfmemalloc; 4827 + /* See comments in __netif_receive_skb */ 4828 + if (pfmemalloc) 4829 + noreclaim_flag = memalloc_noreclaim_save(); 4830 + else 4831 + memalloc_noreclaim_restore(noreclaim_flag); 4832 + } 4833 + } 4834 + /* Handle the remaining sublist */ 4835 + if (!list_empty(head)) 4836 + __netif_receive_skb_list_core(head, pfmemalloc); 4837 + /* Restore pflags */ 4838 + if (pfmemalloc) 4839 + memalloc_noreclaim_restore(noreclaim_flag); 4876 4840 } 4877 4841 4878 4842 static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp) ··· 4979 4883 return ret; 4980 4884 } 4981 4885 4886 + static void netif_receive_skb_list_internal(struct list_head *head) 4887 + { 4888 + struct bpf_prog *xdp_prog = NULL; 4889 + struct sk_buff *skb, *next; 4890 + 4891 + list_for_each_entry_safe(skb, next, head, list) { 4892 + net_timestamp_check(netdev_tstamp_prequeue, skb); 4893 + if (skb_defer_rx_timestamp(skb)) 4894 + /* Handled, remove from list */ 4895 + list_del(&skb->list); 4896 + } 4897 + 4898 + if (static_branch_unlikely(&generic_xdp_needed_key)) { 4899 + preempt_disable(); 4900 + rcu_read_lock(); 4901 + list_for_each_entry_safe(skb, next, head, list) { 4902 + xdp_prog = rcu_dereference(skb->dev->xdp_prog); 4903 + if (do_xdp_generic(xdp_prog, skb) != XDP_PASS) 4904 + /* Dropped, remove from list */ 4905 + list_del(&skb->list); 4906 + } 4907 + rcu_read_unlock(); 4908 + preempt_enable(); 4909 + } 4910 + 4911 + rcu_read_lock(); 4912 + #ifdef CONFIG_RPS 4913 + if (static_key_false(&rps_needed)) { 4914 + list_for_each_entry_safe(skb, next, head, list) { 4915 + struct rps_dev_flow voidflow, *rflow = &voidflow; 4916 + int cpu = get_rps_cpu(skb->dev, skb, &rflow); 4917 + 4918 + if (cpu >= 0) { 4919 + enqueue_to_backlog(skb, cpu, &rflow->last_qtail); 4920 + /* Handled, remove from list */ 4921 + list_del(&skb->list); 4922 + } 4923 + } 4924 + } 4925 + #endif 4926 + __netif_receive_skb_list(head); 4927 + rcu_read_unlock(); 4928 + } 4929 + 4982 4930 /** 4983 4931 * netif_receive_skb - process receive buffer from network 4984 4932 * @skb: buffer to process ··· 5045 4905 return netif_receive_skb_internal(skb); 5046 4906 } 5047 4907 EXPORT_SYMBOL(netif_receive_skb); 4908 + 4909 + /** 4910 + * netif_receive_skb_list - process many receive buffers from network 4911 + * @head: list of skbs to process. 4912 + * 4913 + * Since return value of netif_receive_skb() is normally ignored, and 4914 + * wouldn't be meaningful for a list, this function returns void. 4915 + * 4916 + * This function may only be called from softirq context and interrupts 4917 + * should be enabled. 4918 + */ 4919 + void netif_receive_skb_list(struct list_head *head) 4920 + { 4921 + struct sk_buff *skb; 4922 + 4923 + if (list_empty(head)) 4924 + return; 4925 + list_for_each_entry(skb, head, list) 4926 + trace_netif_receive_skb_list_entry(skb); 4927 + netif_receive_skb_list_internal(head); 4928 + } 4929 + EXPORT_SYMBOL(netif_receive_skb_list); 5048 4930 5049 4931 DEFINE_PER_CPU(struct work_struct, flush_works); 5050 4932
+1
net/ipv4/af_inet.c
··· 1882 1882 static struct packet_type ip_packet_type __read_mostly = { 1883 1883 .type = cpu_to_be16(ETH_P_IP), 1884 1884 .func = ip_rcv, 1885 + .list_func = ip_list_rcv, 1885 1886 }; 1886 1887 1887 1888 static int __init inet_init(void)
+105 -9
net/ipv4/ip_input.c
··· 307 307 return true; 308 308 } 309 309 310 - static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) 310 + static int ip_rcv_finish_core(struct net *net, struct sock *sk, 311 + struct sk_buff *skb) 311 312 { 312 313 const struct iphdr *iph = ip_hdr(skb); 313 314 int (*edemux)(struct sk_buff *skb); ··· 394 393 goto drop; 395 394 } 396 395 397 - return dst_input(skb); 396 + return NET_RX_SUCCESS; 398 397 399 398 drop: 400 399 kfree_skb(skb); ··· 406 405 goto drop; 407 406 } 408 407 408 + static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) 409 + { 410 + int ret = ip_rcv_finish_core(net, sk, skb); 411 + 412 + if (ret != NET_RX_DROP) 413 + ret = dst_input(skb); 414 + return ret; 415 + } 416 + 409 417 /* 410 418 * Main IP Receive routine. 411 419 */ 412 - int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 420 + static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net) 413 421 { 414 422 const struct iphdr *iph; 415 - struct net *net; 416 423 u32 len; 417 424 418 425 /* When the interface is in promisc. mode, drop all the crap ··· 430 421 goto drop; 431 422 432 423 433 - net = dev_net(dev); 434 424 __IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len); 435 425 436 426 skb = skb_share_check(skb, GFP_ATOMIC); ··· 497 489 /* Must drop socket now because of tproxy. */ 498 490 skb_orphan(skb); 499 491 500 - return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, 501 - net, NULL, skb, dev, NULL, 502 - ip_rcv_finish); 492 + return skb; 503 493 504 494 csum_error: 505 495 __IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS); ··· 506 500 drop: 507 501 kfree_skb(skb); 508 502 out: 509 - return NET_RX_DROP; 503 + return NULL; 504 + } 505 + 506 + /* 507 + * IP receive entry point 508 + */ 509 + int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, 510 + struct net_device *orig_dev) 511 + { 512 + struct net *net = dev_net(dev); 513 + 514 + skb = ip_rcv_core(skb, net); 515 + if (skb == NULL) 516 + return NET_RX_DROP; 517 + return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, 518 + net, NULL, skb, dev, NULL, 519 + ip_rcv_finish); 520 + } 521 + 522 + static void ip_sublist_rcv_finish(struct list_head *head) 523 + { 524 + struct sk_buff *skb, *next; 525 + 526 + list_for_each_entry_safe(skb, next, head, list) 527 + dst_input(skb); 528 + } 529 + 530 + static void ip_list_rcv_finish(struct net *net, struct sock *sk, 531 + struct list_head *head) 532 + { 533 + struct dst_entry *curr_dst = NULL; 534 + struct sk_buff *skb, *next; 535 + struct list_head sublist; 536 + 537 + list_for_each_entry_safe(skb, next, head, list) { 538 + struct dst_entry *dst; 539 + 540 + if (ip_rcv_finish_core(net, sk, skb) == NET_RX_DROP) 541 + continue; 542 + 543 + dst = skb_dst(skb); 544 + if (curr_dst != dst) { 545 + /* dispatch old sublist */ 546 + list_cut_before(&sublist, head, &skb->list); 547 + if (!list_empty(&sublist)) 548 + ip_sublist_rcv_finish(&sublist); 549 + /* start new sublist */ 550 + curr_dst = dst; 551 + } 552 + } 553 + /* dispatch final sublist */ 554 + ip_sublist_rcv_finish(head); 555 + } 556 + 557 + static void ip_sublist_rcv(struct list_head *head, struct net_device *dev, 558 + struct net *net) 559 + { 560 + NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL, 561 + head, dev, NULL, ip_rcv_finish); 562 + ip_list_rcv_finish(net, NULL, head); 563 + } 564 + 565 + /* Receive a list of IP packets */ 566 + void ip_list_rcv(struct list_head *head, struct packet_type *pt, 567 + struct net_device *orig_dev) 568 + { 569 + struct net_device *curr_dev = NULL; 570 + struct net *curr_net = NULL; 571 + struct sk_buff *skb, *next; 572 + struct list_head sublist; 573 + 574 + list_for_each_entry_safe(skb, next, head, list) { 575 + struct net_device *dev = skb->dev; 576 + struct net *net = dev_net(dev); 577 + 578 + skb = ip_rcv_core(skb, net); 579 + if (skb == NULL) 580 + continue; 581 + 582 + if (curr_dev != dev || curr_net != net) { 583 + /* dispatch old sublist */ 584 + list_cut_before(&sublist, head, &skb->list); 585 + if (!list_empty(&sublist)) 586 + ip_sublist_rcv(&sublist, dev, net); 587 + /* start new sublist */ 588 + curr_dev = dev; 589 + curr_net = net; 590 + } 591 + } 592 + /* dispatch final sublist */ 593 + ip_sublist_rcv(head, curr_dev, curr_net); 510 594 }