Part 1 | Part 2 | Part 3 | Part 4 | Part 5
Overview
This post will pick up where the left part 3 begins by describing Receive Packet Steering (RPS), what it is and how to set up, followed by a review the network stack describing how packets are processed according to RPS parameters, packet backlog queue, the beginning of the IP layer, and netfilter.
Receive Packet Steering
We have seen that the device drivers register NAPI polling instances. Each instance NAPI poller runs in the context of a kernel thread called softirq which there is a per CPU. The core CPU thread for the hardware interrupt handler runs on is awake / designed for use in the hardware interrupt handler.
Thus, one CPU processes the interrupt hardware and polls the network layer to process the incoming data.
Some NPI supports multiple queues at the hardware level. This means that incoming packets can be DMA'd to separate receive rings, each ring having received its own hardware interrupt is delivered to indicate the data is available. Each of these hardware interrupts would plan NAPI polling instances to run on each of the associated processors.
This allows multiple processors to handle hardware interrupts and poll the network layer.
Receive Packet Steering (RPS) is a software implementation of hardware enable NPI multi-queue. It allows multiple processors to handle incoming packets, even if the network adapter supports a single queue to get into the hardware.
RPS works by generating a hash for incoming data to determine which CPU must process the data. The data is then queued to the per-CPU receiving network backlog to deal with. An inter-processor interrupt is delivered to the CPU owns the backlog. This will restart the processing of the backlog by the remote CPU if it is not currently processing packets.
  netif_receive_skb  will either continue to send data over the network to the network stack, or rely on RPS for treatment on another CPU. 
RPS set
RPS to work it must be enabled in the kernel configuration (it's on Ubuntu 3.13.0 for Linux kernel), and a bit mask describing the processors must treat packets to an interface and rx given queue.
 The bit masks to modify are in  / sys / class / net / DEVICE_NAME / files / tail / rps_cpus . 
 So, for eth0, and receive queue 0, you must change:  / sys / class / net / eth0 / files / rx-0 / rps_cpus  with a hexadecimal number indicating which processor should process packets eth0 receive queue 0. 
 Back to  netif_receive_skb . 
  netif_receive_skb  
  to recall  netif_receive_skb function  is called from  napi_skb_finish  in the context of softirq the poller NAPI recorded by the device driver. 
  netif_receive_skb  will either attempt to use RPS (as described above) or keep sending data to the network stack 
Let consider to first the second path: .. sending data until the battery if RPS is off
  netif_receive_skb  without RPS 
   netif_receive_skb  calls  __ netif_receive_skb  who did some accounting before calling  __ netif_receive_skb_core  to move the data along to the network stack to the protocol levels. 
  __ netif_receive_skb_core  
 This function changes the skb to the protocol layer in this piece of code (net / core / dev.c: 3628):
 skb- type => Protocol; list_for_each_entry_rcu (ptype, & ptype_base [ntohs(type) & PTYPE_HASH_MASK], list) {if (ptype-> type == type && (ptype-> dev == || null_or_dev ptype-> dev skb- ==> || dev ptype-> = dev orig_dev =)) {if (pt_prev) ret = deliver_skb (skb, pt_prev, orig_dev); pt_prev = ptype; }}  We'll look at exactly how this code provides data to the protocol layer below, but first, let's see what happens when RPS is on.
  netif_receive_skb  RPS 
  RPS If enabled,  netif_receive_skb  calculate that the CPU's backlog, he must queue data. It does this by using  get_rps_cpu  (defined in net / core / dev.c: 2980) 
 int cpu = get_rps_cpu (skb-> dev, skb, & rflow); if (cpu> = 0) {ret = enqueue_to_backlog (skb, cpu, & rflow-> last_qtail); rcu_read_unlock (); return ret; }    enqueue_to_backlog  
  This function first get a pointer to  softnet_data  structure of the remote CPU that contains a pointer to a poller NAPI. 
 Then, the length of the queue of waiting  input_pkt_queue  to the remote CPU is satisfied: 
 = qlen skb_queue_len (& SD-> input_pkt_queue); if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) { if (skb_queue_len(&sd-> input_pkt_queue)) {  There is first with respect to the  netdev_max_backlog . If the length of the queue is larger than the order backlog, the data is deleted and the fall is counted against the remote CPU 
 You can prevent falls by increasing  netdev_max_backlog  :. 
sysctl -w net.core.netdev_max_backlog = 3000
 If the length of the queue is not too large, the next code checks if the flow limit is reached . By default, the speed limits are disabled. To enable rate limits, you must specify a bitmap (similar to RPS bitmap) in  / proc / sys / net / core / flow_limit_cpu_bitmap . 
 Once you have activated the CPU speed limits, you can also adjust the flow limit of the hash table size by modifying the sysctl  net.core.flow_limit_table_len  . 
you can learn more about the flow limits in Documentation / networking / scaling. txt file.
 assuming that the flow limit has not been met,  enqueue_to_backlog  then checks if the backlog queue has data queued in already . 
If so, the data is queued:
 if (skb_queue_len (& SD-> input_pkt_queue)) {enqueue: __skb_queue_tail (& SD-> input_pkt_queue, skb); input_queue_tail_incr_save (sd, qtail); rps_unlock (nd); local_irq_restore (flags); NET_RX_SUCCESS return; }  If the queue is empty, the first NAPI poller for the backlog queue is launched:
 / * NAPI schedule for the delay device * We may use non-atomic operation because we have the lock of the queue * / if (__ test_and_set_bit (NAPI_STATE_SCHED, & SD-> backlog.state!)) {if ____ napi_schedule (sd, & SD-> backlog) (rps_ipi_queued (nd)!); } Goto enqueue ;.   The  goto  at the bottom brings execution was above the code block, the queue data to the backlog 
backlog queue, NAPI waiting poller
queue backlog by CPU plugs into NAPI in the same way a device driver is. A sampling function is provided that is used to process packets from softirq context.
 This struct NAPI is provided during initialization of the networking system. Of  net_dev_init  in net / core / dev.c: 6952: 
SD-> = backlog.poll process_backlog; SD-> = backlog.weight weight_p; SD-> backlog.gro_list = NULL; SD-> backlog.gro_count = 0;
 The structure of the NAPI backlog NAPI structure differs from the device driver in the weight   parameter is adjustable. Pilots hardcode values (more hardcode to 64, as seen in e1000e). 
Set NAPI weight poller backlog, change /proc/sys/net/core/dev_weight.
 The poll function for the backlog is called  process_backlog , and, similar to the function of e1000e  e1000e_poll , is called from the context of softirq. 
  process_backlog  
  The  process_backlog  (net / core / dev.c: 4097) is a loop that runs until its weight (specified in `/ proc / sys / net / core / dev_weight`) was consumed or no more data remains on the order book. 
 Each piece of data waiting in the backlog queue is removed from the queue backlog and forwarded to  __ netif_receive_skb . As explained above in any case of RPS, the data passed to this function finally reaches protocol layers after some accounting. 
 Similarly NAPI device driver implementations, the  process_backlog  code disables the poller if the total weight will not be used. The voter is restarted with the call to  ____ napi_schedule  of  enqueue_to_backlog  as described above. 
 The function returns the amount of work done, which  net_rx_action  (described above) will subtract from the budget (which is adjusted with the  net.core.netdev_budget  as described above). 
  __ netif_receive_skb_core  provides data on protocol layers 
  The  __ netif_receive_skb_core  provides data on the protocol layers. It does this by getting the protocol field of  skb  and iterate over a list of  provide  functions registered for this type of protocol. 
What is happening in this piece of code (as above shown):
 skb- type => Protocol; list_for_each_entry_rcu (ptype, & ptype_base [ntohs(type) & PTYPE_HASH_MASK], list) {if (ptype-> type == type && (ptype-> dev == || null_or_dev ptype-> dev skb- ==> || dev ptype-> = dev orig_dev =)) {if (pt_prev) ret = deliver_skb (skb, pt_prev, orig_dev); pt_prev = ptype; }}   The  ptype_base  ID is defined in level net / core / dev.c: 146as a hash of lists: 
struct list_head ptype_base [PTYPE_HASH_SIZE] __read_mostly ;
 Each protocol layer adds  struct packet_type  to a list in a specific location in the hash table. 
 slot in the hash table is calculated by  ptype_head  
 static inline struct list_head ptype_head * (const struct packet_type * pt) {if (PT> Type == htons (ETH_P_ALL)) return & ptype_all; Another back and ptype_base [ntohs(pt->type) & PTYPE_HASH_MASK] }   The protocol layers call  dev_add_pack  to add to the list. 
layer IP
 The IP protocol layer attaches to the  ptype_base  hash table so that data will be delivered to the lower layers it 
 This occurs in the  inet_init  net / ipv4 / af_inet.c :. 1815 
dev_add_pack (& ip_packet_type);
This stores the IP packet type structure defined as follows:
 static struct packet_type ip_packet_type __read_mostly = {.type = cpu_to_be16 (ETH_P_IP) .func = ip_rcv,};    __ netif_receive_skb_core  calls  deliver_skb  (as seen in the above section). This function (net / core / dev.c: 1712) 
 static inline int deliver_skb (struct sk_buff * skb, struct packet_type pt_prev *, struct * net_device orig_dev) {if (unlikely (skb_orphan_frags (skb, GFP_ATOMIC))) return -ENOMEM; atomic_inc (& skb-> users); return pt_prev-> func (skb, skb-> dev, pt_prev, orig_dev); }   In the case of IP, the  ip_rcv function  is called. 
  ip_rcv  
  The  ip_rcv function  is pretty simple at a high level. There are several integrity checks to ensure data validity. Statistics counters that are superseded and 
 . Ip_rcv  ends by passing the package  ip_rcv_finish  through netfilter. This is done so that all the iptables rules that should be matched to the IP protocol layer can have a look at the package before it continues (net / ipv4 / ip_input.c: 453): 
back NF_HOOK (NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, null, ip_rcv_finish);
netfilter
 The  NF_HOOK_THRESH  is pretty simple. It calls to  nf_hook_thresh  and the success, called  okfn  which in our case is  ip_rcv_finish  (include / linux / netfilter.h: 175): 
 static inline int nF_HOOK_THRESH (uint8_t pf, unsigned int hook struct sk_buff * skb, struct net_device * in, struct net_device * out, int (* okfn) (struct sk_buff *), int thresh) {int ret = nf_hook_thresh ( pf, hook, skb, inside, outside, okfn beat); if (ret == 1) ret = okfn (SKB) return ret; }  The nf_hook_thresh function    continues down iptables approach. It begins by determining if the netfilter hooks to the chain of netfilter protocol family and netfilter transmitted. 
 In our example above, the protocol family is  NFPROTO_IPV4  and the type of chain is  NF_INET_PRE_ROUTING  
 / ** * nf_hook_thresh - call a netfilter hook * * Returns 1 if the hook has the package to pass. * Okfn the function must be invoked by the appellant in this case. Any other return value * indicates that the packet was consumed by the hook. * / Static inline int nf_hook_thresh (u_int8_t pf, unsigned int hook struct sk_buff * skb, struct net_device * Indev, struct net_device * outdev, int (* okfn) (struct sk_buff *), int thresh) {if (nf_hooks_active (pf, hook)) return nf_hook_slow (pf, crochet, skb, Indev, outdev, okfn, thresh); return 1; }   This function calls the function  nf_hooks_active  which examines a table called  nf_hooks_needed  (include / linux / netfilter.h: 114): 
 static inline bool nf_hooks_active (u_int8_t pf, unsigned int hook) {return list_empty (& nf_hooks [pf] [hook]) !; }   And if this one hook,  nf_hook_slow  is called to go further in iptables. 
  nf_hook_slow  
   nf_hook_slow  through the list of hooks in  nf_hooks  table for the protocol type and the type of chain by calling  nf_iterate  for each entry in the list of hook. 
  nf_iterate  in turn calls the hook function associated with an entry on the hook list and returns a "verdict" on the package. 
  iptables  ... Tables 
   iptables  saves the hook functions for each of the corresponding tables packages :. filter, nat, mangle, raw, and security 
 in our example, we are interested in  NF_INET_PRE_ROUTING  chains that are in the  nat  Table [
Indeed, the struct pointer with the hook function that is registered with netfilter is in net / ipv4 / netfilter / iptable_nat.c: 251
 static struct nf_hook_ops nf_nat_ipv4_ops [] = {__read_mostly / * Before packet filtering, change of destination * / = {.hook nf_nat_ipv4_in, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_PRE_ROUTING, .Priority = NF_IP_PRI_NAT_DST,},  , which is part of  iptable_nat_init  (net / ipv4 / netfilter / iptable_nat.c: 316): 
err = nf_register_hooks (nf_nat_ipv4_ops, ARRAY_SIZE (nf_nat_ipv4_ops)); if (err <0) goto err2;
 In our example above the IP layer, the packets will be forwarded to  nf_nat_ipv4_in  down further in iptables via  nf_hook_slow function  described in the previous section. 
  nf_nat_ipv4_in  
   nf_nat_ipv4_in  passes the package on  nf_nat_ipv4_fn  which begins by obtaining information conntrack for the package: 
struct nf_conn * ct; enum ip_conntrack_info ctinfo; / * Slightly abridged sample code * / ct = nf_ct_get (skb, & ctinfo);
 If the package under discussion is a package for a new connection, the  nf_nat_rule_find  is called (net / ipv4 / netfilter / iptable_nat.c: 117): 
 IP_CT_NEW case: / * seen before? This can happen for resupply, retrans, * or local packages. * / If (nf_nat_initialized (ct, maniptype!)) {Unsigned int ret; ret = nf_nat_rule_find (skb, PAHO, and finalize> hooknum, in, out, ct); if return ret (ret = NF_ACCEPT!);   And, finally,  nf_nat_rule_find  calls  ipt_do_table  entering the iptables subsystem. This is as far as we'll get into the Netfilter and iptables systems because they are complex enough to warrant their own multi-page documents. 
 The return value of the function  ipt_do_table  will be 
-  not NF_ACCEPT, in which case it is returned immediately, oR
-  will be NF_ACCEPTcausingnf_nat_ipv4_fnto callnf_nat_packetfor handling packets and return eitherNF_ACCEPTorNF_DROP.
unwinding the return value
 In both cases the return value for  ipt_do_table , the final value of  nf_nat_ipv4_fn  is returned back through all the functions described above until  NF_HOOK_THRESH  
-  nf_nat_ipv4_fnthe return value is returnednf_nat_ipv4_in
-  that returns to nf_iterate
-  that returns to nf_hook_slow
-  that returns to nf_hook_thresh
-  that returns to nF_HOOK_THRESH
  nF_HOOK_THRESH  checks the return value and if it is  NF_ACCEPT  (1), it calls the function pointed by  okfn . 
 In our example,  okfn  is  ip_rcv_finish  that will be part of the treatment and forwards the packet to the next protocol layer. 

0 Komentar