Linux networking stack from the ground, part 4

8:15:00 PM
Linux networking stack from the ground, part 4 -

Part 1 | Part 2 | Part 3 | Part 4 | Part 5

Overview

This post will pick up where the left part 3 begins by describing Receive Packet Steering (RPS), what it is and how to set up, followed by a review the network stack describing how packets are processed according to RPS parameters, packet backlog queue, the beginning of the IP layer, and netfilter.

Receive Packet Steering

We have seen that the device drivers register NAPI polling instances. Each instance NAPI poller runs in the context of a kernel thread called softirq which there is a per CPU. The core CPU thread for the hardware interrupt handler runs on is awake / designed for use in the hardware interrupt handler.

Thus, one CPU processes the interrupt hardware and polls the network layer to process the incoming data.

Some NPI supports multiple queues at the hardware level. This means that incoming packets can be DMA'd to separate receive rings, each ring having received its own hardware interrupt is delivered to indicate the data is available. Each of these hardware interrupts would plan NAPI polling instances to run on each of the associated processors.

This allows multiple processors to handle hardware interrupts and poll the network layer.

Receive Packet Steering (RPS) is a software implementation of hardware enable NPI multi-queue. It allows multiple processors to handle incoming packets, even if the network adapter supports a single queue to get into the hardware.

RPS works by generating a hash for incoming data to determine which CPU must process the data. The data is then queued to the per-CPU receiving network backlog to deal with. An inter-processor interrupt is delivered to the CPU owns the backlog. This will restart the processing of the backlog by the remote CPU if it is not currently processing packets.

netif_receive_skb will either continue to send data over the network to the network stack, or rely on RPS for treatment on another CPU.

RPS set

RPS to work it must be enabled in the kernel configuration (it's on Ubuntu 3.13.0 for Linux kernel), and a bit mask describing the processors must treat packets to an interface and rx given queue.

The bit masks to modify are in / sys / class / net / DEVICE_NAME / files / tail / rps_cpus .

So, for eth0, and receive queue 0, you must change: / sys / class / net / eth0 / files / rx-0 / rps_cpus with a hexadecimal number indicating which processor should process packets eth0 receive queue 0.

Back to netif_receive_skb .

netif_receive_skb

to recall netif_receive_skb function is called from napi_skb_finish in the context of softirq the poller NAPI recorded by the device driver.

netif_receive_skb will either attempt to use RPS (as described above) or keep sending data to the network stack

Let consider to first the second path: .. sending data until the battery if RPS is off

netif_receive_skb without RPS

netif_receive_skb calls __ netif_receive_skb who did some accounting before calling __ netif_receive_skb_core to move the data along to the network stack to the protocol levels.

__ netif_receive_skb_core

This function changes the skb to the protocol layer in this piece of code (net / core / dev.c: 3628):

 skb- type => Protocol; list_for_each_entry_rcu (ptype, & ptype_base [ntohs(type) & PTYPE_HASH_MASK], list) {if (ptype-> type == type && (ptype-> dev == || null_or_dev ptype-> dev skb- ==> || dev ptype-> = dev orig_dev =)) {if (pt_prev) ret = deliver_skb (skb, pt_prev, orig_dev); pt_prev = ptype; }} 

We'll look at exactly how this code provides data to the protocol layer below, but first, let's see what happens when RPS is on.

netif_receive_skb RPS

RPS If enabled, netif_receive_skb calculate that the CPU's backlog, he must queue data. It does this by using get_rps_cpu (defined in net / core / dev.c: 2980)

 int cpu = get_rps_cpu (skb-> dev, skb, & rflow); if (cpu> = 0) {ret = enqueue_to_backlog (skb, cpu, & rflow-> last_qtail); rcu_read_unlock (); return ret; } 

enqueue_to_backlog

This function first get a pointer to softnet_data structure of the remote CPU that contains a pointer to a poller NAPI.

Then, the length of the queue of waiting input_pkt_queue to the remote CPU is satisfied:

 = qlen skb_queue_len (& SD-> input_pkt_queue); if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) { if (skb_queue_len(&sd-> input_pkt_queue)) {

There is first with respect to the netdev_max_backlog . If the length of the queue is larger than the order backlog, the data is deleted and the fall is counted against the remote CPU

You can prevent falls by increasing netdev_max_backlog :.

 sysctl -w net.core.netdev_max_backlog = 3000 

If the length of the queue is not too large, the next code checks if the flow limit is reached . By default, the speed limits are disabled. To enable rate limits, you must specify a bitmap (similar to RPS bitmap) in / proc / sys / net / core / flow_limit_cpu_bitmap .

Once you have activated the CPU speed limits, you can also adjust the flow limit of the hash table size by modifying the sysctl net.core.flow_limit_table_len .

you can learn more about the flow limits in Documentation / networking / scaling. txt file.

assuming that the flow limit has not been met, enqueue_to_backlog then checks if the backlog queue has data queued in already .

If so, the data is queued:

 if (skb_queue_len (& SD-> input_pkt_queue)) {enqueue: __skb_queue_tail (& SD-> input_pkt_queue, skb); input_queue_tail_incr_save (sd, qtail); rps_unlock (nd); local_irq_restore (flags); NET_RX_SUCCESS return; } 

If the queue is empty, the first NAPI poller for the backlog queue is launched:

 / * NAPI schedule for the delay device * We may use non-atomic operation because we have the lock of the queue * / if (__ test_and_set_bit (NAPI_STATE_SCHED, & SD-> backlog.state!)) {if ____ napi_schedule (sd, & SD-> backlog) (rps_ipi_queued (nd)!); } Goto enqueue ;. 

The goto at the bottom brings execution was above the code block, the queue data to the backlog

backlog queue, NAPI waiting poller

queue backlog by CPU plugs into NAPI in the same way a device driver is. A sampling function is provided that is used to process packets from softirq context.

This struct NAPI is provided during initialization of the networking system. Of net_dev_init in net / core / dev.c: 6952:

 SD-> = backlog.poll process_backlog; SD-> = backlog.weight weight_p; SD-> backlog.gro_list = NULL; SD-> backlog.gro_count = 0; 

The structure of the NAPI backlog NAPI structure differs from the device driver in the weight parameter is adjustable. Pilots hardcode values ​​(more hardcode to 64, as seen in e1000e).

Set NAPI weight poller backlog, change /proc/sys/net/core/dev_weight.

The poll function for the backlog is called process_backlog , and, similar to the function of e1000e e1000e_poll , is called from the context of softirq.

process_backlog

The process_backlog (net / core / dev.c: 4097) is a loop that runs until its weight (specified in `/ proc / sys / net / core / dev_weight`) was consumed or no more data remains on the order book.

Each piece of data waiting in the backlog queue is removed from the queue backlog and forwarded to __ netif_receive_skb . As explained above in any case of RPS, the data passed to this function finally reaches protocol layers after some accounting.

Similarly NAPI device driver implementations, the process_backlog code disables the poller if the total weight will not be used. The voter is restarted with the call to ____ napi_schedule of enqueue_to_backlog as described above.

The function returns the amount of work done, which net_rx_action (described above) will subtract from the budget (which is adjusted with the net.core.netdev_budget as described above).

__ netif_receive_skb_core provides data on protocol layers

The __ netif_receive_skb_core provides data on the protocol layers. It does this by getting the protocol field of skb and iterate over a list of provide functions registered for this type of protocol.

What is happening in this piece of code (as above shown):

 skb- type => Protocol; list_for_each_entry_rcu (ptype, & ptype_base [ntohs(type) & PTYPE_HASH_MASK], list) {if (ptype-> type == type && (ptype-> dev == || null_or_dev ptype-> dev skb- ==> || dev ptype-> = dev orig_dev =)) {if (pt_prev) ret = deliver_skb (skb, pt_prev, orig_dev); pt_prev = ptype; }} 

The ptype_base ID is defined in level net / core / dev.c: 146as a hash of lists:

 struct list_head ptype_base [PTYPE_HASH_SIZE] __read_mostly ; 

Each protocol layer adds struct packet_type to a list in a specific location in the hash table.

slot in the hash table is calculated by ptype_head

 static inline struct list_head ptype_head * (const struct packet_type * pt) {if (PT> Type == htons (ETH_P_ALL)) return & ptype_all; Another back and ptype_base [ntohs(pt->type) & PTYPE_HASH_MASK] } 

The protocol layers call dev_add_pack to add to the list.

layer IP

The IP protocol layer attaches to the ptype_base hash table so that data will be delivered to the lower layers it

This occurs in the inet_init net / ipv4 / af_inet.c :. 1815

 dev_add_pack (& ​​ip_packet_type); 

This stores the IP packet type structure defined as follows:

 static struct packet_type ip_packet_type __read_mostly = {.type = cpu_to_be16 (ETH_P_IP) .func = ip_rcv,}; 

__ netif_receive_skb_core calls deliver_skb (as seen in the above section). This function (net / core / dev.c: 1712)

 static inline int deliver_skb (struct sk_buff * skb, struct packet_type pt_prev *, struct * net_device orig_dev) {if (unlikely (skb_orphan_frags (skb, GFP_ATOMIC))) return -ENOMEM; atomic_inc (& skb-> users); return pt_prev-> func (skb, skb-> dev, pt_prev, orig_dev); } 

In the case of IP, the ip_rcv function is called.

ip_rcv

The ip_rcv function is pretty simple at a high level. There are several integrity checks to ensure data validity. Statistics counters that are superseded and

. Ip_rcv ends by passing the package ip_rcv_finish through netfilter. This is done so that all the iptables rules that should be matched to the IP protocol layer can have a look at the package before it continues (net / ipv4 / ip_input.c: 453):

 back NF_HOOK (NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, null, ip_rcv_finish); 

netfilter

The NF_HOOK_THRESH is pretty simple. It calls to nf_hook_thresh and the success, called okfn which in our case is ip_rcv_finish (include / linux / netfilter.h: 175):

 static inline int nF_HOOK_THRESH (uint8_t pf, unsigned int hook struct sk_buff * skb, struct net_device * in, struct net_device * out, int (* okfn) (struct sk_buff *), int thresh) {int ret = nf_hook_thresh ( pf, hook, skb, inside, outside, okfn beat); if (ret == 1) ret = okfn (SKB) return ret; } 
The nf_hook_thresh function

continues down iptables approach. It begins by determining if the netfilter hooks to the chain of netfilter protocol family and netfilter transmitted.

In our example above, the protocol family is NFPROTO_IPV4 and the type of chain is NF_INET_PRE_ROUTING

 / ** * nf_hook_thresh - call a netfilter hook * * Returns 1 if the hook has the package to pass. * Okfn the function must be invoked by the appellant in this case. Any other return value * indicates that the packet was consumed by the hook. * / Static inline int nf_hook_thresh (u_int8_t pf, unsigned int hook struct sk_buff * skb, struct net_device * Indev, struct net_device * outdev, int (* okfn) (struct sk_buff *), int thresh) {if (nf_hooks_active (pf, hook)) return nf_hook_slow (pf, crochet, skb, Indev, outdev, okfn, thresh); return 1; } 

This function calls the function nf_hooks_active which examines a table called nf_hooks_needed (include / linux / netfilter.h: 114):

 static inline bool nf_hooks_active (u_int8_t pf, unsigned int hook) {return list_empty (& nf_hooks [pf] [hook]) !; } 

And if this one hook, nf_hook_slow is called to go further in iptables.

nf_hook_slow

nf_hook_slow through the list of hooks in nf_hooks table for the protocol type and the type of chain by calling nf_iterate for each entry in the list of hook.

nf_iterate in turn calls the hook function associated with an entry on the hook list and returns a "verdict" on the package.

iptables ... Tables

iptables saves the hook functions for each of the corresponding tables packages :. filter, nat, mangle, raw, and security

in our example, we are interested in NF_INET_PRE_ROUTING chains that are in the nat Table [

Indeed, the struct pointer with the hook function that is registered with netfilter is in net / ipv4 / netfilter / iptable_nat.c: 251

 static struct nf_hook_ops nf_nat_ipv4_ops [] = {__read_mostly / * Before packet filtering, change of destination * / = {.hook nf_nat_ipv4_in, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_PRE_ROUTING, .Priority = NF_IP_PRI_NAT_DST,}, 

, which is part of iptable_nat_init (net / ipv4 / netfilter / iptable_nat.c: 316):

 err = nf_register_hooks (nf_nat_ipv4_ops, ARRAY_SIZE (nf_nat_ipv4_ops)); if (err <0) goto err2; 

In our example above the IP layer, the packets will be forwarded to nf_nat_ipv4_in down further in iptables via nf_hook_slow function described in the previous section.

nf_nat_ipv4_in

nf_nat_ipv4_in passes the package on nf_nat_ipv4_fn which begins by obtaining information conntrack for the package:

 struct nf_conn * ct; enum ip_conntrack_info ctinfo; / * Slightly abridged sample code * / ct = nf_ct_get (skb, & ctinfo); 

If the package under discussion is a package for a new connection, the nf_nat_rule_find is called (net / ipv4 / netfilter / iptable_nat.c: 117):

 IP_CT_NEW case: / * seen before? This can happen for resupply, retrans, * or local packages. * / If (nf_nat_initialized (ct, maniptype!)) {Unsigned int ret; ret = nf_nat_rule_find (skb, PAHO, and finalize> hooknum, in, out, ct); if return ret (ret = NF_ACCEPT!); 

And, finally, nf_nat_rule_find calls ipt_do_table entering the iptables subsystem. This is as far as we'll get into the Netfilter and iptables systems because they are complex enough to warrant their own multi-page documents.

The return value of the function ipt_do_table will be

  • not NF_ACCEPT , in which case it is returned immediately, oR
  • will be NF_ACCEPT causing nf_nat_ipv4_fn to call nf_nat_packet for handling packets and return either NF_ACCEPT or NF_DROP .

unwinding the return value

In both cases the return value for ipt_do_table , the final value of nf_nat_ipv4_fn is returned back through all the functions described above until NF_HOOK_THRESH

  1. nf_nat_ipv4_fn the return value is returned nf_nat_ipv4_in
  2. that returns to nf_iterate
  3. that returns to nf_hook_slow
  4. that returns to nf_hook_thresh
  5. that returns to nF_HOOK_THRESH

nF_HOOK_THRESH checks the return value and if it is NF_ACCEPT (1), it calls the function pointed by okfn .

In our example, okfn is ip_rcv_finish that will be part of the treatment and forwards the packet to the next protocol layer.

Previous
Next Post »
0 Komentar