Part 1 | Part 2 | Part 3 | Part 4 | Part 5
Overview
This post will pick up where the left part 3 begins by describing Receive Packet Steering (RPS), what it is and how to set up, followed by a review the network stack describing how packets are processed according to RPS parameters, packet backlog queue, the beginning of the IP layer, and netfilter.
Receive Packet Steering
We have seen that the device drivers register NAPI polling instances. Each instance NAPI poller runs in the context of a kernel thread called softirq which there is a per CPU. The core CPU thread for the hardware interrupt handler runs on is awake / designed for use in the hardware interrupt handler.
Thus, one CPU processes the interrupt hardware and polls the network layer to process the incoming data.
Some NPI supports multiple queues at the hardware level. This means that incoming packets can be DMA'd to separate receive rings, each ring having received its own hardware interrupt is delivered to indicate the data is available. Each of these hardware interrupts would plan NAPI polling instances to run on each of the associated processors.
This allows multiple processors to handle hardware interrupts and poll the network layer.
Receive Packet Steering (RPS) is a software implementation of hardware enable NPI multi-queue. It allows multiple processors to handle incoming packets, even if the network adapter supports a single queue to get into the hardware.
RPS works by generating a hash for incoming data to determine which CPU must process the data. The data is then queued to the per-CPU receiving network backlog to deal with. An inter-processor interrupt is delivered to the CPU owns the backlog. This will restart the processing of the backlog by the remote CPU if it is not currently processing packets.
netif_receive_skb
will either continue to send data over the network to the network stack, or rely on RPS for treatment on another CPU.
RPS set
RPS to work it must be enabled in the kernel configuration (it's on Ubuntu 3.13.0 for Linux kernel), and a bit mask describing the processors must treat packets to an interface and rx given queue.
The bit masks to modify are in / sys / class / net / DEVICE_NAME / files / tail / rps_cpus
.
So, for eth0, and receive queue 0, you must change: / sys / class / net / eth0 / files / rx-0 / rps_cpus
with a hexadecimal number indicating which processor should process packets eth0 receive queue 0.
Back to netif_receive_skb
.
netif_receive_skb
to recall netif_receive_skb function
is called from napi_skb_finish
in the context of softirq the poller NAPI recorded by the device driver.
netif_receive_skb
will either attempt to use RPS (as described above) or keep sending data to the network stack
Let consider to first the second path: .. sending data until the battery if RPS is off
netif_receive_skb
without RPS
netif_receive_skb
calls __ netif_receive_skb
who did some accounting before calling __ netif_receive_skb_core
to move the data along to the network stack to the protocol levels.
__ netif_receive_skb_core
This function changes the skb to the protocol layer in this piece of code (net / core / dev.c: 3628):
skb- type => Protocol; list_for_each_entry_rcu (ptype, & ptype_base [ntohs(type) & PTYPE_HASH_MASK], list) {if (ptype-> type == type && (ptype-> dev == || null_or_dev ptype-> dev skb- ==> || dev ptype-> = dev orig_dev =)) {if (pt_prev) ret = deliver_skb (skb, pt_prev, orig_dev); pt_prev = ptype; }}
We'll look at exactly how this code provides data to the protocol layer below, but first, let's see what happens when RPS is on.
netif_receive_skb
RPS
RPS If enabled, netif_receive_skb
calculate that the CPU's backlog, he must queue data. It does this by using get_rps_cpu
(defined in net / core / dev.c: 2980)
int cpu = get_rps_cpu (skb-> dev, skb, & rflow); if (cpu> = 0) {ret = enqueue_to_backlog (skb, cpu, & rflow-> last_qtail); rcu_read_unlock (); return ret; }
enqueue_to_backlog
This function first get a pointer to softnet_data
structure of the remote CPU that contains a pointer to a poller NAPI.
Then, the length of the queue of waiting input_pkt_queue
to the remote CPU is satisfied:
= qlen skb_queue_len (& SD-> input_pkt_queue); if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) { if (skb_queue_len(&sd-> input_pkt_queue)) {
There is first with respect to the netdev_max_backlog
. If the length of the queue is larger than the order backlog, the data is deleted and the fall is counted against the remote CPU
You can prevent falls by increasing netdev_max_backlog
:.
sysctl -w net.core.netdev_max_backlog = 3000
If the length of the queue is not too large, the next code checks if the flow limit is reached . By default, the speed limits are disabled. To enable rate limits, you must specify a bitmap (similar to RPS bitmap) in / proc / sys / net / core / flow_limit_cpu_bitmap
.
Once you have activated the CPU speed limits, you can also adjust the flow limit of the hash table size by modifying the sysctl net.core.flow_limit_table_len
.
you can learn more about the flow limits in Documentation / networking / scaling. txt file.
assuming that the flow limit has not been met, enqueue_to_backlog
then checks if the backlog queue has data queued in already .
If so, the data is queued:
if (skb_queue_len (& SD-> input_pkt_queue)) {enqueue: __skb_queue_tail (& SD-> input_pkt_queue, skb); input_queue_tail_incr_save (sd, qtail); rps_unlock (nd); local_irq_restore (flags); NET_RX_SUCCESS return; }
If the queue is empty, the first NAPI poller for the backlog queue is launched:
/ * NAPI schedule for the delay device * We may use non-atomic operation because we have the lock of the queue * / if (__ test_and_set_bit (NAPI_STATE_SCHED, & SD-> backlog.state!)) {if ____ napi_schedule (sd, & SD-> backlog) (rps_ipi_queued (nd)!); } Goto enqueue ;.
The goto
at the bottom brings execution was above the code block, the queue data to the backlog
backlog queue, NAPI waiting poller
queue backlog by CPU plugs into NAPI in the same way a device driver is. A sampling function is provided that is used to process packets from softirq context.
This struct NAPI is provided during initialization of the networking system. Of net_dev_init
in net / core / dev.c: 6952:
SD-> = backlog.poll process_backlog; SD-> = backlog.weight weight_p; SD-> backlog.gro_list = NULL; SD-> backlog.gro_count = 0;
The structure of the NAPI backlog NAPI structure differs from the device driver in the weight
parameter is adjustable. Pilots hardcode values (more hardcode to 64, as seen in e1000e).
Set NAPI weight poller backlog, change /proc/sys/net/core/dev_weight.
The poll function for the backlog is called process_backlog
, and, similar to the function of e1000e e1000e_poll
, is called from the context of softirq.
process_backlog
The process_backlog
(net / core / dev.c: 4097) is a loop that runs until its weight (specified in `/ proc / sys / net / core / dev_weight`) was consumed or no more data remains on the order book.
Each piece of data waiting in the backlog queue is removed from the queue backlog and forwarded to __ netif_receive_skb
. As explained above in any case of RPS, the data passed to this function finally reaches protocol layers after some accounting.
Similarly NAPI device driver implementations, the process_backlog
code disables the poller if the total weight will not be used. The voter is restarted with the call to ____ napi_schedule
of enqueue_to_backlog
as described above.
The function returns the amount of work done, which net_rx_action
(described above) will subtract from the budget (which is adjusted with the net.core.netdev_budget
as described above).
__ netif_receive_skb_core
provides data on protocol layers
The __ netif_receive_skb_core
provides data on the protocol layers. It does this by getting the protocol field of skb
and iterate over a list of provide
functions registered for this type of protocol.
What is happening in this piece of code (as above shown):
skb- type => Protocol; list_for_each_entry_rcu (ptype, & ptype_base [ntohs(type) & PTYPE_HASH_MASK], list) {if (ptype-> type == type && (ptype-> dev == || null_or_dev ptype-> dev skb- ==> || dev ptype-> = dev orig_dev =)) {if (pt_prev) ret = deliver_skb (skb, pt_prev, orig_dev); pt_prev = ptype; }}
The ptype_base
ID is defined in level net / core / dev.c: 146as a hash of lists:
struct list_head ptype_base [PTYPE_HASH_SIZE] __read_mostly ;
Each protocol layer adds struct packet_type
to a list in a specific location in the hash table.
slot in the hash table is calculated by ptype_head
static inline struct list_head ptype_head * (const struct packet_type * pt) {if (PT> Type == htons (ETH_P_ALL)) return & ptype_all; Another back and ptype_base [ntohs(pt->type) & PTYPE_HASH_MASK] }
The protocol layers call dev_add_pack
to add to the list.
layer IP
The IP protocol layer attaches to the ptype_base
hash table so that data will be delivered to the lower layers it
This occurs in the inet_init
net / ipv4 / af_inet.c :. 1815
dev_add_pack (& ip_packet_type);
This stores the IP packet type structure defined as follows:
static struct packet_type ip_packet_type __read_mostly = {.type = cpu_to_be16 (ETH_P_IP) .func = ip_rcv,};
__ netif_receive_skb_core
calls deliver_skb
(as seen in the above section). This function (net / core / dev.c: 1712)
static inline int deliver_skb (struct sk_buff * skb, struct packet_type pt_prev *, struct * net_device orig_dev) {if (unlikely (skb_orphan_frags (skb, GFP_ATOMIC))) return -ENOMEM; atomic_inc (& skb-> users); return pt_prev-> func (skb, skb-> dev, pt_prev, orig_dev); }
In the case of IP, the ip_rcv function
is called.
ip_rcv
The ip_rcv function
is pretty simple at a high level. There are several integrity checks to ensure data validity. Statistics counters that are superseded and
. Ip_rcv
ends by passing the package ip_rcv_finish
through netfilter. This is done so that all the iptables rules that should be matched to the IP protocol layer can have a look at the package before it continues (net / ipv4 / ip_input.c: 453):
back NF_HOOK (NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, null, ip_rcv_finish);
netfilter
The NF_HOOK_THRESH
is pretty simple. It calls to nf_hook_thresh
and the success, called okfn
which in our case is ip_rcv_finish
(include / linux / netfilter.h: 175):
static inline int nF_HOOK_THRESH (uint8_t pf, unsigned int hook struct sk_buff * skb, struct net_device * in, struct net_device * out, int (* okfn) (struct sk_buff *), int thresh) {int ret = nf_hook_thresh ( pf, hook, skb, inside, outside, okfn beat); if (ret == 1) ret = okfn (SKB) return ret; }The nf_hook_thresh function
continues down iptables approach. It begins by determining if the netfilter hooks to the chain of netfilter protocol family and netfilter transmitted.
In our example above, the protocol family is NFPROTO_IPV4
and the type of chain is NF_INET_PRE_ROUTING
/ ** * nf_hook_thresh - call a netfilter hook * * Returns 1 if the hook has the package to pass. * Okfn the function must be invoked by the appellant in this case. Any other return value * indicates that the packet was consumed by the hook. * / Static inline int nf_hook_thresh (u_int8_t pf, unsigned int hook struct sk_buff * skb, struct net_device * Indev, struct net_device * outdev, int (* okfn) (struct sk_buff *), int thresh) {if (nf_hooks_active (pf, hook)) return nf_hook_slow (pf, crochet, skb, Indev, outdev, okfn, thresh); return 1; }
This function calls the function nf_hooks_active
which examines a table called nf_hooks_needed
(include / linux / netfilter.h: 114):
static inline bool nf_hooks_active (u_int8_t pf, unsigned int hook) {return list_empty (& nf_hooks [pf] [hook]) !; }
And if this one hook, nf_hook_slow
is called to go further in iptables.
nf_hook_slow
nf_hook_slow
through the list of hooks in nf_hooks
table for the protocol type and the type of chain by calling nf_iterate
for each entry in the list of hook.
nf_iterate
in turn calls the hook function associated with an entry on the hook list and returns a "verdict" on the package.
iptables
... Tables
iptables
saves the hook functions for each of the corresponding tables packages :. filter, nat, mangle, raw, and security
in our example, we are interested in NF_INET_PRE_ROUTING
chains that are in the nat
Table [
Indeed, the struct pointer with the hook function that is registered with netfilter is in net / ipv4 / netfilter / iptable_nat.c: 251
static struct nf_hook_ops nf_nat_ipv4_ops [] = {__read_mostly / * Before packet filtering, change of destination * / = {.hook nf_nat_ipv4_in, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_PRE_ROUTING, .Priority = NF_IP_PRI_NAT_DST,},
, which is part of iptable_nat_init
(net / ipv4 / netfilter / iptable_nat.c: 316):
err = nf_register_hooks (nf_nat_ipv4_ops, ARRAY_SIZE (nf_nat_ipv4_ops)); if (err <0) goto err2;
In our example above the IP layer, the packets will be forwarded to nf_nat_ipv4_in
down further in iptables via nf_hook_slow function
described in the previous section.
nf_nat_ipv4_in
nf_nat_ipv4_in
passes the package on nf_nat_ipv4_fn
which begins by obtaining information conntrack for the package:
struct nf_conn * ct; enum ip_conntrack_info ctinfo; / * Slightly abridged sample code * / ct = nf_ct_get (skb, & ctinfo);
If the package under discussion is a package for a new connection, the nf_nat_rule_find
is called (net / ipv4 / netfilter / iptable_nat.c: 117):
IP_CT_NEW case: / * seen before? This can happen for resupply, retrans, * or local packages. * / If (nf_nat_initialized (ct, maniptype!)) {Unsigned int ret; ret = nf_nat_rule_find (skb, PAHO, and finalize> hooknum, in, out, ct); if return ret (ret = NF_ACCEPT!);
And, finally, nf_nat_rule_find
calls ipt_do_table
entering the iptables subsystem. This is as far as we'll get into the Netfilter and iptables systems because they are complex enough to warrant their own multi-page documents.
The return value of the function ipt_do_table
will be
- not
NF_ACCEPT
, in which case it is returned immediately, oR - will be
NF_ACCEPT
causingnf_nat_ipv4_fn
to callnf_nat_packet
for handling packets and return eitherNF_ACCEPT
orNF_DROP
.
unwinding the return value
In both cases the return value for ipt_do_table
, the final value of nf_nat_ipv4_fn
is returned back through all the functions described above until NF_HOOK_THRESH
-
nf_nat_ipv4_fn
the return value is returnednf_nat_ipv4_in
- that returns to
nf_iterate
- that returns to
nf_hook_slow
- that returns to
nf_hook_thresh
- that returns to
nF_HOOK_THRESH
nF_HOOK_THRESH
checks the return value and if it is NF_ACCEPT
(1), it calls the function pointed by okfn
.
In our example, okfn
is ip_rcv_finish
that will be part of the treatment and forwards the packet to the next protocol layer.
0 Komentar