S5P6818_驱动篇(26)网络驱动
网络驱动是 linux 里面驱动三巨头之一, linux 下的网络功能非常强大,嵌入式 linux 中也常常用到网络功能。前面我们已经讲过了字符设备驱动和块设备驱动,本章我们就来学习一下linux 里面的网络设备驱动。
网络驱动框架
net_device 结构体
Linux 内核使用 net_device 结构体表示一个具体的网络设备, net_device 是整个网络驱动的灵魂。网络驱动的核心就是初始化 net_device 结构体中的各个成员变量,然后将初始化完成以后的 net_device 注册到 Linux 内核中。 net_device 结构体定义在 include/linux/netdevice.h 中,net_device 是一个庞大的结构体,内容如下:
/*** struct net_device - The DEVICE structure.** Actually, this whole structure is a big mistake. It mixes I/O* data with strictly "high-level" data, and it has to know about* almost every data structure used in the INET module.** @name: This is the first field of the "visible" part of this structure* (i.e. as seen by users in the "Space.c" file). It is the name* of the interface.** @name_node: Name hashlist node* @ifalias: SNMP alias* @mem_end: Shared memory end* @mem_start: Shared memory start* @base_addr: Device I/O address* @irq: Device IRQ number** @state: Generic network queuing layer state, see netdev_state_t* @dev_list: The global list of network devices* @napi_list: List entry used for polling NAPI devices* @unreg_list: List entry when we are unregistering the* device; see the function unregister_netdev* @close_list: List entry used when we are closing the device* @ptype_all: Device-specific packet handlers for all protocols* @ptype_specific: Device-specific, protocol-specific packet handlers** @adj_list: Directly linked devices, like slaves for bonding* @features: Currently active device features* @hw_features: User-changeable features** @wanted_features: User-requested features* @vlan_features: Mask of features inheritable by VLAN devices** @hw_enc_features: Mask of features inherited by encapsulating devices* This field indicates what encapsulation* offloads the hardware is capable of doing,* and drivers will need to set them appropriately.** @mpls_features: Mask of features inheritable by MPLS* @gso_partial_features: value(s) from NETIF_F_GSO\*** @ifindex: interface index* @group: The group the device belongs to** @stats: Statistics struct, which was left as a legacy, use* rtnl_link_stats64 instead** @core_stats: core networking counters,* do not use this in drivers* @carrier_up_count: Number of times the carrier has been up* @carrier_down_count: Number of times the carrier has been down** @wireless_handlers: List of functions to handle Wireless Extensions,* instead of ioctl,* see <net/iw_handler.h> for details.* @wireless_data: Instance data managed by the core of wireless extensions** @netdev_ops: Includes several pointers to callbacks,* if one wants to override the ndo_*() functions* @ethtool_ops: Management operations* @l3mdev_ops: Layer 3 master device operations* @ndisc_ops: Includes callbacks for different IPv6 neighbour* discovery handling. Necessary for e.g. 6LoWPAN.* @xfrmdev_ops: Transformation offload operations* @tlsdev_ops: Transport Layer Security offload operations* @header_ops: Includes callbacks for creating,parsing,caching,etc* of Layer 2 headers.** @flags: Interface flags (a la BSD)* @priv_flags: Like 'flags' but invisible to userspace,* see if.h for the definitions* @gflags: Global flags ( kept as legacy )* @padded: How much padding added by alloc_netdev()* @operstate: RFC2863 operstate* @link_mode: Mapping policy to operstate* @if_port: Selectable AUI, TP, ...* @dma: DMA channel* @mtu: Interface MTU value* @min_mtu: Interface Minimum MTU value* @max_mtu: Interface Maximum MTU value* @type: Interface hardware type* @hard_header_len: Maximum hardware header length.* @min_header_len: Minimum hardware header length** @needed_headroom: Extra headroom the hardware may need, but not in all* cases can this be guaranteed* @needed_tailroom: Extra tailroom the hardware may need, but not in all* cases can this be guaranteed. Some cases also use* LL_MAX_HEADER instead to allocate the skb** interface address info:** @perm_addr: Permanent hw address* @addr_assign_type: Hw address assignment type* @addr_len: Hardware address length* @upper_level: Maximum depth level of upper devices.* @lower_level: Maximum depth level of lower devices.* @neigh_priv_len: Used in neigh_alloc()* @dev_id: Used to differentiate devices that share* the same link layer address* @dev_port: Used to differentiate devices that share* the same function* @addr_list_lock: XXX: need comments on this one* @name_assign_type: network interface name assignment type* @uc_promisc: Counter that indicates promiscuous mode* has been enabled due to the need to listen to* additional unicast addresses in a device that* does not implement ndo_set_rx_mode()* @uc: unicast mac addresses* @mc: multicast mac addresses* @dev_addrs: list of device hw addresses* @queues_kset: Group of all Kobjects in the Tx and RX queues* @promiscuity: Number of times the NIC is told to work in* promiscuous mode; if it becomes 0 the NIC will* exit promiscuous mode* @allmulti: Counter, enables or disables allmulticast mode** @vlan_info: VLAN info* @dsa_ptr: dsa specific data* @tipc_ptr: TIPC specific data* @atalk_ptr: AppleTalk link* @ip_ptr: IPv4 specific data* @ip6_ptr: IPv6 specific data* @ax25_ptr: AX.25 specific data* @ieee80211_ptr: IEEE 802.11 specific data, assign before registering* @ieee802154_ptr: IEEE 802.15.4 low-rate Wireless Personal Area Network* device struct* @mpls_ptr: mpls_dev struct pointer* @mctp_ptr: MCTP specific data** @dev_addr: Hw address (before bcast,* because most packets are unicast)** @_rx: Array of RX queues* @num_rx_queues: Number of RX queues* allocated at register_netdev() time* @real_num_rx_queues: Number of RX queues currently active in device* @xdp_prog: XDP sockets filter program pointer* @gro_flush_timeout: timeout for GRO layer in NAPI* @napi_defer_hard_irqs: If not zero, provides a counter that would* allow to avoid NIC hard IRQ, on busy queues.** @rx_handler: handler for received packets* @rx_handler_data: XXX: need comments on this one* @miniq_ingress: ingress/clsact qdisc specific data for* ingress processing* @ingress_queue: XXX: need comments on this one* @nf_hooks_ingress: netfilter hooks executed for ingress packets* @broadcast: hw bcast address** @rx_cpu_rmap: CPU reverse-mapping for RX completion interrupts,* indexed by RX queue number. Assigned by driver.* This must only be set if the ndo_rx_flow_steer* operation is defined* @index_hlist: Device index hash chain** @_tx: Array of TX queues* @num_tx_queues: Number of TX queues allocated at alloc_netdev_mq() time* @real_num_tx_queues: Number of TX queues currently active in device* @qdisc: Root qdisc from userspace point of view* @tx_queue_len: Max frames per queue allowed* @tx_global_lock: XXX: need comments on this one* @xdp_bulkq: XDP device bulk queue* @xps_maps: all CPUs/RXQs maps for XPS device** @xps_maps: XXX: need comments on this one* @miniq_egress: clsact qdisc specific data for* egress processing* @nf_hooks_egress: netfilter hooks executed for egress packets* @qdisc_hash: qdisc hash table* @watchdog_timeo: Represents the timeout that is used by* the watchdog (see dev_watchdog())* @watchdog_timer: List of timers** @proto_down_reason: reason a netdev interface is held down* @pcpu_refcnt: Number of references to this device* @dev_refcnt: Number of references to this device* @refcnt_tracker: Tracker directory for tracked references to this device* @todo_list: Delayed register/unregister* @link_watch_list: XXX: need comments on this one** @reg_state: Register/unregister state machine* @dismantle: Device is going to be freed* @rtnl_link_state: This enum represents the phases of creating* a new link** @needs_free_netdev: Should unregister perform free_netdev?* @priv_destructor: Called from unregister* @npinfo: XXX: need comments on this one* @nd_net: Network namespace this network device is inside** @ml_priv: Mid-layer private* @ml_priv_type: Mid-layer private type* @lstats: Loopback statistics* @tstats: Tunnel statistics* @dstats: Dummy statistics* @vstats: Virtual ethernet statistics** @garp_port: GARP* @mrp_port: MRP** @dm_private: Drop monitor private** @dev: Class/net/name entry* @sysfs_groups: Space for optional device, statistics and wireless* sysfs groups** @sysfs_rx_queue_group: Space for optional per-rx queue attributes* @rtnl_link_ops: Rtnl_link_ops** @gso_max_size: Maximum size of generic segmentation offload* @tso_max_size: Device (as in HW) limit on the max TSO request size* @gso_max_segs: Maximum number of segments that can be passed to the* NIC for GSO* @tso_max_segs: Device (as in HW) limit on the max TSO segment count** @dcbnl_ops: Data Center Bridging netlink ops* @num_tc: Number of traffic classes in the net device* @tc_to_txq: XXX: need comments on this one* @prio_tc_map: XXX: need comments on this one** @fcoe_ddp_xid: Max exchange id for FCoE LRO by ddp** @priomap: XXX: need comments on this one* @phydev: Physical device may attach itself* for hardware timestamping* @sfp_bus: attached &struct sfp_bus structure.** @qdisc_tx_busylock: lockdep class annotating Qdisc->busylock spinlock** @proto_down: protocol port state information can be sent to the* switch driver and used to set the phys state of the* switch port.** @wol_enabled: Wake-on-LAN is enabled** @threaded: napi threaded mode is enabled** @net_notifier_list: List of per-net netdev notifier block* that follow this device when it is moved* to another network namespace.** @macsec_ops: MACsec offloading ops** @udp_tunnel_nic_info: static structure describing the UDP tunnel* offload capabilities of the device* @udp_tunnel_nic: UDP tunnel offload state* @xdp_state: stores info on attached XDP BPF programs** @nested_level: Used as a parameter of spin_lock_nested() of* dev->addr_list_lock.* @unlink_list: As netif_addr_lock() can be called recursively,* keep a list of interfaces to be deleted.* @gro_max_size: Maximum size of aggregated packet in generic* receive offload (GRO)** @dev_addr_shadow: Copy of @dev_addr to catch direct writes.* @linkwatch_dev_tracker: refcount tracker used by linkwatch.* @watchdog_dev_tracker: refcount tracker used by watchdog.* @dev_registered_tracker: tracker for reference held while* registered* @offload_xstats_l3: L3 HW stats for this netdevice.** FIXME: cleanup struct net_device such that network protocol info* moves out.*/struct net_device {char name[IFNAMSIZ];struct netdev_name_node *name_node;struct dev_ifalias __rcu *ifalias;/** I/O specific fields* FIXME: Merge these and struct ifmap into one*/unsigned long mem_end;unsigned long mem_start;unsigned long base_addr;/** Some hardware also needs these fields (state,dev_list,* napi_list,unreg_list,close_list) but they are not* part of the usual set specified in Space.c.*/unsigned long state;struct list_head dev_list;struct list_head napi_list;struct list_head unreg_list;struct list_head close_list;struct list_head ptype_all;struct list_head ptype_specific;struct {struct list_head upper;struct list_head lower;} adj_list;/* Read-mostly cache-line for fast-path access */unsigned int flags;unsigned long long priv_flags;const struct net_device_ops *netdev_ops;int ifindex;unsigned short gflags;unsigned short hard_header_len;/* Note : dev->mtu is often read without holding a lock.* Writers usually hold RTNL.* It is recommended to use READ_ONCE() to annotate the reads,* and to use WRITE_ONCE() to annotate the writes.*/unsigned int mtu;unsigned short needed_headroom;unsigned short needed_tailroom;netdev_features_t features;netdev_features_t hw_features;netdev_features_t wanted_features;netdev_features_t vlan_features;netdev_features_t hw_enc_features;netdev_features_t mpls_features;netdev_features_t gso_partial_features;unsigned int min_mtu;unsigned int max_mtu;unsigned short type;unsigned char min_header_len;unsigned char name_assign_type;int group;struct net_device_stats stats; /* not used by modern drivers */struct net_device_core_stats __percpu *core_stats;/* Stats to monitor link on/off, flapping */atomic_t carrier_up_count;atomic_t carrier_down_count;#ifdef CONFIG_WIRELESS_EXTconst struct iw_handler_def *wireless_handlers;struct iw_public_data *wireless_data;
#endifconst struct ethtool_ops *ethtool_ops;
#ifdef CONFIG_NET_L3_MASTER_DEVconst struct l3mdev_ops *l3mdev_ops;
#endif
#if IS_ENABLED(CONFIG_IPV6)const struct ndisc_ops *ndisc_ops;
#endif#ifdef CONFIG_XFRM_OFFLOADconst struct xfrmdev_ops *xfrmdev_ops;
#endif#if IS_ENABLED(CONFIG_TLS_DEVICE)const struct tlsdev_ops *tlsdev_ops;
#endifconst struct header_ops *header_ops;unsigned char operstate;unsigned char link_mode;unsigned char if_port;unsigned char dma;/* Interface address info. */unsigned char perm_addr[MAX_ADDR_LEN];unsigned char addr_assign_type;unsigned char addr_len;unsigned char upper_level;unsigned char lower_level;unsigned short neigh_priv_len;unsigned short dev_id;unsigned short dev_port;unsigned short padded;spinlock_t addr_list_lock;int irq;struct netdev_hw_addr_list uc;struct netdev_hw_addr_list mc;struct netdev_hw_addr_list dev_addrs;#ifdef CONFIG_SYSFSstruct kset *queues_kset;
#endif
#ifdef CONFIG_LOCKDEPstruct list_head unlink_list;
#endifunsigned int promiscuity;unsigned int allmulti;bool uc_promisc;
#ifdef CONFIG_LOCKDEPunsigned char nested_level;
#endif/* Protocol-specific pointers */struct in_device __rcu *ip_ptr;struct inet6_dev __rcu *ip6_ptr;
#if IS_ENABLED(CONFIG_VLAN_8021Q)struct vlan_info __rcu *vlan_info;
#endif
#if IS_ENABLED(CONFIG_NET_DSA)struct dsa_port *dsa_ptr;
#endif
#if IS_ENABLED(CONFIG_TIPC)struct tipc_bearer __rcu *tipc_ptr;
#endif
#if IS_ENABLED(CONFIG_ATALK)void *atalk_ptr;
#endif
#if IS_ENABLED(CONFIG_AX25)void *ax25_ptr;
#endif
#if IS_ENABLED(CONFIG_CFG80211)struct wireless_dev *ieee80211_ptr;
#endif
#if IS_ENABLED(CONFIG_IEEE802154) || IS_ENABLED(CONFIG_6LOWPAN)struct wpan_dev *ieee802154_ptr;
#endif
#if IS_ENABLED(CONFIG_MPLS_ROUTING)struct mpls_dev __rcu *mpls_ptr;
#endif
#if IS_ENABLED(CONFIG_MCTP)struct mctp_dev __rcu *mctp_ptr;
#endif/** Cache lines mostly used on receive path (including eth_type_trans())*//* Interface address info used in eth_type_trans() */const unsigned char *dev_addr;struct netdev_rx_queue *_rx;unsigned int num_rx_queues;unsigned int real_num_rx_queues;struct bpf_prog __rcu *xdp_prog;unsigned long gro_flush_timeout;int napi_defer_hard_irqs;
#define GRO_LEGACY_MAX_SIZE 65536u
/* TCP minimal MSS is 8 (TCP_MIN_GSO_SIZE),* and shinfo->gso_segs is a 16bit field.*/
#define GRO_MAX_SIZE (8 * 65535u)unsigned int gro_max_size;rx_handler_func_t __rcu *rx_handler;void __rcu *rx_handler_data;#ifdef CONFIG_NET_CLS_ACTstruct mini_Qdisc __rcu *miniq_ingress;
#endifstruct netdev_queue __rcu *ingress_queue;
#ifdef CONFIG_NETFILTER_INGRESSstruct nf_hook_entries __rcu *nf_hooks_ingress;
#endifunsigned char broadcast[MAX_ADDR_LEN];
#ifdef CONFIG_RFS_ACCELstruct cpu_rmap *rx_cpu_rmap;
#endifstruct hlist_node index_hlist;/** Cache lines mostly used on transmit path*/struct netdev_queue *_tx ____cacheline_aligned_in_smp;unsigned int num_tx_queues;unsigned int real_num_tx_queues;struct Qdisc __rcu *qdisc;unsigned int tx_queue_len;spinlock_t tx_global_lock;struct xdp_dev_bulk_queue __percpu *xdp_bulkq;#ifdef CONFIG_XPSstruct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
#endif
#ifdef CONFIG_NET_CLS_ACTstruct mini_Qdisc __rcu *miniq_egress;
#endif
#ifdef CONFIG_NETFILTER_EGRESSstruct nf_hook_entries __rcu *nf_hooks_egress;
#endif#ifdef CONFIG_NET_SCHEDDECLARE_HASHTABLE (qdisc_hash, 4);
#endif/* These may be needed for future network-power-down code. */struct timer_list watchdog_timer;int watchdog_timeo;u32 proto_down_reason;struct list_head todo_list;#ifdef CONFIG_PCPU_DEV_REFCNTint __percpu *pcpu_refcnt;
#elserefcount_t dev_refcnt;
#endifstruct ref_tracker_dir refcnt_tracker;struct list_head link_watch_list;enum { NETREG_UNINITIALIZED=0,NETREG_REGISTERED, /* completed register_netdevice */NETREG_UNREGISTERING, /* called unregister_netdevice */NETREG_UNREGISTERED, /* completed unregister todo */NETREG_RELEASED, /* called free_netdev */NETREG_DUMMY, /* dummy device for NAPI poll */} reg_state:8;bool dismantle;enum {RTNL_LINK_INITIALIZED,RTNL_LINK_INITIALIZING,} rtnl_link_state:16;bool needs_free_netdev;void (*priv_destructor)(struct net_device *dev);#ifdef CONFIG_NETPOLLstruct netpoll_info __rcu *npinfo;
#endifpossible_net_t nd_net;/* mid-layer private */void *ml_priv;enum netdev_ml_priv_type ml_priv_type;union {struct pcpu_lstats __percpu *lstats;struct pcpu_sw_netstats __percpu *tstats;struct pcpu_dstats __percpu *dstats;};#if IS_ENABLED(CONFIG_GARP)struct garp_port __rcu *garp_port;
#endif
#if IS_ENABLED(CONFIG_MRP)struct mrp_port __rcu *mrp_port;
#endif
#if IS_ENABLED(CONFIG_NET_DROP_MONITOR)struct dm_hw_stat_delta __rcu *dm_private;
#endifstruct device dev;const struct attribute_group *sysfs_groups[4];const struct attribute_group *sysfs_rx_queue_group;const struct rtnl_link_ops *rtnl_link_ops;/* for setting kernel sock attribute on TCP connection setup */
#define GSO_MAX_SEGS 65535u
#define GSO_LEGACY_MAX_SIZE 65536u
/* TCP minimal MSS is 8 (TCP_MIN_GSO_SIZE),* and shinfo->gso_segs is a 16bit field.*/
#define GSO_MAX_SIZE (8 * GSO_MAX_SEGS)unsigned int gso_max_size;
#define TSO_LEGACY_MAX_SIZE 65536
#define TSO_MAX_SIZE UINT_MAXunsigned int tso_max_size;u16 gso_max_segs;
#define TSO_MAX_SEGS U16_MAXu16 tso_max_segs;#ifdef CONFIG_DCBconst struct dcbnl_rtnl_ops *dcbnl_ops;
#endifs16 num_tc;struct netdev_tc_txq tc_to_txq[TC_MAX_QUEUE];u8 prio_tc_map[TC_BITMASK + 1];#if IS_ENABLED(CONFIG_FCOE)unsigned int fcoe_ddp_xid;
#endif
#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)struct netprio_map __rcu *priomap;
#endifstruct phy_device *phydev;struct sfp_bus *sfp_bus;struct lock_class_key *qdisc_tx_busylock;bool proto_down;unsigned wol_enabled:1;unsigned threaded:1;struct list_head net_notifier_list;#if IS_ENABLED(CONFIG_MACSEC)/* MACsec management functions */const struct macsec_ops *macsec_ops;
#endifconst struct udp_tunnel_nic_info *udp_tunnel_nic_info;struct udp_tunnel_nic *udp_tunnel_nic;/* protected by rtnl_lock */struct bpf_xdp_entity xdp_state[__MAX_XDP_MODE];u8 dev_addr_shadow[MAX_ADDR_LEN];netdevice_tracker linkwatch_dev_tracker;netdevice_tracker watchdog_dev_tracker;netdevice_tracker dev_registered_tracker;struct rtnl_hw_stats64 *offload_xstats_l3;
};
这个 struct net_device 是 Linux 内核网络子系统中最核心的数据结构之一,它代表一个网络接口设备(如以太网卡 eth0、Wi-Fi 接口 wlan0、环回接口 lo 等)。它包含了管理和操作一个网络设备所需的所有信息,从硬件配置到协议状态,再到数据包处理队列。
以下是其关键部分和功能的解释:
基本标识与配置:
name[IFNAMSIZ]: 设备名称 (e.g., "eth0")。
name_node, ifindex: 用于在系统内唯一标识设备(名字哈希节点和索引号)。
irq: 设备使用的中断号。
mem_end, mem_start, base_addr: 设备使用的共享内存区域(起始、结束地址)和 I/O 地址(在现代 PCI/PCIe 设备中较少直接使用)。
mtu, min_mtu, max_mtu: 最大传输单元(字节)及其允许的最小/最大值。
type: 硬件类型 (e.g., ARPHRD_ETHER 表示以太网)。
flags, priv_flags: 接口标志位(如 IFF_UP 表示接口已启用,IFF_PROMISC 表示混杂模式)。priv_flags 是内核内部使用的标志,对用户空间不可见。
operstate, link_mode: 链路操作状态 (RFC 2863) 和链路模式映射策略。
地址信息:
perm_addr: 硬件的永久 MAC 地址。
dev_addr: 当前使用的 MAC 地址(通常是 perm_addr,但可被修改)。
addr_len: MAC 地址长度 (通常为 6)。
addr_assign_type: MAC 地址分配类型(如永久、随机、用户设置)。
broadcast: 广播地址。
uc, mc, dev_addrs: 链表管理设备的单播 (uc)、组播 (mc) 和所有硬件地址 (dev_addrs)。
promiscuity, allmulti, uc_promisc: 计数器,管理混杂模式(接收所有包)和全组播模式的开启状态。
dev_addr_shadow: dev_addr 的副本,用于检测非法修改。
队列与数据包处理 (核心性能部分):
_rx, num_rx_queues, real_num_rx_queues: 接收队列数组、分配的队列数、实际启用的队列数(用于多队列网卡 RSS/RPS)。
_tx, num_tx_queues, real_num_tx_queues: 发送队列数组、分配的队列数、实际启用的队列数。
qdisc: 指向该设备流量控制(QoS)的根排队规则 (e.g., pfifo_fast)。
xdp_prog: 指向附加的 XDP (eXpress Data Path) eBPF 程序的指针,用于在驱动层进行高性能数据包处理。
miniq_ingress, miniq_egress: 指向 ingress/egress 方向的小型 clsact Qdisc 结构的指针(用于 eBPF 分类和动作)。
rx_handler, rx_handler_data: 可选的接收处理函数及其数据(常用于桥接、MACVLAN 等)。
gro_max_size, gso_max_size, gso_max_segs, tso_max_size, tso_max_segs: 控制 GRO (Generic Receive Offload), GSO (Generic Segmentation Offload), TSO (TCP Segmentation Offload) 等硬件卸载功能的最大包大小和分段数量限制。
gro_flush_timeout, napi_defer_hard_irqs: NAPI (New API) 轮询机制的参数,用于优化中断处理。
功能特性 (Features):
features, hw_features, wanted_features, vlan_features, hw_enc_features, mpls_features, gso_partial_features: 一系列位掩码 (netdev_features_t),表示设备支持、硬件支持、用户请求、可被 VLAN/MPLS/封装设备继承的各种网络功能特性 (如校验和卸载 NETIF_F_HW_CSUM, TSO NETIF_F_TSO, GRO NETIF_F_GRO, 硬件加密卸载等)。
操作函数集 (Ops - 驱动/协议实现的关键):
netdev_ops: 核心操作集。包含指向驱动程序必须实现的函数的指针,如打开设备 (ndo_open)、关闭设备 (ndo_stop)、启动发送 (ndo_start_xmit)、获取统计信息 (ndo_get_stats64)、改变 MTU (ndo_change_mtu)、设置 MAC 地址 (ndo_set_mac_address) 等。
ethtool_ops: 指向 ethtool 相关操作的指针,用于查询和设置驱动/硬件特定参数(速度、双工、环回测试、寄存器转储等)。
header_ops: 包含创建、解析、重建 L2 头(如以太网头)的函数指针。
l3mdev_ops, ndisc_ops, xfrmdev_ops, tlsdev_ops, macsec_ops, dcbnl_ops: 特定协议或功能(L3主设备、IPv6邻居发现、IPsec卸载、TLS卸载、MACsec、数据中心桥接)的操作集。
udp_tunnel_nic_info, udp_tunnel_nic: UDP 隧道(如 VXLAN, GENEVE)卸载的配置和状态信息。
协议特定数据 (指针):
ip_ptr (struct in_device): IPv4 协议相关配置和状态 (IP 地址、路由标志、ARP 表、ACL 等)。
ip6_ptr (struct inet6_dev): IPv6 协议相关配置和状态 (IPv6 地址、邻居发现、MLD 等)。
vlan_info: VLAN 配置信息。
dsa_ptr: 分布式交换机架构 (DSA) 信息。
ieee80211_ptr (struct wireless_dev): 无线 (IEEE 802.11) 设备特定信息。
ieee802154_ptr: IEEE 802.15.4 (低速率 WPAN) 设备信息。
mpls_ptr, mctp_ptr, tipc_ptr, atalk_ptr, ax25_ptr: 其他协议(MPLS, MCTP, TIPC, AppleTalk, AX.25)的私有数据指针。
统计信息:
stats (struct net_device_stats): 传统统计信息结构 (已过时)。包含接收/发送包数、字节数、错误数等。驱动实现 ndo_get_stats64 返回 rtnl_link_stats64 是更现代的方式。
core_stats: 核心网络层内部使用的计数器(驱动不应使用)。
carrier_up_count, carrier_down_count: 载波状态(物理链路)切换计数器。
lstats, tstats, dstats, vstats: 特定设备类型(环回、隧道、Dummy、Virtual)的统计结构指针。
offload_xstats_l3: L3 硬件卸载统计信息。
设备状态与生命周期管理:
state: 通用网络队列层状态 (__LINK_STATE_* 标志)。
reg_state: 注册状态机 (NETREG_*),跟踪设备在内核中的注册/注销过程。
rtnl_link_state: RTnetlink 链路创建状态。
dismantle: 设备即将被释放的标志。
needs_free_netdev: 指示注销时是否需要调用 free_netdev 释放结构。
priv_destructor: 设备注销时调用的清理函数。
dev_refcnt, pcpu_refcnt, refcnt_tracker: 引用计数及其跟踪器,管理设备内存的生命周期。
watchdog_timer, watchdog_timeo: 看门狗定时器及其超时时间,用于检测发送队列卡死。
linkwatch_dev_tracker, watchdog_dev_tracker, dev_registered_tracker: 用于跟踪特定子模块(链路监视、看门狗、注册状态)对设备的引用。
todo_list, link_watch_list: 延迟注册/注销列表和链路状态监视列表。
其他重要部分:
adj_list: 上层(如 VLAN、桥接端口)和下层(如物理从属设备 bonding slave)设备的链表。
ptype_all, ptype_specific: 所有协议和特定协议的包类型处理程序链表。
dev (struct device): 继承自设备模型的结构,用于 sysfs 表示、电源管理、DMA 映射等。
sysfs_groups, sysfs_rx_queue_group: 定义设备在 sysfs 中的属性组。
queues_kset: 所有 Tx/Rx 队列在内核对象模型中的集合。
nd_net (possible_net_t): 设备所属的网络命名空间。
ml_priv, ml_priv_type: 中间层(如 Open vSwitch)可以使用的私有数据和类型标识。
proto_down: 协议(而非物理链路)导致端口关闭的标志。
wol_enabled: Wake-on-LAN 功能是否启用。
threaded: NAPI 是否在线程中运行。
xps_maps: XPS (Transmit Packet Steering) CPU/RXQ 映射表,优化多队列发送。
net_notifier_list: 当设备被移动到其他网络命名空间时需要跟随的 netlink 通知块列表。
xdp_state: 存储附加的 XDP 程序模式信息 (bpf_xdp_entity)。
申请 net_device
编写网络驱动的时候首先要申请 net_device,使用 alloc_netdev 函数来申请 net_device,这是一个宏,宏定义如下:
#define alloc_netdev(sizeof_priv, name, name_assign_type, setup) \alloc_netdev_mqs(sizeof_priv, name, name_assign_type, setup, 1, 1)
可以看出 alloc_netdev 的本质是 alloc_netdev_mqs 函数,此函数原型如下
struct net_device * alloc_netdev_mqs ( int sizeof_priv,const char *name,void (*setup) (struct net_device *))unsigned int txqs,unsigned int rxqs);
函数参数和返回值含义如下:
sizeof_priv: 私有数据块大小。
name: 设备名字。
setup: 回调函数,初始化设备后调用此函数。
txqs: 分配的发送队列数量。
rxqs: 分配的接收队列数量。
返回值: 如果申请成功的话就返回申请到的 net_device 指针,失败的话就返回 NULL。事实上网络设备有多种,大家不要以为就只有以太网一种。 Linux 内核内核支持的网络接口有很多,比如光纤分布式数据接口(FDDI)、以太网设备(Ethernet)、红外数据接口(InDA)、高性能并行接口(HPPI)、 CAN 网络等。内核针对不同的网络设备在 alloc_netdev 的基础上提供了一层封装,比如我们本章讲解的以太网,针对以太网封装的 net_device 申请函数是 alloc_etherdev,这也是一个宏,内容如下:
#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
#define alloc_etherdev_mq(sizeof_priv, count)alloc_etherdev_mqs(sizeof_priv, count, count)
可 以 看 出, alloc_etherdev 最 终依 靠 的是 alloc_etherdev_mqs 函 数 ,此 函数 就 是 对alloc_netdev_mqs 的简单封装,函数内容如下:
struct net_device *alloc_etherdev_mqs(int sizeof_priv, unsigned int txqs, unsigned int rxqs)
{return alloc_netdev_mqs(sizeof_priv, "eth%d", NET_NAME_UNKNOWN,ether_setup, txqs, rxqs);
}
alloc_netdev_mqs 来申请 net_device,注意这里设置网卡的名字为“eth%d”,这是格式化字符串,大家进入开发板的 linux 系统以后看到的“eth0”、“eth1”这样的网卡名字就是从这里来的。同样的,这里设置了以太网的 setup 函数为 ether_setup,不同的网络设备其 setup函数不同,比如 CAN 网络里面 setup 函数就是 can_setup。ether_setup 函数会对 net_device 做初步的初始化,函数内容如下所示:
void ether_setup(struct net_device *dev)
{dev->header_ops = ð_header_ops;dev->type = ARPHRD_ETHER;dev->hard_header_len = ETH_HLEN;dev->min_header_len = ETH_HLEN;dev->mtu = ETH_DATA_LEN;dev->min_mtu = ETH_MIN_MTU;dev->max_mtu = ETH_DATA_LEN;dev->addr_len = ETH_ALEN;dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;dev->flags = IFF_BROADCAST|IFF_MULTICAST;dev->priv_flags |= IFF_TX_SKB_SHARING;eth_broadcast_addr(dev->broadcast);}
删除 net_device
当我们注销网络驱动的时候需要释放掉前面已经申请到的 net_device,释放函数为free_netdev,函数原型如下:
void free_netdev(struct net_device *dev)
函数参数和返回值含义如下:
dev: 要释放掉的 net_device 指针。
返回值: 无。
注册 net_device
net_device 申请并初始化完成以后就需要向内核注册 net_device,要用到函数 register_netdev,函数原型如下:
int register_netdev(struct net_device *dev)
函数参数和返回值含义如下:
dev: 要注册的 net_device 指针。
返回值: 0 注册成功,负值 注册失败。
注销 net_device
既然有注册,那么必然有注销,注销 net_device 使用函数 unregister_netdev,函数原型如下:
void unregister_netdev(struct net_device *dev)
函数参数和返回值含义如下:
dev: 要注销的 net_device 指针。
返回值: 无
net_device_ops 结构体
net_device 有个非常重要的成员变量: netdev_ops,为 net_device_ops 结构体指针类型,这就是网络设备的操作集。 net_device_ops 结构体定义在 include/linux/netdevice.h 文件中,
net_device_ops 结构体里面都是一些以“ndo_”开头的函数,这些函数就需要网络驱动编写人员去实现,不需要全部都实现,根据实际驱动情况实现其中一部分即可。结构体内容如下所示:
/** This structure defines the management hooks for network devices.* The following hooks can be defined; unless noted otherwise, they are* optional and can be filled with a null pointer.** int (*ndo_init)(struct net_device *dev);* This function is called once when a network device is registered.* The network device can use this for any late stage initialization* or semantic validation. It can fail with an error code which will* be propagated back to register_netdev.** void (*ndo_uninit)(struct net_device *dev);* This function is called when device is unregistered or when registration* fails. It is not called if init fails.** int (*ndo_open)(struct net_device *dev);* This function is called when a network device transitions to the up* state.** int (*ndo_stop)(struct net_device *dev);* This function is called when a network device transitions to the down* state.** netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb,* struct net_device *dev);* Called when a packet needs to be transmitted.* Returns NETDEV_TX_OK. Can return NETDEV_TX_BUSY, but you should stop* the queue before that can happen; it's for obsolete devices and weird* corner cases, but the stack really does a non-trivial amount* of useless work if you return NETDEV_TX_BUSY.* Required; cannot be NULL.** netdev_features_t (*ndo_features_check)(struct sk_buff *skb,* struct net_device *dev* netdev_features_t features);* Called by core transmit path to determine if device is capable of* performing offload operations on a given packet. This is to give* the device an opportunity to implement any restrictions that cannot* be otherwise expressed by feature flags. The check is called with* the set of features that the stack has calculated and it returns* those the driver believes to be appropriate.** u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb,* struct net_device *sb_dev);* Called to decide which queue to use when device supports multiple* transmit queues.** void (*ndo_change_rx_flags)(struct net_device *dev, int flags);* This function is called to allow device receiver to make* changes to configuration when multicast or promiscuous is enabled.** void (*ndo_set_rx_mode)(struct net_device *dev);* This function is called device changes address list filtering.* If driver handles unicast address filtering, it should set* IFF_UNICAST_FLT in its priv_flags.** int (*ndo_set_mac_address)(struct net_device *dev, void *addr);* This function is called when the Media Access Control address* needs to be changed. If this interface is not defined, the* MAC address can not be changed.** int (*ndo_validate_addr)(struct net_device *dev);* Test if Media Access Control address is valid for the device.** int (*ndo_do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);* Old-style ioctl entry point. This is used internally by the* appletalk and ieee802154 subsystems but is no longer called by* the device ioctl handler.** int (*ndo_siocbond)(struct net_device *dev, struct ifreq *ifr, int cmd);* Used by the bonding driver for its device specific ioctls:* SIOCBONDENSLAVE, SIOCBONDRELEASE, SIOCBONDSETHWADDR, SIOCBONDCHANGEACTIVE,* SIOCBONDSLAVEINFOQUERY, and SIOCBONDINFOQUERY** * int (*ndo_eth_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);* Called for ethernet specific ioctls: SIOCGMIIPHY, SIOCGMIIREG,* SIOCSMIIREG, SIOCSHWTSTAMP and SIOCGHWTSTAMP.** int (*ndo_set_config)(struct net_device *dev, struct ifmap *map);* Used to set network devices bus interface parameters. This interface* is retained for legacy reasons; new devices should use the bus* interface (PCI) for low level management.** int (*ndo_change_mtu)(struct net_device *dev, int new_mtu);* Called when a user wants to change the Maximum Transfer Unit* of a device.** void (*ndo_tx_timeout)(struct net_device *dev, unsigned int txqueue);* Callback used when the transmitter has not made any progress* for dev->watchdog ticks.** void (*ndo_get_stats64)(struct net_device *dev,* struct rtnl_link_stats64 *storage);* struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);* Called when a user wants to get the network device usage* statistics. Drivers must do one of the following:* 1. Define @ndo_get_stats64 to fill in a zero-initialised* rtnl_link_stats64 structure passed by the caller.* 2. Define @ndo_get_stats to update a net_device_stats structure* (which should normally be dev->stats) and return a pointer to* it. The structure may be changed asynchronously only if each* field is written atomically.* 3. Update dev->stats asynchronously and atomically, and define* neither operation.** bool (*ndo_has_offload_stats)(const struct net_device *dev, int attr_id)* Return true if this device supports offload stats of this attr_id.** int (*ndo_get_offload_stats)(int attr_id, const struct net_device *dev,* void *attr_data)* Get statistics for offload operations by attr_id. Write it into the* attr_data pointer.** int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid);* If device supports VLAN filtering this function is called when a* VLAN id is registered.** int (*ndo_vlan_rx_kill_vid)(struct net_device *dev, __be16 proto, u16 vid);* If device supports VLAN filtering this function is called when a* VLAN id is unregistered.** void (*ndo_poll_controller)(struct net_device *dev);** SR-IOV management functions.* int (*ndo_set_vf_mac)(struct net_device *dev, int vf, u8* mac);* int (*ndo_set_vf_vlan)(struct net_device *dev, int vf, u16 vlan,* u8 qos, __be16 proto);* int (*ndo_set_vf_rate)(struct net_device *dev, int vf, int min_tx_rate,* int max_tx_rate);* int (*ndo_set_vf_spoofchk)(struct net_device *dev, int vf, bool setting);* int (*ndo_set_vf_trust)(struct net_device *dev, int vf, bool setting);* int (*ndo_get_vf_config)(struct net_device *dev,* int vf, struct ifla_vf_info *ivf);* int (*ndo_set_vf_link_state)(struct net_device *dev, int vf, int link_state);* int (*ndo_set_vf_port)(struct net_device *dev, int vf,* struct nlattr *port[]);** Enable or disable the VF ability to query its RSS Redirection Table and* Hash Key. This is needed since on some devices VF share this information* with PF and querying it may introduce a theoretical security risk.* int (*ndo_set_vf_rss_query_en)(struct net_device *dev, int vf, bool setting);* int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);* int (*ndo_setup_tc)(struct net_device *dev, enum tc_setup_type type,* void *type_data);* Called to setup any 'tc' scheduler, classifier or action on @dev.* This is always called from the stack with the rtnl lock held and netif* tx queues stopped. This allows the netdevice to perform queue* management safely.** Fiber Channel over Ethernet (FCoE) offload functions.* int (*ndo_fcoe_enable)(struct net_device *dev);* Called when the FCoE protocol stack wants to start using LLD for FCoE* so the underlying device can perform whatever needed configuration or* initialization to support acceleration of FCoE traffic.** int (*ndo_fcoe_disable)(struct net_device *dev);* Called when the FCoE protocol stack wants to stop using LLD for FCoE* so the underlying device can perform whatever needed clean-ups to* stop supporting acceleration of FCoE traffic.** int (*ndo_fcoe_ddp_setup)(struct net_device *dev, u16 xid,* struct scatterlist *sgl, unsigned int sgc);* Called when the FCoE Initiator wants to initialize an I/O that* is a possible candidate for Direct Data Placement (DDP). The LLD can* perform necessary setup and returns 1 to indicate the device is set up* successfully to perform DDP on this I/O, otherwise this returns 0.** int (*ndo_fcoe_ddp_done)(struct net_device *dev, u16 xid);* Called when the FCoE Initiator/Target is done with the DDPed I/O as* indicated by the FC exchange id 'xid', so the underlying device can* clean up and reuse resources for later DDP requests.** int (*ndo_fcoe_ddp_target)(struct net_device *dev, u16 xid,* struct scatterlist *sgl, unsigned int sgc);* Called when the FCoE Target wants to initialize an I/O that* is a possible candidate for Direct Data Placement (DDP). The LLD can* perform necessary setup and returns 1 to indicate the device is set up* successfully to perform DDP on this I/O, otherwise this returns 0.** int (*ndo_fcoe_get_hbainfo)(struct net_device *dev,* struct netdev_fcoe_hbainfo *hbainfo);* Called when the FCoE Protocol stack wants information on the underlying* device. This information is utilized by the FCoE protocol stack to* register attributes with Fiber Channel management service as per the* FC-GS Fabric Device Management Information(FDMI) specification.** int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type);* Called when the underlying device wants to override default World Wide* Name (WWN) generation mechanism in FCoE protocol stack to pass its own* World Wide Port Name (WWPN) or World Wide Node Name (WWNN) to the FCoE* protocol stack to use.** RFS acceleration.* int (*ndo_rx_flow_steer)(struct net_device *dev, const struct sk_buff *skb,* u16 rxq_index, u32 flow_id);* Set hardware filter for RFS. rxq_index is the target queue index;* flow_id is a flow ID to be passed to rps_may_expire_flow() later.* Return the filter ID on success, or a negative error code.** Slave management functions (for bridge, bonding, etc).* int (*ndo_add_slave)(struct net_device *dev, struct net_device *slave_dev);* Called to make another netdev an underling.** int (*ndo_del_slave)(struct net_device *dev, struct net_device *slave_dev);* Called to release previously enslaved netdev.** struct net_device *(*ndo_get_xmit_slave)(struct net_device *dev,* struct sk_buff *skb,* bool all_slaves);* Get the xmit slave of master device. If all_slaves is true, function* assume all the slaves can transmit.** Feature/offload setting functions.* netdev_features_t (*ndo_fix_features)(struct net_device *dev,* netdev_features_t features);* Adjusts the requested feature flags according to device-specific* constraints, and returns the resulting flags. Must not modify* the device state.** int (*ndo_set_features)(struct net_device *dev, netdev_features_t features);* Called to update device configuration to new features. Passed* feature set might be less than what was returned by ndo_fix_features()).* Must return >0 or -errno if it changed dev->features itself.** int (*ndo_fdb_add)(struct ndmsg *ndm, struct nlattr *tb[],* struct net_device *dev,* const unsigned char *addr, u16 vid, u16 flags,* struct netlink_ext_ack *extack);* Adds an FDB entry to dev for addr.* int (*ndo_fdb_del)(struct ndmsg *ndm, struct nlattr *tb[],* struct net_device *dev,* const unsigned char *addr, u16 vid)* Deletes the FDB entry from dev coresponding to addr.* int (*ndo_fdb_del_bulk)(struct ndmsg *ndm, struct nlattr *tb[],* struct net_device *dev,* u16 vid,* struct netlink_ext_ack *extack);* int (*ndo_fdb_dump)(struct sk_buff *skb, struct netlink_callback *cb,* struct net_device *dev, struct net_device *filter_dev,* int *idx)* Used to add FDB entries to dump requests. Implementers should add* entries to skb and update idx with the number of entries.** int (*ndo_bridge_setlink)(struct net_device *dev, struct nlmsghdr *nlh,* u16 flags, struct netlink_ext_ack *extack)* int (*ndo_bridge_getlink)(struct sk_buff *skb, u32 pid, u32 seq,* struct net_device *dev, u32 filter_mask,* int nlflags)* int (*ndo_bridge_dellink)(struct net_device *dev, struct nlmsghdr *nlh,* u16 flags);** int (*ndo_change_carrier)(struct net_device *dev, bool new_carrier);* Called to change device carrier. Soft-devices (like dummy, team, etc)* which do not represent real hardware may define this to allow their* userspace components to manage their virtual carrier state. Devices* that determine carrier state from physical hardware properties (eg* network cables) or protocol-dependent mechanisms (eg* USB_CDC_NOTIFY_NETWORK_CONNECTION) should NOT implement this function.** int (*ndo_get_phys_port_id)(struct net_device *dev,* struct netdev_phys_item_id *ppid);* Called to get ID of physical port of this device. If driver does* not implement this, it is assumed that the hw is not able to have* multiple net devices on single physical port.** int (*ndo_get_port_parent_id)(struct net_device *dev,* struct netdev_phys_item_id *ppid)* Called to get the parent ID of the physical port of this device.** void* (*ndo_dfwd_add_station)(struct net_device *pdev,* struct net_device *dev)* Called by upper layer devices to accelerate switching or other* station functionality into hardware. 'pdev is the lowerdev* to use for the offload and 'dev' is the net device that will* back the offload. Returns a pointer to the private structure* the upper layer will maintain.* void (*ndo_dfwd_del_station)(struct net_device *pdev, void *priv)* Called by upper layer device to delete the station created* by 'ndo_dfwd_add_station'. 'pdev' is the net device backing* the station and priv is the structure returned by the add* operation.* int (*ndo_set_tx_maxrate)(struct net_device *dev,* int queue_index, u32 maxrate);* Called when a user wants to set a max-rate limitation of specific* TX queue.* int (*ndo_get_iflink)(const struct net_device *dev);* Called to get the iflink value of this device.* int (*ndo_fill_metadata_dst)(struct net_device *dev, struct sk_buff *skb);* This function is used to get egress tunnel information for given skb.* This is useful for retrieving outer tunnel header parameters while* sampling packet.* void (*ndo_set_rx_headroom)(struct net_device *dev, int needed_headroom);* This function is used to specify the headroom that the skb must* consider when allocation skb during packet reception. Setting* appropriate rx headroom value allows avoiding skb head copy on* forward. Setting a negative value resets the rx headroom to the* default value.* int (*ndo_bpf)(struct net_device *dev, struct netdev_bpf *bpf);* This function is used to set or query state related to XDP on the* netdevice and manage BPF offload. See definition of* enum bpf_netdev_command for details.* int (*ndo_xdp_xmit)(struct net_device *dev, int n, struct xdp_frame **xdp,* u32 flags);* This function is used to submit @n XDP packets for transmit on a* netdevice. Returns number of frames successfully transmitted, frames* that got dropped are freed/returned via xdp_return_frame().* Returns negative number, means general error invoking ndo, meaning* no frames were xmit'ed and core-caller will free all frames.* struct net_device *(*ndo_xdp_get_xmit_slave)(struct net_device *dev,* struct xdp_buff *xdp);* Get the xmit slave of master device based on the xdp_buff.* int (*ndo_xsk_wakeup)(struct net_device *dev, u32 queue_id, u32 flags);* This function is used to wake up the softirq, ksoftirqd or kthread* responsible for sending and/or receiving packets on a specific* queue id bound to an AF_XDP socket. The flags field specifies if* only RX, only Tx, or both should be woken up using the flags* XDP_WAKEUP_RX and XDP_WAKEUP_TX.* struct devlink_port *(*ndo_get_devlink_port)(struct net_device *dev);* Get devlink port instance associated with a given netdev.* Called with a reference on the netdevice and devlink locks only,* rtnl_lock is not held.* int (*ndo_tunnel_ctl)(struct net_device *dev, struct ip_tunnel_parm *p,* int cmd);* Add, change, delete or get information on an IPv4 tunnel.* struct net_device *(*ndo_get_peer_dev)(struct net_device *dev);* If a device is paired with a peer device, return the peer instance.* The caller must be under RCU read context.* int (*ndo_fill_forward_path)(struct net_device_path_ctx *ctx, struct net_device_path *path);* Get the forwarding path to reach the real device from the HW destination address* ktime_t (*ndo_get_tstamp)(struct net_device *dev,* const struct skb_shared_hwtstamps *hwtstamps,* bool cycles);* Get hardware timestamp based on normal/adjustable time or free running* cycle counter. This function is required if physical clock supports a* free running cycle counter.*/
struct net_device_ops {int (*ndo_init)(struct net_device *dev);void (*ndo_uninit)(struct net_device *dev);int (*ndo_open)(struct net_device *dev);int (*ndo_stop)(struct net_device *dev);netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb,struct net_device *dev);netdev_features_t (*ndo_features_check)(struct sk_buff *skb,struct net_device *dev,netdev_features_t features);u16 (*ndo_select_queue)(struct net_device *dev,struct sk_buff *skb,struct net_device *sb_dev);void (*ndo_change_rx_flags)(struct net_device *dev,int flags);void (*ndo_set_rx_mode)(struct net_device *dev);int (*ndo_set_mac_address)(struct net_device *dev,void *addr);int (*ndo_validate_addr)(struct net_device *dev);int (*ndo_do_ioctl)(struct net_device *dev,struct ifreq *ifr, int cmd);int (*ndo_eth_ioctl)(struct net_device *dev,struct ifreq *ifr, int cmd);int (*ndo_siocbond)(struct net_device *dev,struct ifreq *ifr, int cmd);int (*ndo_siocwandev)(struct net_device *dev,struct if_settings *ifs);int (*ndo_siocdevprivate)(struct net_device *dev,struct ifreq *ifr,void __user *data, int cmd);int (*ndo_set_config)(struct net_device *dev,struct ifmap *map);int (*ndo_change_mtu)(struct net_device *dev,int new_mtu);int (*ndo_neigh_setup)(struct net_device *dev,struct neigh_parms *);void (*ndo_tx_timeout) (struct net_device *dev,unsigned int txqueue);void (*ndo_get_stats64)(struct net_device *dev,struct rtnl_link_stats64 *storage);bool (*ndo_has_offload_stats)(const struct net_device *dev, int attr_id);int (*ndo_get_offload_stats)(int attr_id,const struct net_device *dev,void *attr_data);struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);int (*ndo_vlan_rx_add_vid)(struct net_device *dev,__be16 proto, u16 vid);int (*ndo_vlan_rx_kill_vid)(struct net_device *dev,__be16 proto, u16 vid);
#ifdef CONFIG_NET_POLL_CONTROLLERvoid (*ndo_poll_controller)(struct net_device *dev);int (*ndo_netpoll_setup)(struct net_device *dev,struct netpoll_info *info);void (*ndo_netpoll_cleanup)(struct net_device *dev);
#endifint (*ndo_set_vf_mac)(struct net_device *dev,int queue, u8 *mac);int (*ndo_set_vf_vlan)(struct net_device *dev,int queue, u16 vlan,u8 qos, __be16 proto);int (*ndo_set_vf_rate)(struct net_device *dev,int vf, int min_tx_rate,int max_tx_rate);int (*ndo_set_vf_spoofchk)(struct net_device *dev,int vf, bool setting);int (*ndo_set_vf_trust)(struct net_device *dev,int vf, bool setting);int (*ndo_get_vf_config)(struct net_device *dev,int vf,struct ifla_vf_info *ivf);int (*ndo_set_vf_link_state)(struct net_device *dev,int vf, int link_state);int (*ndo_get_vf_stats)(struct net_device *dev,int vf,struct ifla_vf_stats*vf_stats);int (*ndo_set_vf_port)(struct net_device *dev,int vf,struct nlattr *port[]);int (*ndo_get_vf_port)(struct net_device *dev,int vf, struct sk_buff *skb);int (*ndo_get_vf_guid)(struct net_device *dev,int vf,struct ifla_vf_guid *node_guid,struct ifla_vf_guid *port_guid);int (*ndo_set_vf_guid)(struct net_device *dev,int vf, u64 guid,int guid_type);int (*ndo_set_vf_rss_query_en)(struct net_device *dev,int vf, bool setting);int (*ndo_setup_tc)(struct net_device *dev,enum tc_setup_type type,void *type_data);
#if IS_ENABLED(CONFIG_FCOE)int (*ndo_fcoe_enable)(struct net_device *dev);int (*ndo_fcoe_disable)(struct net_device *dev);int (*ndo_fcoe_ddp_setup)(struct net_device *dev,u16 xid,struct scatterlist *sgl,unsigned int sgc);int (*ndo_fcoe_ddp_done)(struct net_device *dev,u16 xid);int (*ndo_fcoe_ddp_target)(struct net_device *dev,u16 xid,struct scatterlist *sgl,unsigned int sgc);int (*ndo_fcoe_get_hbainfo)(struct net_device *dev,struct netdev_fcoe_hbainfo *hbainfo);
#endif#if IS_ENABLED(CONFIG_LIBFCOE)
#define NETDEV_FCOE_WWNN 0
#define NETDEV_FCOE_WWPN 1int (*ndo_fcoe_get_wwn)(struct net_device *dev,u64 *wwn, int type);
#endif#ifdef CONFIG_RFS_ACCELint (*ndo_rx_flow_steer)(struct net_device *dev,const struct sk_buff *skb,u16 rxq_index,u32 flow_id);
#endifint (*ndo_add_slave)(struct net_device *dev,struct net_device *slave_dev,struct netlink_ext_ack *extack);int (*ndo_del_slave)(struct net_device *dev,struct net_device *slave_dev);struct net_device* (*ndo_get_xmit_slave)(struct net_device *dev,struct sk_buff *skb,bool all_slaves);struct net_device* (*ndo_sk_get_lower_dev)(struct net_device *dev,struct sock *sk);netdev_features_t (*ndo_fix_features)(struct net_device *dev,netdev_features_t features);int (*ndo_set_features)(struct net_device *dev,netdev_features_t features);int (*ndo_neigh_construct)(struct net_device *dev,struct neighbour *n);void (*ndo_neigh_destroy)(struct net_device *dev,struct neighbour *n);int (*ndo_fdb_add)(struct ndmsg *ndm,struct nlattr *tb[],struct net_device *dev,const unsigned char *addr,u16 vid,u16 flags,struct netlink_ext_ack *extack);int (*ndo_fdb_del)(struct ndmsg *ndm,struct nlattr *tb[],struct net_device *dev,const unsigned char *addr,u16 vid, struct netlink_ext_ack *extack);int (*ndo_fdb_del_bulk)(struct ndmsg *ndm,struct nlattr *tb[],struct net_device *dev,u16 vid,struct netlink_ext_ack *extack);int (*ndo_fdb_dump)(struct sk_buff *skb,struct netlink_callback *cb,struct net_device *dev,struct net_device *filter_dev,int *idx);int (*ndo_fdb_get)(struct sk_buff *skb,struct nlattr *tb[],struct net_device *dev,const unsigned char *addr,u16 vid, u32 portid, u32 seq,struct netlink_ext_ack *extack);int (*ndo_bridge_setlink)(struct net_device *dev,struct nlmsghdr *nlh,u16 flags,struct netlink_ext_ack *extack);int (*ndo_bridge_getlink)(struct sk_buff *skb,u32 pid, u32 seq,struct net_device *dev,u32 filter_mask,int nlflags);int (*ndo_bridge_dellink)(struct net_device *dev,struct nlmsghdr *nlh,u16 flags);int (*ndo_change_carrier)(struct net_device *dev,bool new_carrier);int (*ndo_get_phys_port_id)(struct net_device *dev,struct netdev_phys_item_id *ppid);int (*ndo_get_port_parent_id)(struct net_device *dev,struct netdev_phys_item_id *ppid);int (*ndo_get_phys_port_name)(struct net_device *dev,char *name, size_t len);void* (*ndo_dfwd_add_station)(struct net_device *pdev,struct net_device *dev);void (*ndo_dfwd_del_station)(struct net_device *pdev,void *priv);int (*ndo_set_tx_maxrate)(struct net_device *dev,int queue_index,u32 maxrate);int (*ndo_get_iflink)(const struct net_device *dev);int (*ndo_fill_metadata_dst)(struct net_device *dev,struct sk_buff *skb);void (*ndo_set_rx_headroom)(struct net_device *dev,int needed_headroom);int (*ndo_bpf)(struct net_device *dev,struct netdev_bpf *bpf);int (*ndo_xdp_xmit)(struct net_device *dev, int n,struct xdp_frame **xdp,u32 flags);struct net_device * (*ndo_xdp_get_xmit_slave)(struct net_device *dev,struct xdp_buff *xdp);int (*ndo_xsk_wakeup)(struct net_device *dev,u32 queue_id, u32 flags);struct devlink_port * (*ndo_get_devlink_port)(struct net_device *dev);int (*ndo_tunnel_ctl)(struct net_device *dev,struct ip_tunnel_parm *p, int cmd);struct net_device * (*ndo_get_peer_dev)(struct net_device *dev);int (*ndo_fill_forward_path)(struct net_device_path_ctx *ctx,struct net_device_path *path);ktime_t (*ndo_get_tstamp)(struct net_device *dev,const struct skb_shared_hwtstamps *hwtstamps,bool cycles);
};
这个 struct net_device_ops 是 Linux 内核网络子系统中一个极其关键的数据结构。它定义了一个函数指针的集合,这些函数是网络设备驱动程序(Driver)必须实现(或可选实现)的回调函数。内核通过这些函数与具体的硬件设备进行交互,执行设备的初始化、配置、数据收发、状态管理等操作。
核心作用: 它是驱动与内核网络栈之间的契约或接口规范。当内核需要操作一个网络设备(比如打开设备、发送数据包、改变MTU等)时,它会调用 net_device 结构中的 netdev_ops 成员所指向的这个结构体中相应的函数。
设备生命周期管理:
ndo_init: 设备注册时调用一次,进行后期初始化或验证。
ndo_uninit: 设备注销或注册失败时调用,进行清理。
ndo_open: 设备启动(ifconfig eth0 up)时调用。
ndo_stop: 设备停止(ifconfig eth0 down)时调用。
数据包传输 (Transmit - TX):
ndo_start_xmit: (必需!) 这是最核心的函数。当内核有一个数据包 (sk_buff *skb) 需要通过网络设备发送出去时调用此函数。驱动程序必须实现此函数来将数据包推送到硬件队列或直接发送。返回值指示发送状态 (NETDEV_TX_OK, NETDEV_TX_BUSY - 慎用)。
ndo_features_check: 在发送路径调用,让驱动有机会根据数据包的具体情况检查或调整硬件卸载功能(如TSO, UFO, GSO)是否可用。
ndo_select_queue: 如果设备支持多传输队列,此函数决定将数据包放入哪个发送队列。
ndo_xdp_xmit: 用于XDP (eXpress Data Path) 模式,直接将XDP帧(xdp_frame)从驱动发送出去。
ndo_xdp_get_xmit_slave: 在XDP模式下,为主设备获取用于发送的从设备(Slave)。
ndo_set_tx_maxrate: 设置特定发送队列的最大速率限制。
数据包接收 (Receive - RX) 配置:
ndo_set_rx_mode: 当设备的接收过滤模式改变(如开启/关闭混杂模式 PROMISC、组播 MULTICAST、或单播过滤列表改变)时调用。驱动需要根据新配置设置硬件的过滤规则。
ndo_change_rx_flags: (较旧) 当接收标志(主要是 IFF_PROMISC, IFF_ALLMULTI)改变时调用。通常 ndo_set_rx_mode 更常用。
ndo_set_rx_headroom: 设置接收数据包 (skb) 时需要的额外头部空间(避免后续转发时的内存拷贝)。
ndo_rx_flow_steer: (用于RFS - Receive Flow Steering) 设置硬件过滤器,将特定流的数据包引导到指定的CPU/RX队列。
ndo_xsk_wakeup: (用于AF_XDP) 唤醒处理特定队列上AF_XDP socket数据的软中断/线程。
地址管理:
ndo_set_mac_address: 设置设备的MAC地址。
ndo_validate_addr: 验证给定的MAC地址是否有效。
ndo_fdb_add: (通常用于桥接) 添加一个**转发数据库(FDB)**条目(MAC地址到端口/VLAN的映射)。
ndo_fdb_del: 删除一个FDB条目。
ndo_fdb_dump: 转储FDB内容(例如 bridge fdb show)。
配置与状态:
ndo_do_ioctl / ndo_eth_ioctl / ndo_siocbond: 处理设备特定的IOCTL命令(传统配置方式,部分被ethtool/netlink取代)。
ndo_change_mtu: 改变设备的最大传输单元(MTU)。
ndo_set_config: (过时) 设置低层总线接口参数(现代驱动用PCI等总线接口)。
ndo_change_carrier: (主要用于虚拟设备) 设置设备的虚拟载波状态(开/关)。
ndo_fix_features: 根据硬件限制调整内核请求的网络功能特性 (netdev_features_t) 标志位。
ndo_set_features: 应用已调整的网络功能特性到硬件配置。
ndo_tunnel_ctl: 管理IPv4隧道(添加、删除、修改、获取信息)。
统计信息:
ndo_get_stats64 / ndo_get_stats: 获取设备的网络统计信息(收发包数、字节数、错误数等)。现代驱动应实现 ndo_get_stats64 填充 rtnl_link_stats64 结构。
ndo_has_offload_stats / ndo_get_offload_stats: 查询和获取硬件卸载操作(如XDP, tc offload)的统计信息。
ndo_get_vf_stats: (用于SR-IOV) 获取虚拟功能(VF)的统计信息。
ndo_get_tstamp: 获取硬件时间戳。
错误与超时处理:
ndo_tx_timeout: 当某个发送队列在watchdog_timeo时间内没有任何进展时调用(可能硬件卡死)。
高级功能 (VLAN, Bridge, Bonding, SR-IOV, FCoE, XDP/bpf):
ndo_vlan_rx_add_vid / ndo_vlan_rx_kill_vid: 管理设备感知的VLAN ID(用于硬件VLAN过滤/加速)。
ndo_bridge_setlink / ndo_bridge_getlink / ndo_bridge_dellink: 实现网桥相关的Netlink操作。
ndo_add_slave / ndo_del_slave: 用于绑定(Bonding)、网桥等,添加/移除被管理的从设备(Slave)。
ndo_get_xmit_slave: 在绑定或负载均衡场景下,为特定数据包选择用于发送的从设备(Slave)。
ndo_set_vf_mac / ndo_set_vf_vlan / ndo_set_vf_spoofchk / ndo_get_vf_config / etc: 管理SR-IOV虚拟功能(VF) 的配置(MAC, VLAN, 防欺骗检查,速率限制,链路状态等)。
ndo_fcoe_enable / ndo_fcoe_disable / ndo_fcoe_ddp_setup / etc: (FCoE - Fiber Channel over Ethernet) FCoE卸载相关操作。
ndo_bpf: XDP和tc BPF卸载的核心入口点。用于加载/卸载/查询BPF程序,配置XDP模式等。
ndo_setup_tc: 设置流量控制(tc) 的调度器、分类器或动作(如硬件offload)。
其他:
ndo_get_phys_port_id / ndo_get_port_parent_id / ndo_get_phys_port_name: 获取物理端口的标识符、父标识符或名称(用于多端口设备)。
ndo_get_iflink: 获取设备在底层物理设备上的接口链接索引(常用于虚拟设备)。
ndo_fill_metadata_dst: 为skb填充出口隧道信息(用于隧道封装卸载)。
ndo_dfwd_add_station / ndo_dfwd_del_station: (Direct Function Offload) 加速交换或站点功能到硬件。
ndo_get_devlink_port: 获取关联的devlink_port实例(用于设备管理框架)。
ndo_get_peer_dev: 获取配对的对等设备(例如veth对端)。
ndo_fill_forward_path: 获取到达真实设备的转发路径信息(用于硬件转发卸载)。
ndo_poll_controller / ndo_netpoll_setup / ndo_netpoll_cleanup: (用于Netconsole) 在中断禁用时驱动设备进行轮询。
sk_buff 结构体
网络是分层的,对于应用层而言不用关心具体的底层是如何工作的,只需要按照协议将要发送或接收的数据打包好即可。打包好以后都通过 dev_queue_xmit 函数将数据发送出去,接收数据的话使用 netif_rx 函数即可,我们依次来看一下这两个函数。
1、 dev_queue_xmit 函数
此函数用于将网络数据发送出去,函数定义在 include/linux/netdevice.h 中,函数原型如下:
static inline int dev_queue_xmit(struct sk_buff *skb)
函数参数和返回值含义如下:
skb: 要发送的数据, 这是一个 sk_buff 结构体指针, sk_buff 是 Linux 网络驱动中一个非常重要的结构体,网络数据就是以 sk_buff 保存的,各个协议层在 sk_buff 中添加自己的协议头,最终由底层驱动将 sk_buff 中的数据发送出去。网络数据的接收过程恰好相反,网络底层驱动将接收到的原始数据打包成 sk_buff,然后发送给上层协议,上层会取掉相应的头部,然后将最终的数据发送给用户。
返回值: 0 发送成功,负值 发送失败。
dev_queue_xmit 函数太长,这里就不详细的分析了, dev_queue_xmit 函数最终是通过net_device_ops 操作集里面的 ndo_start_xmit 函数来完成最终发送了, ndo_start_xmit 就是网络驱动编写人员去实现的,整个流程如图所示:
2、 netif_rx 函数
上层接收数据的话使用 netif_rx 函数,但是最原始的网络数据一般是通过轮询、中断或 NAPI的方式来接收。 netif_rx 函数定义在 net/core/dev.c 中,函数原型如下:
int netif_rx(struct sk_buff *skb)
函数参数和返回值含义如下:
skb: 保存接收数据的 sk_buff。
返回值: NET_RX_SUCCESS 成功, NET_RX_DROP 数据包丢弃。
我们重点来看一下 sk_buff 这个结构体, sk_buff 是 Linux 网络重要的数据结构,用于管理接收或发送数据包, sk_buff 结构体定义在 include/linux/skbuff.h 中,结构体内容如下:
/*** DOC: Basic sk_buff geometry** struct sk_buff itself is a metadata structure and does not hold any packet* data. All the data is held in associated buffers.** &sk_buff.head points to the main "head" buffer. The head buffer is divided* into two parts:** - data buffer, containing headers and sometimes payload;* this is the part of the skb operated on by the common helpers* such as skb_put() or skb_pull();* - shared info (struct skb_shared_info) which holds an array of pointers* to read-only data in the (page, offset, length) format.** Optionally &skb_shared_info.frag_list may point to another skb.** Basic diagram may look like this::** ---------------* | sk_buff |* ---------------* ,--------------------------- + head* / ,----------------- + data* / / ,----------- + tail* | | | , + end* | | | |* v v v v* -----------------------------------------------* | headroom | data | tailroom | skb_shared_info |* -----------------------------------------------* + [page frag]* + [page frag]* + [page frag]* + [page frag] ---------* + frag_list --> | sk_buff |* ---------**//*** struct sk_buff - socket buffer* @next: Next buffer in list* @prev: Previous buffer in list* @tstamp: Time we arrived/left* @skb_mstamp_ns: (aka @tstamp) earliest departure time; start point* for retransmit timer* @rbnode: RB tree node, alternative to next/prev for netem/tcp* @list: queue head* @ll_node: anchor in an llist (eg socket defer_list)* @sk: Socket we are owned by* @dev: Device we arrived on/are leaving by* @dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL* @cb: Control buffer. Free for use by every layer. Put private vars here* @_skb_refdst: destination entry (with norefcount bit)* @sp: the security path, used for xfrm* @len: Length of actual data* @data_len: Data length* @mac_len: Length of link layer header* @hdr_len: writable header length of cloned skb* @csum: Checksum (must include start/offset pair)* @csum_start: Offset from skb->head where checksumming should start* @csum_offset: Offset from csum_start where checksum should be stored* @priority: Packet queueing priority* @ignore_df: allow local fragmentation* @cloned: Head may be cloned (check refcnt to be sure)* @ip_summed: Driver fed us an IP checksum* @nohdr: Payload reference only, must not modify header* @pkt_type: Packet class* @fclone: skbuff clone status* @ipvs_property: skbuff is owned by ipvs* @inner_protocol_type: whether the inner protocol is* ENCAP_TYPE_ETHER or ENCAP_TYPE_IPPROTO* @remcsum_offload: remote checksum offload is enabled* @offload_fwd_mark: Packet was L2-forwarded in hardware* @offload_l3_fwd_mark: Packet was L3-forwarded in hardware* @tc_skip_classify: do not classify packet. set by IFB device* @tc_at_ingress: used within tc_classify to distinguish in/egress* @redirected: packet was redirected by packet classifier* @from_ingress: packet was redirected from the ingress path* @nf_skip_egress: packet shall skip nf egress - see netfilter_netdev.h* @peeked: this packet has been seen already, so stats have been* done for it, don't do them again* @nf_trace: netfilter packet trace flag* @protocol: Packet protocol from driver* @destructor: Destruct function* @tcp_tsorted_anchor: list structure for TCP (tp->tsorted_sent_queue)* @_sk_redir: socket redirection information for skmsg* @_nfct: Associated connection, if any (with nfctinfo bits)* @nf_bridge: Saved data about a bridged frame - see br_netfilter.c* @skb_iif: ifindex of device we arrived on* @tc_index: Traffic control index* @hash: the packet hash* @queue_mapping: Queue mapping for multiqueue devices* @head_frag: skb was allocated from page fragments,* not allocated by kmalloc() or vmalloc().* @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves* @pp_recycle: mark the packet for recycling instead of freeing (implies* page_pool support on driver)* @active_extensions: active extensions (skb_ext_id types)* @ndisc_nodetype: router type (from link layer)* @ooo_okay: allow the mapping of a socket to a queue to be changed* @l4_hash: indicate hash is a canonical 4-tuple hash over transport* ports.* @sw_hash: indicates hash was computed in software stack* @wifi_acked_valid: wifi_acked was set* @wifi_acked: whether frame was acked on wifi or not* @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS* @encapsulation: indicates the inner headers in the skbuff are valid* @encap_hdr_csum: software checksum is needed* @csum_valid: checksum is already valid* @csum_not_inet: use CRC32c to resolve CHECKSUM_PARTIAL* @csum_complete_sw: checksum was completed by software* @csum_level: indicates the number of consecutive checksums found in* the packet minus one that have been verified as* CHECKSUM_UNNECESSARY (max 3)* @scm_io_uring: SKB holds io_uring registered files* @dst_pending_confirm: need to confirm neighbour* @decrypted: Decrypted SKB* @slow_gro: state present at GRO time, slower prepare step required* @mono_delivery_time: When set, skb->tstamp has the* delivery_time in mono clock base (i.e. EDT). Otherwise, the* skb->tstamp has the (rcv) timestamp at ingress and* delivery_time at egress.* @napi_id: id of the NAPI struct this skb came from* @sender_cpu: (aka @napi_id) source CPU in XPS* @alloc_cpu: CPU which did the skb allocation.* @secmark: security marking* @mark: Generic packet mark* @reserved_tailroom: (aka @mark) number of bytes of free space available* at the tail of an sk_buff* @vlan_present: VLAN tag is present* @vlan_proto: vlan encapsulation protocol* @vlan_tci: vlan tag control information* @inner_protocol: Protocol (encapsulation)* @inner_ipproto: (aka @inner_protocol) stores ipproto when* skb->inner_protocol_type == ENCAP_TYPE_IPPROTO;* @inner_transport_header: Inner transport layer header (encapsulation)* @inner_network_header: Network layer header (encapsulation)* @inner_mac_header: Link layer header (encapsulation)* @transport_header: Transport layer header* @network_header: Network layer header* @mac_header: Link layer header* @kcov_handle: KCOV remote handle for remote coverage collection* @tail: Tail pointer* @end: End pointer* @head: Head of buffer* @data: Data head pointer* @truesize: Buffer size* @users: User count - see {datagram,tcp}.c* @extensions: allocated extensions, valid if active_extensions is nonzero*/struct sk_buff {union {struct {/* These two members must be first to match sk_buff_head. */struct sk_buff *next;struct sk_buff *prev;union {struct net_device *dev;/* Some protocols might use this space to store information,* while device pointer would be NULL.* UDP receive path is one user.*/unsigned long dev_scratch;};};struct rb_node rbnode; /* used in netem, ip4 defrag, and tcp stack */struct list_head list;struct llist_node ll_node;};struct sock *sk;union {ktime_t tstamp;u64 skb_mstamp_ns; /* earliest departure time */};/** This is the control buffer. It is free to use for every* layer. Please put your private variables there. If you* want to keep them across layers you have to do a skb_clone()* first. This is owned by whoever has the skb queued ATM.*/char cb[48] __aligned(8);union {struct {unsigned long _skb_refdst;void (*destructor)(struct sk_buff *skb);};struct list_head tcp_tsorted_anchor;
#ifdef CONFIG_NET_SOCK_MSGunsigned long _sk_redir;
#endif};#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)unsigned long _nfct;
#endifunsigned int len,data_len;__u16 mac_len,hdr_len;/* Following fields are _not_ copied in __copy_skb_header()* Note that queue_mapping is here mostly to fill a hole.*/__u16 queue_mapping;/* if you move cloned around you also must adapt those constants */
#ifdef __BIG_ENDIAN_BITFIELD
#define CLONED_MASK (1 << 7)
#else
#define CLONED_MASK 1
#endif
#define CLONED_OFFSET offsetof(struct sk_buff, __cloned_offset)/* private: */__u8 __cloned_offset[0];/* public: */__u8 cloned:1,nohdr:1,fclone:2,peeked:1,head_frag:1,pfmemalloc:1,pp_recycle:1; /* page_pool recycle indicator */
#ifdef CONFIG_SKB_EXTENSIONS__u8 active_extensions;
#endif/* Fields enclosed in headers group are copied* using a single memcpy() in __copy_skb_header()*/struct_group(headers,/* private: */__u8 __pkt_type_offset[0];/* public: */__u8 pkt_type:3; /* see PKT_TYPE_MAX */__u8 ignore_df:1;__u8 nf_trace:1;__u8 ip_summed:2;__u8 ooo_okay:1;__u8 l4_hash:1;__u8 sw_hash:1;__u8 wifi_acked_valid:1;__u8 wifi_acked:1;__u8 no_fcs:1;/* Indicates the inner headers are valid in the skbuff. */__u8 encapsulation:1;__u8 encap_hdr_csum:1;__u8 csum_valid:1;/* private: */__u8 __pkt_vlan_present_offset[0];/* public: */__u8 vlan_present:1; /* See PKT_VLAN_PRESENT_BIT */__u8 csum_complete_sw:1;__u8 csum_level:2;__u8 dst_pending_confirm:1;__u8 mono_delivery_time:1; /* See SKB_MONO_DELIVERY_TIME_MASK */
#ifdef CONFIG_NET_CLS_ACT__u8 tc_skip_classify:1;__u8 tc_at_ingress:1; /* See TC_AT_INGRESS_MASK */
#endif
#ifdef CONFIG_IPV6_NDISC_NODETYPE__u8 ndisc_nodetype:2;
#endif__u8 ipvs_property:1;__u8 inner_protocol_type:1;__u8 remcsum_offload:1;
#ifdef CONFIG_NET_SWITCHDEV__u8 offload_fwd_mark:1;__u8 offload_l3_fwd_mark:1;
#endif__u8 redirected:1;
#ifdef CONFIG_NET_REDIRECT__u8 from_ingress:1;
#endif
#ifdef CONFIG_NETFILTER_SKIP_EGRESS__u8 nf_skip_egress:1;
#endif
#ifdef CONFIG_TLS_DEVICE__u8 decrypted:1;
#endif__u8 slow_gro:1;__u8 csum_not_inet:1;__u8 scm_io_uring:1;#ifdef CONFIG_NET_SCHED__u16 tc_index; /* traffic control index */
#endifunion {__wsum csum;struct {__u16 csum_start;__u16 csum_offset;};};__u32 priority;int skb_iif;__u32 hash;__be16 vlan_proto;__u16 vlan_tci;
#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)union {unsigned int napi_id;unsigned int sender_cpu;};
#endifu16 alloc_cpu;
#ifdef CONFIG_NETWORK_SECMARK__u32 secmark;
#endifunion {__u32 mark;__u32 reserved_tailroom;};union {__be16 inner_protocol;__u8 inner_ipproto;};__u16 inner_transport_header;__u16 inner_network_header;__u16 inner_mac_header;__be16 protocol;__u16 transport_header;__u16 network_header;__u16 mac_header;#ifdef CONFIG_KCOVu64 kcov_handle;
#endif); /* end headers group *//* These elements must be at the end, see alloc_skb() for details. */sk_buff_data_t tail;sk_buff_data_t end;unsigned char *head,*data;unsigned int truesize;refcount_t users;#ifdef CONFIG_SKB_EXTENSIONS/* only useable after checking ->active_extensions != 0 */struct skb_ext *extensions;
#endif
};
这个 struct sk_buff(通常称为 SKB)是 Linux 内核网络子系统中最核心的数据结构之一。它代表一个网络数据包(或数据包缓冲区),在网络协议栈的各个层次(从物理层到应用层)传递时,所有的操作都是围绕这个结构体进行的。
核心概念:元数据 + 数据缓冲区
SKB 本身是元数据:它不直接存储数据包内容,而是通过指针管理数据缓冲区。
数据存储在关联的缓冲区中:
主缓冲区 (head 指向):包含线性数据(头部和可能的部分负载)。
页面片段 (skb_shared_info 管理):指向分散的内存页(用于大包或零拷贝)。
分片列表 (frag_list):指向另一个 sk_buff 链表(用于 IP 分片重组等)。
关键指针与缓冲区布局
head: 指向分配的内存块起始地址。
data: 指向当前协议层有效数据的起始位置(例如 IP 头开始处)。
tail: 指向当前协议层有效数据的结束位置。
end: 指向分配的内存块结束地址。
skb_shared_info (位于主缓冲区末尾):包含页面片段数组 (frags[]) 和分片链表指针 (frag_list)。
结构体关键字段解析(按功能分类)
链表与队列管理:
next, prev: 用于将 SKB 链接到双向链表中(如套接字发送/接收队列)。
list: 用于通用的链表管理。
rbnode: 用于红黑树(如 netem 流量整形、IP 分片重组)。
ll_node: 用于无锁链表 (llist)。
tcp_tsorted_anchor: 用于 TCP 的有序发送队列。
所有权与关联信息:
sk: 指向拥有此数据包的套接字 (struct sock)。
dev: 数据包到达或离开的网络设备 (struct net_device)。
dev_scratch: dev 的替代用途(当 dev 为 NULL 时,如 UDP 接收路径)。
数据包信息:
len: 数据包总长度(线性数据 + 分片数据 + 页面片段数据)。
data_len: 仅分片和页面片段数据的长度(不包括线性数据)。
truesize: 此 SKB 占用的总内存大小(包括 SKB 结构本身和所有数据缓冲区)。
users: 引用计数。用于克隆 (skb_clone()) 和共享。当计数为 0 时释放。
cloned: 指示此 SKB 是克隆出来的(检查引用计数确认)。
head_frag: SKB 主缓冲区是从页面片段分配的(非 kmalloc/vmalloc)。
pfmemalloc: 从 PFMEMALLOC 保留区分配(用于内存紧张时的网络操作)。
pp_recycle: 标记此 SKB 用于页面池 (page_pool) 回收。
协议头处理:
mac_header: 链路层头(如以太网头)在缓冲区中的偏移。
network_header: 网络层头(如 IP 头)在缓冲区中的偏移。
transport_header: 传输层头(如 TCP/UDP 头)在缓冲区中的偏移。
inner_mac_header, inner_network_header, inner_transport_header: 封装包(隧道如 VXLAN、GRE)的内部协议头偏移。
mac_len: 链路层头的实际长度。
hdr_len: 克隆 SKB 时可写的头部长度。
protocol: 从驱动获取的数据包协议(如 ETH_P_IP)。
pkt_type: 数据包类型(广播 PACKET_BROADCAST、组播 PACKET_MULTICAST、主机 PACKET_HOST 等)。
vlan_present, vlan_proto, vlan_tci: VLAN 标签信息。
encapsulation: 标记内部协议头有效(隧道包)。
inner_protocol / inner_ipproto: 内部封装包的协议。
校验和与卸载:
csum: 校验和值。
csum_start: 开始计算校验和的位置(偏移)。
csum_offset: 存储校验和结果的位置(偏移)。
ip_summed: 驱动告诉内核的校验和状态(关键!):
CHECKSUM_NONE: 硬件未做校验和,需软件计算。
CHECKSUM_UNNECESSARY: 硬件已验证校验和正确(无需内核计算)。
CHECKSUM_COMPLETE: 硬件计算了部分校验和(如 L4 伪头),驱动提供了剩余部分的计算结果。
CHECKSUM_PARTIAL: 硬件计算了传输层校验和(需要内核提供伪头)。
csum_valid: 软件计算的校验和已标记有效。
csum_complete_sw: 校验和由软件完成。
csum_level: 指示经过了多少层校验和卸载验证(用于封装包)。
remcsum_offload: 远程校验和卸载启用。
功能特性与状态标志 (大量位域 __u8 xxx:1):
nohdr: 仅引用负载,不能修改头部。
peeked: 数据包已被“窥视”(统计信息已记录,避免重复)。
ignore_df: 允许本地分片(忽略 IP 头中的 Don't Fragment 标志)。
ooo_okay: 允许套接字到队列的映射更改(用于乱序包)。
l4_hash, sw_hash: 哈希计算方式(硬件/软件,是否规范4元组)。
redirected, from_ingress: 数据包被重定向(TC 动作)。
offload_fwd_mark, offload_l3_fwd_mark: 数据包在硬件中 L2/L3 转发过。
tc_skip_classify, tc_at_ingress: TC(流量控制)相关状态。
decrypted: (TLS) 数据包已被解密。
mono_delivery_time: 时间戳使用单调时钟 (EDT)。
等等(约 50 个标志位!)。
路由、安全、QoS:
_skb_refdst: 目标路由条目缓存(带无引用计数标志)。
priority: 数据包排队优先级(用于 QoS)。
mark: 通用数据包标记(可由 iptables 等设置)。
secmark: 安全标记(SELinux 等)。
hash: 数据包哈希值(用于多队列、RPS 等)。
queue_mapping: 指定发送/接收队列索引。
tc_index: 流量控制索引。
时间戳与硬件关联:
tstamp / skb_mstamp_ns: 数据包时间戳(接收时间或最早发送时间)。
napi_id / sender_cpu: 数据包来源的 NAPI 结构 ID 或 CPU ID (用于 XPS)。
alloc_cpu: 分配此 SKB 的 CPU。
扩展与特定协议:
active_extensions: 激活的扩展标志。
extensions: 指向 skb_ext 结构的指针(如连接跟踪 conntrack、时间戳等)。
_nfct: (Netfilter) 关联的连接跟踪条目。
nf_bridge: (Netfilter) 桥接帧的保存数据。
ipvs_property: SKB 由 IPVS 拥有。
scm_io_uring: SKB 持有 io_uring 注册的文件。
控制块 (Control Buffer - cb):
一个 48 字节对齐的缓冲区 (char cb[48] __aligned(8))。
各协议层的私有存储区!TCP、UDP、IP 等协议在传输过程中在此存储临时控制信息(如序列号、状态等)。这是 SKB 高效跨层传递的关键设计。
struct sk_buff 是 Linux 网络栈高性能的基石:
零拷贝优化: 通过页面片段 (page frags) 和 frag_list 避免大内存拷贝。
高效头部操作: data, tail 指针移动即可添加/剥离协议头(无需数据移动)。
硬件卸载支持: 丰富的标志位 (ip_summed, csum_level 等) 支持校验和、分段 (TSO/GSO)、加密等硬件加速。
协议栈穿透: cb 控制块允许各层协议存储私有状态,避免全局查找。
灵活的数据组织: 线性缓冲区 + 分散片段适应各种数据包大小和来源。
统计与跟踪: 包含大量字段用于性能统计、QoS、安全、过滤和调试。
引用计数与克隆: users 和 cloned 支持高效的数据包共享(如镜像、分路)。
与内核基础设施集成: 链表、队列、树、引用计数、内存分配 (kmalloc, page_pool) 等。
针对 sk_buff 内核提供了一系列的操作与管理函数,我们简单看一些常见的 API 函数:
1、 分配 sk_buff
要使用 sk_buff 必须先分配,首先来看一下 alloc_skb 这个函数,此函数定义在include/linux/skbuff.h 中,函数原型如下:
static inline struct sk_buff *alloc_skb(unsigned int size,gfp_t priority)
函数参数和返回值含义如下:
size: 要分配的大小,也就是 skb 数据段大小。
priority: 为 GFP MASK 宏,比如 GFP_KERNEL、 GFP_ATOMIC 等。
返回值: 分配成功的话就返回申请到的 sk_buff 首地址,失败的话就返回 NULL。
在网络设备驱动中常常使用 netdev_alloc_skb 来为某个设备申请一个用于接收的 skb_buff,此函数也定义在 include/linux/skbuff.h 中,函数原型如下:
static inline struct sk_buff *netdev_alloc_skb(struct net_device *dev, unsigned int length)
函数参数和返回值含义如下:
dev: 要给哪个设备分配 sk_buff。
length: 要分配的大小。
返回值: 分配成功的话就返回申请到的 sk_buff 首地址,失败的话就返回 NULL。
2、 释放 sk_buff
当使用完成以后就要释放掉 sk_buff,释放函数可以使用 kfree_skb,函数定义在include/linux/skbuff.c 中,函数原型如下:
void kfree_skb(struct sk_buff *skb)
函数参数和返回值含义如下:
skb: 要释放的 sk_buff。
返回值: 无。
对于网络设备而言最好使用如下所示释放函数:
void dev_kfree_skb (struct sk_buff *skb)
函数只要一个参数 skb,就是要释放的 sk_buff。
3、 skb_put、 skb_push、 sbk_pull 和 skb_reserve
这四个函数用于变更 sk_buff,先来看一下 skb_put 函数,此函数用于在尾部扩展 skb_buff的数据区,也就将 skb_buff 的 tail 后移 n 个字节,从而导致 skb_buff 的 len 增加 n 个字节,原型如下:
unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
函数参数和返回值含义如下:
skb: 要操作的 sk_buff。
len:要增加多少个字节。
返回值: 扩展出来的那一段数据区首地址。
skb_put 操作之前和操作之后的数据区如图所示:
skb_push 函数用于在头部扩展 skb_buff 的数据区,函数原型如下所示:
unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
函数参数和返回值含义如下:
skb: 要操作的 sk_buff。
len:要增加多少个字节。
返回值: 扩展完成以后新的数据区首地址。
skb_push 操作之前和操作之后的数据区如图所示:
sbk_pull 函数用于从 sk_buff 的数据区起始位置删除数据,函数原型如下所示:
unsigned char *skb_pull(struct sk_buff *skb, unsigned int len)
函数参数和返回值含义如下:
skb: 要操作的 sk_buff。
len:要删除的字节数。
返回值: 删除以后新的数据区首地址。
skb_pull 操作之前和操作之后的数据区如图所示:
sbk_reserve 函数用于调整缓冲区的头部大小,方法很简单将 skb_buff 的 data 和 tail 同时后移 n 个字节即可,函数原型如下所示:
static inline void skb_reserve(struct sk_buff *skb, int len)
函数参数和返回值含义如下:
skb: 要操作的 sk_buff。
len:要增加的缓冲区头部大小。
返回值: 无。
网络 NAPI 处理机制
如果玩过单片机的话应该都知道,像 IIC、 SPI、网络等这些通信接口,接收数据有两种方法:轮询或中断。 Linux 里面的网络数据接收也轮询和中断两种,中断的好处就是响应快,数据量小的时候处理及时,速度快,但是一旦当数据量大,而且都是短帧的时候会导致中断频繁发生,消耗大量的 CPU 处理时间在中断自身处理上。轮询恰好相反,响应没有中断及时,但是在处理大量数据的时候不需要消耗过多的 CPU 处理时间。 Linux 在这两个处理方式的基础上提出了另外一种网络数据接收的处理方法: NAPI(New API), NAPI 是一种高效的网络处理技术。NAPI 的核心思想就是不全部采用中断来读取网络数据,而是采用中断来唤醒数据接收服务程序,在接收服务程序中采用 POLL 的方法来轮询处理数据。这种方法的好处就是可以提高短数据包的接收效率,减少中断处理的时间。目前 NAPI 已经在 Linux 的网络驱动中得到了大量的应用, 本章节就简单讲解一下如何在驱动中使用 NAPI,Linux 内核使用结构体 napi_struct 表示 NAPI,在使用 NAPI 之前要先初始化一个 napi_struct 实例。
1、初始化 NAPI
首先要初始化一个 napi_struct 实例,使用 netif_napi_add 函数,此函数定义在 net/core/dev.c中,函数原型如下:
void netif_napi_add(struct net_device *dev,struct napi_struct *napi,int (*poll)(struct napi_struct *, int),int weight)
函数参数和返回值含义如下:
dev: 每个 NAPI 必须关联一个网络设备,此参数指定 NAPI 要关联的网络设备。
napi:要初始化的 NAPI 实例。
poll: NAPI 所使用的轮询函数,非常重要,一般在此轮询函数中完成网络数据接收的工作。
weight: NAPI 默认权重(weight),一般为 NAPI_POLL_WEIGHT。
返回值: 无。
2、删除 NAPI
如果要删除 NAPI,使用 netif_napi_del 函数即可,函数原型如下:
void netif_napi_del(struct napi_struct *napi)
函数参数和返回值含义如下:
napi: 要删除的 NAPI。
返回值: 无。
3、 使能 NAPI
初始化完 NAPI 以后,必须使能才能使用,使用函数 napi_enable,函数原型如下:
inline void napi_enable(struct napi_struct *n)
函数参数和返回值含义如下:
n: 要使能的 NAPI。
返回值: 无。
4、关闭 NAPI
关闭 NAPI 使用 napi_disable 函数即可,函数原型如下:
void napi_disable(struct napi_struct *n)
函数参数和返回值含义如下:
n: 要关闭的 NAPI。
返回值: 无。
5、检查 NAPI 是否可以进行调度
使用 napi_schedule_prep 函数检查 NAPI 是否可以进行调度,函数原型如下:
inline bool napi_schedule_prep(struct napi_struct *n)
函数参数和返回值含义如下:
n: 要检查的 NAPI。
返回值: 如果可以调度就返回真,如果不可调度就返回假。
6、 NAPI 调度
如果可以调度的话就进行调度,使用__napi_schedule 函数完成 NAPI 调度,函数原型如下:
void __napi_schedule(struct napi_struct *n)
函数参数和返回值含义如下:
n: 要调度的 NAPI。
返回值: 无。
我们也可以使用 napi_schedule 函数来一次完成 napi_schedule_prep 和__napi_schedule 这两个函数的工作。
7、 NAPI 处理完成
NAPI 处理完成以后需要调用 napi_complete 函数来标记 NAPI 处理完成,函数原型如下:
inline void napi_complete(struct napi_struct *n)
函数参数和返回值含义如下:
n: 处理完成的 NAPI。
返回值: 无。