当前位置：首页 > ai >正文

内核Sched调度关于find_idlest_cpu选核逻辑

ai 2025/9/2 16:25:24

`find_idlest_cpu`（考虑全局负载均衡）

目标：忽略亲和性，从全局视角寻找负载最低的 CPU（考虑 CPU 利用率、任务数、缓存热度等）。
实现：遍历所有可用 CPU，计算每个 CPU 的负载指标（如加权利用率），选择综合负载最低的核心。

一.函数逻辑实现

static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p,
7131  				  int cpu, int prev_cpu, int sd_flag)
7132  {
7133  	int new_cpu = cpu;// 初始检查// 检查当前调度域 sd 的 CPU 集合（sched_domain_span(sd)）是否与任务 p 允许运行的 CPU 集合（p->cpus_ptr）有交集。// 如果没有交集，说明任务 p 不允许在当前调度域中的任何 CPU 上运行，直接返回 prev_cpu
7134  
7135  	if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr))
7136  		return prev_cpu;
7137  
7138  	/*
7139  	 * We need task's util for cpu_util_without, sync it up to
7140  	 * prev_cpu's last_update_time.
7141  	 */// 同步任务的负载信息// 如果 sd_flag 不包含 SD_BALANCE_FORK 标志（即不是新创建的任务），则调用 sync_entity_load_avg 同步任务的负载信息。// 这是为了确保任务的负载统计信息是最新的，便于后续计算 CPU 的空闲程度
7142  	if (!(sd_flag & SD_BALANCE_FORK))
7143  		sync_entity_load_avg(&p->se);
7144  // 从当前调度域开始，逐层向下（子调度域）寻找最空闲的 CPU7145  	while (sd) {
7146  		struct sched_group *group;
7147  		struct sched_domain *tmp;
7148  		int weight;
7149  // 如果当前调度域 sd 的标志位 flags 不包含 sd_flag，说明该调度域不适合当前的调度场景
7150  		if (!(sd->flags & sd_flag)) {// 直接跳过当前调度域，进入其子调度域（sd->child）
7151  			sd = sd->child;
7152  			continue;
7153  		}
7154  // 查找最空闲的调度组// 调用 find_idlest_group 函数，在当前调度域 sd 中查找最空闲的调度组（sched_group）
7155  		group = find_idlest_group(sd, p, cpu);// 如果没有找到合适的调度组，直接跳过当前调度域，进入其子调度域
7156  		if (!group) {
7157  			sd = sd->child;
7158  			continue;
7159  		}
7160  // 查找最空闲的 CPU// 在找到的调度组中，调用 find_idlest_group_cpu 查找最空闲的 CPU
7161  		new_cpu = find_idlest_group_cpu(group, p, cpu);// 如果找到的 CPU 和当前 CPU 是同一个（new_cpu == cpu），说明没有更好的选择，继续在更低级别的调度域中尝试平衡
7162  		if (new_cpu == cpu) {
7163  			/* Now try balancing at a lower domain level of 'cpu': */
7164  			sd = sd->child;
7165  			continue;
7166  		}
7167  
7168  		/* Now try balancing at a lower domain level of 'new_cpu': */// 更新调度域并继续搜索// 将当前 CPU 更新为找到的最空闲 CPU（new_cpu）
7169  		cpu = new_cpu;// 记录当前调度域的权重（span_weight），然后清空 sd
7170  		weight = sd->span_weight;
7171  		sd = NULL;// 使用 for_each_domain 遍历新 CPU 所属的所有调度域，找到满足条件的下一个调度域
7172  		for_each_domain(cpu, tmp) {// 权重大于当前调度域（weight <= tmp->span_weight）
7173  			if (weight <= tmp->span_weight)
7174  				break;// 包含指定的标志位（tmp->flags & sd_flag）
7175  			if (tmp->flags & sd_flag)
7176  				sd = tmp;
7177  		}
7178  	}
7179  // 最终返回找到的最空闲的 CPU
7180  	return new_cpu;
7181  }

static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p,
int cpu, int prev_cpu, int sd_flag)

输入参数：
- sd: 当前的调度域（sched_domain），表示一组相关的 CPU。
- p: 要调度的任务（task_struct）。
- cpu: 当前考虑的 CPU。
- prev_cpu: 任务之前运行的 CPU。
- sd_flag: 标志位，用于控制调度行为（如是否是新创建的任务）。
返回值：返回一个整数，表示选择的最空闲的 CPU。

二.总结

这段代码的核心逻辑是通过逐层遍历调度域和调度组，找到最空闲的 CPU，并确保任务可以被合理地分配到该 CPU 上运行。主要步骤包括：

检查任务是否可以在当前调度域中运行。
同步任务的负载信息。
在调度域中查找最空闲的调度组和 CPU。
逐层深入子调度域，直到找到最优解或无法进一步优化。

附件知识

三.调度域（`sched_domain`）

struct sched_domain {
89  	/* These fields must be setup */
90  	struct sched_domain __rcu *parent;	/* top domain must be null terminated */
91  	struct sched_domain __rcu *child;	/* bottom domain must be null terminated */
92  	struct sched_group *groups;	/* the balancing groups of the domain */
93  	unsigned long min_interval;	/* Minimum balance interval ms */
94  	unsigned long max_interval;	/* Maximum balance interval ms */
95  	unsigned int busy_factor;	/* less balancing by factor if busy */
96  	unsigned int imbalance_pct;	/* No balance until over watermark */
97  	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
98  	unsigned int imb_numa_nr;	/* Nr running tasks that allows a NUMA imbalance */
99  
100  	int nohz_idle;			/* NOHZ IDLE status */
101  	int flags;			/* See SD_* */
102  	int level;
103  
104  	/* Runtime fields. */
105  	unsigned long last_balance;	/* init to jiffies. units in jiffies */
106  	unsigned int balance_interval;	/* initialise to 1. units in ms. */
107  	unsigned int nr_balance_failed; /* initialise to 0 */
108  
109  	/* idle_balance() stats */
110  	u64 max_newidle_lb_cost;
111  	unsigned long last_decay_max_lb_cost;
112  
113  	u64 avg_scan_cost;		/* select_idle_sibling */
114  
115  #ifdef CONFIG_SCHEDSTATS
116  	/* load_balance() stats */
117  	unsigned int lb_count[CPU_MAX_IDLE_TYPES];
118  	unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
119  	unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
120  	unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
121  	unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
122  	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
123  	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
124  	unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES];
125  
126  	/* Active load balancing */
127  	unsigned int alb_count;
128  	unsigned int alb_failed;
129  	unsigned int alb_pushed;
130  
131  	/* SD_BALANCE_EXEC stats */
132  	unsigned int sbe_count;
133  	unsigned int sbe_balanced;
134  	unsigned int sbe_pushed;
135  
136  	/* SD_BALANCE_FORK stats */
137  	unsigned int sbf_count;
138  	unsigned int sbf_balanced;
139  	unsigned int sbf_pushed;
140  
141  	/* try_to_wake_up() stats */
142  	unsigned int ttwu_wake_remote;
143  	unsigned int ttwu_move_affine;
144  	unsigned int ttwu_move_balance;
145  #endif
146  #ifdef CONFIG_SCHED_DEBUG
147  	char *name;
148  #endif
149  	union {
150  		void *private;		/* used during construction */
151  		struct rcu_head rcu;	/* used during destruction */
152  	};
153  	struct sched_domain_shared *shared;
154  
155  	unsigned int span_weight;
156  
157  	ANDROID_KABI_RESERVE(1);
158  	ANDROID_KABI_RESERVE(2);
159  
160  	/*
161  	 * Span of all CPUs in this domain.
162  	 *
163  	 * NOTE: this field is variable length. (Allocated dynamically
164  	 * by attaching extra space to the end of the structure,
165  	 * depending on how many CPUs the kernel has booted up with)
166  	 */
167  	unsigned long span[];
168  };

调度域（sched_domain）是 Linux 内核中调度器的核心概念之一，用于组织和管理 CPU 的层次结构。它的设计目的是帮助调度器在多核系统中高效地分配任务，同时平衡负载

1. 调度域的作用

调度域的主要作用是将系统中的 CPU 分组，并定义这些组之间的关系。通过这种分层结构，调度器可以在不同的粒度上进行负载均衡：

高层调度域：通常覆盖整个 NUMA 节点或更大的范围。
低层调度域：通常覆盖单个物理核心或其超线程。

调度域的设计使得调度器可以根据任务的需求和系统的拓扑结构，在不同层级上动态地调整任务分配。

2. 调度域的层次结构

调度域是一个树形结构，每一层调度域都有一个父节点（parent）和子节点（child）。以下是典型的调度域层次结构（从高到低）：

(1) NUMA 节点域

描述：最高层的调度域，对应非统一内存访问（NUMA）架构中的节点。
特点：
- 每个 NUMA 节点包含一组物理 CPU 和本地内存。
- 跨 NUMA 节点的任务迁移会带来较高的内存访问延迟。
用途：优先在同一个 NUMA 节点内分配任务。

(2) MC（Multi-Core）域

描述：对应一个多核处理器（Socket 或 Package）。
特点：
- 包含同一物理处理器上的所有核心。
- 核心之间共享 L3 缓存或其他资源。
用途：在同一个物理处理器内进行任务分配和负载均衡。

(3) SMT（Simultaneous Multi-Threading）域

描述：最低层的调度域，对应一个物理核心上的超线程（Hyper-Threading）。
特点：
- 超线程共享同一个物理核心的执行资源。
- 在超线程之间迁移任务的成本最低。
用途：在同一个物理核心内的超线程之间进行任务分配。

3. 调度域的关键字段

每个调度域（sched_domain）都有一些关键字段，用于描述其特性。以下是一些重要的字段及其含义：

(1) `span`

类型：cpumask_t
描述：表示该调度域覆盖的 CPU 集合。
用途：用于判断某个任务是否可以在该调度域中的 CPU 上运行。

(2) `parent` 和 `child`

类型：指向其他 sched_domain 的指针。
描述：
- parent：指向更高层的调度域。
- child：指向更低层的调度域。
用途：定义调度域的层次结构。

(3) `flags`

类型：位掩码（bitmask）
描述：表示该调度域的特性和行为标志。
常见标志：
- SD_BALANCE_NEWIDLE：在新任务创建时进行负载均衡。
- SD_BALANCE_EXEC：在任务首次执行时进行负载均衡。
- SD_BALANCE_FORK：在任务 fork 时进行负载均衡。
- SD_SHARE_PKG_RESOURCES：表示该域内的 CPU 共享某些硬件资源（如 L3 缓存）。

(4) `groups`

类型：指向 sched_group 的链表。
描述：表示该调度域内的调度组。
用途：调度器会在调度组之间选择最空闲的组来分配任务。

4. 调度组（`sched_group`）

struct sched_group {
1947  	struct sched_group	*next;			/* Must be a circular list */
1948  	atomic_t		ref;
1949  
1950  	unsigned int		group_weight;
1951  	unsigned int		cores;
1952  	struct sched_group_capacity *sgc;
1953  	int			asym_prefer_cpu;	/* CPU of highest priority in group */
1954  	int			flags;
1955  
1956  	/*
1957  	 * The CPUs this group covers.
1958  	 *
1959  	 * NOTE: this field is variable length. (Allocated dynamically
1960  	 * by attaching extra space to the end of the structure,
1961  	 * depending on how many CPUs the kernel has booted up with)
1962  	 */
1963  	unsigned long		cpumask[];
1964  };

调度组是调度域的子集，用于进一步细化 CPU 的分组。每个调度域包含多个调度组，调度组的关键字段包括：

(1) `cpumask`

类型：cpumask_t
描述：表示该调度组覆盖的 CPU 集合。
用途：用于判断某个任务是否可以在该调度组中的 CPU 上运行。

(2) `sgc`（`sched_group_capacity`）

类型：struct sched_group_capacity
描述：表示该调度组的容量信息，包括 CPU 的利用率、空闲程度等。
用途：用于计算调度组的负载，以便选择最空闲的组。

5. 调度域的工作流程

调度域的工作流程可以分为以下几个步骤：

(1) 初始化调度域

在系统启动时，调度器会根据 CPU 拓扑结构初始化调度域和调度组。
每个 CPU 都会有一个完整的调度域层次结构。

(2) 负载均衡

调度器会定期检查每个调度域的负载情况。
如果发现某个调度域内的负载不均衡，调度器会尝试将任务迁移到更空闲的 CPU。

(3) 任务分配

当需要为新任务分配 CPU 时，调度器会从最高层的调度域开始，逐层向下查找最空闲的 CPU。
查找过程涉及调度域、调度组和 CPU 的利用率计算。

6. 示例

假设一个系统有以下拓扑结构：

两个 NUMA 节点（Node0 和 Node1）。
每个 NUMA 节点有两个物理核心（Core0 和 Core1）。
每个物理核心有两个超线程（Thread0 和 Thread1）。

NUMA 域 (Node0)
├── MC 域 (Core0, Core1)
│ ├── SMT 域 (Thread0, Thread1)
│ └── SMT 域 (Thread0, Thread1)
└── MC 域 (Core2, Core3)
├── SMT 域 (Thread0, Thread1)
└── SMT 域 (Thread0, Thread1)
NUMA 域 (Node1)
├── MC 域 (Core4, Core5)
│ ├── SMT 域 (Thread0, Thread1)
│ └── SMT 域 (Thread0, Thread1)
└── MC 域 (Core6, Core7)
├── SMT 域 (Thread0, Thread1)
└── SMT 域 (Thread0, Thread1)

四.`sd->flags & sd_flag` 的判断逻辑

在代码中，sd->flags & sd_flag 的判断逻辑是用来决定当前调度域（sched_domain）是否符合特定的调度需求。

/*
7139  	 * We need task's util for cpu_util_without, sync it up to
7140  	 * prev_cpu's last_update_time.
7141  	 */
7142  	if (!(sd_flag & SD_BALANCE_FORK))
7143  		sync_entity_load_avg(&p->se);
7144  
7145  	while (sd) {
7146  		struct sched_group *group;
7147  		struct sched_domain *tmp;
7148  		int weight;
7149  
7150  		if (!(sd->flags & sd_flag)) {
7151  			sd = sd->child;
7152  			continue;
7153  		}
7154

1. `flags` 的作用

flags 是 sched_domain 结构体中的一个字段，用于描述该调度域的特性和行为。它是一个位掩码（bitmask），每一位表示一种特性或标志。常见的标志包括：

SD_BALANCE_NEWIDLE：允许在新任务创建时进行负载均衡。
SD_BALANCE_EXEC：允许在任务首次执行时进行负载均衡。
SD_BALANCE_FORK：允许在任务 fork 时进行负载均衡。
SD_SHARE_PKG_RESOURCES：表示调度域内的 CPU 共享某些硬件资源（如 L3 缓存）。

判断逻辑

if (!(sd->flags & sd_flag)) {sd = sd->child;continue;
}

sd_flag：
- 这是调用者传入的一个参数，表示当前调度场景需要满足的标志。
- 比如，如果 sd_flag 包含 SD_BALANCE_NEWIDLE，那么只有支持 SD_BALANCE_NEWIDLE 的调度域才会被考虑。
sd->flags & sd_flag：
- 这是一个按位与操作，用来检查 sd->flags 是否包含 sd_flag 中的所有标志。
- 如果 sd->flags 不包含 sd_flag 中的标志，则跳过当前调度域，进入其子调度域（sd->child）。

2. 遍历到 `child` 的逻辑

调度域是一个树形结构，每个调度域都有一个指向其子调度域的指针（child）。当当前调度域不满足条件时，代码会通过以下方式遍历到子调度域：

(1) 更新 `sd` 指针

sd = sd->child;

将 sd 指向其子调度域。
子调度域通常覆盖更小的范围（例如，从 NUMA 域到 MC 域，再到 SMT 域）。

(2) 继续循环

continue;

跳过当前调度域的剩余逻辑，直接进入下一次循环，处理新的 sd。

3. 示例：调度域层次结构

假设系统有以下调度域层次结构：

NUMA 域 (Node0)
├── MC 域 (Core0, Core1)
│ ├── SMT 域 (Thread0, Thread1)
│ └── SMT 域 (Thread0, Thread1)
└── MC 域 (Core2, Core3)
├── SMT 域 (Thread0, Thread1)
└── SMT 域 (Thread0, Thread1)

(1) 初始状态

当前调度域 sd 指向 NUMA 域。
假设 sd_flag 包含 SD_BALANCE_NEWIDLE。

(2) 检查 `flags`

如果 NUMA 域的 flags 不包含 SD_BALANCE_NEWIDLE，则执行：

sd = sd->child;
continue;

将 sd 指向 MC 域（NUMA 域的子调度域）。

(3) 再次检查

如果 MC 域的 flags 包含 SD_BALANCE_NEWIDLE，则继续处理该调度域。
如果 MC 域的 flags 不包含 SD_BALANCE_NEWIDLE，则再次执行：

sd = sd->child;
continue;

将 sd 指向 SMT 域（MC 域的子调度域）。

4. 遍历的终止条件

遍历会在以下情况下终止：

找到符合条件的调度域：
- 如果当前调度域的 flags 包含 sd_flag，则进入后续逻辑（如查找最空闲的调度组和 CPU）。
到达叶子节点：
- 如果当前调度域没有子调度域（即 sd->child == NULL），则遍历结束。

5. 总结

flags 的判断：
- 通过按位与操作（sd->flags & sd_flag）检查调度域是否符合特定的调度需求。
遍历到 child 的逻辑：
- 如果当前调度域不符合条件，则将 sd 指向其子调度域，并继续循环。
- 这种逐层遍历的方式确保了调度器可以在不同粒度上寻找最合适的 CPU。

查看全文

http://www.xdnf.cn/news/19297.html