MySQL 8.0 窗口函数详解:让数据分析更简单高效
在日常的数据分析工作中,我们经常需要对数据进行分组排序、计算移动平均值、统计累计求和等操作。在MySQL 8.0之前,这类需求通常需要编写复杂的子查询或连接查询才能实现。而MySQL 8.0引入的窗口函数(Window Functions)极大地简化了这类操作,让数据分析变得更加简单高效。
本文将通过通俗易懂的方式,带你全面了解MySQL 8.0中的窗口函数,包括聚合类、排名类和跨行类窗口函数的使用方法。
什么是窗口函数?
窗口函数是一种特殊的SQL函数,它能够在不改变原有行数的情况下,对查询结果的某个"窗口"(一组相关的行)进行计算,并为每一行返回一个值。
可以把窗口函数想象成:你有一张全班学生的成绩表,窗口函数允许你同时看到每个学生的成绩、他在班级中的排名、他与前一名同学的分数差等信息,而不需要改变原始数据表的行数。这个排名是新加的一行。
窗口函数的基本语法
SELECT 列1, 列2,窗口函数() OVER ([PARTITION BY 分区字段] [ORDER BY 排序字段][frame_clause]) AS 别名
FROM 表名;
PARTITION BY
:将数据分成多个分区(类似于GROUP BY,但不合并行)ORDER BY
:确定分区内数据的排序方式frame_clause
:定义窗口框架,即计算范围
一、聚合类窗口函数
聚合类窗口函数可以在保留所有行的同时,计算分组的聚合值。
1. SUM() OVER()
计算分区内的总和:
-- 计算每个部门的工资总额,同时显示每个员工的详细信息
SELECT employee_id,name,department,salary,SUM(salary) OVER (PARTITION BY department) AS dept_total_salary
FROM employees;
2. AVG() OVER()
计算分区内的平均值:
-- 计算每个部门的平均工资,同时显示每个员工的详细信息
SELECT employee_id,name,department,salary,AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary
FROM employees;
3. COUNT() OVER()
计算分区内的行数:
-- 计算每个部门的员工数量,同时显示每个员工的详细信息
SELECT employee_id,name,department,salary,COUNT(*) OVER (PARTITION BY department) AS dept_employee_count
FROM employees;
二、排名类窗口函数
排名类窗口函数用于为分区内的行分配排名、序号等。
1. ROW_NUMBER()
为分区内的每一行分配一个唯一的序号:1,2,3,4,5,6,7
-- 为每个部门的员工按工资从高到低编号
SELECT employee_id,name,department,salary,ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rank_in_dept
FROM employees;
2. RANK()
计算分区内的排名,相同值会有相同排名,并跳过后续排名:1,1,3,3,5,6,7
-- 计算每个部门内的工资排名(允许并列)
SELECT employee_id,name,department,salary,RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank
FROM employees;
3. DENSE_RANK()
计算分区内的排名,相同值有相同排名,但不跳过后续排名:1,1,2,2,3,4,5,6,6
-- 计算每个部门内的工资密集排名(允许并列但不跳号)
SELECT employee_id,name,department,salary,DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank
FROM employees;
4. NTILE()
将分区内的数据分成指定数量的组:
-- 将每个部门的员工按工资高低分成4个组
SELECT employee_id,name,department,salary,NTILE(4) OVER (PARTITION BY department ORDER BY salary DESC) as quartile
FROM employees;
三、跨行类窗口函数
跨行类窗口函数可以访问分区内其他行的数据。
1. LAG()
访问分区中当前行之前的数据:
-- 查看每位员工和上一名员工的工资差异
SELECT employee_id,name,department,salary,LAG(salary) OVER (PARTITION BY department ORDER BY salary DESC) as prev_salary,salary - LAG(salary) OVER (PARTITION BY department ORDER BY salary DESC) as diff_from_prev
FROM employees;
2. LEAD()
访问分区中当前行之后的数据:
-- 查看每位员工和下一名员工的工资差异
SELECT employee_id,name,department,salary,LEAD(salary) OVER (PARTITION BY department ORDER BY salary DESC) as next_salary,salary - LEAD(salary) OVER (PARTITION BY department ORDER BY salary DESC) as diff_from_next
FROM employees;
3. FIRST_VALUE()
获取分区内第一行的值:
-- 查看每位员工与部门最高工资的差异
SELECT employee_id,name,department,salary,FIRST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary DESC) as highest_salary,salary - FIRST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary DESC) as diff_from_highest
FROM employees;
4. LAST_VALUE()
获取分区内最后一行的值:
-- 查看每位员工与部门最低工资的差异
SELECT employee_id,name,department,salary,LAST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary DESCRANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lowest_salary,salary - LAST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary DESCRANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as diff_from_lowest
FROM employees;
注意:使用LAST_VALUE()时需要特别注意窗口框架的定义,否则可能得不到预期结果。
窗口框架(Window Frame)详解
窗口框架定义了窗口函数计算时使用的行范围,常用的有两种:
ROWS模式:基于物理行偏移
RANGE模式:基于逻辑值偏移
示例:计算移动平均值
-- 计算每个员工与前2行、当前行、后2行共5行的平均工资
SELECT employee_id,name,department,salary,AVG(salary) OVER (PARTITION BY department ORDER BY salary DESCROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) as moving_avg
FROM employees;
实际应用场景
场景1:计算同比/环比增长率
-- 计算每月销售额与上月相比的增长率
SELECT year,month,sales,LAG(sales) OVER (ORDER BY year, month) as prev_month_sales,(sales - LAG(sales) OVER (ORDER BY year, month)) / LAG(sales) OVER (ORDER BY year, month) * 100 as growth_rate
FROM monthly_sales;
场景2:获取Top N记录
-- 获取每个部门工资前三名的员工
WITH ranked_employees AS (SELECT employee_id,name,department,salary,DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank_in_deptFROM employees
)
SELECT * FROM ranked_employees WHERE rank_in_dept <= 3;
场景3:计算累计百分比
-- 计算每个部门工资的累计百分比
SELECT employee_id,name,department,salary,SUM(salary) OVER (PARTITION BY department ORDER BY salary DESC) as running_total,SUM(salary) OVER (PARTITION BY department) as dept_total,SUM(salary) OVER (PARTITION BY department ORDER BY salary DESC) / SUM(salary) OVER (PARTITION BY department) * 100 as cumulative_percent
FROM employees;
性能优化建议
合理使用索引:为PARTITION BY和ORDER BY涉及的字段创建索引
避免过度使用窗口函数:在数据量大的表中,窗口函数可能影响性能
使用CTE(公用表表达式):将复杂查询分解为多个步骤,提高可读性和性能
限制窗口大小:对于移动平均等计算,尽量限制窗口框架的范围
总结
MySQL 8.0的窗口函数为数据分析提供了强大而灵活的工具,使我们能够在保留原始行细节的同时,进行各种复杂的计算和分析。通过本文的学习,你应该已经掌握了:
窗口函数的基本概念和语法结构
聚合类、排名类和跨行类窗口函数的使用方法
窗口框架的定义和使用场景
窗口函数在实际业务中的应用示例
窗口函数的学习曲线可能稍陡,但一旦掌握,将极大地提升你的数据处理能力和工作效率。建议在实际工作中多加练习,逐步掌握这些强大的功能。
希望本文对你理解和使用MySQL 8.0窗口函数有所帮助!如有任何疑问,欢迎留言讨论。