【Hive 开发进阶】窗口函数深度解析:OVER/NTILE/RANK 实战案例与行转列高级技巧
一、窗口函数 OVER 详解
窗口函数用于在分组内进行数据排名、聚合计算等操作,语法格式:
函数名() over([partition by 分组字段] [order by 排序字段] [window子句])
案例:员工信息与部门平均工资
create table emp (id int,dept string,salary double
);select id, dept, salary,avg(salary) over(partition by dept) as 部门平均工资
from emp;
二、序列函数
1. NTILE(数据分桶)
-- 将消费记录按金额分成2桶
select id, amount,ntile(2) over(order by amount desc) as 桶号
from sales;
2. LAG 与 LEAD(前后行数据获取)
-- 查询顾客上次购买时间
select user_id, order_time,lag(order_time, 1, '无历史记录') over(partition by user_id order by order_time) as 上次购买时间
from orders;
3. FIRST_VALUE 与 LAST_VALUE
-- 获取每个班级最高分和最低分
select class, score,first_value(score) over(partition by class order by score desc) as 最高分,last_value(score) over(partition by class order by score) as 最低分
from stu_score;
三、排名函数
函数 | 特点 |
---|---|
ROW_NUMBER() | 顺序排名,无重复 |
RANK() | 并列排名,有空位 |
DENSE_RANK() | 并列排名,无空位 |
案例:班级成绩排名
select id, class, score,row_number() over(partition by class order by score desc) as rn_row,rank() over(partition by class order by score desc) as rn_rank,dense_rank() over(partition by class order by score desc) as rn_dense
from stu_score;
部分输出结果:
id | class | score | rn_row | rn_rank | rn_dense |
---|---|---|---|---|---|
1 | A | 95 | 1 | 1 | 1 |
2 | A | 95 | 2 | 1 | 1 |
3 | A | 90 | 3 | 3 | 2 |
四、窗口子句(WINDOW 子句)
-- 计算当前行与前两行的消费总和
select user_id, order_time, amount,sum(amount) over(partition by user_id order by order_time rows between 2 preceding and current row) as 近3次消费总额
from orders;
五、综合实战:行转列进阶
-- 需求:将课程表转换为选修/未选修状态(中文显示)
select id,if(array_contains(collect_set(course), 'a'), '选修', '未选修') as a,if(array_contains(collect_set(course), 'b'), '选修', '未选修') as b
from courses group by id;
输出结果:
id | a | b |
---|---|---|
1 | 选修 | 选修 |
2 | 选修 | 未选修 |
3 | 选修 | 选修 |