当前位置: 首页 > backend >正文

大数据学习(125)-hive数据分析

🍋🍋大数据学习🍋🍋

🔥系列专栏: 👑哲学语录: 用力所能及,改变世界。
💖如果觉得博主的文章还不错的话,请点赞👍+收藏⭐️+留言📝支持一下博主哦🤞


1. 连续登录问题变种
  • 题目
    找出恰好连续登录 3 天的用户(不允许更长的连续区间)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH ranked_logs AS (SELECT user_id,login_date,ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rnFROM user_logs
    ),
    consecutive_groups AS (SELECT user_id,DATE_SUB(login_date, INTERVAL rn DAY) AS grp,MIN(login_date) AS start_date,MAX(login_date) AS end_date,COUNT(*) AS daysFROM ranked_logsGROUP BY user_id, grp
    )
    SELECT user_id, start_date, end_date
    FROM consecutive_groups
    WHERE days = 3;
    
2. 连续未登录问题
  • 题目
    找出用户最长连续未登录天数(假设表中仅记录登录日期)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH next_logs AS (SELECT user_id,login_date,LEAD(login_date) OVER (PARTITION BY user_id ORDER BY login_date) AS next_loginFROM user_logs
    )
    SELECT user_id,MAX(DATEDIFF(next_login, login_date) - 1) AS max_consecutive_missing
    FROM next_logs
    WHERE next_login IS NOT NULL
    GROUP BY user_id;
    

二、窗口函数高级应用

3. 移动平均值计算
  • 题目
    计算用户最近 7 天的平均消费金额(滑动窗口)。
    表结构orders(user_id, order_date, amount)

  • 参考答案

    SELECT user_id,order_date,AVG(amount) OVER (PARTITION BY user_id ORDER BY order_date RANGE BETWEEN INTERVAL '6 DAY' PRECEDING AND CURRENT ROW) AS rolling_7day_avg
    FROM orders;
    
4. 增长率计算
  • 题目
    计算每个用户月消费金额的环比增长率
    表结构orders(user_id, order_date, amount)

  • 参考答案

    WITH monthly_sales AS (SELECT user_id,DATE_FORMAT(order_date, '%Y-%m') AS month,SUM(amount) AS total_amountFROM ordersGROUP BY user_id, month
    )
    SELECT user_id,month,total_amount,(total_amount / LAG(total_amount) OVER (PARTITION BY user_id ORDER BY month) - 1) * 100 AS growth_rate
    FROM monthly_sales;
    

三、时间序列分析

5. 缺失日期填充
  • 题目
    生成用户每日登录状态(0 = 未登录,1 = 登录),包括缺失的日期。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH date_range AS (SELECT user_id,MIN(login_date) AS start_date,MAX(login_date) AS end_dateFROM user_logsGROUP BY user_id
    ),
    all_dates AS (SELECT dr.user_id,d.calendar_dateFROM date_range drCROSS JOIN (SELECT CURDATE() - INTERVAL n DAY AS calendar_dateFROM (SELECT @row := @row + 1 AS n FROM (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t1,(SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t2,(SELECT @row := -1) t3) t) dWHERE d.calendar_date BETWEEN dr.start_date AND dr.end_date
    )
    SELECT ad.user_id,ad.calendar_date,IF(ul.login_date IS NULL, 0, 1) AS is_logged_in
    FROM all_dates ad
    LEFT JOIN user_logs ul 
    ON ad.user_id = ul.user_id AND ad.calendar_date = ul.login_date;
    
6. 周期性检测
  • 题目
    找出用户每周固定某天登录的行为模式(如每周一登录)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH day_of_week AS (SELECT user_id,login_date,DAYOFWEEK(login_date) AS dowFROM user_logs
    )
    SELECT user_id,dow,COUNT(DISTINCT WEEK(login_date)) AS weeks_count,COUNT(*) AS login_count
    FROM day_of_week
    GROUP BY user_id, dow
    HAVING login_count = weeks_count; -- 每周该天均登录
    

四、复杂业务场景

7. 购买间隔分析
  • 题目
    计算用户平均购买间隔,并找出间隔超过 30 天的用户。
    表结构orders(user_id, order_date)

  • 参考答案

    WITH order_intervals AS (SELECT user_id,order_date,DATEDIFF(order_date, LAG(order_date) OVER (PARTITION BY user_id ORDER BY order_date)) AS days_since_lastFROM orders
    )
    SELECT user_id,AVG(days_since_last) AS avg_interval
    FROM order_intervals
    WHERE days_since_last IS NOT NULL
    GROUP BY user_id
    HAVING avg_interval > 30;
    
8. 活跃 / 流失用户分析
  • 题目
    标记用户每月状态(活跃 = 当月有登录,流失 = 连续 3 个月未登录)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    WITH months AS (SELECT user_id,DATE_FORMAT(login_date, '%Y-%m') AS month,MAX(login_date) AS last_loginFROM user_logsGROUP BY user_id, month
    ),
    status AS (SELECT m.user_id,m.month,m.last_login,LEAD(m.last_login, 3) OVER (PARTITION BY m.user_id ORDER BY m.month) AS next_3rd_month_loginFROM months m
    )
    SELECT user_id,month,CASE WHEN next_3rd_month_login IS NULL THEN '流失'ELSE '活跃'END AS status
    FROM status;
    

五、进阶挑战

9. 最长连续事件链
  • 题目
    找出用户最长的连续事件链(如连续点赞、评论等,事件类型相同)。
    表结构events(user_id, event_time, event_type)

  • 参考答案

    WITH ranked_events AS (SELECT user_id,event_time,event_type,ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) AS rnFROM events
    ),
    event_groups AS (SELECT user_id,event_type,DATE_SUB(event_time, INTERVAL rn SECOND) AS grp,COUNT(*) AS chain_lengthFROM ranked_eventsGROUP BY user_id, event_type, grp
    )
    SELECT user_id,event_type,MAX(chain_length) AS max_chain
    FROM event_groups
    GROUP BY user_id, event_type;
    
10. 会话识别
  • 题目
    将用户行为按会话分组(假设会话间隔为 30 分钟)。
    表结构actions(user_id, action_time, action_type)

  • 参考答案

    WITH time_diff AS (SELECT user_id,action_time,action_type,TIMESTAMPDIFF(MINUTE, LAG(action_time) OVER (PARTITION BY user_id ORDER BY action_time), action_time) AS minutes_since_lastFROM actions
    ),
    session_markers AS (SELECT user_id,action_time,action_type,IF(minutes_since_last > 30 OR minutes_since_last IS NULL, 1, 0) AS new_sessionFROM time_diff
    ),
    sessions AS (SELECT user_id,action_time,action_type,SUM(new_session) OVER (PARTITION BY user_id ORDER BY action_time) AS session_idFROM session_markers
    )
    SELECT * FROM sessions;
    
  1. 先手动模拟数据:创建测试表并插入少量数据,验证逻辑正确性。
  2. 对比不同方法:例如连续值问题,尝试用 LEAD()DATE_SUB + ROW_NUMBER 等多种方法实现。
  3. 注意边界条件:处理空值、同一天多次记录、跨年 / 跨月等场景。
http://www.xdnf.cn/news/9849.html

相关文章:

  • HOW - 简历和求职面试宝典(七)
  • 整数加减法测试题
  • API网关和API管理的区别
  • 【PCB工艺】绘制原理图 + PCB设计大纲:最小核心板STM32F103ZET6
  • Day39
  • remote: error: hook declined to update refs/heads.....
  • DrissionPage ChromiumPage模式:浏览器自动化的高效利器
  • 【PhysUnits】15.1 引入P1后的加一特质(add1.rs)
  • DeepSeekMath:突破开放式语言模型中数学推理能力的极限
  • 百度之星2024 初赛第一场 补给
  • 一键提取Office内图片的工具
  • MySQL 数据库调优指南:提升性能的全面策略
  • 【第4章 图像与视频】4.4 离屏 canvas
  • 前端开源JavaScrip库
  • uniapp+ts模拟popup弹出框(下拉框)
  • 【Kubernetes】ubuntu20.04通过kubeadm + Docker安装k8s
  • 进程间通信(共享内存)
  • Maven 仓库类型与镜像策略
  • aws instance store 的恢复
  • 【仿生系统】爱丽丝的“内在”或“灵魂”:概念与形式
  • C语言进阶--字符串+内存函数
  • R语言在生物群落数据统计分析与绘图中的实践应用
  • 【电拖自控】转速检测数字测速(脉冲计数测速)
  • SSH免密登录其它用户脚本
  • Hadoop MapReduce:大数据处理利器
  • 25 字符数组与字符串及多维数组详解:定义与初始化、访问与遍历、%s 格式符、内存剖析、编程实战
  • 什么是单片机?
  • Axure设计案例——科技感对比柱状图
  • 小白的进阶之路系列之五----人工智能从初步到精通pytorch张量
  • kibana解析Excel文件,生成mapping es导入Excel