Hive PredicatePushDown 谓词下推规则的计算逻辑
1. PredicatePushDown 谓词下推
谓词下推的处理顺序是先处理子节点的操作,子节点都处理完,然后处理父节点。
select web_site_sk from (select web_site_sk,web_name from web_site where web_city='Pleasant Hill' )
t
where web_name <> 'site_1';
这个sql 生成的操作树是
TS[0]](web_site)||\/
Filter[1](web_city='Pleasant Hill')||\/
SELECT[2](web_site_sk,web_name)||\/
Filter[3](web_name <> 'site_1')||\/
SELECT[4](web_site_sk)||\/
FileSink[5](web_site_sk)
从 FileSink 开始,FileSink 使用 DefaultPPD,什么操作也没有。
SELECT 也使用DefaultPPD
Filter 使用 FilterPPD, 把条件 web_name <> ‘site_1’ 下推,并把Filter 对象放到 candidateFilterOps 中。
SELECT(web_site_sk,web_name) 把 predicate 传递下去。
Filter(web_city=‘Pleasant Hill’) 把当前的 predicate 和传递过来的 predicate 合并, 把当前 Filter 对象放到 candidateFilterOps。
TableScan 创建新的 filter,并把 candidateFilterOps 的 filter 删除。
优化后变为
TS[0](web_site)||\/
Filter[6](web_city='Pleasant Hill' and web_name <> 'site_1')||\/
SELECT[2](web_site_sk,web_name)||\/
SELECT[4](web_site_sk)||\/
FileSink(web_site_sk)
谓词下推是最常见和最有效的SQL优化方式。
以最常见的下推说明。
FilterPPD
从 where 的表达式中,决定哪些可以下推,并且和子节点下推过来的过滤条件合并。
select web_site_sk from (select web_site_sk,web_name from web_site where web_city='Pleasant Hill' )
t
where web_name <> 'site_1';
JoinPDD
基于 ANSI SQL 语法规则。Join 操作是从左往右计算。“a RIGHT OUTER JOIN b LEFT OUTER JOIN c INNER JOIN d” 解释为 “((a RIGHT OUTER JOIN b) LEFT OUTER JOIN c) INNER JOIN d”。
对于 inner join, 关联的两边都可以下推。
对于 right outer join,右边的可以下推。
对于 left outer join,左边的可以下推。
对于 full outer join,两边都不可以下推
以left join 解释为什么右表的条件不可以下推。
t1 表有字段 key,记录为
key1
key2
t2 表有字段 key,记录为
key1
两张表先左关联的结果是:
select t1.key,t2.key left join t2 on t1.key = t2.key;
t1.key t2.key
key1 key1
key2 null
因为 left outer join,左表关联不上的记录,也在结果里。
对于
select t1.key,t2.key
from t1
left join t2 on t1.key = t2.key
where t1.key='key1';
关联后过滤 t1.key='key1'
后的结果是
t1.key t2.key
key1 key1
所以这条语句等价于
select t1.key,t2.key
from (select key from t1 where t1.key='key1') t1
left join t2 on t1.key = t2.key;
但是 left outer join,右表关联不上的记录,不在最终结果里
如以下语句,
select t1.key,t2.key
from t1 left join t2
on t1.key = t2.key
where t2.key='key1';
不等价于
select t1.key,t2.key
from t1 left join
(select key from t2 where t2.key='key1') t2
on t1.key = t2.key
;
先关联后过滤的结果是
t1.key t2.key
key1 key1
先过滤后关联
t2 过滤后的结果是 key1。
和 t1 关联后的结果是。
t1.key t2.key
key1 key1
key2 null
结论,右表过滤后再关联,左边表的记录都存在。如果先关联再过滤,左表的记录可能会过滤掉。所以左关联右表的判断条件不能下推。
TableScan PPD
创建新的 filter,并把 candidateFilterOps 的 filter 删除。
Partition Table Function PPD(PTFPdd)
explain select * from (
select web_city,web_site_sk, rank() over (partition by web_city order by web_site_sk asc) rk
from web_site)t
where rk <= 2;
生成的执行计划如下:
把 rank_window_0 <= 2
推到PTF Operator
之后。
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0Fetch Operatorlimit:-1Stage-1Reducer 2File Output Operator [FS_8]Select Operator [SEL_4] (rows=10 width=375)Output:["_col0","_col1","_col2"]Filter Operator [FIL_9] (rows=10 width=375)predicate:(rank_window_0 <= 2)PTF Operator [PTF_3] (rows=32 width=375)Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col0 ASC NULLS FIRST","partition by:":"_col19"}]Select Operator [SEL_2] (rows=32 width=375)Output:["_col0","_col19"]<-Map 1 [SIMPLE_EDGE]SHUFFLE [RS_1]PartitionCols:web_cityTableScan [TS_0] (rows=32 width=99)tpcds_hdfs_orc_3@web_site,web_site,Tbl:COMPLETE,Col:COMPLETE,Output:["web_city","web_site_sk"]
不能下推的情形:
Limit 不能下推
select web_city, count(1) from web_site group by web_city limit 2;