Hive调优

写在最前

主要说一下我们平时查询中会遇到的一些性能问题。

Map join

基于join的性质，我们可以知道把小表放在join的左边，可以有效地减少数据量（基本习惯）。同时我们开启map join，小表数据会被广播到各个节点，消除shuffle运算。

1 2	set hive.auto.convert.join = true ; -- 开启自动转化成mapjoin set hive.mapjoin.smalltable.filesize = 2500000 ; -- 设置广播小表size

distinct

多列或者一列去重

hive是通过group by实现distinct的,如下代码hive的执行计划其实是完全一致的

1 2	select distinct a,b,func(c) as tt from xxx ; select a,b,func(c) as tt from xxx group by a,b,func(c)

如果我们可以确定这个func()是一个单映射的话，那么其实是可以直接等效写为:

1	select a,b,func(c) as tt from xxx group by a,b,c

聚合函数中进行计算

1
2
3

select dt,count(distinct userid) as uv 
from xxx
group by dt

这样会把所有的数据放在一个reducer里面，执行时间较长，可以这么优化，即把

select dt,count(1)
from(
    select  distinct dt,userid
    from xxx 
    ) final
group by dt

这样把第一个阶段分担到多个reducer上，但实际业务过程中，大多数的数据都是在多维度下多指标计算（计算量大），经常会导致数据倾斜之类的问题。

不同条件的count(distinct)

select dt 
, count(distinct userid) as tt 
, count(distinct if(a rlike 'xxx',userid,null)) as u1
, count(distinct if(b > xxx ,userid, null)) as u2
, count(distinct if(c = xxx , userid , null)) as u3

可以通过标记的方式来解决

select dt
, count(userid) as tt
, count(if(tag0=1,userid,null)) as u1
, count(if(tag1=1,userid,null)) as u2
, count(if(tag2=2,userid,null)) as u3
from(
    select dt
        , max(if(a rlike 'xxx',1,0)) as tag0
        , max(if(b > xxx ,1,0)) as tag1
        , max(if(c = xxx ,1,0))
    group by dt 
)

多维度聚合（group by xx,xx with cube)

手动维护grouping sets的组合

group by semesterid , grade_type , gradeid , courseid  , seasonid , seasonname, teamid
grouping sets( (semesterid , grade_type , gradeid , courseid  , seasonid , seasonname, teamid )
  , (semesterid , grade_type , gradeid , courseid  , seasonid , seasonname)

  )

不同列聚合

1 2	count(distinct userid) count(distinct deviceid)

可以进行分拆成两个查询分别计算(load两遍数据),最后join到一起