hive在外表中插入非常慢

发布时间 : 2020/11/19 04:11

Hive版本: 2.1.1, Spark版本是1.6.0

这几天发现insert overwrite partition运行的很慢,看了下是hive on spark引擎,这引擎平常比mapreduce快多了,但是怎么今日感觉比mapreduce慢了好几倍,运行了1h多还没运行完。

将SQL拿来手动hive -f 文件.sql实行了,看到spark的stage状态不停都是处于0,大概没有改变,如List-1所示。

List-1 [xx@xxxx xx]# hive -f sql.sql ... Query ID = root_20200807155008_80726145-e8f2-4f4e-8222-94083907a70c Total jobs = 1 Launching Job 1 out of 1 In o!rder to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Spark Job = d5e51d11-0254-49e3-93c7-f1380a89b3d5 Running with YARN Application = application_1593752968338_0506 Kill Command = /usr/local/hadoop/bin/yarn application -kill application_1593752968338_0506 Query Hive on Spark job[0] stages: 0 Status: Running (Hive on Spark job[0]) Job Progress Format CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost] 2020-08-07 15:50:47,501 Stage-0_0: 0(+2)/3 2020-08-07 15:50:50,530 Stage-0_0: 0(+2)/3 2020-08-07 15:50:53,555 Stage-0_0: 0(+2)/3 2020-08-07 15:50:56,582 Stage-0_0: 0(+2)/3 2020-08-07 15:50:57,590 Stage-0_0: 0(+3)/3 2020-08-07 15:51:00,620 Stage-0_0: 0(+3)/3 2020-08-07 15:51:03,641 Stage-0_0: 0(+3)/3 2020-08-07 15:51:06,662 Stage-0_0: 0(+3)/3 2020-08-07 15:51:09,680 Stage-0_0: 0(+3)/3 2020-08-07 15:51:12,700 Stage-0_0: 0(+3)/3 ...

运行1h多了,但是还是处于那个状态,感觉不对立刻搜索了下,别人也碰到了这个问题,没找到好的解决方法

我临时对这个使命设置mr作为实行引擎——使用set hive.execution.engine=mr,不使用spark作为引擎,这样就解决了不停卡住不动的问题

之后hive又报错了,提示超越了单个node的max partition数,如List-2

List-2

 

再设置partitions和partitions.pernode,如下List-3

List-3 set hive.execution.engine=mr; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=100000; set hive.exec.max.dynamic.partitions=100000; ...

这个问题,google了下,在 Spark的jira issue 内里有,说是个bug,背面修复了。

这样就解决了,但是mr还是慢,没措施要么调换hive/spark版本,要么自己去修改spark源码,先用mr临时解决下。

hive的insert很慢

我有一张表,在HIVE中有stop_logs。当我运行约莫6000行的插入查询时,需要300秒,就似乎我只运行SELECT查询一样,它在6秒内完成。为什么插入花了这么多时间?

CREATE TABLE stop_logs (event STRING, loadId STRING) STORED AS SEQUENCEFILE;

以下需要300秒:

INSERT INTO TABLE stop_logs SELECT i.event, i.loadId FROM event_logs i WHERE i.stopId IS NOT NULL; ;

以下查询需要6秒。

SELECT i.event, i.loadId FROM event_logs i WHERE i.stopId IS NOT NULL; ;

1 个答案:

答案 0 :(得分:3)

首先,您需要理解Hive怎样处置您的查询:

当你实行“ select * from&lt; tablename&gt; ”时,Hive会将整个数据从文件中取出!作为FetchTask而不是mapreduce使命,它只是按原样转储数据而不做任何事情在上面。这类似于“hadoop dfs -text”。由于它没有运行任何map-reduce使命以是它运行得更快。

在使用 “从a&lt; tablename&gt;” 中选择a,b时,Hive需要map-reduce作业,由于它需要通过从文件中剖析来从每行中提取“列”负载。

使用 “插入表stop_logs从事件_logs” 语句中选择a,b时,首先选择语句运行,这会触发map-reduce作业,由于它需要从每行中提取“列”从它加载的文件剖析它并插入!另一个表(stop_logs)它将启动另一个map reduce使命,将值插入'stop_logs'中的a和b列,并分别映射到a和b列,以便插入新的一行。

迟钝的另一个缘故是检察假如“ hive.typecheck.on.insert ”设置为true,由于该值已经过验证,转换和规定化以切合其列种类(Hive 0.12)插入表时,与select语句相比,插入实行速率慢。

I'm working on Hive with Spark as the execution engine. Tables are parquet uncompressed.

This statement returns data in a few seconds:

select * from mydb.src_table limit 100;

But when I do the following, the insert statement is extremely slow:

create table mydb.dest_table like mydb.src_table ; insert into mydb.dest_table select * from mydb.src_table limit 100 ;

I killed the insert query after 10 minutes. src_table is pretty big (2+ billion rows, several columns containing a lot of text), but I'm only getting 100 rows. I just don't understand how the select ... limit is so fast but the insert ... select ... limit is so slow.

The EXPLAIN for the select shows 1 stage. But for the insert it shows no less than 8 stages - what's going on?

Any ideas?

performance insert hive apache-spark-sql 526

本文网址: http://www.directapkdownloader.com/d/2020101943122_6882_4254562322/home