PlanetJava
Custom Search

java-hadoop-hive-user
[Top] [All Lists]

Re: Hive query taking too much time

Subject: Re: Hive query taking too much time
Date: Tue, 06 Dec 2011 13:51:31 +0100
Hi,
In your case total file size isn't main factor that reduces performance, number of files is.
To test this try merging those over 2000 files into one (or few) big, 
then upload it to HDFS and test hive performance (it should be 
definitely higher). It this works you should think about merging those 
files before or after loading them to HDFS.
Second issue is counts, try to observe how your jobs uses mappers and 
reducers, my experience is that simple count() jobs might be stuck on 
one reducer (the one that does all counting) for longer time. I have not 
resolved this issue, but it was not significant in my case.
set mapred.reduce.tasks=xyz doesn't change that behavior, but for 
example using GROUP with COUNT works much faster.
I hope this helps.
--
Wojciech Langiewicz
On 06.12.2011 12:00, Savant, Keshav wrote:
Hi All,
My setup is
hadoop-0.20.203.0
hive-0.7.1
I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.
I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.
The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.
For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data.
This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.
On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.
I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.
Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated.
Keshav

msgmiddle
<Prev in Thread] Current Thread [Next in Thread>
  • Hive query taking too much time
    • Re: Hive query taking too much time,
    • RE: Hive query taking too much time
      • Re: Hive query taking too much time
        • Re: Hive query taking too much time
        • Re: Hive query taking too much time
        • Re: Hive query taking too much time
        • RE: Hive query taking too much time
        • Re: Hive query taking too much time
        • Re: Hive query taking too much time
      • RE: Hive query taking too much time
Current Sitemap | © 2012 planetjava | Contact | Privacy Policy