In your case total file size isn't main factor that reduces performance,
number of files is.
To test this try merging those over 2000 files into one (or few) big,
then upload it to HDFS and test hive performance (it should be
definitely higher). It this works you should think about merging those
files before or after loading them to HDFS.
Second issue is counts, try to observe how your jobs uses mappers and
reducers, my experience is that simple count() jobs might be stuck on
one reducer (the one that does all counting) for longer time. I have not
resolved this issue, but it was not significant in my case.
set mapred.reduce.tasks=xyz doesn't change that behavior, but for
example using GROUP with COUNT works much faster.
I hope this helps.
On 06.12.2011 12:00, Savant, Keshav wrote:
My setup is
I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.
I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.
The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.
For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data.
This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.
I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly