FAQ
Hi,

The simplest of hive queries seem to be consuming 100% cpu. This is
with a small 4-node cluster. The machines are pretty beefy (16 cores
per machine, tons of RAM, 16 M+R maximum tasks configured, 1GB RAM for
mapred.child.java.opts, etc). A simple query like "select count(1)
from events" where the events table has daily partitions of log files
in gzipped file format). While this is probably too generic a question
and there is a bunch of investigation we need to, are there any
specific areas for me to look at? Has anyone see anything like this
before? Also, are there any tools or easy options to profile hive
query execution?

Thanks in advance,
Vijay

Search Discussions

  • Viral Bajaria at Feb 3, 2011 at 10:32 pm
    Hey Vijay,

    You can go to the mapred ui, normally it runs on port 50030 of the namenode
    and see how many map jobs got created for your submitted query.

    You said that the events table has daily partitions but the example query
    that you have does not prune the partitions by specifying a WHERE clause. So
    I have the following questions
    1) how big is the table (you can just do a hadoop dfs -dus
    <hdfs-dir-for-table> ? how many partitions ?
    2) do you really intend to count the number of events across all days ?
    3) could you build a query which computes over 1-5 day(s) and persists the
    data in a separate table for consumption later on ?

    Based on your node configuration, I am just guessing the amount of data to
    process is too large and hence the high CPU.

    Thanks,
    Viral
    On Thu, Feb 3, 2011 at 12:49 PM, Vijay wrote:

    Hi,

    The simplest of hive queries seem to be consuming 100% cpu. This is
    with a small 4-node cluster. The machines are pretty beefy (16 cores
    per machine, tons of RAM, 16 M+R maximum tasks configured, 1GB RAM for
    mapred.child.java.opts, etc). A simple query like "select count(1)
    from events" where the events table has daily partitions of log files
    in gzipped file format). While this is probably too generic a question
    and there is a bunch of investigation we need to, are there any
    specific areas for me to look at? Has anyone see anything like this
    before? Also, are there any tools or easy options to profile hive
    query execution?

    Thanks in advance,
    Vijay
  • Vijay at Feb 3, 2011 at 11:50 pm
    Sorry i should've given more details.

    The query was limited by a partition range; I just omitted the WHERE
    clause in the mail.
    The table is not that big. For each day, there is one gzipped file.
    The largest file is about 250MB (close to 2GB uncompressed).
    I did intend to count and that was just to test since I wanted to run
    a query that did the most minimal logic/processing.

    Here's a test I ran now. The query is getting count(1) for 8 days. It
    spawned 8 maps as expected. The maps run for anywhere between 42 to 69
    seconds (which may or may not be right; I need to check that). It
    spawned only one reduce task. The reducer ran for 117 seconds, which
    seems long for this query.
    On Thu, Feb 3, 2011 at 2:31 PM, Viral Bajaria wrote:
    Hey Vijay,
    You can go to the mapred ui, normally it runs on port 50030 of the namenode
    and see how many map jobs got created for your submitted query.
    You said that the events table has daily partitions but the example query
    that you have does not prune the partitions by specifying a WHERE clause. So
    I have the following questions
    1) how big is the table (you can just do a hadoop dfs -dus
    <hdfs-dir-for-table> ? how many partitions ?
    2) do you really intend to count the number of events across all days ?
    3) could you build a query which computes over 1-5 day(s) and persists the
    data in a separate table for consumption later on ?
    Based on your node configuration, I am just guessing the amount of data to
    process is too large and hence the high CPU.
    Thanks,
    Viral
    On Thu, Feb 3, 2011 at 12:49 PM, Vijay wrote:

    Hi,

    The simplest of hive queries seem to be consuming 100% cpu. This is
    with a small 4-node cluster. The machines are pretty beefy (16 cores
    per machine, tons of RAM, 16 M+R maximum tasks configured, 1GB RAM for
    mapred.child.java.opts, etc). A simple query like "select count(1)
    from events" where the events table has daily partitions of log files
    in gzipped file format). While this is probably too generic a question
    and there is a bunch of investigation we need to, are there any
    specific areas for me to look at? Has anyone see anything like this
    before? Also, are there any tools or easy options to profile hive
    query execution?

    Thanks in advance,
    Vijay

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedFeb 3, '11 at 8:49p
activeFeb 3, '11 at 11:50p
posts3
users2
websitehive.apache.org

2 users in discussion

Vijay: 2 posts Viral Bajaria: 1 post

People

Translate

site design / logo © 2021 Grokbase