FAQ
Greetings,

I'm new to Impala and I'm basically using it for learning purposes. I've
installed VM and ran some querries on Impala but I have problems
understanding Impala architecture. I understand the concepts of querry
planner, coordinator and exec. engine and their usage as a replacement to
MapReduce but I can't recognize their functions in a simple query. For
exaple when I'm running a EXPLAIN querry:

SELECT ss_item_sk, COUNT(ss_item_sk)
FROM store_sales
GROUP BY ss_item_sk
ORDER BY COUNT(ss_item_sk) DESC
LIMIT 10


PLAN FRAGMENT 0
   PARTITION: UNPARTITIONED

   6:TOP-N
order by: <slot 2> DESC
limit: 10
tuple ids: 1
   5:EXCHANGE
      tuple ids: 1

PLAN FRAGMENT 1
   PARTITION: HASH_PARTITIONED: <slot 1>

   STREAM DATA SINK
     EXCHANGE ID: 5
     UNPARTITIONED

   2:TOP-N
order by: <slot 2> DESC
limit: 10
tuple ids: 1
   4:AGGREGATE
output: SUM(<slot 2>)
group by: <slot 1>
tuple ids: 1
   3:EXCHANGE
      tuple ids: 1

PLAN FRAGMENT 2
   PARTITION: RANDOM

   STREAM DATA SINK
     EXCHANGE ID: 3
     HASH_PARTITIONED: <slot 1>

   1:AGGREGATE
output: COUNT(ss_item_sk)
group by: ss_item_sk
tuple ids: 1
   0:SCAN HDFS
      table=default.store_sales #partitions=1 size=370.45MB
      tuple ids: 0


So, my guess is: the querry planner creates 3 plan phases. In first phase
we scan HDFS for the data and return counted + grouped data to the second
phase. In that phase, the data is limited to only 10 lines(rows), and its
represented to the user in the third phase. But I can't understand the
terms (like partition, exchange id, hash_partitioned etc.) and why do we
need so many steps (0-6) to execute querry.

Thanks in advance
Xan

Search Discussions

  • Alan Choi at May 13, 2013 at 9:24 pm
    Hi Xan,

    Yes, you got it right. We're still preparing the documentation on the
    EXPLAIN output, but let me clarify some of the terms here using your plan:

    PLAN FRAGMENT 0
       PARTITION: UNPARTITIONED *<-- PARTITION refers to the data partition of
    the plan fragment. UNPARTITIONED means that all data goes to a single node.*

       6:TOP-N
    order by: <slot 2> DESC
    limit: 10
    tuple ids: 1
       5:EXCHANGE
          tuple ids: 1

    PLAN FRAGMENT 1
       PARTITION: HASH_PARTITIONED: <slot 1> *<-- HASH_PARTITION here means
    that data are partitioned by ss_item_sk (slot 1).*

       STREAM DATA SINK
         EXCHANGE ID: 5 *<-- This plan fragment's output is going to plan node
    5:EXCHANGE.*
         UNPARTITIONED *<-- The partition of the output stream. This output
    stream is unpartitioned, meaning that all data will go to a single node.
    Note that its destination plan fragment 0 is also UNPARTITIONED.*

       2:TOP-N
    order by: <slot 2> DESC
    limit: 10
    tuple ids: 1
       4:AGGREGATE
    output: SUM(<slot 2>)
    group by: <slot 1>
    tuple ids: 1
       3:EXCHANGE
          tuple ids: 1

    PLAN FRAGMENT 2
       PARTITION: RANDOM *<-- Data in this plan fragment comes from reading
    data from HDFS, which is not bounded by any partition.*

       STREAM DATA SINK
         EXCHANGE ID: 3
         HASH_PARTITIONED: <slot 1>

       1:AGGREGATE
    output: COUNT(ss_item_sk)
    group by: ss_item_sk
    tuple ids: 1
       0:SCAN HDFS
          table=default.store_sales #partitions=1 size=370.45MB
          tuple ids: 0

    On Sat, May 11, 2013 at 11:52 AM, Xan McGregor wrote:

    Greetings,

    I'm new to Impala and I'm basically using it for learning purposes. I've
    installed VM and ran some querries on Impala but I have problems
    understanding Impala architecture. I understand the concepts of querry
    planner, coordinator and exec. engine and their usage as a replacement to
    MapReduce but I can't recognize their functions in a simple query. For
    exaple when I'm running a EXPLAIN querry:

    SELECT ss_item_sk, COUNT(ss_item_sk)
    FROM store_sales
    GROUP BY ss_item_sk
    ORDER BY COUNT(ss_item_sk) DESC
    LIMIT 10


    PLAN FRAGMENT 0
    PARTITION: UNPARTITIONED

    6:TOP-N
    order by: <slot 2> DESC
    limit: 10
    tuple ids: 1
    5:EXCHANGE
    tuple ids: 1

    PLAN FRAGMENT 1
    PARTITION: HASH_PARTITIONED: <slot 1>

    STREAM DATA SINK
    EXCHANGE ID: 5
    UNPARTITIONED

    2:TOP-N
    order by: <slot 2> DESC
    limit: 10
    tuple ids: 1
    4:AGGREGATE
    output: SUM(<slot 2>)
    group by: <slot 1>
    tuple ids: 1
    3:EXCHANGE
    tuple ids: 1

    PLAN FRAGMENT 2
    PARTITION: RANDOM

    STREAM DATA SINK
    EXCHANGE ID: 3
    HASH_PARTITIONED: <slot 1>

    1:AGGREGATE
    output: COUNT(ss_item_sk)
    group by: ss_item_sk
    tuple ids: 1
    0:SCAN HDFS
    table=default.store_sales #partitions=1 size=370.45MB
    tuple ids: 0


    So, my guess is: the querry planner creates 3 plan phases. In first phase
    we scan HDFS for the data and return counted + grouped data to the second
    phase. In that phase, the data is limited to only 10 lines(rows), and its
    represented to the user in the third phase. But I can't understand the
    terms (like partition, exchange id, hash_partitioned etc.) and why do we
    need so many steps (0-6) to execute querry.

    Thanks in advance
    Xan
  • Xan McGregor at May 15, 2013 at 4:38 pm
    No, I can understand the architecture much better. I hope that the incoming
    documentation on EXPAIN command will help everyone with the same problems.

    Thank you Alan.

    Xan.

    Dana ponedjeljak, 13. svibnja 2013. 23:24:37 UTC+2, korisnik Alan napisao
    je:
    Hi Xan,

    Yes, you got it right. We're still preparing the documentation on the
    EXPLAIN output, but let me clarify some of the terms here using your plan:

    PLAN FRAGMENT 0
    PARTITION: UNPARTITIONED *<-- PARTITION refers to the data partition
    of the plan fragment. UNPARTITIONED means that all data goes to a single
    node.*

    6:TOP-N
    order by: <slot 2> DESC
    limit: 10
    tuple ids: 1
    5:EXCHANGE
    tuple ids: 1

    PLAN FRAGMENT 1
    PARTITION: HASH_PARTITIONED: <slot 1> *<-- HASH_PARTITION here means
    that data are partitioned by ss_item_sk (slot 1).*

    STREAM DATA SINK
    EXCHANGE ID: 5 *<-- This plan fragment's output is going to plan
    node 5:EXCHANGE.*
    UNPARTITIONED *<-- The partition of the output stream. This output
    stream is unpartitioned, meaning that all data will go to a single node.
    Note that its destination plan fragment 0 is also UNPARTITIONED.*

    2:TOP-N
    order by: <slot 2> DESC
    limit: 10
    tuple ids: 1
    4:AGGREGATE
    output: SUM(<slot 2>)
    group by: <slot 1>
    tuple ids: 1
    3:EXCHANGE
    tuple ids: 1

    PLAN FRAGMENT 2
    PARTITION: RANDOM *<-- Data in this plan fragment comes from reading
    data from HDFS, which is not bounded by any partition.*

    STREAM DATA SINK
    EXCHANGE ID: 3
    HASH_PARTITIONED: <slot 1>

    1:AGGREGATE
    output: COUNT(ss_item_sk)
    group by: ss_item_sk
    tuple ids: 1
    0:SCAN HDFS
    table=default.store_sales #partitions=1 size=370.45MB
    tuple ids: 0


    On Sat, May 11, 2013 at 11:52 AM, Xan McGregor <xan.mc...@gmail.com<javascript:>
    wrote:
    Greetings,

    I'm new to Impala and I'm basically using it for learning purposes. I've
    installed VM and ran some querries on Impala but I have problems
    understanding Impala architecture. I understand the concepts of querry
    planner, coordinator and exec. engine and their usage as a replacement to
    MapReduce but I can't recognize their functions in a simple query. For
    exaple when I'm running a EXPLAIN querry:

    SELECT ss_item_sk, COUNT(ss_item_sk)
    FROM store_sales
    GROUP BY ss_item_sk
    ORDER BY COUNT(ss_item_sk) DESC
    LIMIT 10


    PLAN FRAGMENT 0
    PARTITION: UNPARTITIONED

    6:TOP-N
    order by: <slot 2> DESC
    limit: 10
    tuple ids: 1
    5:EXCHANGE
    tuple ids: 1

    PLAN FRAGMENT 1
    PARTITION: HASH_PARTITIONED: <slot 1>

    STREAM DATA SINK
    EXCHANGE ID: 5
    UNPARTITIONED

    2:TOP-N
    order by: <slot 2> DESC
    limit: 10
    tuple ids: 1
    4:AGGREGATE
    output: SUM(<slot 2>)
    group by: <slot 1>
    tuple ids: 1
    3:EXCHANGE
    tuple ids: 1

    PLAN FRAGMENT 2
    PARTITION: RANDOM

    STREAM DATA SINK
    EXCHANGE ID: 3
    HASH_PARTITIONED: <slot 1>

    1:AGGREGATE
    output: COUNT(ss_item_sk)
    group by: ss_item_sk
    tuple ids: 1
    0:SCAN HDFS
    table=default.store_sales #partitions=1 size=370.45MB
    tuple ids: 0


    So, my guess is: the querry planner creates 3 plan phases. In first phase
    we scan HDFS for the data and return counted + grouped data to the second
    phase. In that phase, the data is limited to only 10 lines(rows), and its
    represented to the user in the third phase. But I can't understand the
    terms (like partition, exchange id, hash_partitioned etc.) and why do we
    need so many steps (0-6) to execute querry.

    Thanks in advance
    Xan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 11, '13 at 6:52p
activeMay 15, '13 at 4:38p
posts3
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Xan McGregor: 2 posts Alan Choi: 1 post

People

Translate

site design / logo © 2022 Grokbase