FAQ
Hi all,

Just F.Y.I.

I evaluated Impala GA on our environment to see the performance of Parquet
columner format.
I posted its result on slideshare.

http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-ga

Thanks,

suda

Search Discussions

  • Jung-Yup Lee at May 1, 2013 at 10:03 pm
    Hi Suda,

    Thank you for your great work!

    Could you share more information about join types used in your test?

    If a broadcast join type was used in your additional experiments for
    testing the effect of join order, how about changing the join type from
    broadcast to partitioned join?

    I am curious about the reason of performance degradation in your additional
    experiments.

    Thank you,

    Jung-Yup
    On Wednesday, May 1, 2013 8:45:47 PM UTC+9, Suda Yukinori wrote:

    Hi all,

    Just F.Y.I.

    I evaluated Impala GA on our environment to see the performance of Parquet
    columner format.
    I posted its result on slideshare.


    http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-ga

    Thanks,

    suda
  • Yukinori SUDA at May 2, 2013 at 10:04 am
    Hi Jung-Yup,

    thank you for your comment.

    What query do I need to run in order to change join type from broadcast to
    partitioned join ?

    Thanks,

    suda

    2013/5/2 Jung-Yup Lee <ljykjh@gmail.com>
    Hi Suda,

    Thank you for your great work!

    Could you share more information about join types used in your test?

    If a broadcast join type was used in your additional experiments for
    testing the effect of join order, how about changing the join type from
    broadcast to partitioned join?

    I am curious about the reason of performance degradation in your
    additional experiments.

    Thank you,

    Jung-Yup

    On Wednesday, May 1, 2013 8:45:47 PM UTC+9, Suda Yukinori wrote:

    Hi all,

    Just F.Y.I.

    I evaluated Impala GA on our environment to see the performance of
    Parquet columner format.
    I posted its result on slideshare.

    http://www.slideshare.net/**sudabon/performance-**
    evaluation-of-cloudera-impala-**ga<http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-ga>

    Thanks,

    suda
  • Jung-Yup Lee at May 2, 2013 at 9:49 pm
    What query do I need to run in order to change join type from broadcast
    to partitioned join ?

    The query on page 10 in your slides.

    SELECT
    sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank)
    FROM (SELECT sourceIP, destURL, adRevenue FROM uservisits_̲ps UV
    WHERE (datediff(UV.visitDate, 1999-‐‑‒01-‐‑‒01)>=0 AND
    datediff(UV.visitDate, 2000-‐‑‒01-‐‑‒01)<=0)) NUV
    JOIN rankings_̲ps RON(R.pageURL = NUV.destURL)
    group by sourceIP
    order by totalRevenue DESC
    limit 1;

    How about add the hint clause for specifying the type of join to use repartitioned
    join in this query? How much latency will different between the above query
    and the below query?
    like this => "... NUV JOIN [SHUFFLE] rankings_̲ps ..."

    I am just curious about the reason of performance degradation.
    Sorry for bothering you.
  • Yukinori SUDA at May 3, 2013 at 1:50 pm
    Hi Jung-Yup,

    thank you for your explanation.

    I could better understand about JOIN in impala.
    A JOIN B
    Basically, I shall use partitioned join.
    And if size of B is extremely smaller than that of A, I try to use
    broadcast join.

    Best rgds,

    suda

    2013/5/3 Jung-Yup Lee <ljykjh@gmail.com>
    On Friday, May 3, 2013 11:29:52 AM UTC+9, Suda Yukinori wrote:

    Hi Jung-Yup and Alex,

    thank you for your reply.

    I ran the queries and I confirmed by impalad.INFO that join type was
    correct.
    These results are below.
    I restarted the impalad process between each test.

    A. query on slide p.7 using partitioned join

    1st: Returned 1 row(s) in 12.44s
    2nd: Returned 1 row(s) in 10.74s
    3rd: Returned 1 row(s) in 10.74s
    4th: Returned 1 row(s) in 11.30s
    5th: Returned 1 row(s) in 11.51s

    B. query on slide p.10 using partitioned join

    1st: Returned 1 row(s) in 12.69s
    2nd: Returned 1 row(s) in 10.16s
    3rd: Returned 1 row(s) in 12.87s
    4th: Returned 1 row(s) in 10.09s
    5th: Returned 1 row(s) in 10.19s
    According to your results(B), the reason of performance degradation on
    page 10 seems to be a high communication cost for broadcasting RANKINGS_PS
    table to all nodes.
    I think the query planner in Impala may not be calculate the cardinality
    of table accurately yet.
    Thank you for your help.

    C. query on slide p.7 using broadcast join
    1st: Returned 1 row(s) in 12.82s
    2nd: Returned 1 row(s) in 11.16s
    3rd: Returned 1 row(s) in 11.17s
    4th: Returned 1 row(s) in 11.17s
    5th: Returned 1 row(s) in 11.17s

    D. query on slide p.10 using broadcast join

    1st: Returned 1 row(s) in 34.82s
    2nd: Returned 1 row(s) in 27.21s
    3rd: Returned 1 row(s) in 26.73s
    4th: Returned 1 row(s) in 25.76s
    5th: Returned 1 row(s) in 26.71s

    I think we should always use not broadcast join but partitioned join.
    A JOIN B

    if(sizeof(A) + sizeof(B) > sizeof(B) x numOfNodes)
    broadcast join is better
    else
    partitioned join is better


    If you have any further questions, please feel free to reply.
    Btw, I re-updated the slide cause the result of "additional experiments"
    was wrong.

    Thanks,

    suda


    2013/5/3 Jung-Yup Lee <ljy...@gmail.com>

    What query do I need to run in order to change join type from
    broadcast to partitioned join ?

    The query on page 10 in your slides.

    SELECT
    sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank)
    FROM (SELECT sourceIP, destURL, adRevenue FROM uservisits_̲ps UV
    WHERE (datediff(UV.visitDate, 1999-‐‑‒01-‐‑‒01)>=0 AND
    datediff(UV.visitDate, 2000-‐‑‒01-‐‑‒01)<=0)) NUV
    JOIN rankings_̲ps RON(R.pageURL = NUV.destURL)
    group by sourceIP
    order by totalRevenue DESC
    limit 1;

    How about add the hint clause for specifying the type of join to use repartitioned
    join in this query? How much latency will different between the above query
    and the below query?
    like this => "... NUV JOIN [SHUFFLE] rankings_̲ps ..."

    I am just curious about the reason of performance degradation.
    Sorry for bothering you.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 1, '13 at 11:45a
activeMay 3, '13 at 1:50p
posts5
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Yukinori SUDA: 3 posts Jung-Yup Lee: 2 posts

People

Translate

site design / logo © 2022 Grokbase