FAQ
We're profiling Impala and are curious as to the design of the Impala JDBC
driver. We're running on 4 EC2 m1.large instances with standard non-EBS
disks.

*First Table Size*

*Query Description*

*num rows*

*Execute time*

*Total next times*

*first next time*

5m

row count of transactions

1

0.411 secs

3.623 secs

3.592 secs

10m

row count of transactions

1

0.586 secs

11.523 secs

11.49 secs

5m

calculated measure

100

0.407 secs

6.758 secs

6.698 secs

10m

calculated measure

100

0.464 secs

15.888 secs

15.827 secs

5m

complex join

63766

2.982 secs

52.049 secs

9.475 secs

10m

complex join

63766

2.737 secs

1.00265 minutes

17.72 secs

Consider the following simple code, where the statement has been setup to
connect to jdbc:hive2://[ipaddress]:21050/;auth=noSasl...

1. ResultSet results = statement.executeQuery(query);
2. while (results.next()) {}

The call to execute the query typically happens very fast (less than 5
seconds) for all Impala queries. The first call to next() takes up a large
majority of the time (several seconds) in comparison to the number of
future times the next() call takes. For example, the first next() time may
take 6.6 seconds to complete, while the remaining 99 calls may an
additional 0.06 seconds to complete.

My assumption is the executeQuery() acts as though it is kicking off the
initial query on the Impala node. The returned ResultSet is then a proxy
object for accessing the actual results. The first call to next() forces
another server call. If that call happens too fast and the data isn't
finished processing in Impala, then the call will effectively block until
the query has finished. If the first call to next() is delayed, then I
assume the execution would have completed and that first call will return
much quicker, but this doesn't seem to be the case. In either case,
successive calls to next are then hitting the stored query results which
return much more immediately in that time frame.

Can anyone confirm if that line of thinking is correct for the driver or
provide me with the details?

Are there optimizations that occur for the next() call that pull back more
data at a time than a single row?

Is there any type of polling call that can be made to see if the results
are finished processing before getting locked into a synchronous call to
next()? If not, is a typical workflow then to make all your queries to
Impala on a separate thread to ensure you don't lock up your main thread
waiting for a long running Impala query?

Thanks in advance!

Search Discussions

  • Scott Ruffing at Mar 27, 2013 at 7:26 pm
    Some additional information on the queries that may add to the conversation.

    row count query - select count(*) from TRANSACTIONS
    calculated query - select DISCOUNT / PRICE as PERCENT from TRANSACTIONS
    order by DISCOUNT_PERCENT DESC limit 100
    complex query (A variation of this where there are multiple joins)
    SELECT TABLE1.id, TABLE2.id, TABLE3.id, TABLE4.id, TABLE5.id
    FROM TRANSACTIONS join TABLE1 on (TABLE1.id = TRANSACTIONS.id)
    join TABLE2 on(TABLE2.id = TRANSACTIONS.id)
    join TABLE3 on (TABLE3.id = TRANSACTIONS.id)
    join TABLE4 on (TABLE4.id = TRANSACTIONS.id)
    join TABLE5 on (TABLE5.id = TRANSACTIONS.id)
    WHERE TABLE2.id = '33'
    AND TABLE3.id = 'SMALL'
    AND TABLE5.id = '2005'
    GROUP BY TABLE1.id, TABLE2.id, TABLE3.id, TABLE4.id, TABLE5.id

    I know in previous posts that the top-n (order by x limit n) does affect
    query times. How does this show in the Impala driver?

    Thanks again.
  • Alan Choi at Mar 27, 2013 at 8:29 pm
    Hi Scott,

    My comments are inlined. Hope it helps!

    Thanks,
    Alan

    On Wed, Mar 27, 2013 at 12:14 PM, Scott Ruffing wrote:

    We're profiling Impala and are curious as to the design of the Impala JDBC
    driver. We're running on 4 EC2 m1.large instances with standard non-EBS
    disks.

    *First Table Size*****

    *Query Description*****

    *num rows*****

    *Execute time*****

    *Total next times*****

    *first next time*****

    5m****

    row count of transactions****

    1****

    0.411 secs****

    3.623 secs****

    3.592 secs****

    10m****

    row count of transactions****

    1****

    0.586 secs****

    11.523 secs****

    11.49 secs****

    5m****

    calculated measure****

    100****

    0.407 secs****

    6.758 secs****

    6.698 secs****

    10m****

    calculated measure****

    100****

    0.464 secs****

    15.888 secs****

    15.827 secs****

    5m****

    complex join****

    63766****

    2.982 secs****

    52.049 secs****

    9.475 secs****

    10m****

    complex join****

    63766****

    2.737 secs****

    1.00265 minutes****

    17.72 secs****

    Consider the following simple code, where the statement has been setup to
    connect to jdbc:hive2://[ipaddress]:21050/;auth=noSasl...

    1. ResultSet results = statement.executeQuery(query);
    2. while (results.next()) {}

    The call to execute the query typically happens very fast (less than 5
    seconds) for all Impala queries. The first call to next() takes up a large
    majority of the time (several seconds) in comparison to the number of
    future times the next() call takes. For example, the first next() time may
    take 6.6 seconds to complete, while the remaining 99 calls may an
    additional 0.06 seconds to complete.

    My assumption is the executeQuery() acts as though it is kicking off the
    initial query on the Impala node. The returned ResultSet is then a proxy
    object for accessing the actual results. The first call to next() forces
    another server call. If that call happens too fast and the data isn't
    finished processing in Impala, then the call will effectively block until
    the query has finished. If the first call to next() is delayed, then I
    assume the execution would have completed and that first call will return
    much quicker, but this doesn't seem to be the case. In either case,
    successive calls to next are then hitting the stored query results which
    return much more immediately in that time frame.
    Can anyone confirm if that line of thinking is correct for the driver or
    provide me with the details?
    Yes. You are correct. executeQuery() is async and next() will block until
    the result is ready. In your case, all the queries has a top level blocking
    operator: count, order by, group by. Impala cannot return the first row
    until all the data has been read. Therefore, all of the time should be in
    the first "next".

    >
    Are there optimizations that occur for the next() call that pull back more
    data at a time than a single row?
    You can set the fetch size int he JDBC driver:
    http://docs.oracle.com/javase/6/docs/api/java/sql/Statement.html#setFetchSize(int)

    Is there any type of polling call that can be made to see if the results
    are finished processing before getting locked into a synchronous call to
    next()? If not, is a typical workflow then to make all your queries to
    Impala on a separate thread to ensure you don't lock up your main thread
    waiting for a long running Impala query?
    I don't think such polling exists in the JDBC spec at all. Using a separate
    thread sounds reasonable.
    Thanks in advance!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMar 27, '13 at 7:15p
activeMar 27, '13 at 8:29p
posts3
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Scott Ruffing: 2 posts Alan Choi: 1 post

People

Translate

site design / logo © 2022 Grokbase