FAQ

On Wed, Mar 6, 2013 at 8:50 PM, Anil Kumar wrote:

Hi wong,

1) can impala-state-store delegate the queries to impalads running
in another host like any load balencing is there.?
The state store does not perform query delegation. The client
(impala-shell) picks an impalad to submit the query to. Any impalad will
do. See `impala-shell --help'. The impalad that receives the query will
distribute the query to other nodes based on various factors, like data
locality.

2) when we implement impala in cluster environment, do i need to dump
the same data into all data nodes or only in master node?
Neither. The short answer is: simply put the data in HDFS and stop worrying.

You may have some misunderstanding about how HDFS works. You don't get to
control data placement. HDFS itself does data placement. And Impala is
smart w.r.t. data locality, which you don't have to worry about. (We can
talk about increasing the replication factor for hot files. But that's an
advance topic.)

Cheers,
bc

Thanks in advance,

On Thursday, March 7, 2013 1:41:29 AM UTC+5:30, bc Wong wrote:

CM dramatically reduces your setup complexity. So I'd recommend that.

Adding impala-user for help with non-CM setup. First thing I'd check,
given the error message, is port conflict.

Cheers,
bc

On Wed, Mar 6, 2013 at 3:22 AM, Anil Kumar wrote:

Hi All,

I am working on *cloudara impala* *metastore* as *mysql*, how to run
*statestored* in impala without cloudara manager ,

i am able to run impalad in single node successfully and by
using impala-shell we are able to get the data from impalad.

so now we try to run the statestore for 2 node cluster. i am not able
to understand about this statestore.


I am able to understand when i implement it in single node.i am not able
understand when implement it in cluster environment.

Please explain me what are all nessary configuration things to follow
when you implment it in cluster environment.

Anybody explain it clearly.

i got below error :

E0306 21:45:45.764327 26084 statestored-main.cc:52] Could not start
webserver on port: 25010

i stuck over here.

Thanks in Advance.

Search Discussions

  • bc Wong at Mar 7, 2013 at 8:49 am
    [bcc: impala-user]

    Since you're apparently not using Cloudera Manager, my first recommendation
    is for you to try it:
    https://ccp.cloudera.com/display/FREE45DOC/Cloudera+Manager+4.5+Free+Edition+Documentation.
    It takes care of basic setup problems like what you described.

    Cheers,
    bc
    On Wed, Mar 6, 2013 at 10:34 PM, Anil Kumar wrote:

    Hi *Wong,*

    Thank you for your suggestions. you clarified my doubts ,

    i have 2 nodes in my cluster.

    one node i am treating as namenode as well as data node and seconde node
    treated as one more datanode.

    in this case.it is taking master node as name node and another node as
    data node but it is not taking master as a data node.

    I am feeling i miss some configuration to make my master node as data
    node.


    If i want to see my master node as data node and as well as name node what
    are all configuration changes i need to do.

    Please help me in this.


    Note : If i have only one node cluster , then its taking that node as name
    node and data node but problem is coming when i am adding one more node to
    cluster.


    Thanks in Advance,


    On Thursday, March 7, 2013 10:45:22 AM UTC+5:30, bc Wong wrote:
    On Wed, Mar 6, 2013 at 8:50 PM, Anil Kumar wrote:

    Hi wong,

    1) can impala-state-store delegate the queries to impalads running
    in another host like any load balencing is there.?
    The state store does not perform query delegation. The client
    (impala-shell) picks an impalad to submit the query to. Any impalad will
    do. See `impala-shell --help'. The impalad that receives the query will
    distribute the query to other nodes based on various factors, like data
    locality.

    2) when we implement impala in cluster environment, do i need to dump
    the same data into all data nodes or only in master node?
    Neither. The short answer is: simply put the data in HDFS and stop
    worrying.

    You may have some misunderstanding about how HDFS works. You don't get to
    control data placement. HDFS itself does data placement. And Impala is
    smart w.r.t. data locality, which you don't have to worry about. (We can
    talk about increasing the replication factor for hot files. But that's an
    advance topic.)

    Cheers,
    bc

    Thanks in advance,

    On Thursday, March 7, 2013 1:41:29 AM UTC+5:30, bc Wong wrote:

    CM dramatically reduces your setup complexity. So I'd recommend that.

    Adding impala-user for help with non-CM setup. First thing I'd check,
    given the error message, is port conflict.

    Cheers,
    bc

    On Wed, Mar 6, 2013 at 3:22 AM, Anil Kumar wrote:

    Hi All,

    I am working on *cloudara impala* *metastore* as *mysql*, how to
    run *statestored* in impala without cloudara manager ,

    i am able to run impalad in single node successfully and by
    using impala-shell we are able to get the data from impalad.

    so now we try to run the statestore for 2 node cluster. i am not
    able to understand about this statestore.


    I am able to understand when i implement it in single node.i am not
    able understand when implement it in cluster environment.

    Please explain me what are all nessary configuration things to follow
    when you implment it in cluster environment.

    Anybody explain it clearly.

    i got below error :

    E0306 21:45:45.764327 26084 statestored-main.cc:52] Could not start
    webserver on port: 25010

    i stuck over here.

    Thanks in Advance.

  • Vikas Singh at Mar 20, 2013 at 4:01 pm
    Adding impala-user.

    As you have just installed, I am assuming you have Impala 0.6. If you want
    to use Java code for querying, you need to use JDBC driver released by
    Cloudera. Instructions are here:
    https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+to+Work+with+JDBC

    But before moving forward, please make sure that Hive is setup correctly.
    Are you able to execute hive queries in this setup? Once Hive is setup,
    next step will be to start "impala-shell" and execute same query using
    that. Once that is working, next logical step will be to write Java client
    (with the knowledge that any issue that you see is then related to Java
    client and not Impala/Hive setup)

    Let us know if you need any help with this process. You should be able to
    find documentation related to all these on Cloudera site.

    Vikas
  • Anil Kumar at Mar 22, 2013 at 3:19 am
    Hi Udai,

    http://10.219.197.8:25000/queries.

    In this page I have seen the result like below.

    6983882f9f5a47b4:9545125340d8f7bc | select * from bidemo.sales | QUERY
    2013-03-22 08:22:47 | 0 / 1 ( 0%) | FINISHED | 100
    *Query Profile: *

    Query (id=6983882f9f5a47b4:9545125340d8f7bc):
    - PlanningTime: 77ms
    Query 6983882f9f5a47b4:9545125340d8f7bc:(605ms 0.00%)
    Aggregate Profile:
    Coordinator Fragment:(65ms 0.00%)
    - RowsProduced: 1.02K
    CodeGen:
    - CodegenTime: 0K clock cycles
    - CompileTime: 104ms
    - LoadTime: 197ms
    - ModuleFileSize: 37.04 KB
    EXCHANGE_NODE (id=1):(65ms 0.00%)
    - BytesReceived: 1.25 MB
    - ConvertRowBatchTime: 34K clock cycles
    - DeserializeRowBatchTimer: 2ms
    - MemoryUsed: 0.00
    - RowsReturned: 1.02K
    - RowsReturnedRate: 15.74 K/sec
    Averaged Fragment 1:
    split sizes: min: 0.00 , max: 19.25 MB, avg: 6.42 MB, stddev: 9.07 MB
    Fragment 1:

    *Corresponding Logs : *
    **
    * *I0322 08:22:47.914163 30335 impala-beeswax-server.cc:138] query():
    query=select * from bidemo.sales
    I0322 08:22:47.931503 30335 impala-beeswax-server.cc:429] query: Query {
    01: query (string) = "select * from bidemo.sales",
    03: configuration (list) = list[1] {
    [0] = "",
    },
    04: hadoop_user (string) = "admin",
    }
    I0322 08:22:47.931627 30335 impala-beeswax-server.cc:444]
    TClientRequest.queryOptions: TQueryOptions {
    01: abort_on_error (bool) = false,
    02: max_errors (i32) = 0,
    03: disable_codegen (bool) = false,
    04: batch_size (i32) = 0,
    05: return_as_ascii (bool) = true,
    06: num_nodes (i32) = 0,
    07: max_scan_range_length (i64) = 0,
    08: num_scanner_threads (i32) = 0,
    09: max_io_buffers (i32) = 0,
    10: allow_unsupported_formats (bool) = false,
    11: default_order_by_limit (i64) = -1,
    12: debug_action (string) = "",
    }
    INFO0322 08:22:47.932000 Thread-31 com.cloudera.impala.service.Frontend]
    analyze query select * from bidemo.sales
    INFO0322 08:22:47.959000 Thread-31 com.cloudera.impala.service.Frontend]
    create plan
    INFO0322 08:22:47.959000 Thread-31 com.cloudera.impala.planner.Planner]
    create single-node plan
    INFO0322 08:22:47.959000 Thread-31 com.cloudera.impala.planner.Planner]
    create plan fragments
    INFO0322 08:22:47.960000 Thread-31 com.cloudera.impala.planner.Planner]
    finalize plan fragments
    INFO0322 08:22:47.960000 Thread-31
    com.cloudera.impala.planner.HdfsScanNode] collecting partitions for table
    sales
    INFO0322 08:22:47.961000 Thread-31 com.cloudera.impala.service.Frontend]
    get scan range locations
    INFO0322 08:22:47.981000 Thread-31 com.cloudera.impala.catalog.HdfsTable]
    loaded partiton PartitionBlockMetadata{#blocks=1, #filenames=1,
    totalStringLen=86}
    INFO0322 08:22:48.004000 Thread-31 com.cloudera.impala.catalog.HdfsTable]
    loaded disk ids for PartitionBlockMetadata{#blocks=1, #filenames=1,
    totalStringLen=86}
    INFO0322 08:22:48.005000 Thread-31 com.cloudera.impala.catalog.HdfsTable]
    block metadata cache: CacheStats{hitCount=13, missCount=10,
    loadSuccessCount=10, loadExceptionCount=0, totalLoadTime=583890510,
    evictionCount=4}
    INFO0322 08:22:48.006000 Thread-31 com.cloudera.impala.service.Frontend]
    create result set metadata
    INFO0322 08:22:48.006000 Thread-31 com.cloudera.impala.service.JniFrontend]
    Plan Fragment 0
    UNPARTITIONED
    EXCHANGE (1)
    TUPLE IDS: 0
    Plan Fragment 1
    RANDOM
    STREAM DATA SINK
    EXCHANGE ID: 1
    UNPARTITIONED
    SCAN HDFS table=bidemo.sales #partitions=1 size=19.25MB (0)
    TUPLE IDS: 0
    I0322 08:22:48.009865 30335 coordinator.cc:285] Exec()
    query_id=6983882f9f5a47b4:9545125340d8f7bc
    I0322 08:22:48.010149 30335 simple-scheduler.cc:168] SimpleScheduler
    assignment (data->backend): (10.219.197.9:50010 -> 10.219.197.9:22000),
    (10.219.197.10:50010 -> 10.219.197.10:22000), (10.219.197.8:50010 ->
    10.219.197.8:22000)
    I0322 08:22:48.010196 30335 simple-scheduler.cc:171] SimpleScheduler
    locality percentage 100% (3 out of 3)
    I0322 08:22:48.010329 30335 plan-fragment-executor.cc:80] Prepare():
    query_id=6983882f9f5a47b4:9545125340d8f7bc
    instance_id=6983882f9f5a47b4:9545125340d8f7bd
    I0322 08:22:48.234875 30335 plan-fragment-executor.cc:93] descriptor table
    for fragment=6983882f9f5a47b4:9545125340d8f7bd
    tuples:
    Tuple(id=0 size=168 slots=[Slot(id=0 type=STRING col=0 offset=56
    null=(offset=1 mask=10)), Slot(id=1 type=STRING col=1 offset=72
    null=(offset=1 mask=20)), Slot(id=2 type=STRING col=2 offset=88
    null=(offset=1 mask=40)), Slot(id=3 type=STRING col=3 offset=104
    null=(offset=1 mask=80)), Slot(id=4 type=STRING col=4 offset=120
    null=(offset=2 mask=1)), Slot(id=5 type=STRING col=5 offset=136
    null=(offset=2 mask=2)), Slot(id=6 type=STRING col=6 offset=152
    null=(offset=2 mask=4)), Slot(id=7 type=INT col=7 offset=4 null=(offset=0
    mask=1)), Slot(id=8 type=INT col=8 offset=8 null=(offset=0 mask=2)),
    Slot(id=9 type=INT col=9 offset=12 null=(offset=0 mask=4)), Slot(id=10
    type=INT col=10 offset=16 null=(offset=0 mask=8)), Slot(id=11 type=INT
    col=11 offset=20 null=(offset=0 mask=10)), Slot(id=12 type=INT col=12
    offset=24 null=(offset=0 mask=20)), Slot(id=13 type=INT col=13 offset=28
    null=(offset=0 mask=40)), Slot(id=14 type=INT col=14 offset=32
    null=(offset=0 mask=80)), Slot(id=15 type=INT col=15 offset=36
    null=(offset=1 mask=1)), Slot(id=16 type=INT col=16 offset=40
    null=(offset=1 mask=2)), Slot(id=17 type=INT col=17 offset=44
    null=(offset=1 mask=4)), Slot(id=18 type=INT col=18 offset=48
    null=(offset=1 mask=8))])
    I0322 08:22:48.339468 30335 coordinator.cc:377] starting 3 backends for
    query 6983882f9f5a47b4:9545125340d8f7bc
    I0322 08:22:48.339807 6264 impala-server.cc:1327] ExecPlanFragment()
    instance_id=6983882f9f5a47b4:9545125340d8f7bf coord=10.219.197.8:22000
    backend#=1
    I0322 08:22:48.339838 6264 plan-fragment-executor.cc:80] Prepare():
    query_id=6983882f9f5a47b4:9545125340d8f7bc
    instance_id=6983882f9f5a47b4:9545125340d8f7bf
    I0322 08:22:48.343637 6264 plan-fragment-executor.cc:93] descriptor table
    for fragment=6983882f9f5a47b4:9545125340d8f7bf
    tuples:
    Tuple(id=0 size=168 slots=[Slot(id=0 type=STRING col=0 offset=56
    null=(offset=1 mask=10)), Slot(id=1 type=STRING col=1 offset=72
    null=(offset=1 mask=20)), Slot(id=2 type=STRING col=2 offset=88
    null=(offset=1 mask=40)), Slot(id=3 type=STRING col=3 offset=104
    null=(offset=1 mask=80)), Slot(id=4 type=STRING col=4 offset=120
    null=(offset=2 mask=1)), Slot(id=5 type=STRING col=5 offset=136
    null=(offset=2 mask=2)), Slot(id=6 type=STRING col=6 offset=152
    null=(offset=2 mask=4)), Slot(id=7 type=INT col=7 offset=4 null=(offset=0
    mask=1)), Slot(id=8 type=INT col=8 offset=8 null=(offset=0 mask=2)),
    Slot(id=9 type=INT col=9 offset=12 null=(offset=0 mask=4)), Slot(id=10
    type=INT col=10 offset=16 null=(offset=0 mask=8)), Slot(id=11 type=INT
    col=11 offset=20 null=(offset=0 mask=10)), Slot(id=12 type=INT col=12
    offset=24 null=(offset=0 mask=20)), Slot(id=13 type=INT col=13 offset=28
    null=(offset=0 mask=40)), Slot(id=14 type=INT col=14 offset=32
    null=(offset=0 mask=80)), Slot(id=15 type=INT col=15 offset=36
    null=(offset=1 mask=1)), Slot(id=16 type=INT col=16 offset=40
    null=(offset=1 mask=2)), Slot(id=17 type=INT col=17 offset=44
    null=(offset=1 mask=4)), Slot(id=18 type=INT col=18 offset=48
    null=(offset=1 mask=8))])
    I0322 08:22:48.551887 4092 coordinator.cc:1003] Backend 2 completed, 2
    remaining: query_id=6983882f9f5a47b4:9545125340d8f7bc
    I0322 08:22:48.552000 4092 coordinator.cc:1012]
    query_id=6983882f9f5a47b4:9545125340d8f7bc: first in-progress backend:
    10.219.197.8:22000
    I0322 08:22:48.552098 12795 plan-fragment-executor.cc:207] Open():
    instance_id=6983882f9f5a47b4:9545125340d8f7bd
    I0322 08:22:48.552136 30481 coordinator.cc:1003] Backend 0 completed, 1
    remaining: query_id=6983882f9f5a47b4:9545125340d8f7bc
    I0322 08:22:48.552253 30481 coordinator.cc:1012]
    query_id=6983882f9f5a47b4:9545125340d8f7bc: first in-progress backend:
    10.219.197.8:22000
    I0322 08:22:48.552397 12794 plan-fragment-executor.cc:207] Open():
    instance_id=6983882f9f5a47b4:9545125340d8f7bf
    I0322 08:22:49.116104 30335 impala-beeswax-server.cc:272]
    get_results_metadata(): query_id=6983882f9f5a47b4:9545125340d8f7bc
    I0322 08:22:51.563602 4252 client-cache.cc:68] GetClient(): creating
    client for 10.219.197.8:22000.

    Could you please tell me how to increase the performence of impala.

    Thanks,
    Anil.
  • Greg Rahn at Mar 22, 2013 at 4:15 am
    [removing scm-users]

    This issue seems to be one that is coming up frequently (the third time
    that I've seen in the past few days). The issue in the OP was noting the
    difference in execution between these two:

    hive> select * from bidemo.sales;
    Time taken: 10.969 seconds

    impala> select * from bidemo.sales;
    Returned 391435 row(s) in 14.17s

    The reason that hive is faster than impala on "select * from table [limit
    N]" queries seems to be related to the fact that hive does not run a Map
    Reduce job for this query -- it is executed directly by the hive shell
    which talks directly to HDFS. As of impala 0.6, this query is run in
    distributed mode which is not very efficient since the only thing the
    parallel workers do is feed 100% of their rows to the coordinator who then
    returns them all to the client. The overhead of running this query in
    "parallel mode" is a measurable portion of the time for such short running
    queries on small data sets (usually accessing just a single HDFS block of
    data). The other thing that comes into play here is how efficient the
    client shell is printing to the terminal, and perhaps, there is a slight
    advantage to the hive shell in this case.

    I've filed IMPALA-165 to see if there are optimizations that can be made in
    a future release of impala for this scenario.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMar 7, '13 at 5:15a
activeMar 22, '13 at 4:15a
posts5
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase