FAQ
Hello Experts:

I want to install impala in a cluster included 5 computers. I read the
document in Cloudera.com and used yum installation instead of cloudera
manager. Now i have a question : Should i install impala in all datanode?
Should i start impala service in everynode i installed?

Now i start statestored and impalad in one node(cdh4-1,even hive and mysql
are installed there.)
I find 'show tables','describe' and 'select' are fine to work.
After i find a problem there:
*[cdh4-1:21000] > insert overwrite table aaa select * from pokes*
*Query: insert overwrite table aaa select * from pokes*
*Application Exception : Default TException.*
*Remote error*
*ERROR: Invalid query handle*

in my impala.out file, there are some logs:
*I1214 11:00:16.475942 13701 impala-server.cc:1516]
TClientRequest.queryOptions: TQueryOptions {*
* 01: abort_on_error (bool) = false,*
* 02: max_errors (i32) = 0,*
* 03: disable_codegen (bool) = false,*
* 04: batch_size (i32) = 0,*
* 05: return_as_ascii (bool) = true,*
* 06: num_nodes (i32) = 0,*
* 07: max_scan_range_length (i64) = 0,*
* 08: num_scanner_threads (i32) = 0,*
* 09: max_io_buffers (i32) = 0,*
* 10: allow_unsupported_formats (bool) = false,*
* 11: partition_agg (bool) = false,*
*}*
*I1214 11:00:16.476361 13701 impala-server.cc:863] query(): query=insert
overwrite table aaa select * from pokes*
*I1214 11:00:16.527289 13701 coordinator.cc:219] Exec()
query_id=b2c77859437c4384:afb2474eb31322b7*
*I1214 11:00:16.527417 13701 simple-scheduler.cc:171] SimpleScheduler
assignment (data->backend): (192.168.79.57:50010 -> 127.0.0.1:22000),
(192.168.79.59:50010 -> 127.0.0.1:22000), (192.168.79.60:50010 ->
127.0.0.1:22000)*
*I1214 11:00:16.527439 13701 simple-scheduler.cc:174] SimpleScheduler
locality percentage 0% (0 out of 3)*
*I1214 11:00:16.527554 13701 coordinator.cc:308] starting 1 backends for
query b2c77859437c4384:afb2474eb31322b7*
*I1214 11:00:16.528108 13785 impala-server.cc:1655] ExecPlanFragment()
instance_id=b2c77859437c4384:afb2474eb31322b8 coord=127.0.0.1:22000
backend#=0*
*I1214 11:00:16.528167 13785 plan-fragment-executor.cc:82] Prepare():
query_id=b2c77859437c4384:afb2474eb31322b7
instance_id=b2c77859437c4384:afb2474eb31322b8*
*I1214 11:00:16.535728 13785 plan-fragment-executor.cc:95] descriptor table
for fragment=b2c77859437c4384:afb2474eb31322b8*
*tuples:*
*Tuple(id=0 size=24 slots=[Slot(id=0 type=INT col=0 offset=4 null=(offset=0
mask=1)), Slot(id=1 type=STRING col=1 offset=8 null=(offset=0 mask=2))])*
*I1214 11:00:16.637187 13785 hdfs-table-sink.cc:81] Random seed: 39632637*
*I1214 11:00:16.637578 21794 plan-fragment-executor.cc:194] Open():
instance_id=b2c77859437c4384:afb2474eb31322b8*
*I1214 11:00:16.637971 21796 coordinator.cc:481] Coordinator waiting for
backends to finish, 1 remaining*
*I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
writing:
hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
*
*Error(255): Unknown error 255*
* @ 0x767901 (unknown)*
* @ 0x82c660 (unknown)*
* @ 0x82c7a4 (unknown)*
* @ 0x831080 (unknown)*
* @ 0x82d4fb (unknown)*
* @ 0x72b423 (unknown)*
* @ 0x72baf0 (unknown)*
* @ 0x62f171 (unknown)*
* @ 0x6331d1 (unknown)*
* @ 0x7ff2fbc8ed97 (unknown)*
* @ 0x7ff2fa124851 start_thread*
* @ 0x7ff2f96d211d clone*
*I1214 11:00:16.688680 13785 progress-updater.cc:45] Query
b2c77859437c4384:afb2474eb31322b7 100% Complete (1 out of 1)*
*I1214 11:00:16.688720 13785 coordinator.cc:742] Cancel()
query_id=b2c77859437c4384:afb2474eb31322b7*
*I1214 11:00:16.689051 13785 coordinator.cc:1027] Final profile for
query_id=b2c77859437c4384:afb2474eb31322b7*
*Query b2c77859437c4384:afb2474eb31322b7:(110ms 0.00%)*
* Aggregate Profile:*
* Averaged Fragment 0:(32ms 0.00%)*
* completion times: min:51ms max:51ms mean: 51ms stddev:0*
* execution rates: min:1.03 KB/sec max:1.03 KB/sec mean:1.03 KB/sec
stddev:0.00 /sec*
* split sizes: min: 54.00 B, max: 54.00 B, avg: 54.00 B, stddev: 0.00 *
* - RowsProduced: 10*
* CodeGen:*
* - CodegenTime: 2ms*
* - CompileTime: 98ms*
* - LoadTime: 7ms*
* - ModuleFileSize: 40.11 KB*
* HDFS_SCAN_NODE (id=0):(31ms 99.30%)*
* - BytesRead: 54.00 B*
* - DelimiterParseTime: 8K clock cycles*
* - MaterializeTupleTime: 4K clock cycles*
* - MemoryUsed: 0.00 *
* - PerDiskReadThroughput: 1.42 MB/sec*
* - RowsReturned: 10*
* - RowsReturnedRate: 312.00 /sec*
* - ScanRangesComplete: 1*
* - ScannerThreadsReadTime: 67K clock cycles*
* - TotalReadThroughput: 0.00 /sec*
* Fragment 0:*
* Instance b2c77859437c4384:afb2474eb31322b8:(32ms 0.70%)*
* Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/54 *
* - RowsProduced: 10*
* CodeGen:*
* - CodegenTime: 2ms*
* - CompileTime: 98ms*
* - LoadTime: 7ms*
* - ModuleFileSize: 40.11 KB*
* HDFS_SCAN_NODE (id=0):(31ms 99.30%)*
* - BytesRead: 54.00 B*
* - DelimiterParseTime: 8K clock cycles*
* - MaterializeTupleTime: 4K clock cycles*
* - MemoryUsed: 0.00 *
* - PerDiskReadThroughput: 1.42 MB/sec*
* - RowsReturned: 10*
* - RowsReturnedRate: 312.00 /sec*
* - ScanRangesCom*

I could not find the solution. Could you help me?
Thanks for your prompt response.

yyx

--

Search Discussions

  • Yyx at Dec 17, 2012 at 2:12 am
    Hi Alan,

    First of all, thank you for your very helpful reply.

    But i am a new guy in learning the impala, so i am not sure my
    configuration of impala is right.

    Here is my installation,. I installed impala.x86_64 (manually) in everynode
    in my cluster, then i started impalad and statestore in the namenode. In
    your replay, you asked me that i should started impalad in all datanodes,
    so at the end, i will start impalad in every node(nn + dn) and statestore
    in namenode. Am i right? Thank you for your reply and best regards.
    On Friday, December 14, 2012 1:05:59 PM UTC+8, Alan wrote:

    Hi,

    Impala should be installed on all data node and all impalad should be
    started.

    This error log:

    *I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    *

    indicates that you probably have file permission issue. Can you verify if
    impala has the write privilege to the dir?

    Thanks,
    Alan

    On Thu, Dec 13, 2012 at 7:13 PM, yyx <yuxin....@gmail.com <javascript:>>wrote:
    *I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    *
    --
  • Mark Brooks at Dec 17, 2012 at 2:32 am
    Hello,

    As described in the link here<https://ccp.cloudera.com/display/IMPALA10BETADOC/Installing+Impala> in
    the section Installation Process | Impala Components | impalad: "There
    should be one (impalad) daemon process running on each node in the cluster
    that has a data node." -- So yes you should start the impalad's on all
    of your DNs, but you should not start impalad on your NN. The only
    exception would be if in a sandbox environment you were running a NN and a
    DN on the same host -- but that should not be the case in anything other
    than a sandbox environment.
    On Sunday, December 16, 2012 6:12:48 PM UTC-8, yyx wrote:

    Hi Alan,

    First of all, thank you for your very helpful reply.

    But i am a new guy in learning the impala, so i am not sure my
    configuration of impala is right.

    Here is my installation,. I installed impala.x86_64 (manually) in
    everynode in my cluster, then i started impalad and statestore in the
    namenode. In your replay, you asked me that i should started impalad in all
    datanodes, so at the end, i will start impalad in every node(nn + dn) and
    statestore in namenode. Am i right? Thank you for your reply and best
    regards.
    On Friday, December 14, 2012 1:05:59 PM UTC+8, Alan wrote:

    Hi,

    Impala should be installed on all data node and all impalad should be
    started.

    This error log:

    *I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    *

    indicates that you probably have file permission issue. Can you verify if
    impala has the write privilege to the dir?

    Thanks,
    Alan
    On Thu, Dec 13, 2012 at 7:13 PM, yyx wrote:

    *I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    *
    --
  • YUXIN YAN at Dec 17, 2012 at 6:51 am
    Hello Mark,

    Thank you for your explanation. In my cluster, i start one impalad and one
    statestore in NN firstly, and it works fine. Then i stop the impalad and
    start one impalad in every DN, they(all impalad service) still work fine.
    Is it noramal?

    I have this question because when i start only one impalad in one datanode,
    i can connect to it use impala-shell. So with one i can get all information
    in hive tables, why should i still start other impalad service in other DNs?

    Thank you for your explanation.

    Best regards!

    yyx

    On Mon, Dec 17, 2012 at 10:32 AM, Mark Brooks wrote:

    Hello,

    As described in the link here<https://ccp.cloudera.com/display/IMPALA10BETADOC/Installing+Impala> in
    the section Installation Process | Impala Components | impalad: "There
    should be one (impalad) daemon process running on each node in the cluster
    that has a data node." -- So yes you should start the impalad's on all
    of your DNs, but you should not start impalad on your NN. The only
    exception would be if in a sandbox environment you were running a NN and a
    DN on the same host -- but that should not be the case in anything other
    than a sandbox environment.

    On Sunday, December 16, 2012 6:12:48 PM UTC-8, yyx wrote:

    Hi Alan,

    First of all, thank you for your very helpful reply.

    But i am a new guy in learning the impala, so i am not sure my
    configuration of impala is right.

    Here is my installation,. I installed impala.x86_64 (manually) in
    everynode in my cluster, then i started impalad and statestore in the
    namenode. In your replay, you asked me that i should started impalad in all
    datanodes, so at the end, i will start impalad in every node(nn + dn) and
    statestore in namenode. Am i right? Thank you for your reply and best
    regards.
    On Friday, December 14, 2012 1:05:59 PM UTC+8, Alan wrote:

    Hi,

    Impala should be installed on all data node and all impalad should be
    started.

    This error log:

    *I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing: hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-
    5564346489813318780--5786484167880269128_281444116_
    dir/-5564346489813318780--5786484167880269128_927386285_data.0*

    indicates that you probably have file permission issue. Can you verify
    if impala has the write privilege to the dir?

    Thanks,
    Alan
    On Thu, Dec 13, 2012 at 7:13 PM, yyx wrote:

    *I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file
    for writing: hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-
    5564346489813318780--5786484167880269128_281444116_
    dir/-5564346489813318780--5786484167880269128_927386285_data.0*
    --
    --
  • Joey Echeverria at Dec 17, 2012 at 11:31 am
    Hi,

    The impalad process is what actually executes the query. The way that
    Impala achieves it's good performance is by parallelizing as much of
    the query as possible with query processing happening on the same
    nodes where the data lives. If you have a single impalad process on
    your NN, you'll be limited both by the network bandwidth to the NN and
    by the CPU of a single server. Not to mention the fact that you run
    the risk of overwhelming the NN machine which could adversely affect
    the performance and stability of the NN process.

    See slide 10 of this presentation for some more information:

    http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/impala-real-time-queries-in-hadoop-webinar-slides.html

    -Joey
    On Mon, Dec 17, 2012 at 1:51 AM, YUXIN YAN wrote:
    Hello Mark,

    Thank you for your explanation. In my cluster, i start one impalad and one
    statestore in NN firstly, and it works fine. Then i stop the impalad and
    start one impalad in every DN, they(all impalad service) still work fine. Is
    it noramal?

    I have this question because when i start only one impalad in one datanode,
    i can connect to it use impala-shell. So with one i can get all information
    in hive tables, why should i still start other impalad service in other DNs?

    Thank you for your explanation.

    Best regards!

    yyx

    On Mon, Dec 17, 2012 at 10:32 AM, Mark Brooks wrote:

    Hello,

    As described in the link here in the section Installation Process | Impala
    Components | impalad: "There should be one (impalad) daemon process
    running on each node in the cluster that has a data node." -- So yes you
    should start the impalad's on all of your DNs, but you should not start
    impalad on your NN. The only exception would be if in a sandbox environment
    you were running a NN and a DN on the same host -- but that should not be
    the case in anything other than a sandbox environment.

    On Sunday, December 16, 2012 6:12:48 PM UTC-8, yyx wrote:

    Hi Alan,

    First of all, thank you for your very helpful reply.

    But i am a new guy in learning the impala, so i am not sure my
    configuration of impala is right.

    Here is my installation,. I installed impala.x86_64 (manually) in
    everynode in my cluster, then i started impalad and statestore in the
    namenode. In your replay, you asked me that i should started impalad in all
    datanodes, so at the end, i will start impalad in every node(nn + dn) and
    statestore in namenode. Am i right? Thank you for your reply and best
    regards.
    On Friday, December 14, 2012 1:05:59 PM UTC+8, Alan wrote:

    Hi,

    Impala should be installed on all data node and all impalad should be
    started.

    This error log:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0

    indicates that you probably have file permission issue. Can you verify
    if impala has the write privilege to the dir?

    Thanks,
    Alan
    On Thu, Dec 13, 2012 at 7:13 PM, yyx wrote:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    --

    --


    --
    Joey Echeverria
    Principal Solutions Architect
    Cloudera, Inc.

    --
  • YUXIN YAN at Dec 18, 2012 at 1:22 am
    Hi,

    Thanks for all your replies, and i am very happy to get more about impala.
    And then there is a question here:

    I want to use java to get the impala 'select' results, but after i read
    the instructions in cloudera.com, i don't understand what it explains about
    using impala ODBC. So can anybody give me a more explicit guide about java
    connect impala, thanks. and Sorry about my awful english.....

    Best regards to all.

    yyx

    On Mon, Dec 17, 2012 at 7:31 PM, Joey Echeverria wrote:

    Hi,

    The impalad process is what actually executes the query. The way that
    Impala achieves it's good performance is by parallelizing as much of
    the query as possible with query processing happening on the same
    nodes where the data lives. If you have a single impalad process on
    your NN, you'll be limited both by the network bandwidth to the NN and
    by the CPU of a single server. Not to mention the fact that you run
    the risk of overwhelming the NN machine which could adversely affect
    the performance and stability of the NN process.

    See slide 10 of this presentation for some more information:


    http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/impala-real-time-queries-in-hadoop-webinar-slides.html

    -Joey
    On Mon, Dec 17, 2012 at 1:51 AM, YUXIN YAN wrote:
    Hello Mark,

    Thank you for your explanation. In my cluster, i start one impalad and one
    statestore in NN firstly, and it works fine. Then i stop the impalad and
    start one impalad in every DN, they(all impalad service) still work fine. Is
    it noramal?

    I have this question because when i start only one impalad in one datanode,
    i can connect to it use impala-shell. So with one i can get all
    information
    in hive tables, why should i still start other impalad service in other DNs?
    Thank you for your explanation.

    Best regards!

    yyx

    On Mon, Dec 17, 2012 at 10:32 AM, Mark Brooks wrote:

    Hello,

    As described in the link here in the section Installation Process |
    Impala
    Components | impalad: "There should be one (impalad) daemon process
    running on each node in the cluster that has a data node." -- So yes
    you
    should start the impalad's on all of your DNs, but you should not start
    impalad on your NN. The only exception would be if in a sandbox
    environment
    you were running a NN and a DN on the same host -- but that should not
    be
    the case in anything other than a sandbox environment.

    On Sunday, December 16, 2012 6:12:48 PM UTC-8, yyx wrote:

    Hi Alan,

    First of all, thank you for your very helpful reply.

    But i am a new guy in learning the impala, so i am not sure my
    configuration of impala is right.

    Here is my installation,. I installed impala.x86_64 (manually) in
    everynode in my cluster, then i started impalad and statestore in the
    namenode. In your replay, you asked me that i should started impalad
    in all
    datanodes, so at the end, i will start impalad in every node(nn + dn)
    and
    statestore in namenode. Am i right? Thank you for your reply and best
    regards.
    On Friday, December 14, 2012 1:05:59 PM UTC+8, Alan wrote:

    Hi,

    Impala should be installed on all data node and all impalad should be
    started.

    This error log:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    indicates that you probably have file permission issue. Can you verify
    if impala has the write privilege to the dir?

    Thanks,
    Alan
    On Thu, Dec 13, 2012 at 7:13 PM, yyx wrote:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file
    for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    --

    --


    --
    Joey Echeverria
    Principal Solutions Architect
    Cloudera, Inc.

    --

    --
  • Justin Erickson at Dec 18, 2012 at 6:32 pm
    For Java, you'll want a JDBC driver. This is not yet available but under
    active development. If all goes well we expect to have something at the end
    of next month.

    ODBC is a native code driver for non-Java applications.

    Some people have been using a JDBC-ODBC bridge driver in the meantime, but
    if I'd strongly encourage waiting for the JDBC driver.

    On Mon, Dec 17, 2012 at 5:22 PM, YUXIN YAN wrote:

    Hi,

    Thanks for all your replies, and i am very happy to get more about impala.
    And then there is a question here:

    I want to use java to get the impala 'select' results, but after i read
    the instructions in cloudera.com, i don't understand what it explains
    about using impala ODBC. So can anybody give me a more explicit guide about
    java connect impala, thanks. and Sorry about my awful english.....

    Best regards to all.

    yyx

    On Mon, Dec 17, 2012 at 7:31 PM, Joey Echeverria wrote:

    Hi,

    The impalad process is what actually executes the query. The way that
    Impala achieves it's good performance is by parallelizing as much of
    the query as possible with query processing happening on the same
    nodes where the data lives. If you have a single impalad process on
    your NN, you'll be limited both by the network bandwidth to the NN and
    by the CPU of a single server. Not to mention the fact that you run
    the risk of overwhelming the NN machine which could adversely affect
    the performance and stability of the NN process.

    See slide 10 of this presentation for some more information:


    http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/impala-real-time-queries-in-hadoop-webinar-slides.html

    -Joey

    On Mon, Dec 17, 2012 at 1:51 AM, YUXIN YAN <yuxin.yan1987@gmail.com>
    wrote:
    Hello Mark,

    Thank you for your explanation. In my cluster, i start one impalad and one
    statestore in NN firstly, and it works fine. Then i stop the impalad and
    start one impalad in every DN, they(all impalad service) still work fine. Is
    it noramal?

    I have this question because when i start only one impalad in one datanode,
    i can connect to it use impala-shell. So with one i can get all
    information
    in hive tables, why should i still start other impalad service in other DNs?
    Thank you for your explanation.

    Best regards!

    yyx


    On Mon, Dec 17, 2012 at 10:32 AM, Mark Brooks <mbrooks@cloudera.com>
    wrote:
    Hello,

    As described in the link here in the section Installation Process |
    Impala
    Components | impalad: "There should be one (impalad) daemon
    process
    running on each node in the cluster that has a data node." -- So
    yes you
    should start the impalad's on all of your DNs, but you should not start
    impalad on your NN. The only exception would be if in a sandbox
    environment
    you were running a NN and a DN on the same host -- but that should not
    be
    the case in anything other than a sandbox environment.

    On Sunday, December 16, 2012 6:12:48 PM UTC-8, yyx wrote:

    Hi Alan,

    First of all, thank you for your very helpful reply.

    But i am a new guy in learning the impala, so i am not sure my
    configuration of impala is right.

    Here is my installation,. I installed impala.x86_64 (manually) in
    everynode in my cluster, then i started impalad and statestore in the
    namenode. In your replay, you asked me that i should started impalad
    in all
    datanodes, so at the end, i will start impalad in every node(nn + dn)
    and
    statestore in namenode. Am i right? Thank you for your reply and best
    regards.
    On Friday, December 14, 2012 1:05:59 PM UTC+8, Alan wrote:

    Hi,

    Impala should be installed on all data node and all impalad should be
    started.

    This error log:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file
    for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    indicates that you probably have file permission issue. Can you
    verify
    if impala has the write privilege to the dir?

    Thanks,
    Alan
    On Thu, Dec 13, 2012 at 7:13 PM, yyx wrote:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file
    for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    --

    --


    --
    Joey Echeverria
    Principal Solutions Architect
    Cloudera, Inc.

    --

    --

    --
  • Yyx at Dec 19, 2012 at 2:13 am
    Hi,

    Thank you for your advice. :)

    Best regards!

    yyx
    On Wednesday, December 19, 2012 2:32:00 AM UTC+8, Justin Erickson wrote:

    For Java, you'll want a JDBC driver. This is not yet available but under
    active development. If all goes well we expect to have something at the end
    of next month.

    ODBC is a native code driver for non-Java applications.

    Some people have been using a JDBC-ODBC bridge driver in the meantime, but
    if I'd strongly encourage waiting for the JDBC driver.


    On Mon, Dec 17, 2012 at 5:22 PM, YUXIN YAN <yuxin....@gmail.com<javascript:>
    wrote:
    Hi,

    Thanks for all your replies, and i am very happy to get more about
    impala. And then there is a question here:

    I want to use java to get the impala 'select' results, but after i read
    the instructions in cloudera.com, i don't understand what it explains
    about using impala ODBC. So can anybody give me a more explicit guide about
    java connect impala, thanks. and Sorry about my awful english.....

    Best regards to all.

    yyx


    On Mon, Dec 17, 2012 at 7:31 PM, Joey Echeverria <jo...@cloudera.com<javascript:>
    wrote:
    Hi,

    The impalad process is what actually executes the query. The way that
    Impala achieves it's good performance is by parallelizing as much of
    the query as possible with query processing happening on the same
    nodes where the data lives. If you have a single impalad process on
    your NN, you'll be limited both by the network bandwidth to the NN and
    by the CPU of a single server. Not to mention the fact that you run
    the risk of overwhelming the NN machine which could adversely affect
    the performance and stability of the NN process.

    See slide 10 of this presentation for some more information:


    http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/impala-real-time-queries-in-hadoop-webinar-slides.html

    -Joey

    On Mon, Dec 17, 2012 at 1:51 AM, YUXIN YAN <yuxin....@gmail.com<javascript:>>
    wrote:
    Hello Mark,

    Thank you for your explanation. In my cluster, i start one impalad and one
    statestore in NN firstly, and it works fine. Then i stop the impalad and
    start one impalad in every DN, they(all impalad service) still work fine. Is
    it noramal?

    I have this question because when i start only one impalad in one datanode,
    i can connect to it use impala-shell. So with one i can get all
    information
    in hive tables, why should i still start other impalad service in
    other DNs?
    Thank you for your explanation.

    Best regards!

    yyx


    On Mon, Dec 17, 2012 at 10:32 AM, Mark Brooks <mbr...@cloudera.com<javascript:>>
    wrote:
    Hello,

    As described in the link here in the section Installation Process |
    Impala
    Components | impalad: "There should be one (impalad) daemon
    process
    running on each node in the cluster that has a data node." -- So
    yes you
    should start the impalad's on all of your DNs, but you should not
    start
    impalad on your NN. The only exception would be if in a sandbox
    environment
    you were running a NN and a DN on the same host -- but that should
    not be
    the case in anything other than a sandbox environment.

    On Sunday, December 16, 2012 6:12:48 PM UTC-8, yyx wrote:

    Hi Alan,

    First of all, thank you for your very helpful reply.

    But i am a new guy in learning the impala, so i am not sure my
    configuration of impala is right.

    Here is my installation,. I installed impala.x86_64 (manually) in
    everynode in my cluster, then i started impalad and statestore in the
    namenode. In your replay, you asked me that i should started impalad
    in all
    datanodes, so at the end, i will start impalad in every node(nn +
    dn) and
    statestore in namenode. Am i right? Thank you for your reply and best
    regards.
    On Friday, December 14, 2012 1:05:59 PM UTC+8, Alan wrote:

    Hi,

    Impala should be installed on all data node and all impalad should
    be
    started.

    This error log:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file
    for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    indicates that you probably have file permission issue. Can you
    verify
    if impala has the write privilege to the dir?

    Thanks,
    Alan
    On Thu, Dec 13, 2012 at 7:13 PM, yyx wrote:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file
    for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    --

    --


    --
    Joey Echeverria
    Principal Solutions Architect
    Cloudera, Inc.

    --

    --

    --
  • Kenny Sabir at Dec 18, 2012 at 10:05 pm
    For the meantime, I wrote a little command line wrapper for Java. It isn't pretty but it does the job as an interim measure until the JDBC is available.

    public class ImpalaAPI {

    final static private String IMPALA_SHELL = "/usr/bin/impala-shell";
    private String host = "127.0.0.1";
    private int port = 21000;

    public void run(String sql, final RowInterpreter interpreter) {

    String cmd = IMPALA_SHELL + " --impalad=" + host + ":" + port + " --query=\"" + sql + "\"";
    final ProcessBuilder pb = new ProcessBuilder(IMPALA_SHELL, "--impalad=" + host+ ":" + port, "--query=" + sql);
    Process p = null;
    BufferedReader reader = null;
    BufferedReader errReader = null;
    try {
    p = pb.start();
    final InputStream is = p.getInputStream();
    final InputStream es = p.getErrorStream();
    InputStreamReader isr = new InputStreamReader(is);
    InputStreamReader esr = new InputStreamReader(es);
    reader =
    new BufferedReader(isr);
    errReader =
    new BufferedReader(esr);
    String line;
    if (errReader.ready()) {
    line = errReader.readLine();
    while(line != null) {
    System.out.println("ERR: " + line);
    line = errReader.readLine();
    }

    }
    line = reader.readLine();
    if (line.startsWith("ERROR:")) {
    while(line != null) {
    System.out.println(line);
    line = reader.readLine();
    }
    return;
    }
    while(line != null) {
    interpreter.process(line);
    line = reader.readLine();
    }
    } catch (IOException e) {
    e.printStackTrace();
    }
    finally
    {
    try {
    reader.close();
    errReader.close();
    } catch (IOException e) {
    e.printStackTrace();
    }
    }

    }

    public ImpalaAPI setHost(String host) {
    this.host = host;
    return this;
    }

    public ImpalaAPI setPort(int port) {
    this.port = port;
    return this;
    }
    }

    public interface RowInterpreter {

    public boolean process(String line) throws IOException;

    }


    On 18/12/2012, at 12:22 PM, YUXIN YAN wrote:

    Hi,

    Thanks for all your replies, and i am very happy to get more about impala. And then there is a question here:

    I want to use java to get the impala 'select' results, but after i read the instructions in cloudera.com, i don't understand what it explains about using impala ODBC. So can anybody give me a more explicit guide about java connect impala, thanks. and Sorry about my awful english.....

    Best regards to all.

    yyx


    On Mon, Dec 17, 2012 at 7:31 PM, Joey Echeverria wrote:
    Hi,

    The impalad process is what actually executes the query. The way that
    Impala achieves it's good performance is by parallelizing as much of
    the query as possible with query processing happening on the same
    nodes where the data lives. If you have a single impalad process on
    your NN, you'll be limited both by the network bandwidth to the NN and
    by the CPU of a single server. Not to mention the fact that you run
    the risk of overwhelming the NN machine which could adversely affect
    the performance and stability of the NN process.

    See slide 10 of this presentation for some more information:

    http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/impala-real-time-queries-in-hadoop-webinar-slides.html

    -Joey
    On Mon, Dec 17, 2012 at 1:51 AM, YUXIN YAN wrote:
    Hello Mark,

    Thank you for your explanation. In my cluster, i start one impalad and one
    statestore in NN firstly, and it works fine. Then i stop the impalad and
    start one impalad in every DN, they(all impalad service) still work fine. Is
    it noramal?

    I have this question because when i start only one impalad in one datanode,
    i can connect to it use impala-shell. So with one i can get all information
    in hive tables, why should i still start other impalad service in other DNs?

    Thank you for your explanation.

    Best regards!

    yyx

    On Mon, Dec 17, 2012 at 10:32 AM, Mark Brooks wrote:

    Hello,

    As described in the link here in the section Installation Process | Impala
    Components | impalad: "There should be one (impalad) daemon process
    running on each node in the cluster that has a data node." -- So yes you
    should start the impalad's on all of your DNs, but you should not start
    impalad on your NN. The only exception would be if in a sandbox environment
    you were running a NN and a DN on the same host -- but that should not be
    the case in anything other than a sandbox environment.

    On Sunday, December 16, 2012 6:12:48 PM UTC-8, yyx wrote:

    Hi Alan,

    First of all, thank you for your very helpful reply.

    But i am a new guy in learning the impala, so i am not sure my
    configuration of impala is right.

    Here is my installation,. I installed impala.x86_64 (manually) in
    everynode in my cluster, then i started impalad and statestore in the
    namenode. In your replay, you asked me that i should started impalad in all
    datanodes, so at the end, i will start impalad in every node(nn + dn) and
    statestore in namenode. Am i right? Thank you for your reply and best
    regards.
    On Friday, December 14, 2012 1:05:59 PM UTC+8, Alan wrote:

    Hi,

    Impala should be installed on all data node and all impalad should be
    started.

    This error log:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0

    indicates that you probably have file permission issue. Can you verify
    if impala has the write privilege to the dir?

    Thanks,
    Alan
    On Thu, Dec 13, 2012 at 7:13 PM, yyx wrote:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    --

    --


    --
    Joey Echeverria
    Principal Solutions Architect
    Cloudera, Inc.

    --




    --
    --
  • Yyx at Dec 19, 2012 at 2:16 am
    Hi,

    Thank you for your solution, now i can use it in java and python. :)

    Best regards!

    yyx
    On Wednesday, December 19, 2012 6:05:57 AM UTC+8, Kenny Sabir wrote:

    For the meantime, I wrote a little command line wrapper for Java. It isn't
    pretty but it does the job as an interim measure until the JDBC is
    available.

    public class ImpalaAPI {

    final static private String IMPALA_SHELL = "/usr/bin/impala-shell";
    private String host = "127.0.0.1";
    private int port = 21000;

    public void run(String sql, final RowInterpreter interpreter) {

    String cmd = IMPALA_SHELL + " --impalad=" + host + ":" + port + "
    --query=\"" + sql + "\"";
    final ProcessBuilder pb = new ProcessBuilder(IMPALA_SHELL, "--impalad=" +
    host+ ":" + port, "--query=" + sql);
    Process p = null;
    BufferedReader reader = null;
    BufferedReader errReader = null;
    try {
    p = pb.start();
    final InputStream is = p.getInputStream();
    final InputStream es = p.getErrorStream();
    InputStreamReader isr = new InputStreamReader(is);
    InputStreamReader esr = new InputStreamReader(es);
    reader =
    new BufferedReader(isr);
    errReader =
    new BufferedReader(esr);
    String line;
    if (errReader.ready()) {
    line = errReader.readLine();
    while(line != null) {
    System.out.println("ERR: " + line);
    line = errReader.readLine();
    }

    }
    line = reader.readLine();
    if (line.startsWith("ERROR:")) {
    while(line != null) {
    System.out.println(line);
    line = reader.readLine();
    }
    return;
    }
    while(line != null) {
    interpreter.process(line);
    line = reader.readLine();
    }
    } catch (IOException e) {
    e.printStackTrace();
    }
    finally
    {
    try {
    reader.close();
    errReader.close();
    } catch (IOException e) {
    e.printStackTrace();
    }
    }

    }

    public ImpalaAPI setHost(String host) {
    this.host = host;
    return this;
    }

    public ImpalaAPI setPort(int port) {
    this.port = port;
    return this;
    }
    }

    public interface RowInterpreter {

    public boolean process(String line) throws IOException;

    }



    On 18/12/2012, at 12:22 PM, YUXIN YAN <yuxin....@gmail.com <javascript:>>
    wrote:

    Hi,

    Thanks for all your replies, and i am very happy to get more about impala.
    And then there is a question here:

    I want to use java to get the impala 'select' results, but after i read
    the instructions in cloudera.com, i don't understand what it explains
    about using impala ODBC. So can anybody give me a more explicit guide about
    java connect impala, thanks. and Sorry about my awful english.....

    Best regards to all.

    yyx


    On Mon, Dec 17, 2012 at 7:31 PM, Joey Echeverria <jo...@cloudera.com<javascript:>
    wrote:
    Hi,

    The impalad process is what actually executes the query. The way that
    Impala achieves it's good performance is by parallelizing as much of
    the query as possible with query processing happening on the same
    nodes where the data lives. If you have a single impalad process on
    your NN, you'll be limited both by the network bandwidth to the NN and
    by the CPU of a single server. Not to mention the fact that you run
    the risk of overwhelming the NN machine which could adversely affect
    the performance and stability of the NN process.

    See slide 10 of this presentation for some more information:


    http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/impala-real-time-queries-in-hadoop-webinar-slides.html

    -Joey

    On Mon, Dec 17, 2012 at 1:51 AM, YUXIN YAN <yuxin....@gmail.com<javascript:>>
    wrote:
    Hello Mark,

    Thank you for your explanation. In my cluster, i start one impalad and one
    statestore in NN firstly, and it works fine. Then i stop the impalad and
    start one impalad in every DN, they(all impalad service) still work fine. Is
    it noramal?

    I have this question because when i start only one impalad in one datanode,
    i can connect to it use impala-shell. So with one i can get all
    information
    in hive tables, why should i still start other impalad service in other DNs?
    Thank you for your explanation.

    Best regards!

    yyx


    On Mon, Dec 17, 2012 at 10:32 AM, Mark Brooks <mbr...@cloudera.com<javascript:>>
    wrote:
    Hello,

    As described in the link here in the section Installation Process |
    Impala
    Components | impalad: "There should be one (impalad) daemon
    process
    running on each node in the cluster that has a data node." -- So
    yes you
    should start the impalad's on all of your DNs, but you should not start
    impalad on your NN. The only exception would be if in a sandbox
    environment
    you were running a NN and a DN on the same host -- but that should not
    be
    the case in anything other than a sandbox environment.

    On Sunday, December 16, 2012 6:12:48 PM UTC-8, yyx wrote:

    Hi Alan,

    First of all, thank you for your very helpful reply.

    But i am a new guy in learning the impala, so i am not sure my
    configuration of impala is right.

    Here is my installation,. I installed impala.x86_64 (manually) in
    everynode in my cluster, then i started impalad and statestore in the
    namenode. In your replay, you asked me that i should started impalad
    in all
    datanodes, so at the end, i will start impalad in every node(nn + dn)
    and
    statestore in namenode. Am i right? Thank you for your reply and best
    regards.
    On Friday, December 14, 2012 1:05:59 PM UTC+8, Alan wrote:

    Hi,

    Impala should be installed on all data node and all impalad should be
    started.

    This error log:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file
    for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    indicates that you probably have file permission issue. Can you
    verify
    if impala has the write privilege to the dir?

    Thanks,
    Alan
    On Thu, Dec 13, 2012 at 7:13 PM, yyx wrote:

    I1214 11:00:16.687690 21794 status.cc:36] Failed to open HDFS file
    for
    writing:
    hdfs://cdh4-1:9000/user/hive/warehouse/aaa/-5564346489813318780--5786484167880269128_281444116_dir/-5564346489813318780--5786484167880269128_927386285_data.0
    --

    --


    --
    Joey Echeverria
    Principal Solutions Architect
    Cloudera, Inc.

    --

    --



    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedDec 14, '12 at 3:13a
activeDec 19, '12 at 2:16a
posts10
users5
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase