FAQ
Hi DK,
Answered your questions inline

Thanks,
Lenni
Software Engineer - Cloudera
On Wed, Feb 13, 2013 at 11:35 AM, DK wrote:

Hi All,

Trying to install/setup impala on multi machine cluster in distributed
mode. I have following question related to hive:
1. I understand hive installation is mandatory on all nodes.
Hive is *not* required on all nodes. Currently, Impala does not support
DDL syntax (CREATE/ALTER DATABASE/TABLE) so you only need Hive installed on
nodes where you want to perform these operations.

2. Wondering if Hive need to be setup on only one node with a mysql
installation or need to be done on every node.
You only need a single mysql metastore installation. Every Impalad
instances should be configured to point to the same metastore. Note that
other databases (such as postgres) can also be used as a hive metastore,
there is no specific requirement on mysql.

3. For 2 I am guessing one metastore is fine because they all connect to
the same hadoop nodes and will have the same information.
Correct.


Thanks for your help in advance.
-DK

Search Discussions

  • DK at Feb 14, 2013 at 7:48 pm
    Thanks Lenni !
    This is helpful just wanted to confirm a full deployment in a distributed
    cluster:
    1 server - Resource Manager
    1 Server - Name Node
    1 Server - Secondary NameNode
    1 Server at minimum to host Hive and MySQL Server can be anywhere(this can
    go on any of the three above)
    Rest of the servers with CDH components and impala
    Also from wherever the query will be invoked needs the statestored to be
    running.

    Do you think this is the right deployment planning would be ?

    Thanks,
    DK
    On Wednesday, February 13, 2013 11:42:45 AM UTC-8, lskuff wrote:

    Hi DK,
    Answered your questions inline

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Wed, Feb 13, 2013 at 11:35 AM, DK <dileepk...@gmail.com <javascript:>>wrote:
    Hi All,

    Trying to install/setup impala on multi machine cluster in distributed
    mode. I have following question related to hive:
    1. I understand hive installation is mandatory on all nodes.
    Hive is *not* required on all nodes. Currently, Impala does not
    support DDL syntax (CREATE/ALTER DATABASE/TABLE) so you only need Hive
    installed on nodes where you want to perform these operations.

    2. Wondering if Hive need to be setup on only one node with a mysql
    installation or need to be done on every node.
    You only need a single mysql metastore installation. Every Impalad
    instances should be configured to point to the same metastore. Note that
    other databases (such as postgres) can also be used as a hive metastore,
    there is no specific requirement on mysql.

    3. For 2 I am guessing one metastore is fine because they all connect to
    the same hadoop nodes and will have the same information.
    Correct.


    Thanks for your help in advance.
    -DK
  • Lenni Kuff at Feb 15, 2013 at 4:21 pm
    Hi DK,
    First, there is no requirement that the machine running statestored is the
    same machine you run queries on. You can actually run queries targeting any
    impalad instance in your cluster. The state store can run on any machine in
    the cluster. However, we recommend that you run the state store on a
    different machine than the NameNode due to the NameNode's memory
    requirements. For optimal performance, it is also critical that there is an
    impalad instance installed and running on every DataNode in the cluster.

    Let me know if you have any additional questions.

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Thu, Feb 14, 2013 at 11:48 AM, DK wrote:

    Thanks Lenni !
    This is helpful just wanted to confirm a full deployment in a distributed
    cluster:
    1 server - Resource Manager
    1 Server - Name Node
    1 Server - Secondary NameNode
    1 Server at minimum to host Hive and MySQL Server can be anywhere(this can
    go on any of the three above)
    Rest of the servers with CDH components and impala
    Also from wherever the query will be invoked needs the statestored to be
    running.

    Do you think this is the right deployment planning would be ?

    Thanks,
    DK

    On Wednesday, February 13, 2013 11:42:45 AM UTC-8, lskuff wrote:

    Hi DK,
    Answered your questions inline

    Thanks,
    Lenni
    Software Engineer - Cloudera
    On Wed, Feb 13, 2013 at 11:35 AM, DK wrote:

    Hi All,

    Trying to install/setup impala on multi machine cluster in distributed
    mode. I have following question related to hive:
    1. I understand hive installation is mandatory on all nodes.
    Hive is *not* required on all nodes. Currently, Impala does not
    support DDL syntax (CREATE/ALTER DATABASE/TABLE) so you only need Hive
    installed on nodes where you want to perform these operations.

    2. Wondering if Hive need to be setup on only one node with a mysql
    installation or need to be done on every node.
    You only need a single mysql metastore installation. Every Impalad
    instances should be configured to point to the same metastore. Note that
    other databases (such as postgres) can also be used as a hive metastore,
    there is no specific requirement on mysql.

    3. For 2 I am guessing one metastore is fine because they all connect to
    the same hadoop nodes and will have the same information.
    Correct.


    Thanks for your help in advance.
    -DK
  • Jung-Yup Lee at Feb 15, 2013 at 11:08 pm
    You can actually run queries targeting any impalad instance in your
    cluster.

    Does this mean that a client should consider the load balancing of query
    processing site for requesting a bunch of queries?


    2013/2/16 Lenni Kuff <lskuff@cloudera.com>
    Hi DK,
    First, there is no requirement that the machine running statestored is the
    same machine you run queries on. You can actually run queries targeting any
    impalad instance in your cluster. The state store can run on any machine in
    the cluster. However, we recommend that you run the state store on a
    different machine than the NameNode due to the NameNode's memory
    requirements. For optimal performance, it is also critical that there is an
    impalad instance installed and running on every DataNode in the cluster.

    Let me know if you have any additional questions.

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Thu, Feb 14, 2013 at 11:48 AM, DK wrote:

    Thanks Lenni !
    This is helpful just wanted to confirm a full deployment in a distributed
    cluster:
    1 server - Resource Manager
    1 Server - Name Node
    1 Server - Secondary NameNode
    1 Server at minimum to host Hive and MySQL Server can be anywhere(this
    can go on any of the three above)
    Rest of the servers with CDH components and impala
    Also from wherever the query will be invoked needs the statestored to be
    running.

    Do you think this is the right deployment planning would be ?

    Thanks,
    DK

    On Wednesday, February 13, 2013 11:42:45 AM UTC-8, lskuff wrote:

    Hi DK,
    Answered your questions inline

    Thanks,
    Lenni
    Software Engineer - Cloudera
    On Wed, Feb 13, 2013 at 11:35 AM, DK wrote:

    Hi All,

    Trying to install/setup impala on multi machine cluster in distributed
    mode. I have following question related to hive:
    1. I understand hive installation is mandatory on all nodes.
    Hive is *not* required on all nodes. Currently, Impala does not
    support DDL syntax (CREATE/ALTER DATABASE/TABLE) so you only need Hive
    installed on nodes where you want to perform these operations.

    2. Wondering if Hive need to be setup on only one node with a mysql
    installation or need to be done on every node.
    You only need a single mysql metastore installation. Every Impalad
    instances should be configured to point to the same metastore. Note that
    other databases (such as postgres) can also be used as a hive metastore,
    there is no specific requirement on mysql.

    3. For 2 I am guessing one metastore is fine because they all connect
    to the same hadoop nodes and will have the same information.
    Correct.


    Thanks for your help in advance.
    -DK

    --

    Sincerely yours,

    Jung-Yup Lee
  • DK at Feb 18, 2013 at 4:04 am

    On Friday, February 15, 2013 3:08:36 PM UTC-8, Jung-Yup Lee wrote:
    You can actually run queries targeting any impalad instance in your
    cluster.

    Does this mean that a client should consider the load balancing of query
    processing site for requesting a bunch of queries?

    As I understand statestored is supposed to exactly this which will
    delegate this to one of the impala daemon.
    Lets see what Lenni has to say.

    -DK
  • Henry Robinson at Feb 18, 2013 at 4:43 am
    On 17 February 2013 20:04, DK wrote:
    On Friday, February 15, 2013 3:08:36 PM UTC-8, Jung-Yup Lee wrote:

    You can actually run queries targeting any impalad instance in your
    cluster.

    Does this mean that a client should consider the load balancing of query
    processing site for requesting a bunch of queries?

    As I understand statestored is supposed to exactly this which will
    delegate this to one of the impala daemon.
    That's not quite right - the statestore doesn't do any load balancing or
    scheduling. Its responsibility in current releases is limited to telling
    other Impala daemons about the membership of the Impala cluster, that is
    which impalads are running and where.

    Currently, it's the user's responsibility to load-balance around the
    cluster, although for many workloads this won't be as necessary and you can
    safely send queries to only one or two nodes.

    Henry

    Lets see what Lenni has to say.

    -DK


    --
    Henry Robinson
    Software Engineer
    Cloudera
    415-994-6679

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedFeb 13, '13 at 7:42p
activeFeb 18, '13 at 4:43a
posts6
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase