Grokbase Groups Hive user July 2009
FAQ
I have been following some threads on the hadoop mailing list about
speeding up MR jobs. I have a few questions I am sure I can find the
answer to if I dig into the source code but I thought I could get a
quick answer.

1 ADD JAR 'myfile.jar' uses the distributed cache. Using the
distributed cache has some overhead. I know if I create an auxlibs
directory under hive root, they will be added to libjars on startup.
If i add my jar to auxlibs on all my nodes will a UDF in the jar be
available during subsequent jobs? Or is it only necessary to add those
jars to the auxlib on the node I start the job from.

2 Dealing with the entire hive install. How much of the hive install
really needs to be replication on each datanode? If we used
distributed cache for everything the jobs would have unneeded
overhead, but hive would be 'installed on demand' from the client.

Thanks,
Edward

Search Discussions

  • Zheng Shao at Jul 24, 2009 at 5:35 pm
    Hive only needs to be installed at the node that runs the hive query.
    All the jars will be sent to the hadoop JobClient via -libjars. The
    code is in ExecDriver.java.

    In hadoop 0.17, I don't think there is a way to add a path to
    classpath for a job (unless we put it in hadoop-env.sh and start
    TaskTracker with that path). are there any changes in the latter
    versions?



    Zheng


    On 7/24/09, Edward Capriolo wrote:
    I have been following some threads on the hadoop mailing list about
    speeding up MR jobs. I have a few questions I am sure I can find the
    answer to if I dig into the source code but I thought I could get a
    quick answer.

    1 ADD JAR 'myfile.jar' uses the distributed cache. Using the
    distributed cache has some overhead. I know if I create an auxlibs
    directory under hive root, they will be added to libjars on startup.
    If i add my jar to auxlibs on all my nodes will a UDF in the jar be
    available during subsequent jobs? Or is it only necessary to add those
    jars to the auxlib on the node I start the job from.

    2 Dealing with the entire hive install. How much of the hive install
    really needs to be replication on each datanode? If we used
    distributed cache for everything the jobs would have unneeded
    overhead, but hive would be 'installed on demand' from the client.

    Thanks,
    Edward
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
  • Edward Capriolo at Jul 24, 2009 at 5:45 pm

    On Fri, Jul 24, 2009 at 1:36 PM, Zheng Shaowrote:
    Hive only needs to be installed at the node that runs the hive query.
    All the jars will be sent to the hadoop JobClient via -libjars. The
    code is in ExecDriver.java.

    In hadoop 0.17, I don't think there is a way to add a path to
    classpath for a job (unless we put it in hadoop-env.sh and start
    TaskTracker with that path). are there any changes in the latter
    versions?



    Zheng


    On 7/24/09, Edward Capriolo wrote:
    I have been following some threads on the hadoop mailing list about
    speeding up MR jobs. I have a few questions I am sure I can find the
    answer to if I dig into the source code but I thought I could get a
    quick answer.

    1 ADD JAR 'myfile.jar'  uses the distributed cache. Using the
    distributed cache has some overhead. I know if I create an auxlibs
    directory under hive root, they will be added to libjars on startup.
    If i add my jar to auxlibs on all my nodes will a UDF in the jar be
    available during subsequent jobs? Or is it only necessary to add those
    jars to the auxlib on the node I start the job from.

    2 Dealing with the entire hive install. How much of the hive install
    really needs to be replication on each datanode? If we used
    distributed cache for everything the jobs would have unneeded
    overhead, but hive would be 'installed on demand' from the client.

    Thanks,
    Edward
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
    Zheng,

    A thread from the hadoop list peaked my interest. search.
    "hadoop jobs take long time to setup"

    http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3C7e536b1f0906281408n1c2484bfve6dc1ea339110e9d@mail.gmail.com%3E

    Can hive benefit?
    Edward
  • Edward Capriolo at Jul 30, 2009 at 6:55 pm

    On Fri, Jul 24, 2009 at 1:45 PM, Edward Capriolowrote:
    On Fri, Jul 24, 2009 at 1:36 PM, Zheng Shaowrote:
    Hive only needs to be installed at the node that runs the hive query.
    All the jars will be sent to the hadoop JobClient via -libjars. The
    code is in ExecDriver.java.

    In hadoop 0.17, I don't think there is a way to add a path to
    classpath for a job (unless we put it in hadoop-env.sh and start
    TaskTracker with that path). are there any changes in the latter
    versions?



    Zheng


    On 7/24/09, Edward Capriolo wrote:
    I have been following some threads on the hadoop mailing list about
    speeding up MR jobs. I have a few questions I am sure I can find the
    answer to if I dig into the source code but I thought I could get a
    quick answer.

    1 ADD JAR 'myfile.jar'  uses the distributed cache. Using the
    distributed cache has some overhead. I know if I create an auxlibs
    directory under hive root, they will be added to libjars on startup.
    If i add my jar to auxlibs on all my nodes will a UDF in the jar be
    available during subsequent jobs? Or is it only necessary to add those
    jars to the auxlib on the node I start the job from.

    2 Dealing with the entire hive install. How much of the hive install
    really needs to be replication on each datanode? If we used
    distributed cache for everything the jobs would have unneeded
    overhead, but hive would be 'installed on demand' from the client.

    Thanks,
    Edward
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
    Zheng,

    A thread from the  hadoop list peaked my interest. search.
    "hadoop jobs take long time to setup"

    http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3C7e536b1f0906281408n1c2484bfve6dc1ea339110e9d@mail.gmail.com%3E

    Can hive benefit?
    Edward
    Could we use something like this for a performance increase? With the
    assumption that the jars are present on all task-trackers could we
    have an alternate invocation script such as bin/hive-local ?

    Edward
  • Zheng Shao at Jul 31, 2009 at 5:31 am
    I don't see a clear solution from that mailing thread: simply keeping
    a TaskTrackerChild running longer won't solve the problem nicely
    because tasks from different jobs should have different classpaths,
    and I guess this is only supported in later versions of hadoop.

    One simple way to go is to add the jars to hadoop-env.sh (which will
    add those jars to the classpath to TaskTracker). This is not a nice
    solution but it does give us all the performance gain no matter which
    hadoop version we are using.

    I think a better solution would be to add an option
    "mapred.local.classpath" to JobConf - which specifies the path of jars
    on the machines in the cluster. This should be done in the hadoop land
    - at the beginning of the main function in TaskTracker.Child (if
    TaskTracker.Child is reused, then we need to reset the classpath each
    time it is running a new task)

    What do you think?

    Zheng

    On Thu, Jul 30, 2009 at 11:54 AM, Edward Capriolowrote:
    On Fri, Jul 24, 2009 at 1:45 PM, Edward Capriolowrote:
    On Fri, Jul 24, 2009 at 1:36 PM, Zheng Shaowrote:
    Hive only needs to be installed at the node that runs the hive query.
    All the jars will be sent to the hadoop JobClient via -libjars. The
    code is in ExecDriver.java.

    In hadoop 0.17, I don't think there is a way to add a path to
    classpath for a job (unless we put it in hadoop-env.sh and start
    TaskTracker with that path). are there any changes in the latter
    versions?



    Zheng


    On 7/24/09, Edward Capriolo wrote:
    I have been following some threads on the hadoop mailing list about
    speeding up MR jobs. I have a few questions I am sure I can find the
    answer to if I dig into the source code but I thought I could get a
    quick answer.

    1 ADD JAR 'myfile.jar'  uses the distributed cache. Using the
    distributed cache has some overhead. I know if I create an auxlibs
    directory under hive root, they will be added to libjars on startup.
    If i add my jar to auxlibs on all my nodes will a UDF in the jar be
    available during subsequent jobs? Or is it only necessary to add those
    jars to the auxlib on the node I start the job from.

    2 Dealing with the entire hive install. How much of the hive install
    really needs to be replication on each datanode? If we used
    distributed cache for everything the jobs would have unneeded
    overhead, but hive would be 'installed on demand' from the client.

    Thanks,
    Edward
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
    Zheng,

    A thread from the  hadoop list peaked my interest. search.
    "hadoop jobs take long time to setup"

    http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3C7e536b1f0906281408n1c2484bfve6dc1ea339110e9d@mail.gmail.com%3E

    Can hive benefit?
    Edward
    Could we use something like this for a performance increase? With the
    assumption that the jars are present on all task-trackers could we
    have an alternate invocation script such as bin/hive-local ?

    Edward


    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJul 24, '09 at 4:41p
activeJul 31, '09 at 5:31a
posts5
users2
websitehive.apache.org

2 users in discussion

Edward Capriolo: 3 posts Zheng Shao: 2 posts

People

Translate

site design / logo © 2022 Grokbase