|| at Jul 31, 2009 at 5:31 am
I don't see a clear solution from that mailing thread: simply keeping
a TaskTrackerChild running longer won't solve the problem nicely
because tasks from different jobs should have different classpaths,
and I guess this is only supported in later versions of hadoop.
One simple way to go is to add the jars to hadoop-env.sh (which will
add those jars to the classpath to TaskTracker). This is not a nice
solution but it does give us all the performance gain no matter which
hadoop version we are using.
I think a better solution would be to add an option
"mapred.local.classpath" to JobConf - which specifies the path of jars
on the machines in the cluster. This should be done in the hadoop land
- at the beginning of the main function in TaskTracker.Child (if
TaskTracker.Child is reused, then we need to reset the classpath each
time it is running a new task)
What do you think?
On Thu, Jul 30, 2009 at 11:54 AM, Edward Capriolowrote:
On Fri, Jul 24, 2009 at 1:45 PM, Edward Capriolowrote:
On Fri, Jul 24, 2009 at 1:36 PM, Zheng Shaowrote:
Hive only needs to be installed at the node that runs the hive query.
All the jars will be sent to the hadoop JobClient via -libjars. The
code is in ExecDriver.java.
In hadoop 0.17, I don't think there is a way to add a path to
classpath for a job (unless we put it in hadoop-env.sh and start
TaskTracker with that path). are there any changes in the latter
On 7/24/09, Edward Capriolo wrote:
I have been following some threads on the hadoop mailing list about
speeding up MR jobs. I have a few questions I am sure I can find the
answer to if I dig into the source code but I thought I could get a
1 ADD JAR 'myfile.jar' uses the distributed cache. Using the
distributed cache has some overhead. I know if I create an auxlibs
directory under hive root, they will be added to libjars on startup.
If i add my jar to auxlibs on all my nodes will a UDF in the jar be
available during subsequent jobs? Or is it only necessary to add those
jars to the auxlib on the node I start the job from.
2 Dealing with the entire hive install. How much of the hive install
really needs to be replication on each datanode? If we used
distributed cache for everything the jobs would have unneeded
overhead, but hive would be 'installed on demand' from the client.
Sent from Gmail for mobile | mobile.google.com
A thread from the hadoop list peaked my interest. search.
"hadoop jobs take long time to setup"http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3C7e536b1f0906281408n1c2484bfve6dc1ea339110e9d@mail.gmail.com%3E
Can hive benefit?
Could we use something like this for a performance increase? With the
assumption that the jars are present on all task-trackers could we
have an alternate invocation script such as bin/hive-local ?