FAQ
Hi.

Wonder how one should improve the startup times of a hadoop job. Some of my
jobs which have a lot of dependencies in terms of many jar files take a long
time to start in hadoop up to 2 minutes some times.
The data input amounts in these cases are neglible so it seems that Hadoop
have a really high setup cost, which I can live with but this seems to much.

Let's say a job takes 10 minutes to complete then it is bad if it takes 2
mins to set it up... 20-30 sec max would be a lot more reasonable.

Hints ?

//Marcus


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Search Discussions

  • Tim robertson at Jun 28, 2009 at 7:54 pm
    How long does it take to start the code locally in a single thread?

    Can you reuse the JVM so it only starts once per node per job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herouwrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop job. Some of my
    jobs which have a lot of dependencies in terms of many jar files take a long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it seems that Hadoop
    have a really high setup cost, which I can live with but this seems to much.

    Let's say a job takes 10 minutes to complete then it is bad if it takes 2
    mins to set it up... 20-30 sec max would be a lot more reasonable.

    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
  • Marcus Herou at Jun 28, 2009 at 9:17 pm
    Hi.

    Running without a jobtracker makes the job start almost instantly.
    I think it is due to something with the classloader. I use a huge amount of
    jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need to be
    loaded every time I guess.

    By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
    live forever then ?

    Cheers

    //Marcus
    On Sun, Jun 28, 2009 at 9:54 PM, tim robertson wrote:

    How long does it take to start the code locally in a single thread?

    Can you reuse the JVM so it only starts once per node per job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herouwrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop job. Some of my
    jobs which have a lot of dependencies in terms of many jar files take a long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it seems that Hadoop
    have a really high setup cost, which I can live with but this seems to much.
    Let's say a job takes 10 minutes to complete then it is bad if it takes 2
    mins to set it up... 20-30 sec max would be a lot more reasonable.

    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
  • Stuart White at Jun 28, 2009 at 9:21 pm
    Although I've never done it, I believe you could manually copy your jar
    files out to your cluster somewhere in hadoop's classpath, and that would
    remove the need for you to copy them to your cluster at the start of each
    job.
    On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou wrote:

    Hi.

    Running without a jobtracker makes the job start almost instantly.
    I think it is due to something with the classloader. I use a huge amount of
    jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need to be
    loaded every time I guess.

    By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
    live forever then ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <timrobertson100@gmail.com
    wrote:
    How long does it take to start the code locally in a single thread?

    Can you reuse the JVM so it only starts once per node per job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<marcus.herou@tailsweep.com

    wrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop job. Some
    of
    my
    jobs which have a lot of dependencies in terms of many jar files take a long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it seems that Hadoop
    have a really high setup cost, which I can live with but this seems to much.
    Let's say a job takes 10 minutes to complete then it is bad if it takes
    2
    mins to set it up... 20-30 sec max would be a lot more reasonable.

    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
  • Mikhail Bautin at Jun 28, 2009 at 9:26 pm
    This is the way we deal with this problem, too. We put our jar files on NFS,
    and the attached patch makes possible to add those jar files to the
    tasktracker classpath through a configuration property.

    Thanks,
    Mikhail
    On Sun, Jun 28, 2009 at 5:21 PM, Stuart White wrote:

    Although I've never done it, I believe you could manually copy your jar
    files out to your cluster somewhere in hadoop's classpath, and that would
    remove the need for you to copy them to your cluster at the start of each
    job.

    On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <marcus.herou@tailsweep.com
    wrote:
    Hi.

    Running without a jobtracker makes the job start almost instantly.
    I think it is due to something with the classloader. I use a huge amount of
    jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need to be
    loaded every time I guess.

    By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
    live forever then ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
    timrobertson100@gmail.com
    wrote:
    How long does it take to start the code locally in a single thread?

    Can you reuse the JVM so it only starts once per node per job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop job. Some
    of
    my
    jobs which have a lot of dependencies in terms of many jar files take
    a
    long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it seems that Hadoop
    have a really high setup cost, which I can live with but this seems
    to
    much.
    Let's say a job takes 10 minutes to complete then it is bad if it
    takes
    2
    mins to set it up... 20-30 sec max would be a lot more reasonable.

    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
  • Marcus Herou at Jun 28, 2009 at 9:40 pm
    Makes sense... I will try both rsync and NFS but I think rsync will beat NFS
    since NFS can be slow as hell sometimes but what the heck we already have
    our maven2 repo on NFS so why not :)

    Are you saying that this patch make the client able to configure which
    "extra" local jar files to add as classpath when firing up the
    TaskTrackerChild ?

    To be explicit: Do you confirm that using tmpjars like I do is a costful
    slow operation ?

    To what branch to you apply the patch (we use 0.18.3) ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin wrote:

    This is the way we deal with this problem, too. We put our jar files on
    NFS, and the attached patch makes possible to add those jar files to the
    tasktracker classpath through a configuration property.

    Thanks,
    Mikhail
    On Sun, Jun 28, 2009 at 5:21 PM, Stuart White wrote:

    Although I've never done it, I believe you could manually copy your jar
    files out to your cluster somewhere in hadoop's classpath, and that would
    remove the need for you to copy them to your cluster at the start of each
    job.

    On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <marcus.herou@tailsweep.com
    wrote:
    Hi.

    Running without a jobtracker makes the job start almost instantly.
    I think it is due to something with the classloader. I use a huge amount of
    jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need to be
    loaded every time I guess.

    By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
    live forever then ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
    timrobertson100@gmail.com
    wrote:
    How long does it take to start the code locally in a single thread?

    Can you reuse the JVM so it only starts once per node per job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop job.
    Some
    of
    my
    jobs which have a lot of dependencies in terms of many jar files
    take a
    long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it seems that Hadoop
    have a really high setup cost, which I can live with but this seems
    to
    much.
    Let's say a job takes 10 minutes to complete then it is bad if it
    takes
    2
    mins to set it up... 20-30 sec max would be a lot more reasonable.

    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/

    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
  • Mikhail Bautin at Jun 28, 2009 at 10:08 pm
    Marcus,

    We currently use 0.20.0 but this patch just inserts 8 lines of code into
    TaskRunner.java, which could certainly be done with 0.18.3.

    Yes, this patch just appends additional jars to the child JVM classpath.

    I've never really used tmpjars myself, but if it involves uploading multiple
    jar files into HDFS every time a job is started, I see how it can be really
    slow. On our ~80-job workflow this would have really slowed things down.

    Thanks,
    Mikhail
    On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou wrote:

    Makes sense... I will try both rsync and NFS but I think rsync will beat
    NFS
    since NFS can be slow as hell sometimes but what the heck we already have
    our maven2 repo on NFS so why not :)

    Are you saying that this patch make the client able to configure which
    "extra" local jar files to add as classpath when firing up the
    TaskTrackerChild ?

    To be explicit: Do you confirm that using tmpjars like I do is a costful
    slow operation ?

    To what branch to you apply the patch (we use 0.18.3) ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin wrote:

    This is the way we deal with this problem, too. We put our jar files on
    NFS, and the attached patch makes possible to add those jar files to the
    tasktracker classpath through a configuration property.

    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 5:21 PM, Stuart White <stuart.white1@gmail.com
    wrote:
    Although I've never done it, I believe you could manually copy your jar
    files out to your cluster somewhere in hadoop's classpath, and that
    would
    remove the need for you to copy them to your cluster at the start of
    each
    job.

    On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Running without a jobtracker makes the job start almost instantly.
    I think it is due to something with the classloader. I use a huge
    amount
    of
    jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need to
    be
    loaded every time I guess.

    By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
    live forever then ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
    timrobertson100@gmail.com
    wrote:
    How long does it take to start the code locally in a single thread?

    Can you reuse the JVM so it only starts once per node per job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop job.
    Some
    of
    my
    jobs which have a lot of dependencies in terms of many jar files
    take a
    long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it seems
    that
    Hadoop
    have a really high setup cost, which I can live with but this
    seems
    to
    much.
    Let's say a job takes 10 minutes to complete then it is bad if it
    takes
    2
    mins to set it up... 20-30 sec max would be a lot more reasonable.

    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
  • Marcus Herou at Jun 28, 2009 at 10:16 pm
    Hi.

    Just to be clear. It is the jobtracker that needs the patched code right ?
    Or is it the tasktrackers ?

    Kindly

    //Marcus
    On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin wrote:

    Marcus,

    We currently use 0.20.0 but this patch just inserts 8 lines of code into
    TaskRunner.java, which could certainly be done with 0.18.3.

    Yes, this patch just appends additional jars to the child JVM classpath.

    I've never really used tmpjars myself, but if it involves uploading
    multiple
    jar files into HDFS every time a job is started, I see how it can be really
    slow. On our ~80-job workflow this would have really slowed things down.

    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou <marcus.herou@tailsweep.com
    wrote:
    Makes sense... I will try both rsync and NFS but I think rsync will beat
    NFS
    since NFS can be slow as hell sometimes but what the heck we already have
    our maven2 repo on NFS so why not :)

    Are you saying that this patch make the client able to configure which
    "extra" local jar files to add as classpath when firing up the
    TaskTrackerChild ?

    To be explicit: Do you confirm that using tmpjars like I do is a costful
    slow operation ?

    To what branch to you apply the patch (we use 0.18.3) ?

    Cheers

    //Marcus


    On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin <mbautin@gmail.com>
    wrote:
    This is the way we deal with this problem, too. We put our jar files on
    NFS, and the attached patch makes possible to add those jar files to
    the
    tasktracker classpath through a configuration property.

    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 5:21 PM, Stuart White <stuart.white1@gmail.com
    wrote:
    Although I've never done it, I believe you could manually copy your
    jar
    files out to your cluster somewhere in hadoop's classpath, and that
    would
    remove the need for you to copy them to your cluster at the start of
    each
    job.

    On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Running without a jobtracker makes the job start almost instantly.
    I think it is due to something with the classloader. I use a huge
    amount
    of
    jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need
    to
    be
    loaded every time I guess.

    By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
    live forever then ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
    timrobertson100@gmail.com
    wrote:
    How long does it take to start the code locally in a single
    thread?
    Can you reuse the JVM so it only starts once per node per job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop job.
    Some
    of
    my
    jobs which have a lot of dependencies in terms of many jar files
    take a
    long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it seems
    that
    Hadoop
    have a really high setup cost, which I can live with but this
    seems
    to
    much.
    Let's say a job takes 10 minutes to complete then it is bad if
    it
    takes
    2
    mins to set it up... 20-30 sec max would be a lot more
    reasonable.
    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
  • Mikhail Bautin at Jun 28, 2009 at 10:33 pm
    Marcus,

    The code that needs to patched is in the tasktracker, because the
    tasktracker is what starts the child JVM that runs user code.

    Thanks,
    Mikhail
    On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou wrote:

    Hi.

    Just to be clear. It is the jobtracker that needs the patched code right ?
    Or is it the tasktrackers ?

    Kindly

    //Marcus
    On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin wrote:

    Marcus,

    We currently use 0.20.0 but this patch just inserts 8 lines of code into
    TaskRunner.java, which could certainly be done with 0.18.3.

    Yes, this patch just appends additional jars to the child JVM classpath.

    I've never really used tmpjars myself, but if it involves uploading
    multiple
    jar files into HDFS every time a job is started, I see how it can be really
    slow. On our ~80-job workflow this would have really slowed things down.

    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou <
    marcus.herou@tailsweep.com
    wrote:
    Makes sense... I will try both rsync and NFS but I think rsync will
    beat
    NFS
    since NFS can be slow as hell sometimes but what the heck we already
    have
    our maven2 repo on NFS so why not :)

    Are you saying that this patch make the client able to configure which
    "extra" local jar files to add as classpath when firing up the
    TaskTrackerChild ?

    To be explicit: Do you confirm that using tmpjars like I do is a
    costful
    slow operation ?

    To what branch to you apply the patch (we use 0.18.3) ?

    Cheers

    //Marcus


    On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin <mbautin@gmail.com>
    wrote:
    This is the way we deal with this problem, too. We put our jar files
    on
    NFS, and the attached patch makes possible to add those jar files to
    the
    tasktracker classpath through a configuration property.

    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 5:21 PM, Stuart White <
    stuart.white1@gmail.com
    wrote:
    Although I've never done it, I believe you could manually copy your
    jar
    files out to your cluster somewhere in hadoop's classpath, and that
    would
    remove the need for you to copy them to your cluster at the start of
    each
    job.

    On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Running without a jobtracker makes the job start almost instantly.
    I think it is due to something with the classloader. I use a huge
    amount
    of
    jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need
    to
    be
    loaded every time I guess.

    By issuing conf.setNumTasksToExecutePerJvm(-1); will the
    TaskTracker
    child
    live forever then ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
    timrobertson100@gmail.com
    wrote:
    How long does it take to start the code locally in a single
    thread?
    Can you reuse the JVM so it only starts once per node per job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop
    job.
    Some
    of
    my
    jobs which have a lot of dependencies in terms of many jar
    files
    take a
    long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it seems
    that
    Hadoop
    have a really high setup cost, which I can live with but this
    seems
    to
    much.
    Let's say a job takes 10 minutes to complete then it is bad if
    it
    takes
    2
    mins to set it up... 20-30 sec max would be a lot more
    reasonable.
    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
  • Marcus Herou at Jun 29, 2009 at 6:37 am
    Of course... Thanks for the help!

    Cheers

    //Marcus
    On Mon, Jun 29, 2009 at 12:32 AM, Mikhail Bautin wrote:

    Marcus,

    The code that needs to patched is in the tasktracker, because the
    tasktracker is what starts the child JVM that runs user code.

    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou <marcus.herou@tailsweep.com
    wrote:
    Hi.

    Just to be clear. It is the jobtracker that needs the patched code right ?
    Or is it the tasktrackers ?

    Kindly

    //Marcus

    On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin <mbautin@gmail.com>
    wrote:
    Marcus,

    We currently use 0.20.0 but this patch just inserts 8 lines of code
    into
    TaskRunner.java, which could certainly be done with 0.18.3.

    Yes, this patch just appends additional jars to the child JVM
    classpath.
    I've never really used tmpjars myself, but if it involves uploading
    multiple
    jar files into HDFS every time a job is started, I see how it can be really
    slow. On our ~80-job workflow this would have really slowed things
    down.
    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou <
    marcus.herou@tailsweep.com
    wrote:
    Makes sense... I will try both rsync and NFS but I think rsync will
    beat
    NFS
    since NFS can be slow as hell sometimes but what the heck we already
    have
    our maven2 repo on NFS so why not :)

    Are you saying that this patch make the client able to configure
    which
    "extra" local jar files to add as classpath when firing up the
    TaskTrackerChild ?

    To be explicit: Do you confirm that using tmpjars like I do is a
    costful
    slow operation ?

    To what branch to you apply the patch (we use 0.18.3) ?

    Cheers

    //Marcus


    On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin <mbautin@gmail.com>
    wrote:
    This is the way we deal with this problem, too. We put our jar
    files
    on
    NFS, and the attached patch makes possible to add those jar files
    to
    the
    tasktracker classpath through a configuration property.

    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 5:21 PM, Stuart White <
    stuart.white1@gmail.com
    wrote:
    Although I've never done it, I believe you could manually copy
    your
    jar
    files out to your cluster somewhere in hadoop's classpath, and
    that
    would
    remove the need for you to copy them to your cluster at the start
    of
    each
    job.

    On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Running without a jobtracker makes the job start almost
    instantly.
    I think it is due to something with the classloader. I use a
    huge
    amount
    of
    jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which
    need
    to
    be
    loaded every time I guess.

    By issuing conf.setNumTasksToExecutePerJvm(-1); will the
    TaskTracker
    child
    live forever then ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
    timrobertson100@gmail.com
    wrote:
    How long does it take to start the code locally in a single
    thread?
    Can you reuse the JVM so it only starts once per node per job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop
    job.
    Some
    of
    my
    jobs which have a lot of dependencies in terms of many jar
    files
    take a
    long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it
    seems
    that
    Hadoop
    have a really high setup cost, which I can live with but
    this
    seems
    to
    much.
    Let's say a job takes 10 minutes to complete then it is bad
    if
    it
    takes
    2
    mins to set it up... 20-30 sec max would be a lot more
    reasonable.
    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/
  • Marcus Herou at Jul 7, 2009 at 9:49 pm
    Finally got to test this. It reduced the startup from 40+ secs to below 10
    secs which is grrrrrrrrrrrrrrrrreat!

    However we modified the patch slightly.

    String additionalClassPath = conf.get("mapred.additional.class.path");
    if (additionalClassPath != null)
    {
    String[] localfiles = additionalClassPath.split(",");
    for(int i = 0; i < localfiles.length;i++)
    {
    String localfile = localfiles[i].trim();
    LOG.info("Adding "+localfile);
    classPath.append(sep);
    classPath.append(localfile);
    }
    }

    Cheers

    //Marcus
    On Mon, Jun 29, 2009 at 8:36 AM, Marcus Herou wrote:

    Of course... Thanks for the help!

    Cheers

    //Marcus

    On Mon, Jun 29, 2009 at 12:32 AM, Mikhail Bautin wrote:

    Marcus,

    The code that needs to patched is in the tasktracker, because the
    tasktracker is what starts the child JVM that runs user code.

    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou <marcus.herou@tailsweep.com
    wrote:
    Hi.

    Just to be clear. It is the jobtracker that needs the patched code right ?
    Or is it the tasktrackers ?

    Kindly

    //Marcus

    On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin <mbautin@gmail.com>
    wrote:
    Marcus,

    We currently use 0.20.0 but this patch just inserts 8 lines of code
    into
    TaskRunner.java, which could certainly be done with 0.18.3.

    Yes, this patch just appends additional jars to the child JVM
    classpath.
    I've never really used tmpjars myself, but if it involves uploading
    multiple
    jar files into HDFS every time a job is started, I see how it can be really
    slow. On our ~80-job workflow this would have really slowed things
    down.
    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou <
    marcus.herou@tailsweep.com
    wrote:
    Makes sense... I will try both rsync and NFS but I think rsync will
    beat
    NFS
    since NFS can be slow as hell sometimes but what the heck we already
    have
    our maven2 repo on NFS so why not :)

    Are you saying that this patch make the client able to configure
    which
    "extra" local jar files to add as classpath when firing up the
    TaskTrackerChild ?

    To be explicit: Do you confirm that using tmpjars like I do is a
    costful
    slow operation ?

    To what branch to you apply the patch (we use 0.18.3) ?

    Cheers

    //Marcus


    On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin <mbautin@gmail.com
    wrote:
    This is the way we deal with this problem, too. We put our jar
    files
    on
    NFS, and the attached patch makes possible to add those jar files
    to
    the
    tasktracker classpath through a configuration property.

    Thanks,
    Mikhail

    On Sun, Jun 28, 2009 at 5:21 PM, Stuart White <
    stuart.white1@gmail.com
    wrote:
    Although I've never done it, I believe you could manually copy
    your
    jar
    files out to your cluster somewhere in hadoop's classpath, and
    that
    would
    remove the need for you to copy them to your cluster at the start
    of
    each
    job.

    On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou <
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Running without a jobtracker makes the job start almost
    instantly.
    I think it is due to something with the classloader. I use a
    huge
    amount
    of
    jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which
    need
    to
    be
    loaded every time I guess.

    By issuing conf.setNumTasksToExecutePerJvm(-1); will the
    TaskTracker
    child
    live forever then ?

    Cheers

    //Marcus

    On Sun, Jun 28, 2009 at 9:54 PM, tim robertson <
    timrobertson100@gmail.com
    wrote:
    How long does it take to start the code locally in a single
    thread?
    Can you reuse the JVM so it only starts once per node per
    job?
    conf.setNumTasksToExecutePerJvm(-1)

    Cheers,
    Tim



    On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou<
    marcus.herou@tailsweep.com
    wrote:
    Hi.

    Wonder how one should improve the startup times of a hadoop
    job.
    Some
    of
    my
    jobs which have a lot of dependencies in terms of many jar
    files
    take a
    long
    time to start in hadoop up to 2 minutes some times.
    The data input amounts in these cases are neglible so it
    seems
    that
    Hadoop
    have a really high setup cost, which I can live with but
    this
    seems
    to
    much.
    Let's say a job takes 10 minutes to complete then it is bad
    if
    it
    takes
    2
    mins to set it up... 20-30 sec max would be a lot more
    reasonable.
    Hints ?

    //Marcus


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/


    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/

    --
    Marcus Herou CTO and co-founder Tailsweep AB
    +46702561312
    marcus.herou@tailsweep.com
    http://www.tailsweep.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 28, '09 at 7:44p
activeJul 7, '09 at 9:49p
posts11
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase