Hi,

I want to understand a basic concept in MR.

If a mapper creates an instance of some class (using the 'new' operator),
then the created class exists ONCE in the VM of this node.
For each node.
Correct?

Now,
what if instead of using the 'new' operator, the class is created using
reflection.
Is it valid in a MR?
Will only one instance of the created class be existing in that node?

Thanks,


Eyal

Eyal Golan
egolan74@gmail.com

Visit: http://jvdrums.sourceforge.net/
LinkedIn: http://www.linkedin.com/in/egolan74
Skype: egolan74

P Save a tree. Please don't print this e-mail unless it's really necessary

Search Discussions

  • Harsh J at Dec 30, 2011 at 12:51 pm
    Eyal,

    Yes, it is right to think of each Task attempt being one individual JVM running individually on any added Node. Multiple slots would mean multiple VMs in parallel as well. Yes, your use of reflection to build your objects will work just fine -- its all user-side java code that is executed.
    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new' operator), then the created class exists ONCE in the VM of this node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created using reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that node?

    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really necessary
  • Eyal Golan at Dec 30, 2011 at 2:23 pm
    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of interfaces
    (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really necessary


    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J wrote:

    Eyal,

    Yes, it is right to think of each Task attempt being one individual JVM
    running individually on any added Node. Multiple slots would mean multiple
    VMs in parallel as well. Yes, your use of reflection to build your objects
    will work just fine -- its all user-side java code that is executed.

    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new' operator),
    then the created class exists ONCE in the VM of this node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created using
    reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that node?

    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really necessary

  • Anirudh at Dec 31, 2011 at 4:51 am
    Where are you creating this new class. If it is in the map function, then
    it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not too sure
    of this and you may want to look at the code. If the option for JVM reuse
    is set, then my understanding is for every task, a new Map task would be
    created and in that case the "new" operator will create another instance
    even if this statement is not in the map function.
    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan wrote:

    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of interfaces
    (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really necessary


    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J wrote:

    Eyal,

    Yes, it is right to think of each Task attempt being one individual JVM
    running individually on any added Node. Multiple slots would mean multiple
    VMs in parallel as well. Yes, your use of reflection to build your objects
    will work just fine -- its all user-side java code that is executed.

    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new' operator),
    then the created class exists ONCE in the VM of this node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created using
    reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that node?

    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary

  • Eyal Golan at Dec 31, 2011 at 9:29 am
    My idea is to create that class in the setup / configure method (depends
    which Mapper / Reducer I will inherit from).

    I don't understand the 'reuse' option you are referring to.
    How many map tasks will be created? One per split or one per VM (node)?
    Are you suggesting that although there would be one Mapper in the node,
    each new operator (or reflecting) will create a new instance?
    Thus making lots of that instance?

    BTW,
    these helper class I want to create are of course not going to be stateful.
    They are defiantly 'helper' class that have some logic.

    Thanks,

    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really necessary


    On Sat, Dec 31, 2011 at 6:50 AM, Anirudh wrote:

    Where are you creating this new class. If it is in the map function, then
    it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not too sure
    of this and you may want to look at the code. If the option for JVM reuse
    is set, then my understanding is for every task, a new Map task would be
    created and in that case the "new" operator will create another instance
    even if this statement is not in the map function.

    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan wrote:

    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of interfaces
    (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J wrote:

    Eyal,

    Yes, it is right to think of each Task attempt being one individual JVM
    running individually on any added Node. Multiple slots would mean multiple
    VMs in parallel as well. Yes, your use of reflection to build your objects
    will work just fine -- its all user-side java code that is executed.

    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new'
    operator), then the created class exists ONCE in the VM of this node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created using
    reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that node?

    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary

  • Anirudh at Dec 31, 2011 at 11:36 am
    I just wanted to confirm where exactly you were planning to have the
    instantiation code, as it was not mentioned in your previous post. The
    location would have made difference. As you are doing it in the setup of
    mapper/reducer, you are good.

    I was referring to the Task JVM Reuse option:
    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse

    It states that if the option to reuse JVM is enabled, the same Task JVM
    will execute multiple tasks(i.e. map/reduce). I am not sure how this is
    implemented, whether a new Mapper/Reducer is created for each task or they
    too are re-reused.
    If a new instance is created each time, then the mapper/reducer and all
    its reference will be marked for garbage collection and you would be good.
    If the Mapper/Reducer instances are re-used then the setup should be called
    again creating another instance of your helper class.

    In my opinion the latter does not make sense, and the implementation would
    be according to the prior approach i.e. creation of a new Mapper/Reducer
    for each Task. But it would be interesting to check.

    As the classes in question are helper classes(stateless) you may not get
    affected in terms of functionality.

    I am not clear on one of your statement:
    *How many map tasks will be created? One per split or one per VM (node)?*
    *Are you suggesting that although there would be one Mapper in the node*...

    Have you configured your node to have a single slot for map/reduce task? If
    yes then there will be one Mapper/Reducer task in the node. If no there
    could be more than one mapper/reducer in the node depending on lots of
    other paramerters i.e. no of mappers/reducers slots allocated on the node,
    no. of input splits etc. If the node is configured to run more than one
    Mapper/Reducer task the scheduler may choose to run more than one task on
    the same node. The default is 2 Map & 2 Reduce tasks per node. And for each
    task a new JVM is launched unless the JVM reuse option is enabled.

    Thanks,
    Anirudh
    On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan wrote:

    My idea is to create that class in the setup / configure method (depends
    which Mapper / Reducer I will inherit from).

    I don't understand the 'reuse' option you are referring to.
    How many map tasks will be created? One per split or one per VM (node)?
    Are you suggesting that although there would be one Mapper in the node,
    each new operator (or reflecting) will create a new instance?
    Thus making lots of that instance?

    BTW,
    these helper class I want to create are of course not going to be
    stateful. They are defiantly 'helper' class that have some logic.

    Thanks,

    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really necessary


    On Sat, Dec 31, 2011 at 6:50 AM, Anirudh wrote:

    Where are you creating this new class. If it is in the map function, then
    it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not too
    sure of this and you may want to look at the code. If the option for JVM
    reuse is set, then my understanding is for every task, a new Map task would
    be created and in that case the "new" operator will create another instance
    even if this statement is not in the map function.

    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan wrote:

    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of
    interfaces (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J wrote:

    Eyal,

    Yes, it is right to think of each Task attempt being one individual JVM
    running individually on any added Node. Multiple slots would mean multiple
    VMs in parallel as well. Yes, your use of reflection to build your objects
    will work just fine -- its all user-side java code that is executed.

    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new'
    operator), then the created class exists ONCE in the VM of this node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created using
    reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that node?

    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary

  • Eyal Golan at Jan 1, 2012 at 7:40 am
    Thank you very much for the detailed explanation Anirudh.

    I think that my question about node / VM was due to some lack of knowledge
    (I'm just starting to learn the Hadoop environment).
    Regarding configuration of the nodes and clusters.
    This is something that I am not doing by myself. We have a dedicated team
    for managing the Hadoop cluster and I'll ask them.

    I think that my question should have been: How many instances of the
    'helper' class will be created in a single VM.
    And, as I understand, consider I am creating the helper in the setup /
    configure method, there would be one.
    And as long as it's stateless, I'm good.

    Thanks again,

    Eyal



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really necessary


    On Sat, Dec 31, 2011 at 1:36 PM, Anirudh wrote:

    I just wanted to confirm where exactly you were planning to have the
    instantiation code, as it was not mentioned in your previous post. The
    location would have made difference. As you are doing it in the setup of
    mapper/reducer, you are good.

    I was referring to the Task JVM Reuse option:

    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse

    It states that if the option to reuse JVM is enabled, the same Task JVM
    will execute multiple tasks(i.e. map/reduce). I am not sure how this is
    implemented, whether a new Mapper/Reducer is created for each task or they
    too are re-reused.
    If a new instance is created each time, then the mapper/reducer and all
    its reference will be marked for garbage collection and you would be good.
    If the Mapper/Reducer instances are re-used then the setup should be
    called again creating another instance of your helper class.

    In my opinion the latter does not make sense, and the implementation would
    be according to the prior approach i.e. creation of a new Mapper/Reducer
    for each Task. But it would be interesting to check.

    As the classes in question are helper classes(stateless) you may not get
    affected in terms of functionality.

    I am not clear on one of your statement:

    *How many map tasks will be created? One per split or one per VM (node)?*
    *Are you suggesting that although there would be one Mapper in the node*
    ...

    Have you configured your node to have a single slot for map/reduce task?
    If yes then there will be one Mapper/Reducer task in the node. If no there
    could be more than one mapper/reducer in the node depending on lots of
    other paramerters i.e. no of mappers/reducers slots allocated on the node,
    no. of input splits etc. If the node is configured to run more than one
    Mapper/Reducer task the scheduler may choose to run more than one task on
    the same node. The default is 2 Map & 2 Reduce tasks per node. And for each
    task a new JVM is launched unless the JVM reuse option is enabled.

    Thanks,
    Anirudh

    On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan wrote:

    My idea is to create that class in the setup / configure method (depends
    which Mapper / Reducer I will inherit from).

    I don't understand the 'reuse' option you are referring to.
    How many map tasks will be created? One per split or one per VM (node)?
    Are you suggesting that although there would be one Mapper in the node,
    each new operator (or reflecting) will create a new instance?
    Thus making lots of that instance?

    BTW,
    these helper class I want to create are of course not going to be
    stateful. They are defiantly 'helper' class that have some logic.

    Thanks,

    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Sat, Dec 31, 2011 at 6:50 AM, Anirudh wrote:

    Where are you creating this new class. If it is in the map function,
    then it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not too
    sure of this and you may want to look at the code. If the option for JVM
    reuse is set, then my understanding is for every task, a new Map task would
    be created and in that case the "new" operator will create another instance
    even if this statement is not in the map function.

    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan wrote:

    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of
    interfaces (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J wrote:

    Eyal,

    Yes, it is right to think of each Task attempt being one individual
    JVM running individually on any added Node. Multiple slots would mean
    multiple VMs in parallel as well. Yes, your use of reflection to build your
    objects will work just fine -- its all user-side java code that is executed.

    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new'
    operator), then the created class exists ONCE in the VM of this node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created
    using reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that node?

    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary

  • Anirudh at Jan 1, 2012 at 12:19 pm
    No problems Eyal.

    On a second thought, for the JVM re-use the Mapper/Reducer instances
    should be re-used, and the setup should be called only once. This makes
    sense too as the JVM reuse is for the same job.
    You should be good with class instantiation even if the JVM reuse is
    enabled.
    On Sat, Dec 31, 2011 at 11:39 PM, Eyal Golan wrote:

    Thank you very much for the detailed explanation Anirudh.

    I think that my question about node / VM was due to some lack of knowledge
    (I'm just starting to learn the Hadoop environment).
    Regarding configuration of the nodes and clusters.
    This is something that I am not doing by myself. We have a dedicated team
    for managing the Hadoop cluster and I'll ask them.

    I think that my question should have been: How many instances of the
    'helper' class will be created in a single VM.
    And, as I understand, consider I am creating the helper in the setup /
    configure method, there would be one.
    And as long as it's stateless, I'm good.

    Thanks again,

    Eyal



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really necessary


    On Sat, Dec 31, 2011 at 1:36 PM, Anirudh wrote:

    I just wanted to confirm where exactly you were planning to have the
    instantiation code, as it was not mentioned in your previous post. The
    location would have made difference. As you are doing it in the setup of
    mapper/reducer, you are good.

    I was referring to the Task JVM Reuse option:

    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse

    It states that if the option to reuse JVM is enabled, the same Task JVM
    will execute multiple tasks(i.e. map/reduce). I am not sure how this is
    implemented, whether a new Mapper/Reducer is created for each task or they
    too are re-reused.
    If a new instance is created each time, then the mapper/reducer and all
    its reference will be marked for garbage collection and you would be good.
    If the Mapper/Reducer instances are re-used then the setup should be
    called again creating another instance of your helper class.

    In my opinion the latter does not make sense, and the implementation
    would be according to the prior approach i.e. creation of a new
    Mapper/Reducer for each Task. But it would be interesting to check.

    As the classes in question are helper classes(stateless) you may not get
    affected in terms of functionality.

    I am not clear on one of your statement:

    *How many map tasks will be created? One per split or one per VM (node)?*
    *Are you suggesting that although there would be one Mapper in the node*
    ...

    Have you configured your node to have a single slot for map/reduce task?
    If yes then there will be one Mapper/Reducer task in the node. If no there
    could be more than one mapper/reducer in the node depending on lots of
    other paramerters i.e. no of mappers/reducers slots allocated on the node,
    no. of input splits etc. If the node is configured to run more than one
    Mapper/Reducer task the scheduler may choose to run more than one task on
    the same node. The default is 2 Map & 2 Reduce tasks per node. And for each
    task a new JVM is launched unless the JVM reuse option is enabled.

    Thanks,
    Anirudh

    On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan wrote:

    My idea is to create that class in the setup / configure method (depends
    which Mapper / Reducer I will inherit from).

    I don't understand the 'reuse' option you are referring to.
    How many map tasks will be created? One per split or one per VM (node)?
    Are you suggesting that although there would be one Mapper in the node,
    each new operator (or reflecting) will create a new instance?
    Thus making lots of that instance?

    BTW,
    these helper class I want to create are of course not going to be
    stateful. They are defiantly 'helper' class that have some logic.

    Thanks,

    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Sat, Dec 31, 2011 at 6:50 AM, Anirudh wrote:

    Where are you creating this new class. If it is in the map function,
    then it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not too
    sure of this and you may want to look at the code. If the option for JVM
    reuse is set, then my understanding is for every task, a new Map task would
    be created and in that case the "new" operator will create another instance
    even if this statement is not in the map function.

    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan wrote:

    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of
    interfaces (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J wrote:

    Eyal,

    Yes, it is right to think of each Task attempt being one individual
    JVM running individually on any added Node. Multiple slots would mean
    multiple VMs in parallel as well. Yes, your use of reflection to build your
    objects will work just fine -- its all user-side java code that is executed.

    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new'
    operator), then the created class exists ONCE in the VM of this node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created
    using reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that node?

    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary

  • Harsh J at Jan 1, 2012 at 3:05 pm
    You are guaranteed one setup call for every single task attempt. This
    is regardless of JVM reuse being on or off. JVM reuse will cause no
    issues with what Eyal is attempting to do.
    On Sun, Jan 1, 2012 at 5:49 PM, Anirudh wrote:
    No problems Eyal.

    On  a second thought, for the JVM re-use the Mapper/Reducer instances should
    be re-used, and the setup should be called only once. This makes sense too
    as the JVM reuse is for the same job.
    You should be good with class instantiation even if the JVM reuse is
    enabled.

    On Sat, Dec 31, 2011 at 11:39 PM, Eyal Golan wrote:

    Thank you very much for the detailed explanation Anirudh.

    I think that my question about node / VM was due to some lack of knowledge
    (I'm just starting to learn the Hadoop environment).
    Regarding configuration of the nodes and clusters.
    This is something that I am not doing by myself. We have a dedicated team
    for managing the Hadoop cluster and I'll ask them.

    I think that my question should have been: How many instances of the
    'helper' class will be created in a single VM.
    And, as I understand, consider I am creating the helper in the setup /
    configure method, there would be one.
    And as long as it's stateless, I'm good.

    Thanks again,

    Eyal



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P  Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Sat, Dec 31, 2011 at 1:36 PM, Anirudh wrote:

    I just wanted to confirm where exactly you were planning to have the
    instantiation code, as it was not mentioned in your previous post. The
    location would have made difference. As you are doing it in the setup of
    mapper/reducer, you are good.

    I was referring to the Task JVM Reuse option:

    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse

    It states that if the option to reuse JVM is enabled, the same Task JVM
    will execute multiple tasks(i.e. map/reduce). I am not sure how this is
    implemented, whether a new Mapper/Reducer is created for each task or they
    too are re-reused.
    If a new instance is created each time, then the mapper/reducer and  all
    its reference will be marked for garbage collection and you would be good.
    If the Mapper/Reducer instances are re-used then the setup should be
    called again creating another instance of your helper class.

    In my opinion the latter does not make sense, and the implementation
    would be according to the prior approach i.e. creation of a new
    Mapper/Reducer for each Task. But it would be interesting to check.

    As the classes in question are helper classes(stateless) you may not get
    affected in terms of functionality.

    I am not clear on one of your statement:

    How many map tasks will be created? One per split or one per VM (node)?
    Are you suggesting that although there would be one Mapper in the node...

    Have you configured your node to have a single slot for map/reduce task?
    If yes then there will be one Mapper/Reducer task in the node. If no there
    could be more than one mapper/reducer in the node depending on lots of other
    paramerters i.e. no of mappers/reducers slots allocated on the node, no. of
    input splits etc. If the node is configured to run more than one
    Mapper/Reducer task the scheduler may choose to run more than one task on
    the same node. The default is 2 Map & 2 Reduce tasks per node. And for each
    task a new JVM is launched unless the JVM reuse option is enabled.

    Thanks,
    Anirudh

    On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan wrote:

    My idea is to create that class in the setup / configure method (depends
    which Mapper / Reducer I will inherit from).

    I don't understand the 'reuse' option you are referring to.
    How many map tasks will be created? One per split or one per VM (node)?
    Are you suggesting that although there would be one Mapper in the node,
    each new operator (or reflecting) will create a new instance?
    Thus making lots of that instance?

    BTW,
    these helper class I want to create are of course not going to be
    stateful. They are defiantly 'helper' class that have some logic.

    Thanks,

    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P  Save a tree. Please don't print this e-mail unless it's really
    necessary



    On Sat, Dec 31, 2011 at 6:50 AM, Anirudh <techie.anirudh@gmail.com>
    wrote:
    Where are you creating this new class. If it is in the map function,
    then it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not too
    sure of this and you may want to look at the code. If the option for JVM
    reuse is set, then my understanding is for every task, a new Map task would
    be created and in that case the "new" operator will create another instance
    even if this statement is not in the map function.

    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan wrote:

    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of
    interfaces (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P  Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J wrote:

    Eyal,

    Yes, it is right to think of each Task attempt being one individual
    JVM running individually on any added Node. Multiple slots would mean
    multiple VMs in parallel as well. Yes, your use of reflection to build your
    objects will work just fine -- its all user-side java code that is executed.

    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new'
    operator), then the created class exists ONCE in the VM of this node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created
    using reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that node?

    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P  Save a tree. Please don't print this e-mail unless it's really
    necessary


    --
    Harsh J
  • Anirudh at Jan 2, 2012 at 12:01 am
    Any specific reason why setup is called for every task attempt. For
    optimization point of view, wouldnt it be good if the setup is called only
    once in case of JVM reuse.
    I have not yet looked at the implementation, in case of JVM reuse is the
    application Mapper instance reused or a new instance is created for every
    task attempt?

    My suggestion for Eyal would be to have a static field initializer
    expression in the Mapper to create the helper class instance. This will
    ensure that the helper class will be instantiated when the Mapper class is
    loaded.

    On Sun, Jan 1, 2012 at 7:05 AM, Harsh J wrote:

    You are guaranteed one setup call for every single task attempt. This
    is regardless of JVM reuse being on or off. JVM reuse will cause no
    issues with what Eyal is attempting to do.
    On Sun, Jan 1, 2012 at 5:49 PM, Anirudh wrote:
    No problems Eyal.

    On a second thought, for the JVM re-use the Mapper/Reducer instances should
    be re-used, and the setup should be called only once. This makes sense too
    as the JVM reuse is for the same job.
    You should be good with class instantiation even if the JVM reuse is
    enabled.

    On Sat, Dec 31, 2011 at 11:39 PM, Eyal Golan wrote:

    Thank you very much for the detailed explanation Anirudh.

    I think that my question about node / VM was due to some lack of
    knowledge
    (I'm just starting to learn the Hadoop environment).
    Regarding configuration of the nodes and clusters.
    This is something that I am not doing by myself. We have a dedicated
    team
    for managing the Hadoop cluster and I'll ask them.

    I think that my question should have been: How many instances of the
    'helper' class will be created in a single VM.
    And, as I understand, consider I am creating the helper in the setup /
    configure method, there would be one.
    And as long as it's stateless, I'm good.

    Thanks again,

    Eyal



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Sat, Dec 31, 2011 at 1:36 PM, Anirudh wrote:

    I just wanted to confirm where exactly you were planning to have the
    instantiation code, as it was not mentioned in your previous post. The
    location would have made difference. As you are doing it in the setup
    of
    mapper/reducer, you are good.

    I was referring to the Task JVM Reuse option:
    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse
    It states that if the option to reuse JVM is enabled, the same Task JVM
    will execute multiple tasks(i.e. map/reduce). I am not sure how this is
    implemented, whether a new Mapper/Reducer is created for each task or
    they
    too are re-reused.
    If a new instance is created each time, then the mapper/reducer and
    all
    its reference will be marked for garbage collection and you would be
    good.
    If the Mapper/Reducer instances are re-used then the setup should be
    called again creating another instance of your helper class.

    In my opinion the latter does not make sense, and the implementation
    would be according to the prior approach i.e. creation of a new
    Mapper/Reducer for each Task. But it would be interesting to check.

    As the classes in question are helper classes(stateless) you may not
    get
    affected in terms of functionality.

    I am not clear on one of your statement:

    How many map tasks will be created? One per split or one per VM (node)?
    Are you suggesting that although there would be one Mapper in the
    node...
    Have you configured your node to have a single slot for map/reduce
    task?
    If yes then there will be one Mapper/Reducer task in the node. If no
    there
    could be more than one mapper/reducer in the node depending on lots of
    other
    paramerters i.e. no of mappers/reducers slots allocated on the node,
    no. of
    input splits etc. If the node is configured to run more than one
    Mapper/Reducer task the scheduler may choose to run more than one task
    on
    the same node. The default is 2 Map & 2 Reduce tasks per node. And for
    each
    task a new JVM is launched unless the JVM reuse option is enabled.

    Thanks,
    Anirudh

    On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan wrote:

    My idea is to create that class in the setup / configure method
    (depends
    which Mapper / Reducer I will inherit from).

    I don't understand the 'reuse' option you are referring to.
    How many map tasks will be created? One per split or one per VM
    (node)?
    Are you suggesting that although there would be one Mapper in the
    node,
    each new operator (or reflecting) will create a new instance?
    Thus making lots of that instance?

    BTW,
    these helper class I want to create are of course not going to be
    stateful. They are defiantly 'helper' class that have some logic.

    Thanks,

    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary



    On Sat, Dec 31, 2011 at 6:50 AM, Anirudh <techie.anirudh@gmail.com>
    wrote:
    Where are you creating this new class. If it is in the map function,
    then it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not too
    sure of this and you may want to look at the code. If the option for
    JVM
    reuse is set, then my understanding is for every task, a new Map
    task would
    be created and in that case the "new" operator will create another
    instance
    even if this statement is not in the map function.

    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan wrote:

    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of
    interfaces (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J wrote:

    Eyal,

    Yes, it is right to think of each Task attempt being one individual
    JVM running individually on any added Node. Multiple slots would
    mean
    multiple VMs in parallel as well. Yes, your use of reflection to
    build your
    objects will work just fine -- its all user-side java code that is
    executed.
    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new'
    operator), then the created class exists ONCE in the VM of this
    node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created
    using reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that
    node?
    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    --
    Harsh J
  • Eyal Golan at Jan 2, 2012 at 10:18 am
    Thank you very much for the help.

    I am going to start working on it soon (a few days) and will probably have
    more questions :)


    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really necessary


    On Mon, Jan 2, 2012 at 2:01 AM, Anirudh wrote:

    Any specific reason why setup is called for every task attempt. For
    optimization point of view, wouldnt it be good if the setup is called only
    once in case of JVM reuse.
    I have not yet looked at the implementation, in case of JVM reuse is the
    application Mapper instance reused or a new instance is created for every
    task attempt?

    My suggestion for Eyal would be to have a static field initializer
    expression in the Mapper to create the helper class instance. This will
    ensure that the helper class will be instantiated when the Mapper class is
    loaded.


    On Sun, Jan 1, 2012 at 7:05 AM, Harsh J wrote:

    You are guaranteed one setup call for every single task attempt. This
    is regardless of JVM reuse being on or off. JVM reuse will cause no
    issues with what Eyal is attempting to do.
    On Sun, Jan 1, 2012 at 5:49 PM, Anirudh wrote:
    No problems Eyal.

    On a second thought, for the JVM re-use the Mapper/Reducer instances should
    be re-used, and the setup should be called only once. This makes sense too
    as the JVM reuse is for the same job.
    You should be good with class instantiation even if the JVM reuse is
    enabled.


    On Sat, Dec 31, 2011 at 11:39 PM, Eyal Golan <egolan74@gmail.com>
    wrote:
    Thank you very much for the detailed explanation Anirudh.

    I think that my question about node / VM was due to some lack of
    knowledge
    (I'm just starting to learn the Hadoop environment).
    Regarding configuration of the nodes and clusters.
    This is something that I am not doing by myself. We have a dedicated
    team
    for managing the Hadoop cluster and I'll ask them.

    I think that my question should have been: How many instances of the
    'helper' class will be created in a single VM.
    And, as I understand, consider I am creating the helper in the setup /
    configure method, there would be one.
    And as long as it's stateless, I'm good.

    Thanks again,

    Eyal



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary



    On Sat, Dec 31, 2011 at 1:36 PM, Anirudh <techie.anirudh@gmail.com>
    wrote:
    I just wanted to confirm where exactly you were planning to have the
    instantiation code, as it was not mentioned in your previous post. The
    location would have made difference. As you are doing it in the setup
    of
    mapper/reducer, you are good.

    I was referring to the Task JVM Reuse option:
    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse
    It states that if the option to reuse JVM is enabled, the same Task
    JVM
    will execute multiple tasks(i.e. map/reduce). I am not sure how this
    is
    implemented, whether a new Mapper/Reducer is created for each task or
    they
    too are re-reused.
    If a new instance is created each time, then the mapper/reducer and
    all
    its reference will be marked for garbage collection and you would be
    good.
    If the Mapper/Reducer instances are re-used then the setup should be
    called again creating another instance of your helper class.

    In my opinion the latter does not make sense, and the implementation
    would be according to the prior approach i.e. creation of a new
    Mapper/Reducer for each Task. But it would be interesting to check.

    As the classes in question are helper classes(stateless) you may not
    get
    affected in terms of functionality.

    I am not clear on one of your statement:

    How many map tasks will be created? One per split or one per VM
    (node)?
    Are you suggesting that although there would be one Mapper in the
    node...
    Have you configured your node to have a single slot for map/reduce
    task?
    If yes then there will be one Mapper/Reducer task in the node. If no
    there
    could be more than one mapper/reducer in the node depending on lots
    of other
    paramerters i.e. no of mappers/reducers slots allocated on the node,
    no. of
    input splits etc. If the node is configured to run more than one
    Mapper/Reducer task the scheduler may choose to run more than one
    task on
    the same node. The default is 2 Map & 2 Reduce tasks per node. And
    for each
    task a new JVM is launched unless the JVM reuse option is enabled.

    Thanks,
    Anirudh


    On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan <egolan74@gmail.com>
    wrote:
    My idea is to create that class in the setup / configure method
    (depends
    which Mapper / Reducer I will inherit from).

    I don't understand the 'reuse' option you are referring to.
    How many map tasks will be created? One per split or one per VM
    (node)?
    Are you suggesting that although there would be one Mapper in the
    node,
    each new operator (or reflecting) will create a new instance?
    Thus making lots of that instance?

    BTW,
    these helper class I want to create are of course not going to be
    stateful. They are defiantly 'helper' class that have some logic.

    Thanks,

    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary



    On Sat, Dec 31, 2011 at 6:50 AM, Anirudh <techie.anirudh@gmail.com>
    wrote:
    Where are you creating this new class. If it is in the map function,
    then it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not
    too
    sure of this and you may want to look at the code. If the option
    for JVM
    reuse is set, then my understanding is for every task, a new Map
    task would
    be created and in that case the "new" operator will create another
    instance
    even if this statement is not in the map function.


    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan <egolan74@gmail.com>
    wrote:
    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of
    interfaces (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary



    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J <harsh@cloudera.com>
    wrote:
    Eyal,

    Yes, it is right to think of each Task attempt being one
    individual
    JVM running individually on any added Node. Multiple slots would
    mean
    multiple VMs in parallel as well. Yes, your use of reflection to
    build your
    objects will work just fine -- its all user-side java code that
    is executed.
    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new'
    operator), then the created class exists ONCE in the VM of this
    node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created
    using reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that
    node?
    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    --
    Harsh J
  • Harsh J at Jan 2, 2012 at 11:20 am
    Hello Anirudh,
    On 02-Jan-2012, at 5:31 AM, Anirudh wrote:

    Any specific reason why setup is called for every task attempt. For optimization point of view, wouldnt it be good if the setup is called only once in case of JVM reuse.
    Note that the task setup/cleanup procedures are separate from the API hooks 'setup'/'cleanup'. The latter is guaranteed to the developer to be called per split/partition for pre-processing and post-processing execution.
    I have not yet looked at the implementation, in case of JVM reuse is the application Mapper instance reused or a new instance is created for every task attempt?
    That wouldn't be good isolation-wise. Resetting a lot of other relevant parameters would be much more costly than reinitializing the whole task (but not the JVM). I think https://issues.apache.org/jira/browse/HADOOP-249, which introduced this, should help you.
    My suggestion for Eyal would be to have a static field initializer expression in the Mapper to create the helper class instance. This will ensure that the helper class will be instantiated when the Mapper class is loaded.
    Yep this is possible, surely, and is a good advantage to using JVM reuse.
    On Sun, Jan 1, 2012 at 7:05 AM, Harsh J wrote:
    You are guaranteed one setup call for every single task attempt. This
    is regardless of JVM reuse being on or off. JVM reuse will cause no
    issues with what Eyal is attempting to do.
    On Sun, Jan 1, 2012 at 5:49 PM, Anirudh wrote:
    No problems Eyal.

    On a second thought, for the JVM re-use the Mapper/Reducer instances should
    be re-used, and the setup should be called only once. This makes sense too
    as the JVM reuse is for the same job.
    You should be good with class instantiation even if the JVM reuse is
    enabled.

    On Sat, Dec 31, 2011 at 11:39 PM, Eyal Golan wrote:

    Thank you very much for the detailed explanation Anirudh.

    I think that my question about node / VM was due to some lack of knowledge
    (I'm just starting to learn the Hadoop environment).
    Regarding configuration of the nodes and clusters.
    This is something that I am not doing by myself. We have a dedicated team
    for managing the Hadoop cluster and I'll ask them.

    I think that my question should have been: How many instances of the
    'helper' class will be created in a single VM.
    And, as I understand, consider I am creating the helper in the setup /
    configure method, there would be one.
    And as long as it's stateless, I'm good.

    Thanks again,

    Eyal



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Sat, Dec 31, 2011 at 1:36 PM, Anirudh wrote:

    I just wanted to confirm where exactly you were planning to have the
    instantiation code, as it was not mentioned in your previous post. The
    location would have made difference. As you are doing it in the setup of
    mapper/reducer, you are good.

    I was referring to the Task JVM Reuse option:

    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse

    It states that if the option to reuse JVM is enabled, the same Task JVM
    will execute multiple tasks(i.e. map/reduce). I am not sure how this is
    implemented, whether a new Mapper/Reducer is created for each task or they
    too are re-reused.
    If a new instance is created each time, then the mapper/reducer and all
    its reference will be marked for garbage collection and you would be good.
    If the Mapper/Reducer instances are re-used then the setup should be
    called again creating another instance of your helper class.

    In my opinion the latter does not make sense, and the implementation
    would be according to the prior approach i.e. creation of a new
    Mapper/Reducer for each Task. But it would be interesting to check.

    As the classes in question are helper classes(stateless) you may not get
    affected in terms of functionality.

    I am not clear on one of your statement:

    How many map tasks will be created? One per split or one per VM (node)?
    Are you suggesting that although there would be one Mapper in the node...

    Have you configured your node to have a single slot for map/reduce task?
    If yes then there will be one Mapper/Reducer task in the node. If no there
    could be more than one mapper/reducer in the node depending on lots of other
    paramerters i.e. no of mappers/reducers slots allocated on the node, no. of
    input splits etc. If the node is configured to run more than one
    Mapper/Reducer task the scheduler may choose to run more than one task on
    the same node. The default is 2 Map & 2 Reduce tasks per node. And for each
    task a new JVM is launched unless the JVM reuse option is enabled.

    Thanks,
    Anirudh

    On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan wrote:

    My idea is to create that class in the setup / configure method (depends
    which Mapper / Reducer I will inherit from).

    I don't understand the 'reuse' option you are referring to.
    How many map tasks will be created? One per split or one per VM (node)?
    Are you suggesting that although there would be one Mapper in the node,
    each new operator (or reflecting) will create a new instance?
    Thus making lots of that instance?

    BTW,
    these helper class I want to create are of course not going to be
    stateful. They are defiantly 'helper' class that have some logic.

    Thanks,

    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary



    On Sat, Dec 31, 2011 at 6:50 AM, Anirudh <techie.anirudh@gmail.com>
    wrote:
    Where are you creating this new class. If it is in the map function,
    then it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not too
    sure of this and you may want to look at the code. If the option for JVM
    reuse is set, then my understanding is for every task, a new Map task would
    be created and in that case the "new" operator will create another instance
    even if this statement is not in the map function.

    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan wrote:

    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of
    interfaces (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J wrote:

    Eyal,

    Yes, it is right to think of each Task attempt being one individual
    JVM running individually on any added Node. Multiple slots would mean
    multiple VMs in parallel as well. Yes, your use of reflection to build your
    objects will work just fine -- its all user-side java code that is executed.

    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new'
    operator), then the created class exists ONCE in the VM of this node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created
    using reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that node?

    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    --
    Harsh J
  • Anirudh at Jan 9, 2012 at 11:36 pm
    Harsh,

    Appreciate you responding to the thread. Sorry got a bit held up with work,
    hence couldnt reply.
    I am not sure what do you mean by setup/cleanup procedures?
    In my response I actually meant latter i.e. API hooks 'setup/cleanup'. If
    the APIs are called per split/partition, why do we need to do that? I
    cannot think of a case where we would need to do the setup again for the
    next split/partition, as it is the input data. As it is the setup/cleanup
    cannot be dependent on the size of the partition/split. In JVM resuse
    option multiple splits can be considered as a large single split. Something
    like init()/service() API of servlets. Do you think there could be any
    issues with this approach?


    *Quotes: Harsh
    That wouldn't be good isolation-wise. Resetting a lot of other relevant
    parameters would be much more costly than reinitializing the whole task
    (but not the JVM). I think https://issues.apache.org/jira/browse/HADOOP-249,
    which introduced this, should help you*

    This could be an answer to the question I asked earlier. As per me, reusing
    the same instance of the Mapper would be good from optimization
    perspective. Unless it is kind of mandated by the MR framework to run each
    split/partition in isolation(even during JVM reuse). As I have yet not
    looked at the code I will not be able to comment whether the
    re-initializing that you are talking about could be avoided in case of JVM
    reuse. Currently I am not sure what these parameters are that need to be
    reset. I will surely look at the link that you provided and would want to
    investigate further. Thanks for all the time.

    Anirudh



    On Mon, Jan 2, 2012 at 3:19 AM, Harsh J wrote:

    Hello Anirudh,

    On 02-Jan-2012, at 5:31 AM, Anirudh wrote:

    Any specific reason why setup is called for every task attempt. For
    optimization point of view, wouldnt it be good if the setup is called only
    once in case of JVM reuse.


    Note that the task setup/cleanup procedures are separate from the API
    hooks 'setup'/'cleanup'. The latter is guaranteed to the developer to be
    called per split/partition for pre-processing and post-processing execution.

    I have not yet looked at the implementation, in case of JVM reuse is the
    application Mapper instance reused or a new instance is created for every
    task attempt?


    That wouldn't be good isolation-wise. Resetting a lot of other relevant
    parameters would be much more costly than reinitializing the whole task
    (but not the JVM). I think
    https://issues.apache.org/jira/browse/HADOOP-249, which introduced this,
    should help you.


    My suggestion for Eyal would be to have a static field initializer
    expression in the Mapper to create the helper class instance. This will
    ensure that the helper class will be instantiated when the Mapper class is
    loaded.


    Yep this is possible, surely, and is a good advantage to using JVM reuse.
    On Sun, Jan 1, 2012 at 7:05 AM, Harsh J wrote:

    You are guaranteed one setup call for every single task attempt. This
    is regardless of JVM reuse being on or off. JVM reuse will cause no
    issues with what Eyal is attempting to do.
    On Sun, Jan 1, 2012 at 5:49 PM, Anirudh wrote:
    No problems Eyal.

    On a second thought, for the JVM re-use the Mapper/Reducer instances should
    be re-used, and the setup should be called only once. This makes sense too
    as the JVM reuse is for the same job.
    You should be good with class instantiation even if the JVM reuse is
    enabled.


    On Sat, Dec 31, 2011 at 11:39 PM, Eyal Golan <egolan74@gmail.com>
    wrote:
    Thank you very much for the detailed explanation Anirudh.

    I think that my question about node / VM was due to some lack of
    knowledge
    (I'm just starting to learn the Hadoop environment).
    Regarding configuration of the nodes and clusters.
    This is something that I am not doing by myself. We have a dedicated
    team
    for managing the Hadoop cluster and I'll ask them.

    I think that my question should have been: How many instances of the
    'helper' class will be created in a single VM.
    And, as I understand, consider I am creating the helper in the setup /
    configure method, there would be one.
    And as long as it's stateless, I'm good.

    Thanks again,

    Eyal



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary



    On Sat, Dec 31, 2011 at 1:36 PM, Anirudh <techie.anirudh@gmail.com>
    wrote:
    I just wanted to confirm where exactly you were planning to have the
    instantiation code, as it was not mentioned in your previous post. The
    location would have made difference. As you are doing it in the setup
    of
    mapper/reducer, you are good.

    I was referring to the Task JVM Reuse option:
    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Task+JVM+Reuse
    It states that if the option to reuse JVM is enabled, the same Task
    JVM
    will execute multiple tasks(i.e. map/reduce). I am not sure how this
    is
    implemented, whether a new Mapper/Reducer is created for each task or
    they
    too are re-reused.
    If a new instance is created each time, then the mapper/reducer and
    all
    its reference will be marked for garbage collection and you would be
    good.
    If the Mapper/Reducer instances are re-used then the setup should be
    called again creating another instance of your helper class.

    In my opinion the latter does not make sense, and the implementation
    would be according to the prior approach i.e. creation of a new
    Mapper/Reducer for each Task. But it would be interesting to check.

    As the classes in question are helper classes(stateless) you may not
    get
    affected in terms of functionality.

    I am not clear on one of your statement:

    How many map tasks will be created? One per split or one per VM
    (node)?
    Are you suggesting that although there would be one Mapper in the
    node...
    Have you configured your node to have a single slot for map/reduce
    task?
    If yes then there will be one Mapper/Reducer task in the node. If no
    there
    could be more than one mapper/reducer in the node depending on lots
    of other
    paramerters i.e. no of mappers/reducers slots allocated on the node,
    no. of
    input splits etc. If the node is configured to run more than one
    Mapper/Reducer task the scheduler may choose to run more than one
    task on
    the same node. The default is 2 Map & 2 Reduce tasks per node. And
    for each
    task a new JVM is launched unless the JVM reuse option is enabled.

    Thanks,
    Anirudh


    On Sat, Dec 31, 2011 at 1:28 AM, Eyal Golan <egolan74@gmail.com>
    wrote:
    My idea is to create that class in the setup / configure method
    (depends
    which Mapper / Reducer I will inherit from).

    I don't understand the 'reuse' option you are referring to.
    How many map tasks will be created? One per split or one per VM
    (node)?
    Are you suggesting that although there would be one Mapper in the
    node,
    each new operator (or reflecting) will create a new instance?
    Thus making lots of that instance?

    BTW,
    these helper class I want to create are of course not going to be
    stateful. They are defiantly 'helper' class that have some logic.

    Thanks,

    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary



    On Sat, Dec 31, 2011 at 6:50 AM, Anirudh <techie.anirudh@gmail.com>
    wrote:
    Where are you creating this new class. If it is in the map function,
    then it will be create a new object for each record in the split.

    Also you may need to see how the JVM reuse option works. I am not
    too
    sure of this and you may want to look at the code. If the option
    for JVM
    reuse is set, then my understanding is for every task, a new Map
    task would
    be created and in that case the "new" operator will create another
    instance
    even if this statement is not in the map function.


    On Fri, Dec 30, 2011 at 6:22 AM, Eyal Golan <egolan74@gmail.com>
    wrote:
    Great News !!
    Thanks for the info.

    So using reflection, I can inject different implementations of
    interfaces (services) for the mapper (or reducer).
    And this way I can test a mapper (or reducer).
    Just by reflecting a stub instead of a real implementation.

    Thanks,



    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary



    On Fri, Dec 30, 2011 at 2:50 PM, Harsh J <harsh@cloudera.com>
    wrote:
    Eyal,

    Yes, it is right to think of each Task attempt being one
    individual
    JVM running individually on any added Node. Multiple slots would
    mean
    multiple VMs in parallel as well. Yes, your use of reflection to
    build your
    objects will work just fine -- its all user-side java code that
    is executed.
    On 30-Dec-2011, at 4:42 PM, Eyal Golan wrote:

    Hi,

    I want to understand a basic concept in MR.

    If a mapper creates an instance of some class (using the 'new'
    operator), then the created class exists ONCE in the VM of this
    node.
    For each node.
    Correct?

    Now,
    what if instead of using the 'new' operator, the class is created
    using reflection.
    Is it valid in a MR?
    Will only one instance of the created class be existing in that
    node?
    Thanks,


    Eyal

    Eyal Golan
    egolan74@gmail.com

    Visit: http://jvdrums.sourceforge.net/
    LinkedIn: http://www.linkedin.com/in/egolan74
    Skype: egolan74

    P Save a tree. Please don't print this e-mail unless it's really
    necessary


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedDec 30, '11 at 11:13a
activeJan 9, '12 at 11:36p
posts13
users3
websitehadoop.apache.org...
irc#hadoop

3 users in discussion

Eyal Golan: 5 posts Anirudh: 5 posts Harsh J: 3 posts

People

Translate

site design / logo © 2022 Grokbase