FAQ
Hi,

I need to number all output records consecutively, like, 1,2,3...

This is no problem with one reducer, making recordId an instance variable in
the Reducer class, and setting conf.setNumReduceTasks(1)

However, it is an architectural decision forced by processing need, where
the reducer becomes a bottleneck. Can I have a global variable for all
reducers, which would give each the next consecutive recordId? In the
database scenario, this would be the unique autokey. How to do it in
MapReduce?

Thank you

Search Discussions

  • Aaron Kimball at Oct 28, 2009 at 4:28 am
    There is no in-MapReduce mechanism for cross-task synchronization. You'll
    need to use something like Zookeeper for this, or another external database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to eliminate
    this need, or maybe get really clever. For example, do your numbers need to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're all
    building unique sequences. If the former... rethink your pipeline? :)

    - Aaron
    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner wrote:

    Hi,

    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance variable
    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need, where
    the reducer becomes a bottleneck. Can I have a global variable for all
    reducers, which would give each the next consecutive recordId? In the
    database scenario, this would be the unique autokey. How to do it in
    MapReduce?

    Thank you
  • Mark Kerzner at Oct 28, 2009 at 4:35 am
    Aaron, although your notes are not a ready solution, but they are a great
    help.

    Thank you,
    Mark
    On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball wrote:

    There is no in-MapReduce mechanism for cross-task synchronization. You'll
    need to use something like Zookeeper for this, or another external
    database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to
    eliminate
    this need, or maybe get really clever. For example, do your numbers need to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're
    all
    building unique sequences. If the former... rethink your pipeline? :)

    - Aaron
    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner wrote:

    Hi,

    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance variable
    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need, where
    the reducer becomes a bottleneck. Can I have a global variable for all
    reducers, which would give each the next consecutive recordId? In the
    database scenario, this would be the unique autokey. How to do it in
    MapReduce?

    Thank you
  • Michael Klatt at Oct 28, 2009 at 5:41 pm
    I posted an approach to this using streaming, but if the environment variables are available in
    standard Java interface, this may work for you.

    http://www.mail-archive.com/core-user@hadoop.apache.org/msg09079.html

    You'll have to be able to tolerate some small gaps in the ids.

    Michael

    Mark Kerzner wrote:

    Aaron, although your notes are not a ready solution, but they are a great
    help.

    Thank you,
    Mark
    On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball wrote:

    There is no in-MapReduce mechanism for cross-task synchronization. You'll
    need to use something like Zookeeper for this, or another external
    database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to
    eliminate
    this need, or maybe get really clever. For example, do your numbers need to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're
    all
    building unique sequences. If the former... rethink your pipeline? :)

    - Aaron

    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance variable
    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need, where
    the reducer becomes a bottleneck. Can I have a global variable for all
    reducers, which would give each the next consecutive recordId? In the
    database scenario, this would be the unique autokey. How to do it in
    MapReduce?

    Thank you
  • Mark Kerzner at Oct 28, 2009 at 5:49 pm
    Michael,

    environmental variables are available in Java, but the environment itself is
    not shared between instances. I read your code - you are solving exactly the
    same problem I am interested in - but I did not see how it works in
    distributed environment.

    By the way, it occurs to me that JavaSpaces, which is a different approach
    to distributed computing, trumpled by Hadoop, could be used here! Just run
    one instance with GigaSpaces at all times, and you got your self-increment
    for any number of jobs. It is perfect for concurrent processing and very
    fast.

    Thank you,
    Mark
    On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt wrote:


    I posted an approach to this using streaming, but if the environment
    variables are available in standard Java interface, this may work for you.

    http://www.mail-archive.com/core-user@hadoop.apache.org/msg09079.html

    You'll have to be able to tolerate some small gaps in the ids.

    Michael


    Mark Kerzner wrote:

    Aaron, although your notes are not a ready solution, but they are a great
    help.

    Thank you,
    Mark

    On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aaron@cloudera.com>
    wrote:

    There is no in-MapReduce mechanism for cross-task synchronization. You'll
    need to use something like Zookeeper for this, or another external
    database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to
    eliminate
    this need, or maybe get really clever. For example, do your numbers need
    to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current
    output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're
    all
    building unique sequences. If the former... rethink your pipeline? :)

    - Aaron

    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance variable
    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need, where
    the reducer becomes a bottleneck. Can I have a global variable for all
    reducers, which would give each the next consecutive recordId? In the
    database scenario, this would be the unique autokey. How to do it in
    MapReduce?

    Thank you
  • Michael Klatt at Oct 28, 2009 at 5:58 pm
    Hi Mark,

    Each mapper (or reducer) has an environment variable "mapred_map_tasks" (or "mapred_reduce_tasks")
    which will describe how many tasks the map or reduce job was split into. It also has a variable
    "mapred_task_id" which contains a unique identifier for the task. Using these two together, it's
    able to generate a sequence of numbers which won't collide with other mappers (or reducers).

    For example, if a job has 20 (map or reduce) tasks, then task #1 will generate the sequence
    (assuming the first id is zero):

    0,20,40,60,80,100,120...

    and task #2 will generate

    1,21,41,61,81,101,121...

    and so on. This approach works in both the mapper OR the reducer...you just have to look at
    different variables.

    Michael

    Mark Kerzner wrote:
    Michael,

    environmental variables are available in Java, but the environment itself is
    not shared between instances. I read your code - you are solving exactly the
    same problem I am interested in - but I did not see how it works in
    distributed environment.

    By the way, it occurs to me that JavaSpaces, which is a different approach
    to distributed computing, trumpled by Hadoop, could be used here! Just run
    one instance with GigaSpaces at all times, and you got your self-increment
    for any number of jobs. It is perfect for concurrent processing and very
    fast.

    Thank you,
    Mark
    On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt wrote:

    I posted an approach to this using streaming, but if the environment
    variables are available in standard Java interface, this may work for you.

    http://www.mail-archive.com/core-user@hadoop.apache.org/msg09079.html

    You'll have to be able to tolerate some small gaps in the ids.

    Michael


    Mark Kerzner wrote:
    Aaron, although your notes are not a ready solution, but they are a great
    help.

    Thank you,
    Mark

    On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aaron@cloudera.com>
    wrote:

    There is no in-MapReduce mechanism for cross-task synchronization. You'll
    need to use something like Zookeeper for this, or another external
    database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to
    eliminate
    this need, or maybe get really clever. For example, do your numbers need
    to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current
    output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're
    all
    building unique sequences. If the former... rethink your pipeline? :)

    - Aaron

    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance variable
    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need, where
    the reducer becomes a bottleneck. Can I have a global variable for all
    reducers, which would give each the next consecutive recordId? In the
    database scenario, this would be the unique autokey. How to do it in
    MapReduce?

    Thank you
  • Mark Kerzner at Oct 28, 2009 at 6:03 pm
    Oh, I see. Very smart.

    In my case, I need consecutive numbers with no gaps, and I need them in the
    order of how Hadoop sorted the maps. So I don't see how I could apply this
    approach, but thank you - it is a great discussion, which was helpful to
    consider all issues around this, and which brought about the idea of
    JavaSpaces - I am going to check out this one.

    Mark
    On Wed, Oct 28, 2009 at 12:57 PM, Michael Klatt wrote:


    Hi Mark,

    Each mapper (or reducer) has an environment variable "mapred_map_tasks" (or
    "mapred_reduce_tasks") which will describe how many tasks the map or reduce
    job was split into. It also has a variable
    "mapred_task_id" which contains a unique identifier for the task. Using
    these two together, it's able to generate a sequence of numbers which won't
    collide with other mappers (or reducers).

    For example, if a job has 20 (map or reduce) tasks, then task #1 will
    generate the sequence (assuming the first id is zero):

    0,20,40,60,80,100,120...

    and task #2 will generate

    1,21,41,61,81,101,121...

    and so on. This approach works in both the mapper OR the reducer...you
    just have to look at different variables.

    Michael


    Mark Kerzner wrote:
    Michael,

    environmental variables are available in Java, but the environment itself
    is
    not shared between instances. I read your code - you are solving exactly
    the
    same problem I am interested in - but I did not see how it works in
    distributed environment.

    By the way, it occurs to me that JavaSpaces, which is a different approach
    to distributed computing, trumpled by Hadoop, could be used here! Just run
    one instance with GigaSpaces at all times, and you got your self-increment
    for any number of jobs. It is perfect for concurrent processing and very
    fast.

    Thank you,
    Mark

    On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <michael.klatt@gmail.com
    wrote:
    I posted an approach to this using streaming, but if the environment
    variables are available in standard Java interface, this may work for
    you.

    http://www.mail-archive.com/core-user@hadoop.apache.org/msg09079.html

    You'll have to be able to tolerate some small gaps in the ids.

    Michael


    Mark Kerzner wrote:

    Aaron, although your notes are not a ready solution, but they are a
    great
    help.

    Thank you,
    Mark

    On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aaron@cloudera.com>
    wrote:

    There is no in-MapReduce mechanism for cross-task synchronization.
    You'll
    need to use something like Zookeeper for this, or another external
    database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to
    eliminate
    this need, or maybe get really clever. For example, do your numbers
    need
    to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current
    output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that
    they're
    all
    building unique sequences. If the former... rethink your pipeline? :)

    - Aaron

    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:

    Hi,
    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance variable
    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need, where
    the reducer becomes a bottleneck. Can I have a global variable for all
    reducers, which would give each the next consecutive recordId? In the
    database scenario, this would be the unique autokey. How to do it in
    MapReduce?

    Thank you
  • Brien colwell at Oct 28, 2009 at 6:08 pm
    Another approach is to initialize each map task with an ID (using
    JavaSpaces, something like Zookeeper, or some aspect of the input data)
    and then pack that with a map-local counter into a global ID. This
    makes assumptions like the number of map tasks less than 2^10 and the
    number of records per mapper will be less than 2^53. The packed global
    IDs are consecutive per map task. If globally consecutive is needed, a
    second stage can create a histogram of map task ID -> number of records
    and use it to transform the global IDs to globally consecutive .



    Mark Kerzner wrote:
    Michael,

    environmental variables are available in Java, but the environment itself is
    not shared between instances. I read your code - you are solving exactly the
    same problem I am interested in - but I did not see how it works in
    distributed environment.

    By the way, it occurs to me that JavaSpaces, which is a different approach
    to distributed computing, trumpled by Hadoop, could be used here! Just run
    one instance with GigaSpaces at all times, and you got your self-increment
    for any number of jobs. It is perfect for concurrent processing and very
    fast.

    Thank you,
    Mark

    On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt wrote:

    I posted an approach to this using streaming, but if the environment
    variables are available in standard Java interface, this may work for you.

    http://www.mail-archive.com/core-user@hadoop.apache.org/msg09079.html

    You'll have to be able to tolerate some small gaps in the ids.

    Michael


    Mark Kerzner wrote:

    Aaron, although your notes are not a ready solution, but they are a great
    help.

    Thank you,
    Mark

    On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aaron@cloudera.com>
    wrote:

    There is no in-MapReduce mechanism for cross-task synchronization. You'll
    need to use something like Zookeeper for this, or another external
    database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to
    eliminate
    this need, or maybe get really clever. For example, do your numbers need
    to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current
    output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're
    all
    building unique sequences. If the former... rethink your pipeline? :)

    - Aaron

    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:

    Hi,

    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance variable
    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need, where
    the reducer becomes a bottleneck. Can I have a global variable for all
    reducers, which would give each the next consecutive recordId? In the
    database scenario, this would be the unique autokey. How to do it in
    MapReduce?

    Thank you
  • Mark Kerzner at Oct 28, 2009 at 6:21 pm
    Brien,


    - I am on EC2, what would be the advantage of using Zookeeper over
    JavaSpaces? Either would have to be maintained by me, as they are not
    provided on EC2 directly;
    - pack that with a map-local counter into a global ID - you mean, just
    take the global counter and make the local instance counter equal to it?
    - 2^53 is quite sufficient for my purposes, but where is the number
    coming from?
    - Looking at your last point, I saw what I have previously missed: I need
    numbers consecutive within each reducer, and then I need them consecutive
    between reducers. I assume that reducers are sorted. For example, if my
    records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the
    other one - maps 4,5,6. If that's the case, I need to know how the reducers
    are sorted. Then I could simply run the second stage.

    Thank you,
    Mark
    On Wed, Oct 28, 2009 at 1:07 PM, brien colwell wrote:

    Another approach is to initialize each map task with an ID (using
    JavaSpaces, something like Zookeeper, or some aspect of the input data) and
    then pack that with a map-local counter into a global ID. This makes
    assumptions like the number of map tasks less than 2^10 and the number of
    records per mapper will be less than 2^53. The packed global IDs are
    consecutive per map task. If globally consecutive is needed, a second stage
    can create a histogram of map task ID -> number of records and use it to
    transform the global IDs to globally consecutive .




    Mark Kerzner wrote:
    Michael,

    environmental variables are available in Java, but the environment itself
    is
    not shared between instances. I read your code - you are solving exactly
    the
    same problem I am interested in - but I did not see how it works in
    distributed environment.

    By the way, it occurs to me that JavaSpaces, which is a different approach
    to distributed computing, trumpled by Hadoop, could be used here! Just run
    one instance with GigaSpaces at all times, and you got your self-increment
    for any number of jobs. It is perfect for concurrent processing and very
    fast.

    Thank you,
    Mark

    On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <michael.klatt@gmail.com
    wrote:
    I posted an approach to this using streaming, but if the environment
    variables are available in standard Java interface, this may work for
    you.

    http://www.mail-archive.com/core-user@hadoop.apache.org/msg09079.html

    You'll have to be able to tolerate some small gaps in the ids.

    Michael


    Mark Kerzner wrote:


    Aaron, although your notes are not a ready solution, but they are a
    great
    help.

    Thank you,
    Mark

    On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aaron@cloudera.com>
    wrote:

    There is no in-MapReduce mechanism for cross-task synchronization.
    You'll

    need to use something like Zookeeper for this, or another external
    database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to
    eliminate
    this need, or maybe get really clever. For example, do your numbers
    need
    to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current
    output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that
    they're
    all
    building unique sequences. If the former... rethink your pipeline? :)

    - Aaron

    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:


    Hi,

    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance
    variable

    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need,
    where

    the reducer becomes a bottleneck. Can I have a global variable for all
    reducers, which would give each the next consecutive recordId? In the
    database scenario, this would be the unique autokey. How to do it in
    MapReduce?

    Thank you

  • Edward Capriolo at Oct 28, 2009 at 7:23 pm

    On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner wrote:
    Brien,


    - I am on EC2, what would be the advantage of using Zookeeper over
    JavaSpaces? Either would have to be maintained by me, as they are not
    provided on EC2 directly;
    - pack that with a map-local counter into a global ID - you mean, just
    take the global counter and make the local instance counter equal to it?
    - 2^53 is quite sufficient for my purposes, but where is the number
    coming from?
    - Looking at your last point, I saw what I have previously missed: I need
    numbers consecutive within each reducer, and then I need them consecutive
    between reducers. I assume that reducers are sorted. For example, if my
    records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the
    other one - maps 4,5,6. If that's the case, I need to know how the reducers
    are sorted. Then I could simply run the second stage.

    Thank you,
    Mark
    On Wed, Oct 28, 2009 at 1:07 PM, brien colwell wrote:

    Another approach is to initialize each map task with an ID (using
    JavaSpaces, something like Zookeeper, or some aspect of the input data) and
    then pack that with a map-local counter into a global ID.  This makes
    assumptions like the number of map tasks less than 2^10 and the number of
    records per mapper will be less than 2^53. The packed global IDs are
    consecutive per map task. If globally consecutive is needed, a second stage
    can create a histogram of map task ID -> number of records and use it to
    transform the global IDs to globally consecutive .




    Mark Kerzner wrote:
    Michael,

    environmental variables are available in Java, but the environment itself
    is
    not shared between instances. I read your code - you are solving exactly
    the
    same problem I am interested in - but I did not see how it works in
    distributed environment.

    By the way, it occurs to me that JavaSpaces, which is a different approach
    to distributed computing, trumpled by Hadoop, could be used here! Just run
    one instance with GigaSpaces at all times, and you got your self-increment
    for any number of jobs. It is perfect for concurrent processing and very
    fast.

    Thank you,
    Mark

    On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <michael.klatt@gmail.com
    wrote:
    I posted an approach to this using streaming, but if the environment
    variables are available in standard Java interface, this may work for
    you.

    http://www.mail-archive.com/core-user@hadoop.apache.org/msg09079.html

    You'll have to be able to tolerate some small gaps in the ids.

    Michael


    Mark Kerzner wrote:


    Aaron, although your notes are not a ready solution, but they are a
    great
    help.

    Thank you,
    Mark

    On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aaron@cloudera.com>
    wrote:

    There is no in-MapReduce mechanism for cross-task synchronization.
    You'll

    need to use something like Zookeeper for this, or another external
    database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to
    eliminate
    this need, or maybe get really clever. For example, do your numbers
    need
    to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current
    output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that
    they're
    all
    building unique sequences. If the former... rethink your pipeline? :)

    - Aaron

    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:


    Hi,

    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance
    variable

    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need,
    where

    the reducer becomes a bottleneck. Can I have a global variable for all
    reducers, which would give each the next consecutive recordId? In the
    database scenario, this would be the unique autokey. How to do it in
    MapReduce?

    Thank you

    My two cents here.

    It seems like what you want is an atomic auto-id in order with no
    gaps. In particular the "no gaps" and "in order" requirement really
    constrains you because now you can't use the map/reduce id s/hostname
    etc to generate distinct keys.

    If you have these requirements "in order" "no gaps". The best
    approach may just be to use one mapper/reducer. Take a step back and
    look at the big picture. No matter how much software you add, you
    asking for a single/serialized/locking "IDFACTORY". Doing this
    multinode zookeeper locking "IDFACTORY" is probably going to be more
    work and slower then just doing it in a single thread mapper or
    reducer.
  • Mark Kerzner at Oct 28, 2009 at 7:30 pm
    I agree, Edward. One Reducer is what I have now, and it works. It is, or may
    become, a bottleneck. When it does, I can go back to using multiple
    reducers, and add a second MR job to renumber. That way, I can stick with
    the situation until it becomes necessary to optimize, and not introduce any
    changes or new software - for, as you say, it might become a pain. Follow
    the rule (which I got from Scott Meyers' Effective C++) - avoid premature
    optimization.

    mark
    On Wed, Oct 28, 2009 at 2:22 PM, Edward Capriolo wrote:
    On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner wrote:
    Brien,


    - I am on EC2, what would be the advantage of using Zookeeper over
    JavaSpaces? Either would have to be maintained by me, as they are not
    provided on EC2 directly;
    - pack that with a map-local counter into a global ID - you mean, just
    take the global counter and make the local instance counter equal to it?
    - 2^53 is quite sufficient for my purposes, but where is the number
    coming from?
    - Looking at your last point, I saw what I have previously missed: I need
    numbers consecutive within each reducer, and then I need them
    consecutive
    between reducers. I assume that reducers are sorted. For example, if my
    records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the
    other one - maps 4,5,6. If that's the case, I need to know how the reducers
    are sorted. Then I could simply run the second stage.

    Thank you,
    Mark
    On Wed, Oct 28, 2009 at 1:07 PM, brien colwell wrote:

    Another approach is to initialize each map task with an ID (using
    JavaSpaces, something like Zookeeper, or some aspect of the input data)
    and
    then pack that with a map-local counter into a global ID. This makes
    assumptions like the number of map tasks less than 2^10 and the number
    of
    records per mapper will be less than 2^53. The packed global IDs are
    consecutive per map task. If globally consecutive is needed, a second
    stage
    can create a histogram of map task ID -> number of records and use it to
    transform the global IDs to globally consecutive .




    Mark Kerzner wrote:
    Michael,

    environmental variables are available in Java, but the environment
    itself
    is
    not shared between instances. I read your code - you are solving
    exactly
    the
    same problem I am interested in - but I did not see how it works in
    distributed environment.

    By the way, it occurs to me that JavaSpaces, which is a different
    approach
    to distributed computing, trumpled by Hadoop, could be used here! Just
    run
    one instance with GigaSpaces at all times, and you got your
    self-increment
    for any number of jobs. It is perfect for concurrent processing and
    very
    fast.

    Thank you,
    Mark

    On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <
    michael.klatt@gmail.com
    wrote:
    I posted an approach to this using streaming, but if the environment
    variables are available in standard Java interface, this may work for
    you.

    http://www.mail-archive.com/core-user@hadoop.apache.org/msg09079.html

    You'll have to be able to tolerate some small gaps in the ids.

    Michael


    Mark Kerzner wrote:


    Aaron, although your notes are not a ready solution, but they are a
    great
    help.

    Thank you,
    Mark

    On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aaron@cloudera.com>
    wrote:

    There is no in-MapReduce mechanism for cross-task synchronization.
    You'll

    need to use something like Zookeeper for this, or another external
    database.
    Note that this will greatly complicate your life.

    If I were you, I'd try to either redesign my pipeline elsewhere to
    eliminate
    this need, or maybe get really clever. For example, do your numbers
    need
    to
    be sequential, or just unique?

    If the latter, then take the byte offset into the reducer's current
    output
    file and combine that with the reducer id (e.g.,
    <current-byte-offset><zero-padded-reducer-id>) to guarantee that
    they're
    all
    building unique sequences. If the former... rethink your pipeline?
    :)
    - Aaron

    On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <
    markkerzner@gmail.com>
    wrote:


    Hi,

    I need to number all output records consecutively, like, 1,2,3...

    This is no problem with one reducer, making recordId an instance
    variable

    in
    the Reducer class, and setting conf.setNumReduceTasks(1)

    However, it is an architectural decision forced by processing need,
    where

    the reducer becomes a bottleneck. Can I have a global variable for
    all
    reducers, which would give each the next consecutive recordId? In
    the
    database scenario, this would be the unique autokey. How to do it
    in
    MapReduce?

    Thank you

    My two cents here.

    It seems like what you want is an atomic auto-id in order with no
    gaps. In particular the "no gaps" and "in order" requirement really
    constrains you because now you can't use the map/reduce id s/hostname
    etc to generate distinct keys.

    If you have these requirements "in order" "no gaps". The best
    approach may just be to use one mapper/reducer. Take a step back and
    look at the big picture. No matter how much software you add, you
    asking for a single/serialized/locking "IDFACTORY". Doing this
    multinode zookeeper locking "IDFACTORY" is probably going to be more
    work and slower then just doing it in a single thread mapper or
    reducer.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 28, '09 at 3:55a
activeOct 28, '09 at 7:30p
posts11
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase