Hi all,

I have ensured that my mapper produces a unique key for every value it writes and further more that each map() call only writes one value. I note here that the value is a custom for which I handle the Writable interface methods.

I realize that it isn't very real world to have (well, want) no combining done prior to reducing, but I'm still getting my feet wet.

When the reducer runs, I expected to see one reduce() call for every map() call, and I do. However, the value I get is the composite of all the reduce() calls that came before it.

So, for example, the mapper gets data like this :

ID, Name, Type, Other stuff...
A000, Cream, Group, ...
B231, Led Zeppelin, Group, ...
A044, Liberace, Individual, ...


ID is the external key from the source data and is guaranteed to be unique.

When I map it, I create a container for the row data and output that container with all the data from that row only and use the ID field as a key.

Since the key is always unique I expected the sort/shuffle step to never coalesce any two values. So I expected my reduce() method to be called once per mapped input row, and it is.

The problem is, as each row is processed, the reducer sees a set of cumulative value data instead of a container with a row of data in it. So the 'value' parameter to reduce always has the information from previous reduce steps.

For example, given the data above :

1st Reducer Call :
Key = A000
Value =
Container :
(object 1) : Name = Cream, Type = Group, MBID = A000, ...

2nd Reducer Call :
Key = B231
Value =
Container :
(object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
(object 2) : Name = Cream, Type = Group, MBID = A000, ...

So the second reduce call has data in it from the first reduce call. Very strange! At a guess I would say the reducer is re-using the object when it reads the objects back from the mapping step. I dunno..

If anyone has any ideas, I'm open to suggestions. 0.20.2-cdh3u0

Thanks!

R

Search Discussions

  • Sudharsan Sampath at Sep 5, 2011 at 4:38 am
    Hi,

    I suspect it's something to do with your custom Writable. Do you have a
    clear method on your container? If so, that should be used before the obj is
    initialized every time to avoid retaining previous values due to object
    reuse during ser-de process.

    Thanks
    Sudhan S


    On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross wrote:

    Hi all,

    I have ensured that my mapper produces a unique key for every value it
    writes and further more that each map() call only writes one value. I
    note here that the value is a custom for which I handle the Writable
    interface methods.

    I realize that it isn't very real world to have (well, want) no combining
    done prior to reducing, but I'm still getting my feet wet.

    When the reducer runs, I expected to see one reduce() call for every map()
    call, and I do. However, the value I get is the composite of all the
    reduce() calls that came before it.

    So, for example, the mapper gets data like this :

    ID, Name, Type, Other stuff...
    A000, Cream, Group, ...
    B231, Led Zeppelin, Group, ...
    A044, Liberace, Individual, ...


    ID is the external key from the source data and is guaranteed to be unique.

    When I map it, I create a container for the row data and output that
    container with all the data from that row only and use the ID field as a
    key.

    Since the key is always unique I expected the sort/shuffle step to never
    coalesce any two values. So I expected my reduce() method to be called
    once per mapped input row, and it is.

    The problem is, as each row is processed, the reducer sees a set of
    cumulative value data instead of a container with a row of data in it. So
    the 'value' parameter to reduce always has the information from previous
    reduce steps.

    For example, given the data above :

    1st Reducer Call :
    Key = A000
    Value =
    Container :
    (object 1) : Name = Cream, Type = Group, MBID = A000, ...

    2nd Reducer Call :
    Key = B231
    Value =
    Container :
    (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
    (object 2) : Name = Cream, Type = Group, MBID = A000, ...

    So the second reduce call has data in it from the first reduce call. Very
    strange! At a guess I would say the reducer is re-using the object when it
    reads the objects back from the mapping step. I dunno..

    If anyone has any ideas, I'm open to suggestions. 0.20.2-cdh3u0

    Thanks!

    R


  • Rick Ross at Sep 5, 2011 at 5:15 am
    Thanks, but unless I misread you, that didn't do it. Naturally the object that I am creating just has a couple of ArrayLists to gather up Name and Type objects.

    I suspect I need to extend ArrayWritable instead. I'll try that next.

    Cheers.

    R
    On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:

    Hi,

    I suspect it's something to do with your custom Writable. Do you have a clear method on your container? If so, that should be used before the obj is initialized every time to avoid retaining previous values due to object reuse during ser-de process.

    Thanks
    Sudhan S



    On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross wrote:
    Hi all,

    I have ensured that my mapper produces a unique key for every value it writes and further more that each map() call only writes one value. I note here that the value is a custom for which I handle the Writable interface methods.

    I realize that it isn't very real world to have (well, want) no combining done prior to reducing, but I'm still getting my feet wet.

    When the reducer runs, I expected to see one reduce() call for every map() call, and I do. However, the value I get is the composite of all the reduce() calls that came before it.

    So, for example, the mapper gets data like this :

    ID, Name, Type, Other stuff...
    A000, Cream, Group, ...
    B231, Led Zeppelin, Group, ...
    A044, Liberace, Individual, ...


    ID is the external key from the source data and is guaranteed to be unique.

    When I map it, I create a container for the row data and output that container with all the data from that row only and use the ID field as a key.

    Since the key is always unique I expected the sort/shuffle step to never coalesce any two values. So I expected my reduce() method to be called once per mapped input row, and it is.

    The problem is, as each row is processed, the reducer sees a set of cumulative value data instead of a container with a row of data in it. So the 'value' parameter to reduce always has the information from previous reduce steps.

    For example, given the data above :

    1st Reducer Call :
    Key = A000
    Value =
    Container :
    (object 1) : Name = Cream, Type = Group, MBID = A000, ...

    2nd Reducer Call :
    Key = B231
    Value =
    Container :
    (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
    (object 2) : Name = Cream, Type = Group, MBID = A000, ...

    So the second reduce call has data in it from the first reduce call. Very strange! At a guess I would say the reducer is re-using the object when it reads the objects back from the mapping step. I dunno..

    If anyone has any ideas, I'm open to suggestions. 0.20.2-cdh3u0

    Thanks!

    R


  • Rick Ross at Sep 6, 2011 at 5:11 am
    I'm still poking around on this and I was wondering if there is a way to see the intermediate files that the mapper writes and the ones that the reducer reads. I might get some clues in there.

    Thanks

    R
    On Sep 4, 2011, at 10:14 PM, Rick Ross wrote:

    Thanks, but unless I misread you, that didn't do it. Naturally the object that I am creating just has a couple of ArrayLists to gather up Name and Type objects.

    I suspect I need to extend ArrayWritable instead. I'll try that next.

    Cheers.

    R
    On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:

    Hi,

    I suspect it's something to do with your custom Writable. Do you have a clear method on your container? If so, that should be used before the obj is initialized every time to avoid retaining previous values due to object reuse during ser-de process.

    Thanks
    Sudhan S



    On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross wrote:
    Hi all,

    I have ensured that my mapper produces a unique key for every value it writes and further more that each map() call only writes one value. I note here that the value is a custom for which I handle the Writable interface methods.

    I realize that it isn't very real world to have (well, want) no combining done prior to reducing, but I'm still getting my feet wet.

    When the reducer runs, I expected to see one reduce() call for every map() call, and I do. However, the value I get is the composite of all the reduce() calls that came before it.

    So, for example, the mapper gets data like this :

    ID, Name, Type, Other stuff...
    A000, Cream, Group, ...
    B231, Led Zeppelin, Group, ...
    A044, Liberace, Individual, ...


    ID is the external key from the source data and is guaranteed to be unique.

    When I map it, I create a container for the row data and output that container with all the data from that row only and use the ID field as a key.

    Since the key is always unique I expected the sort/shuffle step to never coalesce any two values. So I expected my reduce() method to be called once per mapped input row, and it is.

    The problem is, as each row is processed, the reducer sees a set of cumulative value data instead of a container with a row of data in it. So the 'value' parameter to reduce always has the information from previous reduce steps.

    For example, given the data above :

    1st Reducer Call :
    Key = A000
    Value =
    Container :
    (object 1) : Name = Cream, Type = Group, MBID = A000, ...

    2nd Reducer Call :
    Key = B231
    Value =
    Container :
    (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
    (object 2) : Name = Cream, Type = Group, MBID = A000, ...

    So the second reduce call has data in it from the first reduce call. Very strange! At a guess I would say the reducer is re-using the object when it reads the objects back from the mapping step. I dunno..

    If anyone has any ideas, I'm open to suggestions. 0.20.2-cdh3u0

    Thanks!

    R


  • Sudharsan Sampath at Sep 6, 2011 at 5:25 am
    Hi Rick,

    If possible can u share your custom writable that's configured as the value
    type for the reducer.

    Thanks
    Sudhan S
    On Tue, Sep 6, 2011 at 10:41 AM, Rick Ross wrote:

    I'm still poking around on this and I was wondering if there is a way to
    see the intermediate files that the mapper writes and the ones that the
    reducer reads. I might get some clues in there.

    Thanks

    R

    On Sep 4, 2011, at 10:14 PM, Rick Ross wrote:

    Thanks, but unless I misread you, that didn't do it. Naturally the
    object that I am creating just has a couple of ArrayLists to gather up Name
    and Type objects.

    I suspect I need to extend ArrayWritable instead. I'll try that next.

    Cheers.

    R

    On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:

    Hi,

    I suspect it's something to do with your custom Writable. Do you have a
    clear method on your container? If so, that should be used before the obj is
    initialized every time to avoid retaining previous values due to object
    reuse during ser-de process.

    Thanks
    Sudhan S


    On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross wrote:

    Hi all,

    I have ensured that my mapper produces a unique key for every value it
    writes and further more that each map() call only writes one value. I
    note here that the value is a custom for which I handle the Writable
    interface methods.

    I realize that it isn't very real world to have (well, want) no combining
    done prior to reducing, but I'm still getting my feet wet.

    When the reducer runs, I expected to see one reduce() call for every map()
    call, and I do. However, the value I get is the composite of all the
    reduce() calls that came before it.

    So, for example, the mapper gets data like this :

    ID, Name, Type, Other stuff...
    A000, Cream, Group, ...
    B231, Led Zeppelin, Group, ...
    A044, Liberace, Individual, ...


    ID is the external key from the source data and is guaranteed to be
    unique.

    When I map it, I create a container for the row data and output that
    container with all the data from that row only and use the ID field as a
    key.

    Since the key is always unique I expected the sort/shuffle step to never
    coalesce any two values. So I expected my reduce() method to be called
    once per mapped input row, and it is.

    The problem is, as each row is processed, the reducer sees a set of
    cumulative value data instead of a container with a row of data in it. So
    the 'value' parameter to reduce always has the information from previous
    reduce steps.

    For example, given the data above :

    1st Reducer Call :
    Key = A000
    Value =
    Container :
    (object 1) : Name = Cream, Type = Group, MBID = A000, ...

    2nd Reducer Call :
    Key = B231
    Value =
    Container :
    (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
    (object 2) : Name = Cream, Type = Group, MBID = A000, ...

    So the second reduce call has data in it from the first reduce call.
    Very strange! At a guess I would say the reducer is re-using the object
    when it reads the objects back from the mapping step. I dunno..

    If anyone has any ideas, I'm open to suggestions. 0.20.2-cdh3u0

    Thanks!

    R


  • Sonal Goyal at Sep 6, 2011 at 5:26 am
    Could you share your mapper code and the container code? When your mapper
    emits the keys and values, do you print them out to see that they are
    correct, that is, the container only has data specific to that id?

    Best Regards,
    Sonal
    Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>




    On Tue, Sep 6, 2011 at 10:41 AM, Rick Ross wrote:

    I'm still poking around on this and I was wondering if there is a way to
    see the intermediate files that the mapper writes and the ones that the
    reducer reads. I might get some clues in there.

    Thanks

    R

    On Sep 4, 2011, at 10:14 PM, Rick Ross wrote:

    Thanks, but unless I misread you, that didn't do it. Naturally the
    object that I am creating just has a couple of ArrayLists to gather up Name
    and Type objects.

    I suspect I need to extend ArrayWritable instead. I'll try that next.

    Cheers.

    R

    On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:

    Hi,

    I suspect it's something to do with your custom Writable. Do you have a
    clear method on your container? If so, that should be used before the obj is
    initialized every time to avoid retaining previous values due to object
    reuse during ser-de process.

    Thanks
    Sudhan S


    On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross wrote:

    Hi all,

    I have ensured that my mapper produces a unique key for every value it
    writes and further more that each map() call only writes one value. I
    note here that the value is a custom for which I handle the Writable
    interface methods.

    I realize that it isn't very real world to have (well, want) no combining
    done prior to reducing, but I'm still getting my feet wet.

    When the reducer runs, I expected to see one reduce() call for every map()
    call, and I do. However, the value I get is the composite of all the
    reduce() calls that came before it.

    So, for example, the mapper gets data like this :

    ID, Name, Type, Other stuff...
    A000, Cream, Group, ...
    B231, Led Zeppelin, Group, ...
    A044, Liberace, Individual, ...


    ID is the external key from the source data and is guaranteed to be
    unique.

    When I map it, I create a container for the row data and output that
    container with all the data from that row only and use the ID field as a
    key.

    Since the key is always unique I expected the sort/shuffle step to never
    coalesce any two values. So I expected my reduce() method to be called
    once per mapped input row, and it is.

    The problem is, as each row is processed, the reducer sees a set of
    cumulative value data instead of a container with a row of data in it. So
    the 'value' parameter to reduce always has the information from previous
    reduce steps.

    For example, given the data above :

    1st Reducer Call :
    Key = A000
    Value =
    Container :
    (object 1) : Name = Cream, Type = Group, MBID = A000, ...

    2nd Reducer Call :
    Key = B231
    Value =
    Container :
    (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
    (object 2) : Name = Cream, Type = Group, MBID = A000, ...

    So the second reduce call has data in it from the first reduce call.
    Very strange! At a guess I would say the reducer is re-using the object
    when it reads the objects back from the mapping step. I dunno..

    If anyone has any ideas, I'm open to suggestions. 0.20.2-cdh3u0

    Thanks!

    R


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedSep 5, '11 at 12:42a
activeSep 6, '11 at 5:26a
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase