FAQ
Hello,
I am writing a little hadoop program to index a bunch (large bunch) of
text files joined together in a large xml file. The mapper execute some
basic text preprocessing and emits key-value pair like:

(term,document_id) -> (section_of_the_document,positional frequency vector)

example

(apple,12) -> (title,[1,3])

The reducer should bring together the same terms and create a posting
list like:

apple -> (12,title,[1,3]) , (14,body,[2,5]) ...

... -> ...

To accomplish this I have created a custom class PairOfStringInt to hold
mapper's key which implements writableComparable, a custom partitioner
TermPartioner (https://gist.github.com/809793) and a Reducer which
should bring all values from the same key[1] into the same posting list
as in the example.

Testing my system on a tiny dataset made up of two document (same
content) I get:

minni [(1,body,[1,2])]
pippo [(1,body,[2,0,3])]
pluto [(1,body,[1,1])]
minni [(2,body,[1,2])]
pippo [(2,body,[1,0])]
pluto [(2,body,[1,1])]

The values from the same key are not brought together...Looking at the
secondary sort example I also tried to implement a
GroupComparator(https://gist.github.com/809803) to be set on the job
using job.setGroupingComparatorClass(GroupingComparator.class) but if I
do so I get in the output:

minni
[(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])]


One single key (the first one) and all postings associated with
it...what do I miss??

Thanks for your time

Marco

[1] by "same key" I mean those who have the same left element

Search Discussions

  • Harsh J at Feb 3, 2011 at 6:03 pm
    For a ValueGrouping comparator to work, your Partitioner must act in
    tandem with it. I do not know if you have implemented a custom
    hashCode() method for your Key class, but your partitioner should look
    like:

    return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;

    This will ensure that the to-be grouped data is actually partitioned
    properly too.

    The actual sorting (which ought to occur for the full composite key
    field-by-field, and is the only real 'sorter') would be handled by the
    compare() call of your Writable, if you are using a
    WritableComparable.
    On Thu, Feb 3, 2011 at 10:51 PM, Marco Didonna wrote:
    Hello,
    I am writing a little hadoop program to index a bunch (large bunch) of
    text files joined together in a large xml file. The mapper execute some
    basic text preprocessing and emits key-value pair like:

    (term,document_id) -> (section_of_the_document,positional frequency vector)

    example

    (apple,12) -> (title,[1,3])

    The reducer should bring together the same terms and create a posting
    list like:

    apple -> (12,title,[1,3]) , (14,body,[2,5]) ...

    ... -> ...

    To accomplish this I have created a custom class PairOfStringInt to hold
    mapper's key which implements writableComparable, a custom partitioner
    TermPartioner (https://gist.github.com/809793) and a Reducer which
    should bring all values from the same key[1] into the same posting list
    as in the example.

    Testing my system on a tiny dataset made up of two document (same
    content) I get:

    minni   [(1,body,[1,2])]
    pippo   [(1,body,[2,0,3])]
    pluto   [(1,body,[1,1])]
    minni   [(2,body,[1,2])]
    pippo   [(2,body,[1,0])]
    pluto   [(2,body,[1,1])]

    The values from the same key are not brought together...Looking at the
    secondary sort example I also tried to implement a
    GroupComparator(https://gist.github.com/809803) to be set on the job
    using job.setGroupingComparatorClass(GroupingComparator.class) but if I
    do so I get in the output:

    minni
    [(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])]


    One single key (the first one) and all postings associated with
    it...what do I miss??

    Thanks for your time

    Marco

    [1] by "same key" I mean those who have the same left element


    --
    Harsh J
    www.harshj.com
  • Marco Didonna at Feb 4, 2011 at 9:12 am

    On 02/03/2011 07:02 PM, Harsh J wrote:
    For a ValueGrouping comparator to work, your Partitioner must act in
    tandem with it. I do not know if you have implemented a custom
    hashCode() method for your Key class, but your partitioner should look
    like:
    Yes I did and it works like this return "leftElement.hashCode() +
    rightElement; "
    return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;
    This was definitely a bug, the result is always the same though :(
    This will ensure that the to-be grouped data is actually partitioned
    properly too.

    The actual sorting (which ought to occur for the full composite key
    field-by-field, and is the only real 'sorter') would be handled by the
    compare() call of your Writable, if you are using a
    WritableComparable.
    I am using a WritableComparable...here's PairOfStringInt
    https://gist.github.com/810905

    Thanks again
  • Marco Didonna at Feb 5, 2011 at 9:16 am

    On 02/04/2011 10:11 AM, Marco Didonna wrote:
    On 02/03/2011 07:02 PM, Harsh J wrote:
    For a ValueGrouping comparator to work, your Partitioner must act in
    tandem with it. I do not know if you have implemented a custom
    hashCode() method for your Key class, but your partitioner should look
    like:
    Yes I did and it works like this return "leftElement.hashCode() +
    rightElement; "
    return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;
    This was definitely a bug, the result is always the same though :(
    This will ensure that the to-be grouped data is actually partitioned
    properly too.

    The actual sorting (which ought to occur for the full composite key
    field-by-field, and is the only real 'sorter') would be handled by the
    compare() call of your Writable, if you are using a
    WritableComparable.
    I am using a WritableComparable...here's PairOfStringInt
    https://gist.github.com/810905

    Thanks again

    I finally made it https://gist.github.com/809803 I use the
    groupingComparator as job.setSortComparatorClass(GroupingComparator.class)

    I still do not understand what was wrong with the old version of the
    GroupingComparator and when the key are ordered according to the policy
    encoded in GroupingComparator.

    MD

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 3, '11 at 5:22p
activeFeb 5, '11 at 9:16a
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Marco Didonna: 3 posts Harsh J: 1 post

People

Translate

site design / logo © 2022 Grokbase