I am writing a little hadoop program to index a bunch (large bunch) of
text files joined together in a large xml file. The mapper execute some
basic text preprocessing and emits key-value pair like:
(term,document_id) -> (section_of_the_document,positional frequency vector)
(apple,12) -> (title,[1,3])
The reducer should bring together the same terms and create a posting
apple -> (12,title,[1,3]) , (14,body,[2,5]) ...
... -> ...
To accomplish this I have created a custom class PairOfStringInt to hold
mapper's key which implements writableComparable, a custom partitioner
TermPartioner (https://gist.github.com/809793) and a Reducer which
should bring all values from the same key into the same posting list
as in the example.
Testing my system on a tiny dataset made up of two document (same
content) I get:
The values from the same key are not brought together...Looking at the
secondary sort example I also tried to implement a
GroupComparator(https://gist.github.com/809803) to be set on the job
using job.setGroupingComparatorClass(GroupingComparator.class) but if I
do so I get in the output:
One single key (the first one) and all postings associated with
it...what do I miss??
Thanks for your time
 by "same key" I mean those who have the same left element