Hello,
I am writing a little hadoop program to index a bunch (large bunch) of
text files joined together in a large xml file. The mapper execute some
basic text preprocessing and emits key-value pair like:
(term,document_id) -> (section_of_the_document,positional frequency vector)
example
(apple,12) -> (title,[1,3])
The reducer should bring together the same terms and create a posting
list like:
apple -> (12,title,[1,3]) , (14,body,[2,5]) ...
... -> ...
To accomplish this I have created a custom class PairOfStringInt to hold
mapper's key which implements writableComparable, a custom partitioner
TermPartioner (https://gist.github.com/809793) and a Reducer which
should bring all values from the same key[1] into the same posting list
as in the example.
Testing my system on a tiny dataset made up of two document (same
content) I get:
minni [(1,body,[1,2])]
pippo [(1,body,[2,0,3])]
pluto [(1,body,[1,1])]
minni [(2,body,[1,2])]
pippo [(2,body,[1,0])]
pluto [(2,body,[1,1])]
The values from the same key are not brought together...Looking at the
secondary sort example I also tried to implement a
GroupComparator(https://gist.github.com/809803) to be set on the job
using job.setGroupingComparatorClass(GroupingComparator.class) but if I
do so I get in the output:
minni
[(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])]
One single key (the first one) and all postings associated with
it...what do I miss??
Thanks for your time
Marco
[1] by "same key" I mean those who have the same left element