Hi all,

I have a problem where I need to compare each input record to one or
more large files (the large files are loaded into memory). Which file
the records will need to be compared against varies depending on the
input record.

Currently, I have it working, but I doubt it is the optimal way. The
way I do it is I created a job to run before the main job, in which
each record is emitted one or more times with the file to be compared
against later on. I then use the reduce phase to sort by file order,
so that in the main job it will not be constantly swapping these large
files in and out of memory.

This method works well however the first job writes a lot of data to
HDFS and takes a long time to run. Given it's relatively simple task
I was wondering if there is a better way to do it? Comments

Kind Regards,

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
postedFeb 22, '09 at 10:53p
activeFeb 22, '09 at 10:53p

1 user in discussion

Shane Butler: 1 post



site design / logo © 2022 Grokbase