i have a job where my map will be transforming files and throwing out
malformed records, etc. Another step in this job is to perform
lookups based on certain fields in the records. Think parent records
from an RDBMS.
object id --- view details --- source
grandparentid --- parentid --- objectid --- view details --- source
For small lookups, i can obviously add hashes and other support code
to my map job to perform the lookups. When i have millions of
possible lookups, though, what kind of best practices are there for
doing lookups of that size?
Should they be data join jobs? Can berkely dbs or other self
contained dbs be distributed to each of the nodes?
what have your experiences been with these kinds of lookups?
(ps - big thanks to all the hadoop folks. I am having a blast using
the toolkit. )