FAQ
hi all.

i have a job where my map will be transforming files and throwing out
malformed records, etc. Another step in this job is to perform
lookups based on certain fields in the records. Think parent records
from an RDBMS.

example:
object id --- view details --- source
becomes
grandparentid --- parentid --- objectid --- view details --- source

For small lookups, i can obviously add hashes and other support code
to my map job to perform the lookups. When i have millions of
possible lookups, though, what kind of best practices are there for
doing lookups of that size?

Should they be data join jobs? Can berkely dbs or other self
contained dbs be distributed to each of the nodes?

what have your experiences been with these kinds of lookups?

-jason

(ps - big thanks to all the hadoop folks. I am having a blast using
the toolkit. )

Search Discussions

  • Arun C Murthy at Aug 20, 2007 at 4:24 am
    Jason,
    On Sat, Aug 18, 2007 at 10:53:39AM -0500, jason gessner wrote:

    For small lookups, i can obviously add hashes and other support code
    to my map job to perform the lookups. When i have millions of
    possible lookups, though, what kind of best practices are there for
    doing lookups of that size?

    Should they be data join jobs? Can berkely dbs or other self
    contained dbs be distributed to each of the nodes?

    what have your experiences been with these kinds of lookups?
    There is a distributed file-cache you can use: http://lucene.apache.org/hadoop/api/org/apache/hadoop/filecache/DistributedCache.html

    Other than that clearly you could use an external database (mysql/oracle/bdb) or webservice calls; clearly application specific. I do know folks doing lookups via webservice calls.

    Arun
    -jason

    (ps - big thanks to all the hadoop folks. I am having a blast using
    the toolkit. )

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 18, '07 at 3:54p
activeAug 20, '07 at 4:24a
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Jason gessner: 1 post Arun C Murthy: 1 post

People

Translate

site design / logo © 2021 Grokbase