FAQ
Hi Alan,

Thanks for your message.

The object can be read-only once it is initialized - I do not need to modify
it. Essentially it is an object that allows me to analyze/modify data that I
am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is
that if I run multiple mappers, this object gets replicated in the different
VMs and I run out of memory on my node. I pretty much need to have the full
object in memory to do my processing. It is possible (though quite
difficult) to have it partially on disk and query it (like a lucene store
implementation) but there is a significant performance hit. As an e.g., let
us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this
scenario, I can really only have 1 mapper per node whereas there are 8 CPUs.
But if the overhead of sharing the object (e.g. RMI) or persisting the
object (e.g. lucene) is greater than 8 times the memory speed, then it is
cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was
getting a roughly 600 times decrease in performance versus in-memory access.

So ideally, if I could have all the mappers in the same VM, then I can
create a singleton and still have multiple mappers access it at memory
speeds.

Please do let me know if I am looking at this correctly and if the above is
possible.

Thanks a lot for all your help.

Cheers,
Dev



On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho wrote:

It really depends on what type of data you are sharing, how you are looking
up the data, whether the data is Read-write, and whether you care about
consistency. If you don't care about consistency, I suggest that you shove
the data into a BDB store (for key-value lookup) or a lucene store, and copy
the data to all the nodes. That way all data access will be in-process, no
gc problems, and you will get very fast results. BDB and lucene both have
easy replication strategies.

If the data is RW, and you need consistency, you should probably forget
about MapReduce and just run everything on big-iron.

Regards,
Alan Ho




----- Original Message ----
From: Devajyoti Sarkar <dsarkar@q-kk.com>
To: core-user@hadoop.apache.org
Sent: Thursday, October 2, 2008 8:41:04 PM
Subject: Sharing an object across mappers

I think each mapper/reducer runs in its own JVM which makes it impossible
to
share objects. I need to share a large object so that I can access it at
memory speeds across all the mappers. Is it possible to have all the
mappers
run in the same VM? Or is there a way to do this across VMs at high speed?
I
guess JMI and others such methods will be just too slow.

Thanks,
Dev



__________________________________________________________________
Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo!
Canada Messenger at http://ca.beta.messenger.yahoo.com/

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 7 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 3, '08 at 3:41a
activeOct 3, '08 at 4:48p
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase