---- Original message ----
Date: Thu, 16 Jun 2011 18:04:08 -0700
From: Ian Upright <ian@upright.net>
Subject: Re: large memory tasks
To: common-user@hadoop.apache.org

Aha! Thanks, now I get it. In my case, my values are fairly
large (a few
kilobytes) so I wonder how that might impact performance.
(I've got several
million entries with maybe 250 key/value pairs each) That is
a lot of data
to sort, but then again, depending on how efficient your
lookups are, it
might be cheaper.

Just curious, with your particular dataset, have you actually
benched this
process against a TC database that was replicated in your

Haven't fully compared with TC. A small scale test indicated
that TC was as fast as memcache and MongoDB. Of course, it
depends on data and settings. Please keep in mind if the size
of your value is big, it also requires more resource for TC to
response the queries. My experience was if the stored values
are small, a single TC server could handles more than 20,000
queries per second. If the values are large, the response may
drain to only 2,000 queries per second. It all depends on your
application: if your values are very large, TC is a cheaper
choice because otherwise you would need heavy resource to
prepare the dataset in HDFS.

Now I can easily see how that your method would be
drastically more
efficient than using some slow (millisecond lookup), however, if your
lookups are down to the microseconds and use something like
Tokyo Cabinet,
which is also memory cache friendly, etc, I really wonder if
you are winning
or losing. TC should be dramatically faster than, say, a
local memcache.
Using something like a memcache network protocol (even
locally) for each
lookup adds an excessive amount of overhead.
I didn't observe anything dramatic. So I couldn't say TC is
better than others or Vice-Versa. There are some significant
results mentioned in the literature for different tools, but
they may depend on specific settings.
If you make this more efficient, it also springs to mind
distributing the
values, especially if they are large. I did.
For example

V1 -> File offset 1
V2 -> File offset 2
V3 -> File offset 3

Then your process,

K1 R1 K1 V1
K2 R2 K2 V2

which gives you

K1 R1 (File offset 1)
K2 R2 (File offset 2)

Then you distribute a file that just is a simple file, which
contains all
the values.

As you iterate over the sorted records, now your process uses
the offsets to
dynamically pick off your values off of disk, using a
FileChannel. Thereby,
this could dramatically reduce the amount of data required for this
operation. Also, this would also be cache-friendly with each
node's disk
cache. In being cache-friendly, if there were some values
that were much
more common than others, then I would think this could be a
big win. Also,
all this could done easily and efficiently in pure Java, and wouldn't
require any external library like TokyoCabinet.

Have you tried any experiments like that?
For the natural language problem I have, the frequencies of
words could be modeled by Zipf law. So there are only a very
small number of words occur a lot. Buy exploring the Zipf
curve I found an optimal frequency threshold, and I put all
the key-value pairs above that frequency threshold (<1%) in
distributed cache and load them directly in memory. For other
99% less frequent words I attach them in HDFS. Thus I saved a
lot of storage. I applied a bloom filter upon this to decide
whether query a given key in memory or use the information in
attached in HDFS. This could make the process even faster. But
for my problem, a pure HDFS based approach without Zipf and
bloom filter was already 20% faster than the memcache
I'm still not convinced that it would be any faster than
using something
like Tokyo Cabinet, but it would be interesting to compare.

There is no look up. The process is done by shuffle and sort
sort for multiple keys) in Map/Reduce.

The key problem is to join your record files with lookup
K1 R1 K1 V1
K2 R2 K2 V2
which gives you

K1 R1 V1
K2 R2 V2

You could either implement Map-side join or Reduce side join
to do
this. For multiple keys in a record, you then need
secondary sort to
preserve the order of the keys.

There is nothing broken, moreover, it is scalable by just
tuning the
Map/Reduce configuration parameters or adding more nodes. In
to scale up a memcache look up approach you would need to
tune the
memcache service (servers, bandwidth, connection pools,
etc.), which is
a bottleneck could not be solely tackled by Map/Reduce
On 6/16/2011 12:19 PM, Ian Upright wrote:
Again.. this is totally not making sense to me. If your
lookups are being
done ahead of time, then how are they being done? What
process is doing all
these lookups for R1, to get the values of V1 V2 V3? Is
this lookup process
done in parallel or on the cluster?

If this indeed faster, then I would argue there is
something very broken
about your lookup processes being done in the tasks.

Suppose you are looking up a value V of a key K. And V
is required for
an upcoming process. Suppose the data in the upcoming
process has the form
R1 K1 K2 K3,

where R1 is the record number, K1 to K3 are the keys
occurring in the
record, which means in the look up case you would query
for V1, V2, V3
Using inner join you could attach all the V values for a
single record
and prepare the data like

R1 K1 K2 K3 V1 V2 V3

then each record has the complete information for the
next process. So
you pay the storage for the efficiency. Even taking into
account the
time required for preparing the data, it is still faster
than the
look-up approach.

I have also tried TokyoCabinet, you need to compile and
install some
extensions to get it working. Sometimes getting things
and APIs to work
can be painful. If you don't need to update the lookup
table, install
TC, MemCache, MongoDB locally on each node would be the
most efficient
solution because all the look-ups are local.

On 6/15/2011 5:56 PM, Ian Upright wrote:
If the data set doesn't fit in working memory, but is
still of a reasonable
size (lets say a few hundred gigabytes), then I'd
probably use something
like this:


From reading the Hadoop docs (which I'm very new to),
then I might use
DistributedCache to replicate that database around. My
impression would be
that this might be among the most efficient things one
could do.
However, for my particular application, even using
tokycabinet introduces
too much inefficiency, and a pure plain old memory-based
lookups is by far
the most efficient. (not to mention that some of the
lookups I'm doing are
specialized trees that can't be done with tokyocabinet
or any typical db,
but thats beside the point)

I'm having trouble understanding your more efficient
method by using more
data and HDFS, and having trouble understanding how it
could possibly be any
more efficient than say the above approach.

How is increasing the size minimizing the lookups?

I had the same problem before, a big lookup table too
large to load in

I tried and compared the following approaches: in-
memory MySQL DB, a
dedicated central memcache server, a dedicated central
MongoDB server,
local DB (each node has its own MongoDB server) model.

The local DB model is the most efficient one. I
believe dedicated
server approach could get improved if the number of
server is increased
and distributed. I just tried single server.

But later I dropped out the lookup table approach.
Instead, I attached
the table information in the HDFS (which could be
considered as an inner
join DB process), which significantly increases the
size of data sets
but avoids the bottle neck of table look up. There is a
trade-off, when
no table looks up, the data to process is intensive (TB
size). In
contrast, a look-up table could save 90% of the data
According to our experiments on a 30-node cluster,
attaching information
in HDFS is even 20% faster than the local DB model.
When attaching
information in HDFS, it is also easier to ping-pong
configuration to further improve the efficiency.

On 6/15/2011 5:05 PM, GOEKE, MATTHEW (AG/1000) wrote:
Is the lookup table constant across each of the tasks?
You could try putting it into memcached:


-----Original Message-----
From: Ian Upright
Sent: Wednesday, June 15, 2011 3:42 PM
To: common-user@hadoop.apache.org
Subject: large memory tasks

Hello, I'm quite new to Hadoop, so I'd like to get an
understanding of

Lets say I have a task that requires 16gb of memory,
in order to execute.
Lets say hypothetically it's some sort of big
lookuptable of sorts that
needs that kind of memory.

I could have 8 cores run the task in parallel
(multithreaded), and all 8
cores can share that 16gb lookup table.

On another machine, I could have 4 cores run the same
task, and they still
share that same 16gb lookup table.

Now, with my understanding of Hadoop, each task has
it's own memory.
So if I have 4 tasks that run on one machine, and 8
tasks on another, then
the 4 tasks need a 64 GB machine, and the 8 tasks need
a 128 GB machine, but
really, lets say I only have two machines, one with 4
cores and one with 8,
each machine only having 24 GB.

How can the work be evenly distributed among these
machines? Am I missing
something? What other ways can this be configured
such that this works

Thanks, Ian
This e-mail message may contain privileged and/or
confidential information, and is intended to be received only
by persons entitled
to receive such information. If you have received this
e-mail in error, please notify the sender immediately. Please
delete it and
all attachments from any servers, hard drives or any
other media. Other use of this e-mail by you is strictly
All e-mails and attachments sent and received are
subject to monitoring, reading and archival by Monsanto,
including its
subsidiaries. The recipient of this e-mail is solely
responsible for checking for the presence of "Viruses" or
other "Malware".
Monsanto, along with its subsidiaries, accepts no
liability for any damage caused by any such code transmitted
by or accompanying
this e-mail or any attachment.

The information contained in this email may be subject
to the export control laws and regulations of the United
States, potentially
including but not limited to the Export Administration
Regulations (EAR) and sanctions regulations issued by the U.S.
Department of
Treasury, Office of Foreign Asset Controls (OFAC). As
a recipient of this information you are obligated to comply
with all
applicable U.S. export laws and regulations.

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
postedJun 17, '11 at 1:51a
activeJun 17, '11 at 1:51a

1 user in discussion

Shi Yu: 1 post



site design / logo © 2022 Grokbase