We are rolling up unique counts per event ID. So we use event ID as a key, and want to count the number of unique event values. Since the number of event IDs is reasonably small (well under 100) and the universe of values is large (potentially millions) we wind up in a situation where we are pushing too much into memory.
Ted commented about special purpose Hashmaps. Unfortunatley, our event values can be up to 128 bits long so I don't think a special purpose Hashmap would work.
Thanks,
C G
Runping Qi wrote:
It would be nice if you can contribute a file backed hashmap, or a file
backed implementation of the unique count aggregator.
Short of that, if you just need to count the unique values for each
event id, you can do so by using the aggregate classes with each
event-id/event-value pair as a key and simply counting the number of
occurrences of each composite key.
Runping
-----Original Message-----
From: C G
Sent: Wednesday, December 19, 2007 11:59 AM
To: [email protected]
Subject: HashMap which can spill to disk for Hadoop?
Hi All:
The aggregation classes in Hadoop use a HashMap to hold unique values in
memory when computing unique counts, etc. I ran into a situation on 32-
node grid (4G memory/node) where a single node runs out of memory within
the reduce phase trying to manage a very large HashMap. This was
disappointing because the dataset is only 44M rows (4G) of data. This is
a scenario where I am counting unique values associated with various
events, where the total number of events is very small and the number of
unique values is very high. Since the event IDs serve as keys as the
number of distinct event IDs is small, there is a consequently small
number of reducers running, where each reducer is expected to manage a
very large HashMap of unique values.
It looks like I need to build my own unique aggregator, so I am looking
for an implementation of HashMap which can spill to disk as needed. I've
considered using BDB as a backing store, and I've looking into Derby's
BackingStoreHashtable as well.
For the present time I can restructure my data in an attempt to get more
reducers to run, but I can see in the very near future where even that
will run out of memory.
Any thoughts,comments, or flames?
Thanks,
C G
---------------------------------
Looking for last minute shopping deals? Find them fast with Yahoo!
Search.
From: C G
Sent: Wednesday, December 19, 2007 11:59 AM
To: [email protected]
Subject: HashMap which can spill to disk for Hadoop?
Hi All:
The aggregation classes in Hadoop use a HashMap to hold unique values in
memory when computing unique counts, etc. I ran into a situation on 32-
node grid (4G memory/node) where a single node runs out of memory within
the reduce phase trying to manage a very large HashMap. This was
disappointing because the dataset is only 44M rows (4G) of data. This is
a scenario where I am counting unique values associated with various
events, where the total number of events is very small and the number of
unique values is very high. Since the event IDs serve as keys as the
number of distinct event IDs is small, there is a consequently small
number of reducers running, where each reducer is expected to manage a
very large HashMap of unique values.
It looks like I need to build my own unique aggregator, so I am looking
for an implementation of HashMap which can spill to disk as needed. I've
considered using BDB as a backing store, and I've looking into Derby's
BackingStoreHashtable as well.
For the present time I can restructure my data in an attempt to get more
reducers to run, but I can see in the very near future where even that
will run out of memory.
Any thoughts,comments, or flames?
Thanks,
C G
---------------------------------
Looking for last minute shopping deals? Find them fast with Yahoo!
Search.
---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.