Hi, all



I am a newbie to hadoop and just begin to play it recent days. I am
trying to write a mapreduce program to parse a large dataset (about 20G)
to abstract object id and store to HBase table. The issue is there is
one keyword which associates with several million object id. Here is my
first reduce program.





<program1>

public class MyReducer extends TableReducer<Writable, Writable,
Writable> {



@Override

public void reduce(Writable key, Iterable<Writable> objectids,
Context context)

throws IOException, InterruptedException {



Set<String> objectIDs = new HashSet<String>();

Put put = new Put(((ImmutableBytesWritable) key).get());

byte[] family = Bytes.toBytes("oid");

for (Writable objid : objectids) {

objectIDs.add(((Text)objid)).toString());

}

put.add(family, null, Bytes.toBytes(objectIDs.toString());

context.write((ImmutableBytesWritable) key, put);



}

}



In this program, the reduce failed because of the java heap "out of
memory" issue. A rough counting show that the several million object id
consumes about 900M heap if loading them all into a Set at one time. So
I implements the reduce in another way:



<program2>

public class IndexReducer extends TableReducer<Writable, Writable,
Writable> {

@Override

public void reduce(Writable key, Iterable<Writable> values, Context
context)

throws IOException, InterruptedException {



Put put = new Put(((ImmutableBytesWritable) key).get());

byte[] family = Bytes.toBytes("oid");

for (Writable objid : values) {

put.add(family, Bytes.toBytes(((Text) objid).toString()),
Bytes

.toBytes(((Text) objid).toString()));

}

context.write((ImmutableBytesWritable) key, put);

}

}



This time, the reduce still failed as a result of "reduce time out"
issue. I doubled the reduce time-out. Then, "Out of memory" happened.
Error log shows the put.add() throws "Out of memory" error.





By the way, there are totally 18 datanode in the hadoop/hbase
environment. The number of reduce tasks is 50.



So, my question is how to handle large volume reduce input value in
mapreduce program. Increase memory? I don't think it is a reasonable
option. Increase reduce task number?.........



Sigh, I totally have no any clue. What's your suggestion?





Best Regards,
HB

Search Discussions

  • Aaron Kimball at Oct 5, 2009 at 7:21 pm
    HB,
    From a scalability perspective, you will always have a finite limit of RAM
    available. Without knowing much about HBase, I can't tell whether this is
    the only way you can accomplish your goal or not. But a basic maxim to
    follow is, "Using O(N) space in your reducer is guaranteed to overflow for
    some dataset size." In your code, you buffer up values in your inner loop
    and write them all together at the very end. You want to process each record
    in your inner iterator loop in isolation; the context.write() call should
    occur in there, and then you should build a new Set. You may need to
    redesign other aspects of your system to expect data to be on adjacent HBase
    rows rather than in a single set.

    I hope this helps,
    - Aaron
    On Tue, Sep 29, 2009 at 4:43 AM, wrote:

    Hi, all



    I am a newbie to hadoop and just begin to play it recent days. I am trying
    to write a mapreduce program to parse a large dataset (about 20G) to
    abstract object id and store to HBase table. The issue is there is one
    keyword which associates with several million object id. Here is my first
    reduce program.





    *<program1>*

    *public* *class* MyReducer *extends* TableReducer<Writable, Writable,
    Writable> {



    @Override

    *public* *void* reduce(Writable key, Iterable<Writable> objectids,
    Context context)

    *throws* IOException, InterruptedException {



    Set<String> objectIDs = new HashSet<String>();

    Put put = *new* Put(((ImmutableBytesWritable) key).get());

    *byte*[] family = Bytes.*toBytes*("oid");

    *for* (Writable objid : objectids) {

    objectIDs.add(((Text)objid)).toString());

    }

    put.add(family, null, Bytes.*toBytes*(objectIDs.toString());

    context.write((ImmutableBytesWritable) key, put);



    }

    }



    In this program, the reduce failed because of the java heap “out of memory”
    issue. A rough counting show that the several million object id consumes
    about 900M heap if loading them all into a Set at one time. So I implements
    the reduce in another way:

    * *

    *<program2>*

    *public* *class* IndexReducer *extends* TableReducer<Writable, Writable,
    Writable> {

    @Override

    *public* *void* reduce(Writable key, Iterable<Writable> values,
    Context context)

    *throws* IOException, InterruptedException {



    Put put = *new* Put(((ImmutableBytesWritable) key).get());

    *byte*[] family = Bytes.*toBytes*("oid");

    *for* (Writable objid : values) {

    put.add(family, Bytes.*toBytes*(((Text) objid).toString()),
    Bytes

    .*toBytes*(((Text) objid).toString()));

    }

    context.write((ImmutableBytesWritable) key, put);

    }

    }



    This time, the reduce still failed as a result of “reduce time out” issue.
    I doubled the reduce time-out. Then, “Out of memory” happened. Error log
    shows the put.add() throws “Out of memory” error.





    By the way, there are totally 18 datanode in the hadoop/hbase environment.
    The number of reduce tasks is 50.



    So, my question is how to handle large volume reduce input value in
    mapreduce program. Increase memory? I don’t think it is a reasonable option.
    Increase reduce task number?.........



    Sigh, I totally have no any clue. What’s your suggestion?





    Best Regards,
    HB

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedSep 29, '09 at 8:44a
activeOct 5, '09 at 7:21p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Aaron Kimball: 1 post Yin_hongbin: 1 post

People

Translate

site design / logo © 2022 Grokbase