FAQ
I and running a hadoop program to perform MapReduce work on files inside a folder.

My program is basically doing Map and Reduce work, each line of any file is a pair of string, and the result is a string associate with occurence inside all files.

The program works fine until the number of files grow to about 80,000,then the 'cannot allocate memory' error occur for some reason.

Each of the file contains around 50 lines, but the total size of all files is no more than 1.5 GB. There are 3 datanodes performing calculation,each of them have more than 10GB hd left.

I am wondering if that is normal for Hadoop because the data is too large ? Or it might be my programs problem ?

It is really not supposed to be since Hadoop was developed for processing large data sets.


Any idea is well appreciated

Search Discussions

  • Amogh Vasekar at Oct 19, 2009 at 1:02 pm
    Hi,
    It would be more helpful if you provide the exact error here.
    Also, hadoop uses the local FS to store intermediate data, along with HDFS for final output.
    If your job is memory intensive, try limiting the number of tasks you are running in parallel on a machine.

    Amogh


    On 10/19/09 8:27 AM, "Kunsheng Chen" wrote:

    I and running a hadoop program to perform MapReduce work on files inside a folder.

    My program is basically doing Map and Reduce work, each line of any file is a pair of string, and the result is a string associate with occurence inside all files.

    The program works fine until the number of files grow to about 80,000,then the 'cannot allocate memory' error occur for some reason.

    Each of the file contains around 50 lines, but the total size of all files is no more than 1.5 GB. There are 3 datanodes performing calculation,each of them have more than 10GB hd left.

    I am wondering if that is normal for Hadoop because the data is too large ? Or it might be my programs problem ?

    It is really not supposed to be since Hadoop was developed for processing large data sets.


    Any idea is well appreciated
  • Ashutosh Chauhan at Oct 19, 2009 at 3:31 pm
    You might be hitting into the problem of "small-files". This has been
    discussed multiple times on the list. Greping through archives will help.
    Also http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

    Ashutosh
    On Sun, Oct 18, 2009 at 22:57, Kunsheng Chen wrote:

    I and running a hadoop program to perform MapReduce work on files inside a
    folder.

    My program is basically doing Map and Reduce work, each line of any file is
    a pair of string, and the result is a string associate with occurence inside
    all files.

    The program works fine until the number of files grow to about 80,000,then
    the 'cannot allocate memory' error occur for some reason.

    Each of the file contains around 50 lines, but the total size of all files
    is no more than 1.5 GB. There are 3 datanodes performing calculation,each of
    them have more than 10GB hd left.

    I am wondering if that is normal for Hadoop because the data is too large ?
    Or it might be my programs problem ?

    It is really not supposed to be since Hadoop was developed for processing
    large data sets.


    Any idea is well appreciated








  • Kunsheng Chen at Oct 20, 2009 at 1:21 am
    I guess this is exactly the problem is!

    Is there any way I could do "Greping archives" inside the MP program ? Or some hadoop command that could combine all small pieces files into a big one ?



    Thanks,

    -Kun


    --- On Mon, 10/19/09, Ashutosh Chauhan wrote:
    From: Ashutosh Chauhan <ashutosh.chauhan@gmail.com>
    Subject: Re: Hadoop dfs can't allocate memory with enough hard disk space when data gets huge
    To: common-user@hadoop.apache.org
    Date: Monday, October 19, 2009, 3:30 PM
    You might be hitting into the problem
    of "small-files". This has been
    discussed multiple times on the list. Greping through
    archives will help.
    Also http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

    Ashutosh
    On Sun, Oct 18, 2009 at 22:57, Kunsheng Chen wrote:

    I and running a hadoop program to perform MapReduce
    work on files inside a
    folder.

    My program is basically doing Map and Reduce work,
    each line of any file is
    a pair of string, and the result is a string associate
    with occurence inside
    all files.

    The program works fine until the number of files grow
    to about 80,000,then
    the 'cannot allocate memory' error occur for some reason.
    Each of the file contains around 50 lines, but the
    total size of all files
    is no more than 1.5 GB. There are 3 datanodes
    performing calculation,each of
    them have more than 10GB hd left.

    I am wondering if that is normal for Hadoop because
    the data is too large ?
    Or it might be my programs problem ?

    It is really not supposed to be since Hadoop was
    developed for processing
    large data sets.


    Any idea is well appreciated








    __________________________________________________
    Do You Yahoo!?
    Tired of spam? Yahoo! Mail has the best spam protection around
    http://mail.yahoo.com
  • Dmitriy Ryaboy at Oct 20, 2009 at 2:02 am
    For searching (grepping) mailing list archives, I like MarkMail:
    http://hadoop.markmail.org/ (try searching for "small files").

    For concatenating files -- cat works, if you don't care about
    provenance; as an alternative, you can also write a simple MR program
    that creates a SequenceFile by reading in all the little files and
    producing (filePath, fileContents) records.

    The Cloudera post Ashutosh referred you to has a brief overview of all
    the "standard" ideas.

    -Dmitriy
    On Mon, Oct 19, 2009 at 9:21 PM, Kunsheng Chen wrote:
    I guess this is exactly the problem is!

    Is there any way I could do "Greping archives" inside the MP program ? Or some hadoop command that could combine all small pieces files into a big one ?



    Thanks,

    -Kun


    --- On Mon, 10/19/09, Ashutosh Chauhan wrote:
    From: Ashutosh Chauhan <ashutosh.chauhan@gmail.com>
    Subject: Re: Hadoop dfs can't allocate memory with enough hard disk space when  data gets huge
    To: common-user@hadoop.apache.org
    Date: Monday, October 19, 2009, 3:30 PM
    You might be hitting into the problem
    of "small-files". This has been
    discussed multiple times on the list. Greping through
    archives will help.
    Also http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

    Ashutosh

    On Sun, Oct 18, 2009 at 22:57, Kunsheng Chen <keyek@yahoo.com>
    wrote:
    I and running a hadoop program to perform MapReduce
    work on files inside a
    folder.

    My program is basically doing Map and Reduce work,
    each line of any file is
    a pair of string, and the result is a string associate
    with occurence inside
    all files.

    The program works fine until the number of files grow
    to about 80,000,then
    the 'cannot allocate memory' error occur for some reason.
    Each of the file contains around 50 lines, but the
    total size of all files
    is no more than 1.5 GB. There are 3 datanodes
    performing calculation,each of
    them have more than 10GB hd left.

    I am wondering if that is normal for Hadoop because
    the data is too large ?
    Or it might be my programs problem ?

    It is really not supposed to be since Hadoop was
    developed for processing
    large data sets.


    Any idea is well appreciated








    __________________________________________________
    Do You Yahoo!?
    Tired of spam?  Yahoo! Mail has the best spam protection around
    http://mail.yahoo.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 19, '09 at 2:57a
activeOct 20, '09 at 2:02a
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase