FAQ
Hello All,

I am creating 2 files in the constructor of my reducer and I store the file Handles as member variables. I write some data in these files for each call to reduce method. For some reason the files are created but only the larger file has data(The larger file is 400mb).The smaller file is created but no data is written to it (The size should be around 4MB). I checked the hadoop configuration and the buffer-size parameter is 4096.

I padded some 100MB data to the smaller file and than this file gets around 64MB data. I am clueless what is happening and if there is a solution to this. Please share if you have faced similar problem or if you know the solution.

Thanks,
Parth Brahmbhatt

Search Discussions

  • Ted Xu at Dec 5, 2009 at 1:52 am
    Hello Parth,

    Once I met a similar problem while using user-defined recordwriter in
    reducer. That time I forgot to close the recordwriter in the close phase of
    reducer.

    2009/12/5 Parth J. Brahmbhatt <parth@ccs.neu.edu>
    Hello All,

    I am creating 2 files in the constructor of my reducer and I store the file
    Handles as member variables. I write some data in these files for each call
    to reduce method. For some reason the files are created but only the larger
    file has data(The larger file is 400mb).The smaller file is created but no
    data is written to it (The size should be around 4MB). I checked the hadoop
    configuration and the buffer-size parameter is 4096.

    I padded some 100MB data to the smaller file and than this file gets around
    64MB data. I am clueless what is happening and if there is a solution to
    this. Please share if you have faced similar problem or if you know the
    solution.

    Thanks,
    Parth Brahmbhatt
    Best Regards,

    Ted Xu
  • Gang Luo at Dec 5, 2009 at 4:55 am
    Hi all,
    I got a tricky problem. I input a small file manually to do some filtering work on each line in map function. I check if the line satisfy the constrain then I output it, otherwise I return, without doing any other work below.

    For the map function will be called on each line, I think the logic is correct. But it doesn't work like this. If there are 5 line for a map task, and only the 2nd line satisfies the constrain, then the output will be line 2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5. It seems that once a map task meet the first satisfying line, the filter doesn't work for following lines.

    It is interesting problem. I am checking it now. I also hope someone could give me some ideas on this. Thanks.


    -Gang


    ___________________________________________________________
    好玩贺卡等你发,邮箱贺卡全新上线!
    http://card.mail.cn.yahoo.com/
  • Sonal Goyal at Dec 6, 2009 at 3:03 pm
    Hi,

    Maybe you could post your code/logic for doing this. One way would be to set
    a flag once your criteria is met and emit keys based on the flag.

    Thanks and Regards,
    Sonal


    2009/12/5 Gang Luo <lgpublic@yahoo.com.cn>
    Hi all,
    I got a tricky problem. I input a small file manually to do some filtering
    work on each line in map function. I check if the line satisfy the constrain
    then I output it, otherwise I return, without doing any other work below.

    For the map function will be called on each line, I think the logic is
    correct. But it doesn't work like this. If there are 5 line for a map task,
    and only the 2nd line satisfies the constrain, then the output will be line
    2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5.
    It seems that once a map task meet the first satisfying line, the filter
    doesn't work for following lines.

    It is interesting problem. I am checking it now. I also hope someone could
    give me some ideas on this. Thanks.


    -Gang


    ___________________________________________________________
    好玩贺卡等你发,邮箱贺卡全新上线!
    http://card.mail.cn.yahoo.com/
  • Edmund Kohlwey at Dec 6, 2009 at 3:53 pm
    Let me see if I understand:
    The mapper is reading lines in a text file. You want to see if a single
    line meets a given criteria, and emit all the lines whose index is
    greater than or equal to the single matching line's index. I'll assume
    that if more than one line meets the criteria, you have a different
    condition which you will handle appropriately.

    First some discussion of your input- is this a single file that should
    be considered as a whole? In that case, you probably only want one
    mapper, which, depending on your reduce task, may totally invalidate the
    use case for MapReduce. You may just want to read the file directly from
    HDFS and write to HDFS in whatever application is using the data.

    Anyways, here's how I'd do it. In setup, open a temporary file (it can
    be directly on the node, or on HDFS, although directly on the node is
    preferable). Use map to perform your test, and keep a counter of how
    many lines match. After the first line matches, begin saving lines. If a
    second line matches, log the error condition or whatever. In cleanup, if
    only one line matched, open your temp file and begin emitting the lines
    you saved from earlier.

    There's a few considerations in your implementation:
    1. File size. If the temporary file exceeds the available space on a
    mapper, you can make a temp file in HDFS but this is far from ideal.
    2. As noted above, if there's a single mapper and no need to sort or
    reduce the output, you probably want to just implement this as a program
    that happens to be using HDFS as a data store, and not bother with
    MapReduce at all.




    On 12/6/09 10:03 AM, Sonal Goyal wrote:
    Hi,

    Maybe you could post your code/logic for doing this. One way would be to set
    a flag once your criteria is met and emit keys based on the flag.

    Thanks and Regards,
    Sonal


    2009/12/5 Gang Luo <lgpublic@yahoo.com.cn>

    Hi all,
    I got a tricky problem. I input a small file manually to do some filtering
    work on each line in map function. I check if the line satisfy the constrain
    then I output it, otherwise I return, without doing any other work below.

    For the map function will be called on each line, I think the logic is
    correct. But it doesn't work like this. If there are 5 line for a map task,
    and only the 2nd line satisfies the constrain, then the output will be line
    2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5.
    It seems that once a map task meet the first satisfying line, the filter
    doesn't work for following lines.

    It is interesting problem. I am checking it now. I also hope someone could
    give me some ideas on this. Thanks.


    -Gang


    ___________________________________________________________
    好玩贺卡等你发,邮箱贺卡全新上线!
    http://card.mail.cn.yahoo.com/
  • Gang Luo at Dec 6, 2009 at 11:56 pm
    Thanks for reponse.

    It seems there is something wrong in my logic. I kind of solve it now. What I am still unsure of is how to return or exit in a mapreduce program. If I want to skip one line (because it doesn't satisfy some constrains, for example), use return to quit map function is enough. But what if I want to quit a map task (due to some error I detect, for example, the file I want to read doesn't exist)? if use System.exit(), hadoop will try to run it again. Similarly, if I catch an exception and I want to quit the current task, what should I do?

    -Gang


    ----- 原始邮件 ----
    发件人: Edmund Kohlwey <ekohlwey@gmail.com>
    收件人: common-user@hadoop.apache.org
    发送日期: 2009/12/6 (周日) 10:52:40 上午
    主 题: Re: return in map

    Let me see if I understand:
    The mapper is reading lines in a text file. You want to see if a single
    line meets a given criteria, and emit all the lines whose index is
    greater than or equal to the single matching line's index. I'll assume
    that if more than one line meets the criteria, you have a different
    condition which you will handle appropriately.

    First some discussion of your input- is this a single file that should
    be considered as a whole? In that case, you probably only want one
    mapper, which, depending on your reduce task, may totally invalidate the
    use case for MapReduce. You may just want to read the file directly from
    HDFS and write to HDFS in whatever application is using the data.

    Anyways, here's how I'd do it. In setup, open a temporary file (it can
    be directly on the node, or on HDFS, although directly on the node is
    preferable). Use map to perform your test, and keep a counter of how
    many lines match. After the first line matches, begin saving lines. If a
    second line matches, log the error condition or whatever. In cleanup, if
    only one line matched, open your temp file and begin emitting the lines
    you saved from earlier.

    There's a few considerations in your implementation:
    1. File size. If the temporary file exceeds the available space on a
    mapper, you can make a temp file in HDFS but this is far from ideal.
    2. As noted above, if there's a single mapper and no need to sort or
    reduce the output, you probably want to just implement this as a program
    that happens to be using HDFS as a data store, and not bother with
    MapReduce at all.




    On 12/6/09 10:03 AM, Sonal Goyal wrote:
    Hi,

    Maybe you could post your code/logic for doing this. One way would be to set
    a flag once your criteria is met and emit keys based on the flag.

    Thanks and Regards,
    Sonal


    2009/12/5 Gang Luo <lgpublic@yahoo.com.cn>

    Hi all,
    I got a tricky problem. I input a small file manually to do some filtering
    work on each line in map function. I check if the line satisfy the constrain
    then I output it, otherwise I return, without doing any other work below.

    For the map function will be called on each line, I think the logic is
    correct. But it doesn't work like this. If there are 5 line for a map task,
    and only the 2nd line satisfies the constrain, then the output will be line
    2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5.
    It seems that once a map task meet the first satisfying line, the filter
    doesn't work for following lines.

    It is interesting problem. I am checking it now. I also hope someone could
    give me some ideas on this. Thanks.


    -Gang


    ___________________________________________________________
    好玩贺卡等你发,邮箱贺卡全新上线!
    http://card.mail.cn.yahoo.com/

    ___________________________________________________________
    好玩贺卡等你发,邮箱贺卡全新上线!
    http://card.mail.cn.yahoo.com/
  • Edmund Kohlwey at Dec 7, 2009 at 1:09 am
    As far as I know (someone please correct me if I'm wrong), mapreduce
    doesn't provide a facility to signal to stop processing. You will simply
    have to add a field to your mapper class that you set to signal an error
    condition, then in map check if you've set the error condition and
    return on each call if its been set.
    On 12/6/09 6:55 PM, Gang Luo wrote:
    Thanks for reponse.

    It seems there is something wrong in my logic. I kind of solve it now. What I am still unsure of is how to return or exit in a mapreduce program. If I want to skip one line (because it doesn't satisfy some constrains, for example), use return to quit map function is enough. But what if I want to quit a map task (due to some error I detect, for example, the file I want to read doesn't exist)? if use System.exit(), hadoop will try to run it again. Similarly, if I catch an exception and I want to quit the current task, what should I do?

    -Gang


    ----- 原始邮件 ----
    发件人: Edmund Kohlwey <ekohlwey@gmail.com>
    收件人: common-user@hadoop.apache.org
    发送日期: 2009/12/6 (周日) 10:52:40 上午
    主 题: Re: return in map

    Let me see if I understand:
    The mapper is reading lines in a text file. You want to see if a single
    line meets a given criteria, and emit all the lines whose index is
    greater than or equal to the single matching line's index. I'll assume
    that if more than one line meets the criteria, you have a different
    condition which you will handle appropriately.

    First some discussion of your input- is this a single file that should
    be considered as a whole? In that case, you probably only want one
    mapper, which, depending on your reduce task, may totally invalidate the
    use case for MapReduce. You may just want to read the file directly from
    HDFS and write to HDFS in whatever application is using the data.

    Anyways, here's how I'd do it. In setup, open a temporary file (it can
    be directly on the node, or on HDFS, although directly on the node is
    preferable). Use map to perform your test, and keep a counter of how
    many lines match. After the first line matches, begin saving lines. If a
    second line matches, log the error condition or whatever. In cleanup, if
    only one line matched, open your temp file and begin emitting the lines
    you saved from earlier.

    There's a few considerations in your implementation:
    1. File size. If the temporary file exceeds the available space on a
    mapper, you can make a temp file in HDFS but this is far from ideal.
    2. As noted above, if there's a single mapper and no need to sort or
    reduce the output, you probably want to just implement this as a program
    that happens to be using HDFS as a data store, and not bother with
    MapReduce at all.




    On 12/6/09 10:03 AM, Sonal Goyal wrote:

    Hi,

    Maybe you could post your code/logic for doing this. One way would be to set
    a flag once your criteria is met and emit keys based on the flag.

    Thanks and Regards,
    Sonal


    2009/12/5 Gang Luo <lgpublic@yahoo.com.cn>


    Hi all,
    I got a tricky problem. I input a small file manually to do some filtering
    work on each line in map function. I check if the line satisfy the constrain
    then I output it, otherwise I return, without doing any other work below.

    For the map function will be called on each line, I think the logic is
    correct. But it doesn't work like this. If there are 5 line for a map task,
    and only the 2nd line satisfies the constrain, then the output will be line
    2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5.
    It seems that once a map task meet the first satisfying line, the filter
    doesn't work for following lines.

    It is interesting problem. I am checking it now. I also hope someone could
    give me some ideas on this. Thanks.


    -Gang


    ___________________________________________________________
    好玩贺卡等你发,邮箱贺卡全新上线!
    http://card.mail.cn.yahoo.com/

    ___________________________________________________________
    好玩贺卡等你发,邮箱贺卡全新上线!
    http://card.mail.cn.yahoo.com/
  • Amogh Vasekar at Dec 7, 2009 at 5:45 am
    Hi,
    If the file doesn’t exist, java will error out.
    For partial skips, o.a.h.mapreduce.Mapper class provides a method run(), which determines if the end of split is reached and if not, calls map() on your <k,v> pair. You may override this method to include flag checks too and if that fails, the remaining split may be skipped.
    Hope this helps.

    Amogh


    On 12/7/09 6:38 AM, "Edmund Kohlwey" wrote:

    As far as I know (someone please correct me if I'm wrong), mapreduce
    doesn't provide a facility to signal to stop processing. You will simply
    have to add a field to your mapper class that you set to signal an error
    condition, then in map check if you've set the error condition and
    return on each call if its been set.
    On 12/6/09 6:55 PM, Gang Luo wrote:
    Thanks for reponse.

    It seems there is something wrong in my logic. I kind of solve it now. What I am still unsure of is how to return or exit in a mapreduce program. If I want to skip one line (because it doesn't satisfy some constrains, for example), use return to quit map function is enough. But what if I want to quit a map task (due to some error I detect, for example, the file I want to read doesn't exist)? if use System.exit(), hadoop will try to run it again. Similarly, if I catch an exception and I want to quit the current task, what should I do?

    -Gang


    ----- ԭʼ�ʼ� ----
    �����ˣ� Edmund Kohlwey <ekohlwey@gmail.com>
    �ռ��ˣ� common-user@hadoop.apache.org
    �������ڣ� 2009/12/6 (����) 10:52:40 ����
    �� �⣺ Re: return in map

    Let me see if I understand:
    The mapper is reading lines in a text file. You want to see if a single
    line meets a given criteria, and emit all the lines whose index is
    greater than or equal to the single matching line's index. I'll assume
    that if more than one line meets the criteria, you have a different
    condition which you will handle appropriately.

    First some discussion of your input- is this a single file that should
    be considered as a whole? In that case, you probably only want one
    mapper, which, depending on your reduce task, may totally invalidate the
    use case for MapReduce. You may just want to read the file directly from
    HDFS and write to HDFS in whatever application is using the data.

    Anyways, here's how I'd do it. In setup, open a temporary file (it can
    be directly on the node, or on HDFS, although directly on the node is
    preferable). Use map to perform your test, and keep a counter of how
    many lines match. After the first line matches, begin saving lines. If a
    second line matches, log the error condition or whatever. In cleanup, if
    only one line matched, open your temp file and begin emitting the lines
    you saved from earlier.

    There's a few considerations in your implementation:
    1. File size. If the temporary file exceeds the available space on a
    mapper, you can make a temp file in HDFS but this is far from ideal.
    2. As noted above, if there's a single mapper and no need to sort or
    reduce the output, you probably want to just implement this as a program
    that happens to be using HDFS as a data store, and not bother with
    MapReduce at all.




    On 12/6/09 10:03 AM, Sonal Goyal wrote:

    Hi,

    Maybe you could post your code/logic for doing this. One way would be to set
    a flag once your criteria is met and emit keys based on the flag.

    Thanks and Regards,
    Sonal


    2009/12/5 Gang Luo <lgpublic@yahoo.com.cn>


    Hi all,
    I got a tricky problem. I input a small file manually to do some filtering
    work on each line in map function. I check if the line satisfy the constrain
    then I output it, otherwise I return, without doing any other work below.

    For the map function will be called on each line, I think the logic is
    correct. But it doesn't work like this. If there are 5 line for a map task,
    and only the 2nd line satisfies the constrain, then the output will be line
    2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5.
    It seems that once a map task meet the first satisfying line, the filter
    doesn't work for following lines.

    It is interesting problem. I am checking it now. I also hope someone could
    give me some ideas on this. Thanks.


    -Gang


    ___________________________________________________________
    ����ؿ����㷢������ؿ�ȫ�����ߣ�
    http://card.mail.cn.yahoo.com/

    ___________________________________________________________
    ����ؿ����㷢������ؿ�ȫ�����ߣ�
    http://card.mail.cn.yahoo.com/
  • Gang Luo at Dec 7, 2009 at 6:47 pm
    Thanks. It helps.


    -Gang



    ----- 原始邮件 ----
    发件人: Amogh Vasekar <amogh@yahoo-inc.com>
    收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    发送日期: 2009/12/7 (周一) 12:43:07 上午
    主 题: Re: Re: return in map

    Hi,
    If the file doesn’t exist, java will error out.
    For partial skips, o.a.h.mapreduce.Mapper class provides a method run(), which determines if the end of split is reached and if not, calls map() on your <k,v> pair. You may override this method to include flag checks too and if that fails, the remaining split may be skipped.
    Hope this helps.

    Amogh


    On 12/7/09 6:38 AM, "Edmund Kohlwey" wrote:

    As far as I know (someone please correct me if I'm wrong), mapreduce
    doesn't provide a facility to signal to stop processing. You will simply
    have to add a field to your mapper class that you set to signal an error
    condition, then in map check if you've set the error condition and
    return on each call if its been set.
    On 12/6/09 6:55 PM, Gang Luo wrote:
    Thanks for reponse.

    It seems there is something wrong in my logic. I kind of solve it now. What I am still unsure of is how to return or exit in a mapreduce program. If I want to skip one line (because it doesn't satisfy some constrains, for example), use return to quit map function is enough. But what if I want to quit a map task (due to some error I detect, for example, the file I want to read doesn't exist)? if use System.exit(), hadoop will try to run it again. Similarly, if I catch an exception and I want to quit the current task, what should I do?

    -Gang


    ----- ԭʼ�ʼ� ----
    �����ˣ� Edmund Kohlwey <ekohlwey@gmail.com>
    �ռ��ˣ� common-user@hadoop.apache.org
    �������ڣ� 2009/12/6 (����) 10:52:40 ����
    �� �⣺ Re: return in map

    Let me see if I understand:
    The mapper is reading lines in a text file. You want to see if a single
    line meets a given criteria, and emit all the lines whose index is
    greater than or equal to the single matching line's index. I'll assume
    that if more than one line meets the criteria, you have a different
    condition which you will handle appropriately.

    First some discussion of your input- is this a single file that should
    be considered as a whole? In that case, you probably only want one
    mapper, which, depending on your reduce task, may totally invalidate the
    use case for MapReduce. You may just want to read the file directly from
    HDFS and write to HDFS in whatever application is using the data.

    Anyways, here's how I'd do it. In setup, open a temporary file (it can
    be directly on the node, or on HDFS, although directly on the node is
    preferable). Use map to perform your test, and keep a counter of how
    many lines match. After the first line matches, begin saving lines. If a
    second line matches, log the error condition or whatever. In cleanup, if
    only one line matched, open your temp file and begin emitting the lines
    you saved from earlier.

    There's a few considerations in your implementation:
    1. File size. If the temporary file exceeds the available space on a
    mapper, you can make a temp file in HDFS but this is far from ideal.
    2. As noted above, if there's a single mapper and no need to sort or
    reduce the output, you probably want to just implement this as a program
    that happens to be using HDFS as a data store, and not bother with
    MapReduce at all.




    On 12/6/09 10:03 AM, Sonal Goyal wrote:

    Hi,

    Maybe you could post your code/logic for doing this. One way would be to set
    a flag once your criteria is met and emit keys based on the flag.

    Thanks and Regards,
    Sonal


    2009/12/5 Gang Luo <lgpublic@yahoo.com.cn>


    Hi all,
    I got a tricky problem. I input a small file manually to do some filtering
    work on each line in map function. I check if the line satisfy the constrain
    then I output it, otherwise I return, without doing any other work below.

    For the map function will be called on each line, I think the logic is
    correct. But it doesn't work like this. If there are 5 line for a map task,
    and only the 2nd line satisfies the constrain, then the output will be line
    2, 3, 4, and 5. If the 3rd line satisfies, then output will be line 3, 4, 5.
    It seems that once a map task meet the first satisfying line, the filter
    doesn't work for following lines.

    It is interesting problem. I am checking it now. I also hope someone could
    give me some ideas on this. Thanks.


    -Gang


    ___________________________________________________________
    ����ؿ����㷢������ؿ�ȫ�����ߣ�
    http://card.mail.cn.yahoo.com/

    ___________________________________________________________
    ����ؿ����㷢������ؿ�ȫ�����ߣ�
    http://card.mail.cn.yahoo.com/

    ___________________________________________________________
    好玩贺卡等你发,邮箱贺卡全新上线!
    http://card.mail.cn.yahoo.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedDec 5, '09 at 1:11a
activeDec 7, '09 at 6:47p
posts9
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase