FAQ
Hello Hadoop Users,

I would like to know if anyone has ever tried splitting an input
sequence file by key instead of by size. I know that this is unusual
for the map reduce paradigm but I am in a situation where I need to
perform some large tasks on each key pair in a load balancing like
fashion.

To describe in more detail:
I have one sequence file of 2000 key value pairs. I want to distribute
each key value pair to a map task where it will perform a series of
Map/Reduce tasks. This means that the Map task is calling a series of
Jobs. Once the Jobs in each task is complete, I want to reduce all of
the output into one sequence file.

I am stuck in which I am limited by the number of splits a sequence
file is handled in. Hadoop only splits my sequence file into 80 map
tasks when I can perform around 250 map tasks on my cluster. This
means that I am not fully utilizing my cluster, and my Job will not
scale.

Can anyone shed some light on this problem. I have tried looking at
the InputFormats but I am not sure if this is where I should continue
looking.

Best Regards
Vincent

Search Discussions

  • Joey Echeverria at May 23, 2011 at 12:12 pm
    Look at getInputSplits() of SequenceFileInputFormat.

    -Joey
    On May 23, 2011 5:09 AM, "Vincent Xue" wrote:
    Hello Hadoop Users,

    I would like to know if anyone has ever tried splitting an input
    sequence file by key instead of by size. I know that this is unusual
    for the map reduce paradigm but I am in a situation where I need to
    perform some large tasks on each key pair in a load balancing like
    fashion.

    To describe in more detail:
    I have one sequence file of 2000 key value pairs. I want to distribute
    each key value pair to a map task where it will perform a series of
    Map/Reduce tasks. This means that the Map task is calling a series of
    Jobs. Once the Jobs in each task is complete, I want to reduce all of
    the output into one sequence file.

    I am stuck in which I am limited by the number of splits a sequence
    file is handled in. Hadoop only splits my sequence file into 80 map
    tasks when I can perform around 250 map tasks on my cluster. This
    means that I am not fully utilizing my cluster, and my Job will not
    scale.

    Can anyone shed some light on this problem. I have tried looking at
    the InputFormats but I am not sure if this is where I should continue
    looking.

    Best Regards
    Vincent
  • Jason at May 23, 2011 at 4:20 pm
    Look at NLineInputFormat

    Sent from my iPhone
    On May 23, 2011, at 2:09 AM, Vincent Xue wrote:

    Hello Hadoop Users,

    I would like to know if anyone has ever tried splitting an input
    sequence file by key instead of by size. I know that this is unusual
    for the map reduce paradigm but I am in a situation where I need to
    perform some large tasks on each key pair in a load balancing like
    fashion.

    To describe in more detail:
    I have one sequence file of 2000 key value pairs. I want to distribute
    each key value pair to a map task where it will perform a series of
    Map/Reduce tasks. This means that the Map task is calling a series of
    Jobs. Once the Jobs in each task is complete, I want to reduce all of
    the output into one sequence file.

    I am stuck in which I am limited by the number of splits a sequence
    file is handled in. Hadoop only splits my sequence file into 80 map
    tasks when I can perform around 250 map tasks on my cluster. This
    means that I am not fully utilizing my cluster, and my Job will not
    scale.

    Can anyone shed some light on this problem. I have tried looking at
    the InputFormats but I am not sure if this is where I should continue
    looking.

    Best Regards
    Vincent
  • Harsh J at May 23, 2011 at 4:51 pm
    Vincent,

    You _might_ lose locality by splitting beyond the block splits, and
    the tasks although better 'parallelized', may only end up performing
    worse. A good way to instead increase task #s is to go the block size
    way (lower block size, getting more splits at the cost of little extra
    NN space). After all, block sizes are per-file properties.

    But if you'd really want to go the per-record way, an NLine-like
    implementation for SequenceFiles using what Joey and Jason have
    pointed out - would be the best way. (NLineInputFormat doesn't cover
    the use of SequenceFiles directly - Its implemented with a LineReader)
    On Mon, May 23, 2011 at 2:39 PM, Vincent Xue wrote:
    Hello Hadoop Users,

    I would like to know if anyone has ever tried splitting an input
    sequence file by key instead of by size. I know that this is unusual
    for the map reduce paradigm but I am in a situation where I need to
    perform some large tasks on each key pair in a load balancing like
    fashion.

    To describe in more detail:
    I have one sequence file of 2000 key value pairs. I want to distribute
    each key value pair to a map task where it will perform a series of
    Map/Reduce tasks. This means that the Map task is calling a series of
    Jobs. Once the Jobs in each task is complete, I want to reduce all of
    the output into one sequence file.

    I am stuck in which I am limited by the number of splits a sequence
    file is handled in. Hadoop only splits my sequence file into 80 map
    tasks when I can perform around 250 map tasks on my cluster. This
    means that I am not fully utilizing my cluster, and my Job will not
    scale.

    Can anyone shed some light on this problem. I have tried looking at
    the InputFormats but I am not sure if this is where I should continue
    looking.

    Best Regards
    Vincent


    --
    Harsh J
  • Vincent Xue at May 23, 2011 at 5:02 pm
    Thanks for the suggestions!
    On Mon, May 23, 2011 at 5:50 PM, Harsh J wrote:
    Vincent,

    You _might_ lose locality by splitting beyond the block splits, and
    the tasks although better 'parallelized', may only end up performing
    worse. A good way to instead increase task #s is to go the block size
    way (lower block size, getting more splits at the cost of little extra
    NN space). After all, block sizes are per-file properties.

    But if you'd really want to go the per-record way, an NLine-like
    implementation for SequenceFiles using what Joey and Jason have
    pointed out - would be the best way. (NLineInputFormat doesn't cover
    the use of SequenceFiles directly - Its implemented with a LineReader)
    On Mon, May 23, 2011 at 2:39 PM, Vincent Xue wrote:
    Hello Hadoop Users,

    I would like to know if anyone has ever tried splitting an input
    sequence file by key instead of by size. I know that this is unusual
    for the map reduce paradigm but I am in a situation where I need to
    perform some large tasks on each key pair in a load balancing like
    fashion.

    To describe in more detail:
    I have one sequence file of 2000 key value pairs. I want to distribute
    each key value pair to a map task where it will perform a series of
    Map/Reduce tasks. This means that the Map task is calling a series of
    Jobs. Once the Jobs in each task is complete, I want to reduce all of
    the output into one sequence file.

    I am stuck in which I am limited by the number of splits a sequence
    file is handled in. Hadoop only splits my sequence file into 80 map
    tasks when I can perform around 250 map tasks on my cluster. This
    means that I am not fully utilizing my cluster, and my Job will not
    scale.

    Can anyone shed some light on this problem. I have tried looking at
    the InputFormats but I am not sure if this is where I should continue
    looking.

    Best Regards
    Vincent


    --
    Harsh J
  • Moustafa Gaber at May 24, 2011 at 4:06 am
    I think you don't need to split your input file so that each map is assigned
    one key. Your goal is to make load balance. For each map task of yours, it
    will initiate a new MR sub-job. This sub-job will be assigned a new
    master/workers, which means the map task of the sub-job may be scheduled to
    work on other machines rather than the ones your parent map tasks are
    working on. Therefore, you can still achieve load balance without splitting
    your input file per one key.
    On Mon, May 23, 2011 at 1:02 PM, Vincent Xue wrote:

    Thanks for the suggestions!
    On Mon, May 23, 2011 at 5:50 PM, Harsh J wrote:
    Vincent,

    You _might_ lose locality by splitting beyond the block splits, and
    the tasks although better 'parallelized', may only end up performing
    worse. A good way to instead increase task #s is to go the block size
    way (lower block size, getting more splits at the cost of little extra
    NN space). After all, block sizes are per-file properties.

    But if you'd really want to go the per-record way, an NLine-like
    implementation for SequenceFiles using what Joey and Jason have
    pointed out - would be the best way. (NLineInputFormat doesn't cover
    the use of SequenceFiles directly - Its implemented with a LineReader)
    On Mon, May 23, 2011 at 2:39 PM, Vincent Xue wrote:
    Hello Hadoop Users,

    I would like to know if anyone has ever tried splitting an input
    sequence file by key instead of by size. I know that this is unusual
    for the map reduce paradigm but I am in a situation where I need to
    perform some large tasks on each key pair in a load balancing like
    fashion.

    To describe in more detail:
    I have one sequence file of 2000 key value pairs. I want to distribute
    each key value pair to a map task where it will perform a series of
    Map/Reduce tasks. This means that the Map task is calling a series of
    Jobs. Once the Jobs in each task is complete, I want to reduce all of
    the output into one sequence file.

    I am stuck in which I am limited by the number of splits a sequence
    file is handled in. Hadoop only splits my sequence file into 80 map
    tasks when I can perform around 250 map tasks on my cluster. This
    means that I am not fully utilizing my cluster, and my Job will not
    scale.

    Can anyone shed some light on this problem. I have tried looking at
    the InputFormats but I am not sure if this is where I should continue
    looking.

    Best Regards
    Vincent


    --
    Harsh J


    --
    Best Regards,
    Mostafa Ead

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedMay 23, '11 at 9:09a
activeMay 24, '11 at 4:06a
posts6
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase