FAQ
Suppose a SequenceFile (containing keys and values that are
BytesWritable) is used as input. Will it be divided into InputSplits?
If so, what's the criteria use for splitting?

I'm interested in this because I need to control the number of map tasks
used, which (if I understand it correctly), is equal to the number of
InputSplits.

thanks,

bw

Search Discussions

  • Aaron Kimball at Apr 20, 2009 at 5:19 am
    Yes, there can be more than one InputSplit per SequenceFile. The file will
    be split more-or-less along 64 MB boundaries. (the actual "edges" of the
    splits will be adjusted to hit the next block of key-value pairs, so it
    might be a few kilobytes off.)

    The SequenceFileInputFormat regards mapred.map.tasks (conf.setNumMapTasks())
    as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
    is always 100% user-controlled.) If you need exact control over the number
    of map tasks, you'll need to subclass it and modify this behavior. That
    having been said -- are you sure you actually need to precisely control this
    value? Or is it enough to know how many splits were created?

    - Aaron
    On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman wrote:

    Suppose a SequenceFile (containing keys and values that are BytesWritable)
    is used as input. Will it be divided into InputSplits? If so, what's the
    criteria use for splitting?

    I'm interested in this because I need to control the number of map tasks
    used, which (if I understand it correctly), is equal to the number of
    InputSplits.

    thanks,

    bw
  • Jim Twensky at Apr 20, 2009 at 7:19 am
    In addition to what Aaron mentioned, you can configure the minimum split
    size in hadoop-site.xml to have smaller or larger input splits depending on
    your application.

    -Jim
    On Mon, Apr 20, 2009 at 12:18 AM, Aaron Kimball wrote:

    Yes, there can be more than one InputSplit per SequenceFile. The file will
    be split more-or-less along 64 MB boundaries. (the actual "edges" of the
    splits will be adjusted to hit the next block of key-value pairs, so it
    might be a few kilobytes off.)

    The SequenceFileInputFormat regards mapred.map.tasks
    (conf.setNumMapTasks())
    as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
    is always 100% user-controlled.) If you need exact control over the number
    of map tasks, you'll need to subclass it and modify this behavior. That
    having been said -- are you sure you actually need to precisely control
    this
    value? Or is it enough to know how many splits were created?

    - Aaron
    On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman wrote:

    Suppose a SequenceFile (containing keys and values that are
    BytesWritable)
    is used as input. Will it be divided into InputSplits? If so, what's the
    criteria use for splitting?

    I'm interested in this because I need to control the number of map tasks
    used, which (if I understand it correctly), is equal to the number of
    InputSplits.

    thanks,

    bw
  • Barnet Wagman at Apr 20, 2009 at 2:26 pm
    Thanks Aaron, that really helps. I probably do need to control the
    number of splits. My input 'data' consists of Java objects and their
    size (in bytes) doesn't necessarily reflect the amount of time needed
    for each map operation. I need to ensure that I have enough map tasks
    so that all cpus are utilized and the job gets done in a reasonable
    amount of time. (Currently I'm creating multiple input files and making
    them unsplitable, but subclassing SequenceFileInputFormat to explicitly
    control then number of splits sounds like a better approach).

    Barnet

    Aaron Kimball wrote:
    Yes, there can be more than one InputSplit per SequenceFile. The file will
    be split more-or-less along 64 MB boundaries. (the actual "edges" of the
    splits will be adjusted to hit the next block of key-value pairs, so it
    might be a few kilobytes off.)

    The SequenceFileInputFormat regards mapred.map.tasks (conf.setNumMapTasks())
    as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
    is always 100% user-controlled.) If you need exact control over the number
    of map tasks, you'll need to subclass it and modify this behavior. That
    having been said -- are you sure you actually need to precisely control this
    value? Or is it enough to know how many splits were created?

    - Aaron

    On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman wrote:

    Suppose a SequenceFile (containing keys and values that are BytesWritable)
    is used as input. Will it be divided into InputSplits? If so, what's the
    criteria use for splitting?

    I'm interested in this because I need to control the number of map tasks
    used, which (if I understand it correctly), is equal to the number of
    InputSplits.

    thanks,

    bw
  • Aaron Kimball at Apr 23, 2009 at 8:56 am
    Explicitly controlling your splits will be very challenging. Taking the case
    where you have expensive (X) and cheap (C) objects to process, you may have
    a file where the records are lined up X C X C X C X X X X X C C C. In this
    case, you'll need to scan through the whole file and build splits such that
    the lengthy run of expensive objects is broken up into separate splits, but
    the run of cheap objects is consolidated. I'm suspicious that you can do
    this without scanning through the data (which is what often constitutes the
    bulk of a time in a mapreduce program).

    But how much data are you using? I would imagine that if you're operating at
    the scale where Hadoop makes sense, then the high- and low-cost objects will
    -- on average -- balance out and tasks will be roughly evenly proportioned.

    In general, I would just dodge the problem by making sure your splits
    relatively small compared to the size of your input data. If you have 5
    million objects to process, then make each split be roughly equal to say
    20,000 of them. Then even if some splits take long to process and others
    take a short time, then one CPU may dispatch with a dozen cheap splits in
    the same time where one unlucky JVM had to process a single very expensive
    split. Now you haven't had to manually balance anything, and you still get
    to keep all your CPUs full.

    - Aaron

    On Mon, Apr 20, 2009 at 11:25 PM, Barnet Wagman wrote:

    Thanks Aaron, that really helps. I probably do need to control the number
    of splits. My input 'data' consists of Java objects and their size (in
    bytes) doesn't necessarily reflect the amount of time needed for each map
    operation. I need to ensure that I have enough map tasks so that all cpus
    are utilized and the job gets done in a reasonable amount of time.
    (Currently I'm creating multiple input files and making them unsplitable,
    but subclassing SequenceFileInputFormat to explicitly control then number of
    splits sounds like a better approach).

    Barnet


    Aaron Kimball wrote:
    Yes, there can be more than one InputSplit per SequenceFile. The file will
    be split more-or-less along 64 MB boundaries. (the actual "edges" of the
    splits will be adjusted to hit the next block of key-value pairs, so it
    might be a few kilobytes off.)

    The SequenceFileInputFormat regards mapred.map.tasks
    (conf.setNumMapTasks())
    as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
    is always 100% user-controlled.) If you need exact control over the number
    of map tasks, you'll need to subclass it and modify this behavior. That
    having been said -- are you sure you actually need to precisely control
    this
    value? Or is it enough to know how many splits were created?

    - Aaron

    On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <b.wagman@comcast.net>
    wrote:


    Suppose a SequenceFile (containing keys and values that are
    BytesWritable)
    is used as input. Will it be divided into InputSplits? If so, what's the
    criteria use for splitting?

    I'm interested in this because I need to control the number of map tasks
    used, which (if I understand it correctly), is equal to the number of
    InputSplits.

    thanks,

    bw

  • Shevek at Apr 23, 2009 at 10:43 am

    On Thu, 2009-04-23 at 17:56 +0900, Aaron Kimball wrote:
    Explicitly controlling your splits will be very challenging. Taking the case
    where you have expensive (X) and cheap (C) objects to process, you may have
    a file where the records are lined up X C X C X C X X X X X C C C. In this
    case, you'll need to scan through the whole file and build splits such that
    the lengthy run of expensive objects is broken up into separate splits, but
    the run of cheap objects is consolidated. I'm suspicious that you can do
    this without scanning through the data (which is what often constitutes the
    bulk of a time in a mapreduce program).
    I would also like the ability to stream the data and shuffle it into
    buckets; when any bucket achieves a fixed cost (currently assessed as
    byte size), it would be shipped as a task.

    In practise, in the Hadoop architecture, this causes an extra level of
    I/O, since all the data must be read into the shuffler and re-sorted.
    Also, it breaks the ability to run map tasks on systems hosting the
    data. However, it is a subject about which I am doing some thinking.
    But how much data are you using? I would imagine that if you're operating at
    the scale where Hadoop makes sense, then the high- and low-cost objects will
    -- on average -- balance out and tasks will be roughly evenly proportioned.
    True, dat.

    But it's still worth thinking about stream splitting, since the
    theoretical complexity overhead is an increased constant on a linear
    term.

    Will get more into architecture first.

    S.
  • Barnet Wagman at Apr 23, 2009 at 3:08 pm

    Aaron Kimball wrote:
    Explicitly controlling your splits will be very challenging. Taking the case
    where you have expensive (X) and cheap (C) objects to process, you may have
    a file where the records are lined up X C X C X C X X X X X C C C. In this
    case, you'll need to scan through the whole file and build splits such that
    the lengthy run of expensive objects is broken up into separate splits, but
    the run of cheap objects is consolidated.
    ^ I'm not concerned about the variation in processing time of objects;
    there isn't enough variation to worry about. I'm primarily concerned
    with having enough map tasks to utilized all nodes (and cores).
    In general, I would just dodge the problem by making sure your splits
    relatively small compared to the size of your input data.
    ^ This sounds like the right solution. I'll still need to extend
    SequenceFileInputFormat, but it should be relatively simple to put a
    fixed number of objects into each split.

    thanks

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 20, '09 at 2:24a
activeApr 23, '09 at 3:08p
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase