FAQ
allow minimum split size configurable
-------------------------------------

Key: HADOOP-93
URL: http://issues.apache.org/jira/browse/HADOOP-93
Project: Hadoop
Type: Bug
Reporter: Hairong Kuang


The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!

The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

Search Discussions

  • Hairong Kuang (JIRA) at Mar 17, 2006 at 8:14 pm
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=all ]

    Hairong Kuang updated HADOOP-93:
    --------------------------------

    Attachment: hadoop-93.fix
    allow minimum split size configurable
    -------------------------------------

    Key: HADOOP-93
    URL: http://issues.apache.org/jira/browse/HADOOP-93
    Project: Hadoop
    Type: Bug
    Reporter: Hairong Kuang
    Attachments: hadoop-93.fix

    The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
    The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Mar 17, 2006 at 9:17 pm
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=comments#action_12370887 ]

    Doug Cutting commented on HADOOP-93:
    ------------------------------------

    With such big input files the default logic should split things into dfs block-sized splits. Smaller splits should only be used if this would result in fewer than mapred.map.tasks splits. What value do you have for mapred.map.tasks in your mapred-default.xml? Let's make sure that is working before we add a new min.split.size feature. I don't oppose the feature, but it should be generating 356*30G/32M splits, not 356*30G/2K splits as you claim. That's still a lot of splits. If it is too many then we should add the feature you're adding.

    Note that, as a workaround, it is also easy to implement this w/o patching by defining an InputFormat that subclasses InputFormatBase and specifies a different minSplitSize. But making that a long is a good idea.

    So, in summary, can you please confirm that the actual number of splits that you object to is 356*30G/32M splits, not 356*30G/2K? Thanks.
    allow minimum split size configurable
    -------------------------------------

    Key: HADOOP-93
    URL: http://issues.apache.org/jira/browse/HADOOP-93
    Project: Hadoop
    Type: Bug
    Reporter: Hairong Kuang
    Attachments: hadoop-93.fix

    The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
    The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Owen O'Malley (JIRA) at Mar 17, 2006 at 10:05 pm
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=comments#action_12370894 ]

    Owen O'Malley commented on HADOOP-93:
    -------------------------------------
    From what I've seen, it is always 32M fragments, but that is still 300k input splits/maps, which is a lot. We'd like to be able to drop that by an order of magnitude. (I think in this case that the input splitter never finished, so we don't know.)
    allow minimum split size configurable
    -------------------------------------

    Key: HADOOP-93
    URL: http://issues.apache.org/jira/browse/HADOOP-93
    Project: Hadoop
    Type: Bug
    Reporter: Hairong Kuang
    Attachments: hadoop-93.fix

    The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
    The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Owen O'Malley (JIRA) at Mar 17, 2006 at 10:35 pm
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=all ]

    Owen O'Malley updated HADOOP-93:
    --------------------------------

    Component: mapred
    Fix Version: 0.1
    Version: 0.1
    allow minimum split size configurable
    -------------------------------------

    Key: HADOOP-93
    URL: http://issues.apache.org/jira/browse/HADOOP-93
    Project: Hadoop
    Type: Bug
    Components: mapred
    Versions: 0.1
    Reporter: Hairong Kuang
    Fix For: 0.1
    Attachments: hadoop-93.fix

    The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
    The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Hairong Kuang (JIRA) at Mar 17, 2006 at 10:44 pm
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=comments#action_12370899 ]

    Hairong Kuang commented on HADOOP-93:
    -------------------------------------

    Doug, you are right. The number of splits we got was 356*30G/32M, but still too many.
    allow minimum split size configurable
    -------------------------------------

    Key: HADOOP-93
    URL: http://issues.apache.org/jira/browse/HADOOP-93
    Project: Hadoop
    Type: Bug
    Components: mapred
    Versions: 0.1
    Reporter: Hairong Kuang
    Fix For: 0.1
    Attachments: hadoop-93.fix

    The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
    The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Hairong Kuang (JIRA) at Mar 17, 2006 at 11:29 pm
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=all ]

    Hairong Kuang updated HADOOP-93:
    --------------------------------

    Attachment: hadoop_87.fix

    Updated patch
    allow minimum split size configurable
    -------------------------------------

    Key: HADOOP-93
    URL: http://issues.apache.org/jira/browse/HADOOP-93
    Project: Hadoop
    Type: Bug
    Components: mapred
    Versions: 0.1
    Reporter: Hairong Kuang
    Fix For: 0.1
    Attachments: hadoop-93.fix, hadoop_87.fix

    The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
    The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Owen O'Malley (JIRA) at Mar 17, 2006 at 11:35 pm
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=all ]

    Owen O'Malley updated HADOOP-93:
    --------------------------------

    Attachment: (was: hadoop_87.fix)
    allow minimum split size configurable
    -------------------------------------

    Key: HADOOP-93
    URL: http://issues.apache.org/jira/browse/HADOOP-93
    Project: Hadoop
    Type: Bug
    Components: mapred
    Versions: 0.1
    Reporter: Hairong Kuang
    Fix For: 0.1
    Attachments: hadoop-93.fix

    The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
    The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Eric Baldeschwieler at Mar 21, 2006 at 5:34 am
    It doesn't seem like we are going to get exactly what we want by
    simply going with bigger splits. What we really want is to read all
    of the information locally if possible. We loose control of that by
    increasing the split beyond the block size. This is relevant because
    we'll have more tasks like this in the future. The simple
    aggregations and samples that keep coming up in user asks will all
    look like this.

    It seems to me there are two ways to deal with this:

    1) Make 300k task map jobs work efficiently. How possible /
    impossible is this?

    2) Make jobs which consume a set of blocks which are all local to a
    node. This seems possible, but will require a fair rethink on APIs /
    abstractions.

    ---

    Which way should we push things? Seems that giving up node / switch
    locality on reads is not the right decision. Which is where we just
    went with by allowing input sizes > the block.

    PS Increasing the block size to 64M or 128M would clearly help some,
    but it does not handle the overall issue. Although maybe 256M might
    prove an interesting size...
    On Mar 17, 2006, at 3:35 PM, Owen O'Malley (JIRA) wrote:

    [ http://issues.apache.org/jira/browse/HADOOP-93?page=all ]

    Owen O'Malley updated HADOOP-93:
    --------------------------------

    Attachment: (was: hadoop_87.fix)
    allow minimum split size configurable
    -------------------------------------

    Key: HADOOP-93
    URL: http://issues.apache.org/jira/browse/HADOOP-93
    Project: Hadoop
    Type: Bug
    Components: mapred
    Versions: 0.1
    Reporter: Hairong Kuang
    Fix For: 0.1
    Attachments: hadoop-93.fix

    The current default split size is the size of a block (32M) and a
    SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We
    currently have a Map/Reduce application working on crawled
    docuements. Its input data consists of 356 sequence files, each of
    which is of a size around 30G. A jobtracker takes forever to
    launch the job because it needs to generate 356*30G/2K map tasks!
    The proposed solution is to let the minimum split size
    configurable so that the programmer can control the number of
    tasks to generate.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the
    administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Mar 21, 2006 at 5:39 pm
    [ http://issues.apache.org/jira/browse/HADOOP-93?page=all ]

    Doug Cutting resolved HADOOP-93:
    --------------------------------

    Resolution: Fixed
    Assign To: Doug Cutting

    Okay, I have applied this.

    For the record, patches are easier to apply if they are made from the root of the project. Also, new config properties should generally be added to hadoop-default.xml. Finally, the cast added in SequenceFileInputFormat was not required.
    allow minimum split size configurable
    -------------------------------------

    Key: HADOOP-93
    URL: http://issues.apache.org/jira/browse/HADOOP-93
    Project: Hadoop
    Type: Bug
    Components: mapred
    Versions: 0.1
    Reporter: Hairong Kuang
    Assignee: Doug Cutting
    Fix For: 0.1
    Attachments: hadoop-93.fix

    The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
    The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 17, '06 at 8:01p
activeMar 21, '06 at 5:39p
posts10
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase