FAQ
The input for a M/R job consists of multiple files that are less than a
block size and the number of maps is the number of files.

I would like to be able to control the number of maps in a way that I have
one map task for multiple files (for example, gluing together files up to a
block size).

I don't want to use a M/R job to do that as it is expensive (extra IO ops:
read/write-read/write)

I don't want to have a COPY program as this is still expensive (extra IO
ops: read/write)

I know files are not that big, but this is the common case in my system and
this would mean increasing the number of IO significantly.

I'd rather would want to have a custom InputSplit that takes multiple files
up to a given size, then I don't have any extra IO ops.

Looking at the InputSplit the interfaces do not seem prepared to be able do
such thing (consolidating multiple files into a single split).

Am I missing something on the APIs? Or another suggestion on how to achieve
the desired behavior?

Thxs.

Alejandro

Search Discussions

  • Enis Soztutar at Oct 15, 2007 at 3:46 pm
    I'm not really sure if it helps but there is a MultiFileSplit and
    MultiFileInputFormat which is optimized for cases where numFiles >
    numMapTasks. Let me know if you have any further questions.

    Alejandro Abdelnur wrote:
    The input for a M/R job consists of multiple files that are less than a
    block size and the number of maps is the number of files.

    I would like to be able to control the number of maps in a way that I have
    one map task for multiple files (for example, gluing together files up to a
    block size).

    I don't want to use a M/R job to do that as it is expensive (extra IO ops:
    read/write-read/write)

    I don't want to have a COPY program as this is still expensive (extra IO
    ops: read/write)

    I know files are not that big, but this is the common case in my system and
    this would mean increasing the number of IO significantly.

    I'd rather would want to have a custom InputSplit that takes multiple files
    up to a given size, then I don't have any extra IO ops.

    Looking at the InputSplit the interfaces do not seem prepared to be able do
    such thing (consolidating multiple files into a single split).

    Am I missing something on the APIs? Or another suggestion on how to achieve
    the desired behavior?

    Thxs.

    Alejandro
  • Alejandro Abdelnur at Oct 24, 2007 at 2:40 pm
    Enis,

    I was trying to understand how MultiFileInputFormat works but I could not.

    My use case is:

    * several small (a few megs) SequenceFiles as input files.

    I need to make sure I don't end up with a Map task per input file.

    Ideally I would like to get sets of input files of size X (the size of all
    the files in the set) as one split.

    Ideas are welcome.

    A
    On 10/15/07, Enis Soztutar wrote:

    I'm not really sure if it helps but there is a MultiFileSplit and
    MultiFileInputFormat which is optimized for cases where numFiles >
    numMapTasks. Let me know if you have any further questions.

    Alejandro Abdelnur wrote:
    The input for a M/R job consists of multiple files that are less than a
    block size and the number of maps is the number of files.

    I would like to be able to control the number of maps in a way that I have
    one map task for multiple files (for example, gluing together files up to a
    block size).

    I don't want to use a M/R job to do that as it is expensive (extra IO ops:
    read/write-read/write)

    I don't want to have a COPY program as this is still expensive (extra IO
    ops: read/write)

    I know files are not that big, but this is the common case in my system and
    this would mean increasing the number of IO significantly.

    I'd rather would want to have a custom InputSplit that takes multiple files
    up to a given size, then I don't have any extra IO ops.

    Looking at the InputSplit the interfaces do not seem prepared to be able do
    such thing (consolidating multiple files into a single split).

    Am I missing something on the APIs? Or another suggestion on how to achieve
    the desired behavior?

    Thxs.

    Alejandro

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 15, '07 at 6:23a
activeOct 24, '07 at 2:40p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase