FAQ
Hi,

In some scenarios you have gzipped files as input for your map reduce
job (apache logfiles is a common example).
Now some of those files are several hundred megabytes and as such will
be split by HDFS in several blocks.

When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2)

Total number of blocks: 2
25063947863662497: 10.10.138.62:50010 10.10.138.61:50010
1014249434553595747: 10.10.138.64:50010 10.10.138.63:50010

As you can see the file has been distributed over all 4 nodes.

When actually reading those files they are unsplittable due to the
nature of the Gzip codec.
So a job will (in the above example) ALWAYS need to pull "the other
half" of the file over the network, if a file is bigger and the
cluster is bigger then the percentage of the file that goes over the
network will probably increase.

Now if I can tell HDFS that a ".gz" file should always be "100% local"
for the node that will be doing the processing this would reduce the
network IO during the job dramatically.
Especially if you want to run several jobs against the same input.

So my question is: Is there a way to force/tell HDFS to make sure that
a datanode that has blocks of this file must always have ALL blocks of
this file?

--
Best regards,

Niels Basjes

Search Discussions

  • Harsh J at Apr 27, 2011 at 9:07 am
    Hey Niels,

    The block size is a per-file property. Would putting/creating these
    gzip files on the DFS with a very high block size (such that it
    doesn't split across for such files) be a valid solution to your
    problem here?
    On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes wrote:
    Hi,

    In some scenarios you have gzipped files as input for your map reduce
    job (apache logfiles is a common example).
    Now some of those files are several hundred megabytes and as such will
    be split by HDFS in several blocks.

    When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2)

    Total number of blocks: 2
    25063947863662497:           10.10.138.62:50010         10.10.138.61:50010
    1014249434553595747:   10.10.138.64:50010               10.10.138.63:50010

    As you can see the file has been distributed over all 4 nodes.

    When actually reading those files they are unsplittable due to the
    nature of the Gzip codec.
    So a job will (in the above example) ALWAYS need to pull "the other
    half" of the file over the network, if a file is bigger and the
    cluster is bigger then the percentage of the file that goes over the
    network will probably increase.

    Now if I can tell HDFS that a ".gz" file should always be "100% local"
    for the node that will be doing the processing this would reduce the
    network IO during the job dramatically.
    Especially if you want to run several jobs against the same input.

    So my question is: Is there a way to force/tell HDFS to make sure that
    a datanode that has blocks of this file must always have ALL blocks of
    this file?

    --
    Best regards,

    Niels Basjes


    --
    Harsh J
  • Niels Basjes at Apr 27, 2011 at 9:49 am
    Hi,

    I did the following with a 1.6GB file
    hadoop fs -Ddfs.block.size=2147483648 -put
    /home/nbasjes/access-2010-11-29.log.gz /user/nbasjes
    and I got

    Total number of blocks: 1
    4189183682512190568: 10.10.138.61:50010 10.10.138.62:50010

    Yes, that does the trick. Thank you.

    Niels

    2011/4/27 Harsh J <harsh@cloudera.com>:
    Hey Niels,

    The block size is a per-file property. Would putting/creating these
    gzip files on the DFS with a very high block size (such that it
    doesn't split across for such files) be a valid solution to your
    problem here?
    On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes wrote:
    Hi,

    In some scenarios you have gzipped files as input for your map reduce
    job (apache logfiles is a common example).
    Now some of those files are several hundred megabytes and as such will
    be split by HDFS in several blocks.

    When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2)

    Total number of blocks: 2
    25063947863662497:           10.10.138.62:50010         10.10.138.61:50010
    1014249434553595747:   10.10.138.64:50010               10.10.138.63:50010

    As you can see the file has been distributed over all 4 nodes.

    When actually reading those files they are unsplittable due to the
    nature of the Gzip codec.
    So a job will (in the above example) ALWAYS need to pull "the other
    half" of the file over the network, if a file is bigger and the
    cluster is bigger then the percentage of the file that goes over the
    network will probably increase.

    Now if I can tell HDFS that a ".gz" file should always be "100% local"
    for the node that will be doing the processing this would reduce the
    network IO during the job dramatically.
    Especially if you want to run several jobs against the same input.

    So my question is: Is there a way to force/tell HDFS to make sure that
    a datanode that has blocks of this file must always have ALL blocks of
    this file?

    --
    Best regards,

    Niels Basjes


    --
    Harsh J


    --
    Met vriendelijke groeten,

    Niels Basjes
  • Steve Loughran at Apr 27, 2011 at 11:10 am

    On 27/04/11 10:48, Niels Basjes wrote:
    Hi,

    I did the following with a 1.6GB file
    hadoop fs -Ddfs.block.size=2147483648 -put
    /home/nbasjes/access-2010-11-29.log.gz /user/nbasjes
    and I got

    Total number of blocks: 1
    4189183682512190568: 10.10.138.61:50010 10.10.138.62:50010

    Yes, that does the trick. Thank you.

    Niels

    2011/4/27 Harsh J<harsh@cloudera.com>:
    Hey Niels,

    The block size is a per-file property. Would putting/creating these
    gzip files on the DFS with a very high block size (such that it
    doesn't split across for such files) be a valid solution to your
    problem here?
    Don't set a block size >2GB, not all the bits of the code that use
    signed 32 bit integers have been eliminated yet.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 27, '11 at 7:56a
activeApr 27, '11 at 11:10a
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase