FAQ
Hi all,

As we all know, machines in hadoop cluster may be both datanode and
tasktracker, so one machine may store both MR job intermediate data
and HDFS data. My question is: if we have more than one disk per node,
say 4 disks, and would like both job intermediate data and HDFS data
store into all disks to reduce IO times of each single disk, can we
draw a line between space of local FS and HDFS? For example, restrict
the intermediate temp data occupy no more than 25% space on each disk?
Thanks in advance.

Best Regards,
Carp

Search Discussions

  • Yu Li at Jun 30, 2010 at 3:11 am
    Hi all,

    Anybody has experience on this? Any Comments/Suggestions would be
    highly appreciated, Thanks.

    Best Regards,
    Carp

    2010/6/29 Yu Li <carp84@gmail.com>:
    Hi all,

    As we all know, machines in hadoop cluster may be both datanode and
    tasktracker, so one machine may store both MR job intermediate data
    and HDFS data. My question is: if we have more than one disk per node,
    say 4 disks, and would like both job intermediate data and HDFS data
    store into all disks to reduce IO times of each single disk, can we
    draw a line between space of local FS and HDFS? For example, restrict
    the intermediate temp data occupy no more than 25% space on each disk?
    Thanks in advance.

    Best Regards,
    Carp
  • Steve Loughran at Jun 30, 2010 at 10:11 am

    Yu Li wrote:
    Hi all,

    Anybody has experience on this? Any Comments/Suggestions would be
    highly appreciated, Thanks.

    Best Regards,
    Carp

    2010/6/29 Yu Li <carp84@gmail.com>:
    Hi all,

    As we all know, machines in hadoop cluster may be both datanode and
    tasktracker, so one machine may store both MR job intermediate data
    and HDFS data. My question is: if we have more than one disk per node,
    say 4 disks, and would like both job intermediate data and HDFS data
    store into all disks to reduce IO times of each single disk, can we
    draw a line between space of local FS and HDFS? For example, restrict
    the intermediate temp data occupy no more than 25% space on each disk?
    Thanks in advance.
    There is some configuration parameter to limit space use of either HDFS
    or temp storage, but I forget its name -you'll have to look through the
    docs.

    -steve
  • Vitaliy Semochkin at Jun 30, 2010 at 10:19 am
    set dfs.datanode.du.reserved to amount of bytes you want to reserver for not
    HDFS usage.

    PS
    for search convenience IMHO better post such questions to
    hdfs-user@hadoop.apache.org ;-)


    Regards,
    Vitaliy S
    On Tue, Jun 29, 2010 at 8:32 AM, Yu Li wrote:

    Hi all,

    As we all know, machines in hadoop cluster may be both datanode and
    tasktracker, so one machine may store both MR job intermediate data
    and HDFS data. My question is: if we have more than one disk per node,
    say 4 disks, and would like both job intermediate data and HDFS data
    store into all disks to reduce IO times of each single disk, can we
    draw a line between space of local FS and HDFS? For example, restrict
    the intermediate temp data occupy no more than 25% space on each disk?
    Thanks in advance.

    Best Regards,
    Carp
  • Yu Li at Jun 30, 2010 at 1:34 pm
    Hi Steve and Vitaliy,

    Thanks a lot for your answers, and thanks for Vitaliy's suggestion, I'll
    send questions to relevant mailing list:)

    Best Regards,
    Carp
    2010/6/30 Vitaliy Semochkin <vitaliy.se@gmail.com>
    set dfs.datanode.du.reserved to amount of bytes you want to reserver for
    not
    HDFS usage.

    PS
    for search convenience IMHO better post such questions to
    hdfs-user@hadoop.apache.org ;-)


    Regards,
    Vitaliy S
    On Tue, Jun 29, 2010 at 8:32 AM, Yu Li wrote:

    Hi all,

    As we all know, machines in hadoop cluster may be both datanode and
    tasktracker, so one machine may store both MR job intermediate data
    and HDFS data. My question is: if we have more than one disk per node,
    say 4 disks, and would like both job intermediate data and HDFS data
    store into all disks to reduce IO times of each single disk, can we
    draw a line between space of local FS and HDFS? For example, restrict
    the intermediate temp data occupy no more than 25% space on each disk?
    Thanks in advance.

    Best Regards,
    Carp
  • Chris Smith at Jun 30, 2010 at 2:56 pm
    Some thoughts on how to restrict the temporary data, but I have only
    tried (a) in anger:

    a)    Partition your disks into HDFS and intermediate temp partitions
    of the relevant size.  This gives a fixed separation but is
    difficult/impossible to modify on a busy cluster especially as there
    may be no way of unloading/recovering the data stored in HDFS if you
    make a mistake resizing partitions;

    b)      Implement disk quotas and set relevant hard and soft limits on
    the relevant root directories for intermediate space. This gives you
    the flexibility to change the limits when required but as the limits
    are per user/group some thought may be required as to which user/group
    the limits apply to. There may also be a performance impact?

    You could combine this with setting “dfs.datanode.du.reserved” value
    in $HADOOP_HOME/conf/hdfs-site.xml for limiting HDFS disk usage.

    c)      Implement intermediate data space as a loopback file, see:
    http://wiki.cita.utoronto.ca/mediawiki/index.php/Fake_Fast_Local_Disk
    This example implements a temporary loopback filesystem on a iSCSI
    mounted Lustre filesystem but the principles are the same. There are
    some performance benchmarks linked to in section 3. The intermediate
    temp data space is limited by the size of the loopback file created.

    Chris

    -----Original Message-----
    From: Yu Li
    Sent: 30 June 2010 04:11
    To: common-user@hadoop.apache.org
    Subject: Re: Question about disk space allocation in hadoop

    Hi all,

    Anybody has experience on this? Any Comments/Suggestions would be
    highly appreciated, Thanks.

    Best Regards,
    Carp

    2010/6/29 Yu Li <carp84@gmail.com>:
    Hi all,

    As we all know, machines in hadoop cluster may be both datanode and
    tasktracker, so one machine may store both MR job intermediate data
    and HDFS data. My question is: if we have more than one disk per node,
    say 4 disks, and would like both job intermediate data and HDFS data
    store into all disks to reduce IO times of each single disk, can we
    draw a line between space of local FS and HDFS? For example, restrict
    the intermediate temp data occupy no more than 25% space on each disk?
    Thanks in advance.

    Best Regards,
    Carp
  • Yu Li at Jul 1, 2010 at 7:52 am
    Hi Chris,

    Thanks a lot for your knowledge sharing, I'll have a further
    investigation and give it a try on my cluster, hope could get a good
    solution from them:)

    Best Regards,
    Carp

    2010/6/30 Chris Smith <csmithx+hadoop@gmail.com>:
    Some thoughts on how to restrict the temporary data, but I have only
    tried (a) in anger:

    a)    Partition your disks into HDFS and intermediate temp partitions
    of the relevant size.  This gives a fixed separation but is
    difficult/impossible to modify on a busy cluster especially as there
    may be no way of unloading/recovering the data stored in HDFS if you
    make a mistake resizing partitions;

    b)      Implement disk quotas and set relevant hard and soft limits on
    the relevant root directories for intermediate space. This gives you
    the flexibility to change the limits when required but as the limits
    are per user/group some thought may be required as to which user/group
    the limits apply to. There may also be a performance impact?

    You could combine this with setting “dfs.datanode.du.reserved” value
    in $HADOOP_HOME/conf/hdfs-site.xml for limiting HDFS disk usage.

    c)      Implement intermediate data space as a loopback file, see:
    http://wiki.cita.utoronto.ca/mediawiki/index.php/Fake_Fast_Local_Disk
    This example implements a temporary loopback filesystem on a iSCSI
    mounted Lustre filesystem but the principles are the same. There are
    some performance benchmarks linked to in section 3. The intermediate
    temp data space is limited by the size of the loopback file created.

    Chris

    -----Original Message-----
    From: Yu Li
    Sent: 30 June 2010 04:11
    To: common-user@hadoop.apache.org
    Subject: Re: Question about disk space allocation in hadoop

    Hi all,

    Anybody has experience on this? Any Comments/Suggestions would be
    highly appreciated, Thanks.

    Best Regards,
    Carp

    2010/6/29 Yu Li <carp84@gmail.com>:
    Hi all,

    As we all know, machines in hadoop cluster may be both datanode and
    tasktracker, so one machine may store both MR job intermediate data
    and HDFS data. My question is: if we have more than one disk per node,
    say 4 disks, and would like both job intermediate data and HDFS data
    store into all disks to reduce IO times of each single disk, can we
    draw a line between space of local FS and HDFS? For example, restrict
    the intermediate temp data occupy no more than 25% space on each disk?
    Thanks in advance.

    Best Regards,
    Carp

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 29, '10 at 4:33a
activeJul 1, '10 at 7:52a
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase