FAQ
Hi,

I am trying to use 4 SATA disks per node in my hadoop cluster. This is
a JBOD configuration, no RAID is involved. There is one single xfs
partition per disk, each one mounted as /local/, /local2/, /local3,
/local4 - with sufficient privileges for running hadoop jobs. HDFS is
setup across the 4 disks for a single user usage (user2) with the
following comma separated list in hadoop.tmp.dir

<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/dfs/data</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/local/user2/hdfs/hadoop-${user.name},/local2/user2/hdfs/hadoop-${user.name},/local3/user2/hdfs/hadoop-${user.name,/local4/user2/hdfs/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>

What I see is that most or all data is stored on disks /local and
/local4 across nodes. Directories local2 and local3 from the other
disks are not used. I have verified that these disks can be written to
and have free space.

Isn't HDFS supposed to use all disks in a round-robin way? (provided
there is free space on all). Do I need to change another config
parameter for HDFS to spread I/O across all provided mount points?

- Vasilis

Search Discussions

  • Allen Wittenauer at Feb 9, 2010 at 7:37 pm

    On 2/9/10 8:49 AM, "Vasilis Liaskovitis" wrote:
    <property>
    <name>dfs.data.dir</name>
    <value>${hadoop.tmp.dir}/dfs/data</value>
    </property>

    <property>
    <name>hadoop.tmp.dir</name>

    <value>/local/user2/hdfs/hadoop-${user.name},/local2/user2/hdfs/hadoop-${user.
    name},/local3/user2/hdfs/hadoop-${user.name,/local4/user2/hdfs/hadoop-${user.n
    ame}</value>
    <description>A base for other temporary directories.</description>
    </property>

    What I see is that most or all data is stored on disks /local and
    /local4 across nodes. Directories local2 and local3 from the other
    disks are not used. I have verified that these disks can be written to
    and have free space.

    Isn't HDFS supposed to use all disks in a round-robin way? (provided
    there is free space on all). Do I need to change another config
    parameter for HDFS to spread I/O across all provided mount points?
    You've fallen into a trap that the defaults lay for you. You're not the
    only one, and I think I'm going to file a JIRA to fix this.

    What you really want is:

    dfs.data.dir pointed to /local/user2/hdfs/dfs-data,
    /local2/user2/hdfs/dfs-data, etc

    hadoop.tmp.dir pointed to /local/user2/tmp/hadoop-${user.name},
    /local2/user2/tmp/hadoop-${user.name}, etc


    The hadoop.tmp.dir expansion is meant for a really quick QA and not for Real
    Work (TM).
  • Todd Lipcon at Feb 9, 2010 at 7:41 pm
    Hi Vasilis,

    Two things:

    1) You're missing a matching } in your hadoop.tmp.dir setting
    2) When you use ${hadoop.tmp.dir}/dfs/data, it does a literal string
    interpolation. Thus, it's not adding dfs/data to each of the
    hadoop.tmp.dir directories, but rather just the last one.

    I'd recommend setting dfs.data.dir explicitly to the full comma
    separated list and ignoring hadoop.tmp.dir

    Thanks
    -Todd
    On Tue, Feb 9, 2010 at 8:49 AM, Vasilis Liaskovitis wrote:
    Hi,

    I am trying to use 4 SATA disks per node in my hadoop cluster. This is
    a JBOD configuration, no RAID is involved. There is one single xfs
    partition per disk, each one mounted as /local/, /local2/, /local3,
    /local4 - with sufficient privileges for running hadoop jobs. HDFS is
    setup across the 4 disks for a single user usage (user2) with the
    following comma separated list in hadoop.tmp.dir

    <property>
    <name>dfs.data.dir</name>
    <value>${hadoop.tmp.dir}/dfs/data</value>
    </property>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/local/user2/hdfs/hadoop-${user.name},/local2/user2/hdfs/hadoop-${user.name},/local3/user2/hdfs/hadoop-${user.name,/local4/user2/hdfs/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    </property>

    What I see is that most or all data is stored on disks /local and
    /local4 across nodes. Directories local2 and local3 from the other
    disks are not used. I have verified that these disks can be written to
    and have free space.

    Isn't HDFS supposed to use all disks in a round-robin way? (provided
    there is free space on all). Do I need to change another config
    parameter for HDFS to spread I/O across all  provided mount points?

    - Vasilis

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 9, '10 at 4:50p
activeFeb 9, '10 at 7:41p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase