FAQ
Hi,

We had a cluster of 9 machines with one name node, and 8 data nodes (2 had
220GB hard disk space, rest had 450GB).
Most of the space on first machines with 250GB disk space was consumed.
Now we added two new machines each with 450GB hard disk space as data nodes.

Is there any way to redistribute files on HDFS so that there will
considerable free space left on first two machines without
downloading the files to one local machine and then uploading it back on
HDFS?

~
Prashant,
SIEL,
IIIT-Hyderabad.

Search Discussions

  • Ravi Phulari at Aug 7, 2009 at 5:49 pm
    Use Rebalancer

    http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html#Rebalancer
    -
    Ravi
    On 8/7/09 10:38 AM, "prashant ullegaddi" wrote:

    Hi,

    We had a cluster of 9 machines with one name node, and 8 data nodes (2 had
    220GB hard disk space, rest had 450GB).
    Most of the space on first machines with 250GB disk space was consumed.
    Now we added two new machines each with 450GB hard disk space as data nodes.

    Is there any way to redistribute files on HDFS so that there will
    considerable free space left on first two machines without
    downloading the files to one local machine and then uploading it back on
    HDFS?

    ~
    Prashant,
    SIEL,
    IIIT-Hyderabad.
  • Ted Dunning at Aug 7, 2009 at 5:59 pm
    Make sure you rebalance soon after adding the new node. Otherwise, you will
    have an age bias in file distribution. This can, in some applications, lead
    to some strange effects. For example, if you have log files that you delete
    when they get too old, disk space will be freed non-uniformly. This
    shouldn't much affect performance, but it can lead to a need to rebalance
    again (and again) later. Normal file churn combined with occasional
    rebalancing should eventually fix this, but it is nicer not to.
    On Fri, Aug 7, 2009 at 10:48 AM, Ravi Phulari wrote:

    Use Rebalancer


    http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html#Rebalancer
    -
    Ravi
    On 8/7/09 10:38 AM, "prashant ullegaddi" wrote:

    Hi,

    We had a cluster of 9 machines with one name node, and 8 data nodes (2 had
    220GB hard disk space, rest had 450GB).
    Most of the space on first machines with 250GB disk space was consumed.
    Now we added two new machines each with 450GB hard disk space as data nodes.
    Is there any way to redistribute files on HDFS so that there will
    considerable free space left on first two machines without
    downloading the files to one local machine and then uploading it back on
    HDFS?

    ~
    Prashant,
    SIEL,
    IIIT-Hyderabad.

    --
    Ted Dunning, CTO
    DeepDyve
  • Prashant ullegaddi at Aug 8, 2009 at 5:11 am
    Thank you Ravi and Ted.

    I ran hadoop balancer without default threshold. It's been running for last
    8 hours!
    How long does it take given the following DFS stats:

    *3140 files and directories, 10295 blocks = 13435 total. Heap Size is 17.88
    MB / 963 MB (1%)
    * Capacity : 3.93 TB DFS Remaining : 2.11 TB DFS Used : 1.31 TB DFS
    Used%:33.44 % Live
    Nodes <http://megh01:50070/dfshealth.jsp#LiveNodes> : 10 Dead
    Nodes<http://megh01:50070/dfshealth.jsp#DeadNodes>
    : 0

    If I interrupt it now, what will happen? I've to run a job now. I think
    balancing and running a job
    may not happen together as one will slow down the other.

    Thanks,
    Prashant.
    On Fri, Aug 7, 2009 at 11:28 PM, Ted Dunning wrote:

    Make sure you rebalance soon after adding the new node. Otherwise, you
    will
    have an age bias in file distribution. This can, in some applications,
    lead
    to some strange effects. For example, if you have log files that you
    delete
    when they get too old, disk space will be freed non-uniformly. This
    shouldn't much affect performance, but it can lead to a need to rebalance
    again (and again) later. Normal file churn combined with occasional
    rebalancing should eventually fix this, but it is nicer not to.
    On Fri, Aug 7, 2009 at 10:48 AM, Ravi Phulari wrote:

    Use Rebalancer


    http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html#Rebalancer
    -
    Ravi
    On 8/7/09 10:38 AM, "prashant ullegaddi" wrote:

    Hi,

    We had a cluster of 9 machines with one name node, and 8 data nodes (2 had
    220GB hard disk space, rest had 450GB).
    Most of the space on first machines with 250GB disk space was consumed.
    Now we added two new machines each with 450GB hard disk space as data nodes.
    Is there any way to redistribute files on HDFS so that there will
    considerable free space left on first two machines without
    downloading the files to one local machine and then uploading it back
    on
    HDFS?

    ~
    Prashant,
    SIEL,
    IIIT-Hyderabad.

    --
    Ted Dunning, CTO
    DeepDyve
  • Prashant ullegaddi at Aug 8, 2009 at 5:14 am
    Sorry for the mistake in the previous mail. I meant I ran balancer with
    default threshold.

    On Sat, Aug 8, 2009 at 10:40 AM, prashant ullegaddi wrote:

    Thank you Ravi and Ted.

    I ran hadoop balancer without default threshold. It's been running for last
    8 hours!
    How long does it take given the following DFS stats:

    *3140 files and directories, 10295 blocks = 13435 total. Heap Size is
    17.88 MB / 963 MB (1%)
    * Capacity : 3.93 TB DFS Remaining : 2.11 TB DFS Used : 1.31 TB DFS
    Used% : 33.44 % Live Nodes <http://megh01:50070/dfshealth.jsp#LiveNodes> :10 Dead
    Nodes <http://megh01:50070/dfshealth.jsp#DeadNodes> : 0

    If I interrupt it now, what will happen? I've to run a job now. I think
    balancing and running a job
    may not happen together as one will slow down the other.

    Thanks,
    Prashant.

    On Fri, Aug 7, 2009 at 11:28 PM, Ted Dunning wrote:

    Make sure you rebalance soon after adding the new node. Otherwise, you
    will
    have an age bias in file distribution. This can, in some applications,
    lead
    to some strange effects. For example, if you have log files that you
    delete
    when they get too old, disk space will be freed non-uniformly. This
    shouldn't much affect performance, but it can lead to a need to rebalance
    again (and again) later. Normal file churn combined with occasional
    rebalancing should eventually fix this, but it is nicer not to.
    On Fri, Aug 7, 2009 at 10:48 AM, Ravi Phulari wrote:

    Use Rebalancer


    http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html#Rebalancer
    -
    Ravi

    On 8/7/09 10:38 AM, "prashant ullegaddi" <prashullegaddi@gmail.com>
    wrote:
    Hi,

    We had a cluster of 9 machines with one name node, and 8 data nodes (2 had
    220GB hard disk space, rest had 450GB).
    Most of the space on first machines with 250GB disk space was
    consumed.
    Now we added two new machines each with 450GB hard disk space as data nodes.
    Is there any way to redistribute files on HDFS so that there will
    considerable free space left on first two machines without
    downloading the files to one local machine and then uploading it back
    on
    HDFS?

    ~
    Prashant,
    SIEL,
    IIIT-Hyderabad.

    --
    Ted Dunning, CTO
    DeepDyve
  • Ted Dunning at Aug 8, 2009 at 5:42 am
    I think that I remember that you essentially doubled your storage before
    starting balancing.

    This means that about 1 TB will need to be copied. By default the balancer
    only moves 1MB/s (per node, I believe). This means that it will take a LONG
    time to balance your cluster. You can increase this speed limit, but there
    isn't usually much need to do so. Running the balancer while using your
    cluster is generally not a big deal since the balancer consumes so little
    bandwidth.
    On Fri, Aug 7, 2009 at 10:10 PM, prashant ullegaddi wrote:

    * Capacity : 3.93 TB DFS Remaining : 2.11 TB DFS Used : 1.31 TB DFS
    Used%:33.44 % Live
    Nodes <http://megh01:50070/dfshealth.jsp#LiveNodes> : 10 Dead
    Nodes<http://megh01:50070/dfshealth.jsp#DeadNodes>
    : 0

    If I interrupt it now, what will happen? I've to run a job now. I think
    balancing and running a job
    may not happen together as one will slow down the other.


    --
    Ted Dunning, CTO
    DeepDyve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 7, '09 at 5:38p
activeAug 8, '09 at 5:42a
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase