FAQ

[HDFS-user] Moving disks from one datanode to another

Erik Forsberg
Dec 7, 2011 at 10:43 am
Hi!

I'm facing the problem where datanodes are marked as down due to them
being to slow in doing blockreports, which in turn is due to too many
blocks per node. I.e. https://issues.apache.org/jira/browse/HADOOP-4584,
but I can't easily upgrade to 0.21.

So I came up with a possible workaround - run multiple datanode
instances on each physical node, each handling a subset of the disks on
that node. Not sure it will work, but could be worth a try.

So I configured a second datanode on one of my nodes, configured to run
on a different set of ports, and configured the two datanode instances
to use half of the disks each.

However, when starting up this configuration, I get the below exception
(UnregisteredDatanodeException) in the namenode log, and the datanode
then shuts down after reporting the same.

How can I work around this?

Removing VERSION file in data dir does not help, the data file just
exits with an exception about the data dir being in an inconsistent state.

Can I simply edit the VERSION file in the data dir's that are on the new
instance, replacing e.g. the port number that's there with the new,
correct, port number? Or will that confuse datanode or namenode?

Or should I start the datanode with an empty data dir, let it register
with the namenode, immediately shut it down, then use the VERSION file
from the empty datadir as new VERSION file for all the data dirs that
already contain data?

I'm guessing what I'm trying to do would be equivalent to moving disks
from one host to another, something I can imaginen would happen in some
system administrative situations. So what would be the procedure for that?

Any help would be appreciated.

Thanks,
\EF

Full exception in namenode log:


2011-12-07 09:45:16,699 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 9000, call
blockReceived(DatanodeRegistration(10.20.40.14:50011,
storageID=DS-71308762-1
0.20.11.66-50010-1269957604444, infoPort=50081, ipcPort=50021),
[Lorg.apache.hadoop.hdfs.protocol.Block;@3aa57508,
[Ljava.lang.String;@44a67e4c) from 10.20.40.14:58464: er
ror: org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data
node 10.20.40.14:50011 is attempting to report storage ID
DS-71308762-10.20.11.66-50010-1269957604
444. Node 10.20.40.14:50010 is expected to serve this storage.
org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node
10.20.40.14:50011 is attempting to report storage ID
DS-71308762-10.20.11.66-50010-1269957604444.
Node 10.20.40.14:50010 is expected to serve this storage.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDatanode(FSNamesystem.java:3972)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.blockReceived(FSNamesystem.java:3388)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.blockReceived(NameNode.java:776)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
reply

Search Discussions

4 responses

  • Erik Forsberg at Dec 7, 2011 at 1:14 pm

    On 2011-12-07 11:43, Erik Forsberg wrote:
    Hi!

    I'm facing the problem where datanodes are marked as down due to them
    being to slow in doing blockreports, which in turn is due to too many
    blocks per node. I.e. https://issues.apache.org/jira/browse/HADOOP-4584,
    but I can't easily upgrade to 0.21.

    So I came up with a possible workaround - run multiple datanode
    instances on each physical node, each handling a subset of the disks on
    that node. Not sure it will work, but could be worth a try.

    So I configured a second datanode on one of my nodes, configured to run
    on a different set of ports, and configured the two datanode instances
    to use half of the disks each.

    However, when starting up this configuration, I get the below exception
    (UnregisteredDatanodeException) in the namenode log, and the datanode
    then shuts down after reporting the same.
    I found a way:

    1) Configure second datanode with a set of fresh empty directories.
    2) Start second datanode, let it register with namenode.
    3) Shut down first and second datanode, then move blk* and subdir dirs
    from data dirs of first node to data dirs of second datanode.
    4) Start first and second datanode.

    This seems to work as intended. However, after some thinking I came to
    worry about the replication. HDFS will now consider the two datanode
    instances on the same host as two different hosts, which may cause
    replication to put two copies of the same file on the same host.

    It's probably not going to happen very often given that there's some
    randomness involved. And in my case there's always a third copy on
    another rack.

    Still, it's less than optimal. Are there any ways to fool HDFS into
    always placing all copies on different physical hosts in this rather
    messed up configuration?

    Thanks,
    \EF
  • Jeffrey Buell at Dec 7, 2011 at 6:22 pm

    I found a way:

    1) Configure second datanode with a set of fresh empty directories.
    2) Start second datanode, let it register with namenode.
    3) Shut down first and second datanode, then move blk* and subdir dirs
    from data dirs of first node to data dirs of second datanode.
    4) Start first and second datanode.

    This seems to work as intended. However, after some thinking I came to
    worry about the replication. HDFS will now consider the two datanode
    instances on the same host as two different hosts, which may cause
    replication to put two copies of the same file on the same host.

    It's probably not going to happen very often given that there's some
    randomness involved. And in my case there's always a third copy on
    another rack.

    Still, it's less than optimal. Are there any ways to fool HDFS into
    always placing all copies on different physical hosts in this rather
    messed up configuration?

    Thanks,
    \EF

    This is the same issue as for running multiple virtual machines on each physical host. I've found (on 0.20.2) that this gives consistently better performance than a single VM or a native OS instance (http://www.vmware.com/resources/techresources/10222), at least for I/O-intensive apps. I'm still investigating why, but one possibility is that a datanode can't efficiently handle too many disks (I have either 10 or 12 per physical host). So I'm very interested in seeing if multiple datanodes has a similar performance effect as multiple VMs (each with one DN).

    Back to replication: Hadoop doesn't know that the machines it's running on might share a physical host, so there is a possibility that 2 copies end up on the same host. What I'm doing now is define each host as a rack, so the second copy is guaranteed to go to a different host. I have a single physical rack. I'm tempted to call physical racks "super racks" to distinguish them from logical racks. A better scheme may be to divide the physical rack into 2 logical racks, so that most of the time the third copy goes on a different host than the second. I think that is the best that can be done today. Ideally we want to modify the block placement algorithm to recognize another level in the topology hierarchy for the multiple VM/DN case. A simpler solution would be to add an option where the third copy is placed in a third rack when available (and extended to n replicas on n racks instead of random placement for n>3). This would work for the single physical rack case with each host defined as a rack for the topology. Placing replicas on separate racks may be desirable for some conventional configurations also (e.g., ones with good inter-rack bandwidth).

    Jeff
  • Charles Earl at Dec 7, 2011 at 6:46 pm
    Jeff,
    Interested in how you approach the virtualization on hadoop issue.
    In particular, I would like to have a VM launched as an environment which could in essence mount the local data node's disk (or replica).
    For my application, the users in essence want the map task running in a given virtualized environment, but have the task run against HDFS store.
    Conceptually, it would seem that you would want each VM to have separate physically mounted disk?
    When I've used virtual disk this has shown 30% worse performance on write-oriented map than physical disk mount. This was with kvm with virtio, simple test with randomwriter.
    I wonder if you had any suggestions in that regard.
    I'm actually just now compiling & testing a vm based isolation module for the mesos (http://www.mesosproject.org/) in the hopes that this will address the need.
    The machine-as-rack paradigm seems quite interesting.
    Charles
    On Dec 7, 2011, at 1:21 PM, Jeffrey Buell wrote:

    I found a way:

    1) Configure second datanode with a set of fresh empty directories.
    2) Start second datanode, let it register with namenode.
    3) Shut down first and second datanode, then move blk* and subdir dirs
    from data dirs of first node to data dirs of second datanode.
    4) Start first and second datanode.

    This seems to work as intended. However, after some thinking I came to
    worry about the replication. HDFS will now consider the two datanode
    instances on the same host as two different hosts, which may cause
    replication to put two copies of the same file on the same host.

    It's probably not going to happen very often given that there's some
    randomness involved. And in my case there's always a third copy on
    another rack.

    Still, it's less than optimal. Are there any ways to fool HDFS into
    always placing all copies on different physical hosts in this rather
    messed up configuration?

    Thanks,
    \EF

    This is the same issue as for running multiple virtual machines on each physical host. I've found (on 0.20.2) that this gives consistently better performance than a single VM or a native OS instance (http://www.vmware.com/resources/techresources/10222), at least for I/O-intensive apps. I'm still investigating why, but one possibility is that a datanode can't efficiently handle too many disks (I have either 10 or 12 per physical host). So I'm very interested in seeing if multiple datanodes has a similar performance effect as multiple VMs (each with one DN).

    Back to replication: Hadoop doesn't know that the machines it's running on might share a physical host, so there is a possibility that 2 copies end up on the same host. What I'm doing now is define each host as a rack, so the second copy is guaranteed to go to a different host. I have a single physical rack. I'm tempted to call physical racks "super racks" to distinguish them from logical racks. A better scheme may be to divide the physical rack into 2 logical racks, so that most of the time the third copy goes on a different host than the second. I think that is the best that can be done today. Ideally we want to modify the block placement algorithm to recognize another level in the topology hierarchy for the multiple VM/DN case. A simpler solution would be to add an option where the third copy is placed in a third rack when available (and extended to n replicas on n racks instead of random placement for n>3). This would work for the single physical rack case with each host defined as a rack for the topology. Placing replicas on separate racks may be desirable for some conventional configurations also (e.g., ones with good inter-rack bandwidth).

    Jeff
  • Jeffrey Buell at Dec 7, 2011 at 8:21 pm
    Each physical host has 12 disks and I run 4 VMs with 3 disks dedicated to each. I happen to use Physical RDM (disks are passed through to the VMs), but this was done more for convenience (I can easily switch to native instances using the same storage). Using virtual disks on each physical disk should have negligible overhead. The important part is to effectively partition the physical resources (this includes processors and memory as well as disks) among the VMs. So if you happen to have 2 replicas on different VMs on the same host, you still have protection against any one disk or VM failing. I think this is similar to what you're thinking with multiple DNs.

    Jeff
    -----Original Message-----
    From: Charles Earl
    Sent: Wednesday, December 07, 2011 10:46 AM
    To: hdfs-user@hadoop.apache.org
    Subject: Re: Moving disks from one datanode to another

    Jeff,
    Interested in how you approach the virtualization on hadoop issue.
    In particular, I would like to have a VM launched as an environment
    which could in essence mount the local data node's disk (or replica).
    For my application, the users in essence want the map task running in a
    given virtualized environment, but have the task run against HDFS
    store.
    Conceptually, it would seem that you would want each VM to have
    separate physically mounted disk?
    When I've used virtual disk this has shown 30% worse performance on
    write-oriented map than physical disk mount. This was with kvm with
    virtio, simple test with randomwriter.
    I wonder if you had any suggestions in that regard.
    I'm actually just now compiling & testing a vm based isolation module
    for the mesos (http://www.mesosproject.org/) in the hopes that this
    will address the need.
    The machine-as-rack paradigm seems quite interesting.
    Charles
    On Dec 7, 2011, at 1:21 PM, Jeffrey Buell wrote:

    I found a way:

    1) Configure second datanode with a set of fresh empty directories.
    2) Start second datanode, let it register with namenode.
    3) Shut down first and second datanode, then move blk* and subdir
    dirs
    from data dirs of first node to data dirs of second datanode.
    4) Start first and second datanode.

    This seems to work as intended. However, after some thinking I came
    to
    worry about the replication. HDFS will now consider the two datanode
    instances on the same host as two different hosts, which may cause
    replication to put two copies of the same file on the same host.

    It's probably not going to happen very often given that there's some
    randomness involved. And in my case there's always a third copy on
    another rack.

    Still, it's less than optimal. Are there any ways to fool HDFS into
    always placing all copies on different physical hosts in this rather
    messed up configuration?

    Thanks,
    \EF

    This is the same issue as for running multiple virtual machines on
    each physical host. I've found (on 0.20.2) that this gives
    consistently better performance than a single VM or a native OS
    instance (http://www.vmware.com/resources/techresources/10222), at
    least for I/O-intensive apps. I'm still investigating why, but one
    possibility is that a datanode can't efficiently handle too many disks
    (I have either 10 or 12 per physical host). So I'm very interested in
    seeing if multiple datanodes has a similar performance effect as
    multiple VMs (each with one DN).
    Back to replication: Hadoop doesn't know that the machines it's
    running on might share a physical host, so there is a possibility that
    2 copies end up on the same host. What I'm doing now is define each
    host as a rack, so the second copy is guaranteed to go to a different
    host. I have a single physical rack. I'm tempted to call physical
    racks "super racks" to distinguish them from logical racks. A better
    scheme may be to divide the physical rack into 2 logical racks, so that
    most of the time the third copy goes on a different host than the
    second. I think that is the best that can be done today. Ideally we
    want to modify the block placement algorithm to recognize another level
    in the topology hierarchy for the multiple VM/DN case. A simpler
    solution would be to add an option where the third copy is placed in a
    third rack when available (and extended to n replicas on n racks
    instead of random placement for n>3). This would work for the single
    physical rack case with each host defined as a rack for the topology.
    Placing replicas on separate racks may be desirable for some
    conventional configurations also (e.g., ones with good inter-rack
    bandwidth).
    Jeff

Related Discussions

Discussion Navigation
viewthread | post