FAQ
I was asked by our IT folks if we can put hadoop name nodes storage using a shared disk storage unit. Does anyone have experience of how much IO throughput is required on the name nodes? What are the latency/data throughput requirements between the master and data nodes - can this tolerate network routing?

Did anyone published any throughput requirement for the best network setup recommendation?

Thanks!
Jonathan


________________________________
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.

Search Discussions

  • Jagaran das at Jul 31, 2011 at 9:51 pm
    What is the difference between DFSClient Protocol and FileSystem class in Hadoop DFS (HDFS). Both of these classes are used for connecting a remote client to the namenode in HDFS. So,

    I wanted to know the advantages of one over the other and which one is suitable for remote-client connection.


    Regards,
    JD
  • Tsz Wo Sze at Aug 1, 2011 at 2:01 am
    Hi JD,

    FileSystem is a public API but DFSClient is an internal class.  For developing Hadoop applications, we should use FileSystem.

    Tsz-Wo



    ________________________________
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Sunday, July 31, 2011 2:50 PM
    Subject: DFSClient Protocol and FileSystem class



    What is the difference between DFSClient Protocol and FileSystem class in Hadoop DFS (HDFS). Both of these classes are used for connecting a remote client to the namenode in HDFS. So,

    I wanted to know the advantages of one over the other and which one is suitable for remote-client connection.


    Regards,
    JD
  • Jagaran das at Aug 2, 2011 at 3:29 am
    Hi,

    What is the max number of open connections to a namenode?

    I am using


    FSDataOutputStream out = dfs.create(src);

    Cheers,
    JD
  • Allen Wittenauer at Aug 1, 2011 at 2:16 am

    On Jul 31, 2011, at 12:08 PM, <jonathan.hwang@accenture.com> wrote:

    I was asked by our IT folks if we can put hadoop name nodes storage using a shared disk storage unit.
    What do you mean by "shared disk storage unit"? There are lots of products out there that would claim this, so actual deployment semantics are important.
    Does anyone have experience of how much IO throughput is required on the name nodes?
    IO throughput is completely dependent dependent upon how many changes are being applied to the file system and frequency of edits log merging. In the majority of cases it is "not much". What tends to happen where the storage is shared (such as a NAS) is that the *other* traffic blocks the writes for too long because it is overloaded and the NN declares it dead.
    What are the latency/data throughput requirements between the master and data nodes - can this tolerate network routing?
    If you mean "different data centers", then no. If you mean "same data center, but with routers in between", then probably yes, but you add several more failure points, so this isn't recommended.
    Did anyone published any throughput requirement for the best network setup recommendation?
    Not that I know of. It is very much dependent upon the actual workload being performed. But I wouldn't deploy anything slower than a 1:4 overcommit (uplink-to-host) on the DN side for anything real/significant.
    This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
    Lawyers are funny people. I wonder how much they got paid for this one.
  • Saqib Jang -- Margalla Communications at Aug 1, 2011 at 2:30 am
    Thanks, I'm independently doing some digging into Hadoop networking
    requirements and
    had a couple of quick follow-ups. Could I have some specific info on why
    different data centers
    cannot be supported for master node and data node comms? Also, what
    may be the benefits/use cases for such a scenario?

    Saqib

    -----Original Message-----
    From: jonathan.hwang@accenture.com
    Sent: Sunday, July 31, 2011 12:09 PM
    To: common-user@hadoop.apache.org
    Subject: Hadoop cluster network requirement

    I was asked by our IT folks if we can put hadoop name nodes storage using a
    shared disk storage unit. Does anyone have experience of how much IO
    throughput is required on the name nodes? What are the latency/data
    throughput requirements between the master and data nodes - can this
    tolerate network routing?

    Did anyone published any throughput requirement for the best network setup
    recommendation?

    Thanks!
    Jonathan


    ________________________________
    This message is for the designated recipient only and may contain
    privileged, proprietary, or otherwise private information. If you have
    received it in error, please notify the sender immediately and delete the
    original. Any other use of the email by you is prohibited.
  • Allen Wittenauer at Aug 1, 2011 at 3:29 am

    On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:

    Thanks, I'm independently doing some digging into Hadoop networking
    requirements and
    had a couple of quick follow-ups. Could I have some specific info on why
    different data centers
    cannot be supported for master node and data node comms?
    Also, what
    may be the benefits/use cases for such a scenario?
    Most people who try to put the NN and DNs in different data centers are trying to achieve disaster recovery: one file system in multiple locations. That isn't the way HDFS is designed and it will end in tears. There are multiple problems:

    1) no guarantee that one block replica will be each data center (thereby defeating the whole purpose!)
    2) assuming one can work out problem 1, during a network break, the NN will lose contact from one half of the DNs, causing a massive network replication storm
    3) if one using MR on top of this HDFS, the shuffle will likely kill the network in between (making MR performance pretty dreadful) is going to cause delays for the DN heartbeats
    4) I don't even want to think about rebalancing.

    ... and I'm sure a lot of other problems I'm forgetting at the moment. So don't do it.

    If you want disaster recovery, set up two completely separate HDFSes and run everything in parallel.
  • Michael Segel at Aug 1, 2011 at 11:58 pm
    Yeah what he said.
    Its never a good idea.
    Forget about losing a NN or a Rack, but just losing connectivity between data centers. (It happens more than you think.)
    Your entire cluster in both data centers go down. Boom!

    Its a bad design.

    You're better off doing two different clusters.

    Is anyone really trying to sell this as a design? That's even more scary.

    Subject: Re: Hadoop cluster network requirement
    From: aw@apache.org
    Date: Sun, 31 Jul 2011 20:28:53 -0700
    To: common-user@hadoop.apache.org; saqibj@margallacomm.com

    On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:

    Thanks, I'm independently doing some digging into Hadoop networking
    requirements and
    had a couple of quick follow-ups. Could I have some specific info on why
    different data centers
    cannot be supported for master node and data node comms?
    Also, what
    may be the benefits/use cases for such a scenario?
    Most people who try to put the NN and DNs in different data centers are trying to achieve disaster recovery: one file system in multiple locations. That isn't the way HDFS is designed and it will end in tears. There are multiple problems:

    1) no guarantee that one block replica will be each data center (thereby defeating the whole purpose!)
    2) assuming one can work out problem 1, during a network break, the NN will lose contact from one half of the DNs, causing a massive network replication storm
    3) if one using MR on top of this HDFS, the shuffle will likely kill the network in between (making MR performance pretty dreadful) is going to cause delays for the DN heartbeats
    4) I don't even want to think about rebalancing.

    ... and I'm sure a lot of other problems I'm forgetting at the moment. So don't do it.

    If you want disaster recovery, set up two completely separate HDFSes and run everything in parallel.
  • Mohit Anchlia at Aug 2, 2011 at 12:29 am
    Assuming everything is up this solution still will not scale given the latency, tcpip buffers, sliding window etc. See BDP

    Sent from my iPad
    On Aug 1, 2011, at 4:57 PM, Michael Segel wrote:


    Yeah what he said.
    Its never a good idea.
    Forget about losing a NN or a Rack, but just losing connectivity between data centers. (It happens more than you think.)
    Your entire cluster in both data centers go down. Boom!

    Its a bad design.

    You're better off doing two different clusters.

    Is anyone really trying to sell this as a design? That's even more scary.

    Subject: Re: Hadoop cluster network requirement
    From: aw@apache.org
    Date: Sun, 31 Jul 2011 20:28:53 -0700
    To: common-user@hadoop.apache.org; saqibj@margallacomm.com

    On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:

    Thanks, I'm independently doing some digging into Hadoop networking
    requirements and
    had a couple of quick follow-ups. Could I have some specific info on why
    different data centers
    cannot be supported for master node and data node comms?
    Also, what
    may be the benefits/use cases for such a scenario?
    Most people who try to put the NN and DNs in different data centers are trying to achieve disaster recovery: one file system in multiple locations. That isn't the way HDFS is designed and it will end in tears. There are multiple problems:

    1) no guarantee that one block replica will be each data center (thereby defeating the whole purpose!)
    2) assuming one can work out problem 1, during a network break, the NN will lose contact from one half of the DNs, causing a massive network replication storm
    3) if one using MR on top of this HDFS, the shuffle will likely kill the network in between (making MR performance pretty dreadful) is going to cause delays for the DN heartbeats
    4) I don't even want to think about rebalancing.

    ... and I'm sure a lot of other problems I'm forgetting at the moment. So don't do it.

    If you want disaster recovery, set up two completely separate HDFSes and run everything in parallel.
  • Allen Wittenauer at Aug 3, 2011 at 1:05 am

    On Aug 1, 2011, at 4:57 PM, Michael Segel wrote:

    Forget about losing a NN or a Rack, but just losing connectivity between data centers. (It happens more than you think.)
    Someone somewhere :) told me that when a certain local government was short on cash, network connectivity to their company's data center in that country would suddenly drop. As soon as funds were made available in the form of donations to local charities, the data center would suddenly, miraculous be back on line.

    I'm sure it was just coincidence.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 31, '11 at 7:09p
activeAug 3, '11 at 1:05a
posts10
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase