FAQ
Folks,

I've been digging into the potential benefits of using

10 Gigabit Ethernet (10GbE) NIC server connections for

Hadoop and wanted to run what I've come up with

through initial research by the list for 'sanity check'

feedback. I'd very much appreciate your input on

the importance (or lack of it) of the following potential benefits of

10GbE server connectivity as well as other thoughts regarding

10GbE and Hadoop (My interest is specifically in the value

of 10GbE server connections and 10GbE switching infrastructure,

over scenarios such as bonded 1GbE server connections with

10GbE switching).



1. HDFS Data Loading. The higher throughput enabled by 10GbE

server and switching infrastructure allows faster processing and

distribution of data.

2. Hadoop Cluster Scalability. High-performance for initial data
processing

and distribution directly impacts the degree of parallelism or scalability
supported

by the cluster.

3. HDFS Replication. Higher speed server connections allows faster
file replication.

4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
latency directly impact the

shuffle phase of a data set reduction especially for tasks that are at the
document level

(including large documents) and lots of metadata generated by those
documents as well as video analytics and images.

5. Data Reporting. 10GbE server networking etwork performance can

improve data reporting performance, especially if the Hadoop cluster is
running

multiple data reductions.

6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
reorganized

to use a cluster or network file system. This would allow Hadoop even with
its Java implementation

to have higher performance I/O and not have to be so concerned with disk
drive density in the same server.

7. Others?





thanks,

Saqib



Saqib Jang

Principal/Founder

Margalla Communications, Inc.

1339 Portola Road, Woodside, CA 94062

(650) 274 8745

www.margallacomm.com

Search Discussions

  • Darren Govoni at Jun 28, 2011 at 5:21 pm
    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can
    have drastic positive
    effects on performance. All your points stem from this simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed file
    system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or scalability
    supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are at the
    document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster is
    running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even with
    its Java implementation

    to have higher performance I/O and not have to be so concerned with disk
    drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • Saqib Jang -- Margalla Communications at Jun 28, 2011 at 5:27 pm
    Darren,
    Thanks, the last pt was basically about 10GbE potentially allowing the use
    of a network file system e.g. via NFS as an alternative to HDFS, the
    question
    is there any merit in this. Basically, I was exploring if the commercial
    clustered
    NAS products offer any high-availability or data management benefits for use
    with Hadoop?

    Saqib

    -----Original Message-----
    From: Darren Govoni
    Sent: Tuesday, June 28, 2011 10:21 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can have
    drastic positive effects on performance. All your points stem from this
    simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed file
    system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or
    scalability supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are at
    the document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster
    is running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even
    with its Java implementation

    to have higher performance I/O and not have to be so concerned with
    disk drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • Darren Govoni at Jun 28, 2011 at 5:41 pm
    I see. However, Hadoop is designed to operate best with HDFS because
    of its inherent striping and blocking strategy - which is tracked by Hadoop.
    Going outside of that mechanism will probably yield poor results and/or
    confuse Hadoop.

    Just my thoughts.
    On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
    Darren,
    Thanks, the last pt was basically about 10GbE potentially allowing the use
    of a network file system e.g. via NFS as an alternative to HDFS, the
    question
    is there any merit in this. Basically, I was exploring if the commercial
    clustered
    NAS products offer any high-availability or data management benefits for use
    with Hadoop?

    Saqib

    -----Original Message-----
    From: Darren Govoni
    Sent: Tuesday, June 28, 2011 10:21 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can have
    drastic positive effects on performance. All your points stem from this
    simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed file
    system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or
    scalability supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are at
    the document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster
    is running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even
    with its Java implementation

    to have higher performance I/O and not have to be so concerned with
    disk drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • Matthew Foley at Jun 28, 2011 at 7:05 pm
    Hadoop common provides an abstract FileSystem class, and Hadoop applications
    should be designed to run on that. HDFS is just one implementation of a valid
    Hadoop filesystem, and ports to S3 and KFS as well as OS-supported LocalFileSystem
    are provided in Hadoop common. Use of NFS-mounted storage would fall under the
    LocalFileSystem model.

    However, one of the core values of Hadoop is the model of "bring the computation
    to the data". This does not seem viable with an NFS-based NAS-model storage
    subsystem. Thus, while it will "work" for small clusters and small jobs, it is unlikely
    to scale with high performance to thousands of nodes and petabytes of data in the
    way Hadoop can scale with HDFS or S3.

    --Matt


    On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:

    I see. However, Hadoop is designed to operate best with HDFS because
    of its inherent striping and blocking strategy - which is tracked by Hadoop.
    Going outside of that mechanism will probably yield poor results and/or
    confuse Hadoop.

    Just my thoughts.
    On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
    Darren,
    Thanks, the last pt was basically about 10GbE potentially allowing the use
    of a network file system e.g. via NFS as an alternative to HDFS, the
    question
    is there any merit in this. Basically, I was exploring if the commercial
    clustered
    NAS products offer any high-availability or data management benefits for use
    with Hadoop?

    Saqib

    -----Original Message-----
    From: Darren Govoni
    Sent: Tuesday, June 28, 2011 10:21 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can have
    drastic positive effects on performance. All your points stem from this
    simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed file
    system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or
    scalability supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are at
    the document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster
    is running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even
    with its Java implementation

    to have higher performance I/O and not have to be so concerned with
    disk drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • Saqib Jang -- Margalla Communications at Jun 28, 2011 at 10:06 pm
    Matt,
    Thanks, this is helpful, I was wondering if you may have some thoughts
    on the list of other potential benefits of 10GbE NICs for Hadoop
    (listed in my original e-mail to the list)?

    regards,
    Saqib

    -----Original Message-----
    From: Matthew Foley
    Sent: Tuesday, June 28, 2011 12:04 PM
    To: common-user@hadoop.apache.org
    Cc: Matthew Foley
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop common provides an abstract FileSystem class, and Hadoop applications
    should be designed to run on that. HDFS is just one implementation of a
    valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
    LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage
    would fall under the LocalFileSystem model.

    However, one of the core values of Hadoop is the model of "bring the
    computation to the data". This does not seem viable with an NFS-based
    NAS-model storage subsystem. Thus, while it will "work" for small clusters
    and small jobs, it is unlikely to scale with high performance to thousands
    of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.

    --Matt


    On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:

    I see. However, Hadoop is designed to operate best with HDFS because of its
    inherent striping and blocking strategy - which is tracked by Hadoop.
    Going outside of that mechanism will probably yield poor results and/or
    confuse Hadoop.

    Just my thoughts.
    On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
    Darren,
    Thanks, the last pt was basically about 10GbE potentially allowing the
    use of a network file system e.g. via NFS as an alternative to HDFS,
    the question is there any merit in this. Basically, I was exploring if
    the commercial clustered NAS products offer any high-availability or
    data management benefits for use with Hadoop?

    Saqib

    -----Original Message-----
    From: Darren Govoni
    Sent: Tuesday, June 28, 2011 10:21 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can
    have drastic positive effects on performance. All your points stem
    from this simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed
    file system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or
    scalability supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are
    at the document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster
    is running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even
    with its Java implementation

    to have higher performance I/O and not have to be so concerned with
    disk drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • Matei Zaharia at Jun 28, 2011 at 11:03 pm
    Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is.

    Matei
    On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote:

    Matt,
    Thanks, this is helpful, I was wondering if you may have some thoughts
    on the list of other potential benefits of 10GbE NICs for Hadoop
    (listed in my original e-mail to the list)?

    regards,
    Saqib

    -----Original Message-----
    From: Matthew Foley
    Sent: Tuesday, June 28, 2011 12:04 PM
    To: common-user@hadoop.apache.org
    Cc: Matthew Foley
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop common provides an abstract FileSystem class, and Hadoop applications
    should be designed to run on that. HDFS is just one implementation of a
    valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
    LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage
    would fall under the LocalFileSystem model.

    However, one of the core values of Hadoop is the model of "bring the
    computation to the data". This does not seem viable with an NFS-based
    NAS-model storage subsystem. Thus, while it will "work" for small clusters
    and small jobs, it is unlikely to scale with high performance to thousands
    of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.

    --Matt


    On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:

    I see. However, Hadoop is designed to operate best with HDFS because of its
    inherent striping and blocking strategy - which is tracked by Hadoop.
    Going outside of that mechanism will probably yield poor results and/or
    confuse Hadoop.

    Just my thoughts.
    On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
    Darren,
    Thanks, the last pt was basically about 10GbE potentially allowing the
    use of a network file system e.g. via NFS as an alternative to HDFS,
    the question is there any merit in this. Basically, I was exploring if
    the commercial clustered NAS products offer any high-availability or
    data management benefits for use with Hadoop?

    Saqib

    -----Original Message-----
    From: Darren Govoni
    Sent: Tuesday, June 28, 2011 10:21 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can
    have drastic positive effects on performance. All your points stem
    from this simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed
    file system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or
    scalability supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are
    at the document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster
    is running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even
    with its Java implementation

    to have higher performance I/O and not have to be so concerned with
    disk drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • James Seigel at Jun 28, 2011 at 11:05 pm
    If you are very adhoc-y, more bandwidth the merry-er!

    James

    Sent from my mobile. Please excuse the typos.
    On 2011-06-28, at 5:03 PM, Matei Zaharia wrote:

    Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is.

    Matei
    On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote:

    Matt,
    Thanks, this is helpful, I was wondering if you may have some thoughts
    on the list of other potential benefits of 10GbE NICs for Hadoop
    (listed in my original e-mail to the list)?

    regards,
    Saqib

    -----Original Message-----
    From: Matthew Foley
    Sent: Tuesday, June 28, 2011 12:04 PM
    To: common-user@hadoop.apache.org
    Cc: Matthew Foley
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop common provides an abstract FileSystem class, and Hadoop applications
    should be designed to run on that. HDFS is just one implementation of a
    valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
    LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage
    would fall under the LocalFileSystem model.

    However, one of the core values of Hadoop is the model of "bring the
    computation to the data". This does not seem viable with an NFS-based
    NAS-model storage subsystem. Thus, while it will "work" for small clusters
    and small jobs, it is unlikely to scale with high performance to thousands
    of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.

    --Matt


    On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:

    I see. However, Hadoop is designed to operate best with HDFS because of its
    inherent striping and blocking strategy - which is tracked by Hadoop.
    Going outside of that mechanism will probably yield poor results and/or
    confuse Hadoop.

    Just my thoughts.
    On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
    Darren,
    Thanks, the last pt was basically about 10GbE potentially allowing the
    use of a network file system e.g. via NFS as an alternative to HDFS,
    the question is there any merit in this. Basically, I was exploring if
    the commercial clustered NAS products offer any high-availability or
    data management benefits for use with Hadoop?

    Saqib

    -----Original Message-----
    From: Darren Govoni
    Sent: Tuesday, June 28, 2011 10:21 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can
    have drastic positive effects on performance. All your points stem
    from this simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed
    file system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or
    scalability supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are
    at the document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster
    is running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even
    with its Java implementation

    to have higher performance I/O and not have to be so concerned with
    disk drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • Mathias Herberts at Jun 28, 2011 at 11:05 pm

    On Wed, Jun 29, 2011 at 01:02, Matei Zaharia wrote:
    Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is.
    From my experience, jobs that shuffle lots of data are also very often
    slowed down by the sort phase, compressing mappers' output is a first
    step to improve performance. Given the cost of a 10GbE infrastructure
    with no oversubscription I'd monitor bandwith usage very closely prior
    to investing in that kind of network gear.
  • Russell Jurney at Jun 28, 2011 at 11:13 pm
    Price the cost of 1GbE->10GbE vs. more nodes, using data from monitoring
    your cluster during peak load. It should be clear which is a better value.

    Russ
    On Tue, Jun 28, 2011 at 4:05 PM, Mathias Herberts wrote:
    On Wed, Jun 29, 2011 at 01:02, Matei Zaharia wrote:
    Ideally, to evaluate whether you want to go for 10GbE NICs, you would
    profile your target Hadoop workload and see whether it's
    communication-bound. Hadoop jobs can definitely be communication-bound if
    you shuffle a lot of data between map and reduce, but I've also seen a lot
    of clusters that are CPU-bound (due to decompression, running python, or
    just running expensive user code) or disk-IO-bound. You might be surprised
    at what your bottleneck is.

    From my experience, jobs that shuffle lots of data are also very often
    slowed down by the sort phase, compressing mappers' output is a first
    step to improve performance. Given the cost of a 10GbE infrastructure
    with no oversubscription I'd monitor bandwith usage very closely prior
    to investing in that kind of network gear.
  • Matt Davies at Jun 29, 2011 at 4:28 am
    I would say this is quite a difficult choice. I've seen that our cluster
    could use more bandwidth, but it wasn't to the nodes that made the big
    difference, it was getting better switches that had better backplanes - the
    fabric made the difference.

    I've also seen some workloads where job design is critical - i.e. if you are
    spinning through the data in your mappers you could easily overwhelm the
    namenode and jobtracker with big enough clusters. It is probably quite
    early for you to know such things about your workload. If this becomes a
    problem you may need adjustments to your apps.

    Overall, I think good quality Top Of Rack switches with good uplinks to
    distribution switches can make your cluster fly. That is relatively cheap
    compared to 10G throughout, and I've seen that more CPU's work well for _my_
    workload (I always need more mappers and reducers, but it is quite rare that
    the network is saturated now).

    $0.02

    -Matt




    On Tue, Jun 28, 2011 at 5:13 PM, Russell Jurney wrote:

    Price the cost of 1GbE->10GbE vs. more nodes, using data from monitoring
    your cluster during peak load. It should be clear which is a better value.

    Russ

    On Tue, Jun 28, 2011 at 4:05 PM, Mathias Herberts <
    mathias.herberts@gmail.com> wrote:
    On Wed, Jun 29, 2011 at 01:02, Matei Zaharia <matei@eecs.berkeley.edu>
    wrote:
    Ideally, to evaluate whether you want to go for 10GbE NICs, you would
    profile your target Hadoop workload and see whether it's
    communication-bound. Hadoop jobs can definitely be communication-bound if
    you shuffle a lot of data between map and reduce, but I've also seen a lot
    of clusters that are CPU-bound (due to decompression, running python, or
    just running expensive user code) or disk-IO-bound. You might be surprised
    at what your bottleneck is.

    From my experience, jobs that shuffle lots of data are also very often
    slowed down by the sort phase, compressing mappers' output is a first
    step to improve performance. Given the cost of a 10GbE infrastructure
    with no oversubscription I'd monitor bandwith usage very closely prior
    to investing in that kind of network gear.
  • Geoff Howard at Jul 1, 2011 at 11:14 am

    On Wed, Jun 29, 2011 at 12:27 AM, Matt Davies wrote:

    ... I've seen that our cluster
    could use more bandwidth, but it wasn't to the nodes that made the big
    difference, it was getting better switches that had better backplanes - the
    fabric made the difference.
    Any recommendations on specific 1Gb switches for top of rack that have
    better backplanes?

    Geoff
  • Matthew Foley at Jun 30, 2011 at 4:04 am
    I agree with Matei. Whether you will get good ROI on 10GigE depends very much on the types of jobs you run.
    --Matt

    On Jun 28, 2011, at 4:02 PM, Matei Zaharia wrote:

    Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is.

    Matei
    On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote:

    Matt,
    Thanks, this is helpful, I was wondering if you may have some thoughts
    on the list of other potential benefits of 10GbE NICs for Hadoop
    (listed in my original e-mail to the list)?

    regards,
    Saqib

    -----Original Message-----
    From: Matthew Foley
    Sent: Tuesday, June 28, 2011 12:04 PM
    To: common-user@hadoop.apache.org
    Cc: Matthew Foley
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop common provides an abstract FileSystem class, and Hadoop applications
    should be designed to run on that. HDFS is just one implementation of a
    valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
    LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage
    would fall under the LocalFileSystem model.

    However, one of the core values of Hadoop is the model of "bring the
    computation to the data". This does not seem viable with an NFS-based
    NAS-model storage subsystem. Thus, while it will "work" for small clusters
    and small jobs, it is unlikely to scale with high performance to thousands
    of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.

    --Matt


    On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:

    I see. However, Hadoop is designed to operate best with HDFS because of its
    inherent striping and blocking strategy - which is tracked by Hadoop.
    Going outside of that mechanism will probably yield poor results and/or
    confuse Hadoop.

    Just my thoughts.
    On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
    Darren,
    Thanks, the last pt was basically about 10GbE potentially allowing the
    use of a network file system e.g. via NFS as an alternative to HDFS,
    the question is there any merit in this. Basically, I was exploring if
    the commercial clustered NAS products offer any high-availability or
    data management benefits for use with Hadoop?

    Saqib

    -----Original Message-----
    From: Darren Govoni
    Sent: Tuesday, June 28, 2011 10:21 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can
    have drastic positive effects on performance. All your points stem
    from this simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed
    file system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or
    scalability supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are
    at the document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster
    is running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even
    with its Java implementation

    to have higher performance I/O and not have to be so concerned with
    disk drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • Matthew Foley at Jun 30, 2011 at 4:05 am
    I agree with Matei. Whether you will get good ROI on 10GigE depends very much on the types of jobs you run.
    --Matt

    On Jun 28, 2011, at 4:02 PM, Matei Zaharia wrote:

    Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is.

    Matei
    On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote:

    Matt,
    Thanks, this is helpful, I was wondering if you may have some thoughts
    on the list of other potential benefits of 10GbE NICs for Hadoop
    (listed in my original e-mail to the list)?

    regards,
    Saqib

    -----Original Message-----
    From: Matthew Foley
    Sent: Tuesday, June 28, 2011 12:04 PM
    To: common-user@hadoop.apache.org
    Cc: Matthew Foley
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop common provides an abstract FileSystem class, and Hadoop applications
    should be designed to run on that. HDFS is just one implementation of a
    valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
    LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage
    would fall under the LocalFileSystem model.

    However, one of the core values of Hadoop is the model of "bring the
    computation to the data". This does not seem viable with an NFS-based
    NAS-model storage subsystem. Thus, while it will "work" for small clusters
    and small jobs, it is unlikely to scale with high performance to thousands
    of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.

    --Matt


    On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:

    I see. However, Hadoop is designed to operate best with HDFS because of its
    inherent striping and blocking strategy - which is tracked by Hadoop.
    Going outside of that mechanism will probably yield poor results and/or
    confuse Hadoop.

    Just my thoughts.
    On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
    Darren,
    Thanks, the last pt was basically about 10GbE potentially allowing the
    use of a network file system e.g. via NFS as an alternative to HDFS,
    the question is there any merit in this. Basically, I was exploring if
    the commercial clustered NAS products offer any high-availability or
    data management benefits for use with Hadoop?

    Saqib

    -----Original Message-----
    From: Darren Govoni
    Sent: Tuesday, June 28, 2011 10:21 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can
    have drastic positive effects on performance. All your points stem
    from this simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed
    file system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or
    scalability supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are
    at the document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster
    is running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even
    with its Java implementation

    to have higher performance I/O and not have to be so concerned with
    disk drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • Jeff Schmitz at Jul 11, 2011 at 2:21 pm
    Also there is info on this at Cloudera here

    http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-
    basic-hardware-recommendations/



    -----Original Message-----
    From: Saqib Jang -- Margalla Communications

    Sent: Tuesday, June 28, 2011 5:06 PM
    To: common-user@hadoop.apache.org
    Subject: RE: Sanity check re: value of 10GbE NICs for Hadoop?

    Matt,
    Thanks, this is helpful, I was wondering if you may have some thoughts
    on the list of other potential benefits of 10GbE NICs for Hadoop
    (listed in my original e-mail to the list)?

    regards,
    Saqib

    -----Original Message-----
    From: Matthew Foley
    Sent: Tuesday, June 28, 2011 12:04 PM
    To: common-user@hadoop.apache.org
    Cc: Matthew Foley
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop common provides an abstract FileSystem class, and Hadoop
    applications
    should be designed to run on that. HDFS is just one implementation of a
    valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
    LocalFileSystem are provided in Hadoop common. Use of NFS-mounted
    storage
    would fall under the LocalFileSystem model.

    However, one of the core values of Hadoop is the model of "bring the
    computation to the data". This does not seem viable with an NFS-based
    NAS-model storage subsystem. Thus, while it will "work" for small
    clusters
    and small jobs, it is unlikely to scale with high performance to
    thousands
    of nodes and petabytes of data in the way Hadoop can scale with HDFS or
    S3.

    --Matt


    On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:

    I see. However, Hadoop is designed to operate best with HDFS because of
    its
    inherent striping and blocking strategy - which is tracked by Hadoop.
    Going outside of that mechanism will probably yield poor results and/or
    confuse Hadoop.

    Just my thoughts.
    On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
    Darren,
    Thanks, the last pt was basically about 10GbE potentially allowing the
    use of a network file system e.g. via NFS as an alternative to HDFS,
    the question is there any merit in this. Basically, I was exploring if
    the commercial clustered NAS products offer any high-availability or
    data management benefits for use with Hadoop?

    Saqib

    -----Original Message-----
    From: Darren Govoni
    Sent: Tuesday, June 28, 2011 10:21 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    Hadoop, like other parallel networked computation architectures is I/O
    bound, predominantly.
    This means any increase in network bandwidth is "A Good Thing" and can
    have drastic positive effects on performance. All your points stem
    from this simple realization.

    Although I'm confused by your #6. Hadoop already uses a distributed
    file system. HDFS.
    On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial
    data
    processing

    and distribution directly impacts the degree of parallelism or
    scalability supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows
    faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are
    at the document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance
    can
    improve data reporting performance, especially if the Hadoop cluster
    is running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop
    could
    be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even
    with its Java implementation

    to have higher performance I/O and not have to be so concerned with
    disk drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com




  • Bharath Mundlapudi at Jun 29, 2011 at 6:08 am
    One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly having extra bandwidth is good but at what price?


    Please note that all the points you mentioned can work with 1Gb NICs today. Unless if you can back with price/performance data. Typically, Map output is compressed. If system is hitting peak network utilization, one can select high compression rate algorithms at the cost of CPU.  Most of these machines comes with dual NIC cards, so one could do link bonding to push more bits.


    One area may have good benefit of 10Gb NIC is High Density Systems - 24 core and 3x12TB disks. This is the trend now and will continue. These systems can saturate the 1Gb NICs.


    -Bharath



    ________________________________
    From: Saqib Jang -- Margalla Communications <saqibj@margallacomm.com>
    To: common-user@hadoop.apache.org
    Sent: Tuesday, June 28, 2011 10:16 AM
    Subject: Sanity check re: value of 10GbE NICs for Hadoop?

    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1.      HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2.      Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or scalability
    supported

    by the cluster.

    3.      HDFS Replication. Higher speed server connections allows faster
    file replication.

    4.      Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are at the
    document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5.      Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster is
    running

    multiple data reductions.

    6.      Support of Cluster File Systems.  With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even with
    its Java implementation

    to have higher performance I/O and not have to be so concerned with disk
    drive density in the same server.

    7.      Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com
  • Michel Segel at Jun 29, 2011 at 9:03 pm
    I'm not sure which point you are trying to make.
    To answer to answer your question...

    With respect to price... 10GBe is cost effective.
    You have to consider 1GBe is not only you port speed but also there is going to be the speed of the uplink or trunk.

    So if you continue to build out, you run in to bandwidth issues between racks. So you end up doing 1GBe ports and then higher speed by either port bonding or bigger bandwidth for uplinks only. These switches are more expensive than simple 1GBe switches, but less than full 10GBe.

    Depending on vendor, number of ports, discount, you can get the switch for approx 10,000 and up. Think $550 to $600 a port for 10GBe.

    With Sandy Bridge, you will start to see 10GBe on the mother boards.

    If you're following discussion on the performance gains, you'll start to see the network being the bottleneck.

    If you are planning to build a new cluster... You should plan on 10gbe.







    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Jun 29, 2011, at 1:07 AM, Bharath Mundlapudi wrote:
    One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly having extra bandwidth is good but at what price?


    Please note that all the points you mentioned can work with 1Gb NICs today. Unless if you can back with price/performance data. Typically, Map output is compressed. If system is hitting peak network utilization, one can select high compression rate algorithms at the cost of CPU. Most of these machines comes with dual NIC cards, so one could do link bonding to push more bits.


    One area may have good benefit of 10Gb NIC is High Density Systems - 24 core and 3x12TB disks. This is the trend now and will continue. These systems can saturate the 1Gb NICs.


    -Bharath



    ________________________________
    From: Saqib Jang -- Margalla Communications <saqibj@margallacomm.com>
    To: common-user@hadoop.apache.org
    Sent: Tuesday, June 28, 2011 10:16 AM
    Subject: Sanity check re: value of 10GbE NICs for Hadoop?

    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1. HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2. Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or scalability
    supported

    by the cluster.

    3. HDFS Replication. Higher speed server connections allows faster
    file replication.

    4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are at the
    document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5. Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster is
    running

    multiple data reductions.

    6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even with
    its Java implementation

    to have higher performance I/O and not have to be so concerned with disk
    drive density in the same server.

    7. Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com
  • Bharath Mundlapudi at Jun 30, 2011 at 6:50 pm
    However good the benchmark be, In benchmarking there is a saying 'Performance improvements depends on type of workload'. What matters is your workload. Design the network for your workloads.

    From racks to uplink or trunk need 10GBe. But the question was are we there yet for per node 10GBe? I would plan only if your data is showing the network saturation.

    -Bharath




    ________________________________
    From: Michel Segel <michael_segel@hotmail.com>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Wednesday, June 29, 2011 2:04 PM
    Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

    I'm not sure which point you are trying to make.
    To answer to answer your question...

    With respect to price... 10GBe is cost effective.
    You have to consider 1GBe is not only you port speed but also there is going to be the speed of the uplink or trunk.

    So if you continue to build out, you run in to bandwidth issues between racks. So you end up doing 1GBe ports and then higher speed by either port bonding or bigger bandwidth for uplinks only. These switches are more expensive than simple 1GBe switches, but less than full 10GBe.

    Depending on vendor, number of ports, discount, you can get the switch for approx 10,000 and up. Think $550 to $600 a port for 10GBe.

    With Sandy Bridge, you will start to see 10GBe on the mother boards.

    If you're following discussion on the performance gains, you'll start to see the network being the bottleneck.

    If you are planning to build a new cluster... You should plan on 10gbe.







    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Jun 29, 2011, at 1:07 AM, Bharath Mundlapudi wrote:
    One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly having extra bandwidth is good but at what price?


    Please note that all the points you mentioned can work with 1Gb NICs today. Unless if you can back with price/performance data. Typically, Map output is compressed. If system is hitting peak network utilization, one can select high compression rate algorithms at the cost of CPU.  Most of these machines comes with dual NIC cards, so one could do link bonding to push more bits.


    One area may have good benefit of 10Gb NIC is High Density Systems - 24 core and 3x12TB disks. This is the trend now and will continue. These systems can saturate the 1Gb NICs.


    -Bharath



    ________________________________
    From: Saqib Jang -- Margalla Communications <saqibj@margallacomm.com>
    To: common-user@hadoop.apache.org
    Sent: Tuesday, June 28, 2011 10:16 AM
    Subject: Sanity check re: value of 10GbE NICs for Hadoop?

    Folks,

    I've been digging into the potential benefits of using

    10 Gigabit Ethernet (10GbE) NIC server connections for

    Hadoop and wanted to run what I've come up with

    through initial research by the list for 'sanity check'

    feedback. I'd very much appreciate your input on

    the importance (or lack of it) of the following potential benefits of

    10GbE server connectivity as well as other thoughts regarding

    10GbE and Hadoop (My interest is specifically in the value

    of 10GbE server connections and 10GbE switching infrastructure,

    over scenarios such as bonded 1GbE server connections with

    10GbE switching).



    1.      HDFS Data Loading. The higher throughput enabled by 10GbE

    server and switching infrastructure allows faster processing and

    distribution of data.

    2.      Hadoop Cluster Scalability. High-performance for initial data
    processing

    and distribution directly impacts the degree of parallelism or scalability
    supported

    by the cluster.

    3.      HDFS Replication. Higher speed server connections allows faster
    file replication.

    4.      Map/Reduce Shuffle Phase. Improved end-to-end throughput and
    latency directly impact the

    shuffle phase of a data set reduction especially for tasks that are at the
    document level

    (including large documents) and lots of metadata generated by those
    documents as well as video analytics and images.

    5.      Data Reporting. 10GbE server networking etwork performance can

    improve data reporting performance, especially if the Hadoop cluster is
    running

    multiple data reductions.

    6.      Support of Cluster File Systems.  With 10 GbE NICs, Hadoop could be
    reorganized

    to use a cluster or network file system. This would allow Hadoop even with
    its Java implementation

    to have higher performance I/O and not have to be so concerned with disk
    drive density in the same server.

    7.      Others?





    thanks,

    Saqib



    Saqib Jang

    Principal/Founder

    Margalla Communications, Inc.

    1339 Portola Road, Woodside, CA 94062

    (650) 274 8745

    www.margallacomm.com
  • Related Discussions

    People

    Translate

    site design / logo © 2022 Grokbase