Also there is info on this at Cloudera here
http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
-----Original Message-----
From: Saqib Jang -- Margalla Communications
Sent: Tuesday, June 28, 2011 5:06 PM
To:
[email protected]Subject: RE: Sanity check re: value of 10GbE NICs for Hadoop?
Matt,
Thanks, this is helpful, I was wondering if you may have some thoughts
on the list of other potential benefits of 10GbE NICs for Hadoop
(listed in my original e-mail to the list)?
regards,
Saqib
-----Original Message-----
From: Matthew Foley
Sent: Tuesday, June 28, 2011 12:04 PM
To:
[email protected]Cc: Matthew Foley
Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
Hadoop common provides an abstract FileSystem class, and Hadoop
applications
should be designed to run on that. HDFS is just one implementation of a
valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
LocalFileSystem are provided in Hadoop common. Use of NFS-mounted
storage
would fall under the LocalFileSystem model.
However, one of the core values of Hadoop is the model of "bring the
computation to the data". This does not seem viable with an NFS-based
NAS-model storage subsystem. Thus, while it will "work" for small
clusters
and small jobs, it is unlikely to scale with high performance to
thousands
of nodes and petabytes of data in the way Hadoop can scale with HDFS or
S3.
--Matt
On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:
I see. However, Hadoop is designed to operate best with HDFS because of
its
inherent striping and blocking strategy - which is tracked by Hadoop.
Going outside of that mechanism will probably yield poor results and/or
confuse Hadoop.
Just my thoughts.
On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:Darren,
Thanks, the last pt was basically about 10GbE potentially allowing the
use of a network file system e.g. via NFS as an alternative to HDFS,
the question is there any merit in this. Basically, I was exploring if
the commercial clustered NAS products offer any high-availability or
data management benefits for use with Hadoop?
Saqib
-----Original Message-----
From: Darren Govoni
Sent: Tuesday, June 28, 2011 10:21 AM
To:
[email protected]Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
Hadoop, like other parallel networked computation architectures is I/O
bound, predominantly.
This means any increase in network bandwidth is "A Good Thing" and can
have drastic positive effects on performance. All your points stem
from this simple realization.
Although I'm confused by your #6. Hadoop already uses a distributed
file system. HDFS.
On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
Folks,
I've been digging into the potential benefits of using
10 Gigabit Ethernet (10GbE) NIC server connections for
Hadoop and wanted to run what I've come up with
through initial research by the list for 'sanity check'
feedback. I'd very much appreciate your input on
the importance (or lack of it) of the following potential benefits of
10GbE server connectivity as well as other thoughts regarding
10GbE and Hadoop (My interest is specifically in the value
of 10GbE server connections and 10GbE switching infrastructure,
over scenarios such as bonded 1GbE server connections with
10GbE switching).
1. HDFS Data Loading. The higher throughput enabled by 10GbE
server and switching infrastructure allows faster processing and
distribution of data.
2. Hadoop Cluster Scalability. High-performance for initial
data
processing
and distribution directly impacts the degree of parallelism or
scalability supported
by the cluster.
3. HDFS Replication. Higher speed server connections allows
faster
file replication.
4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and
latency directly impact the
shuffle phase of a data set reduction especially for tasks that are
at the document level
(including large documents) and lots of metadata generated by those
documents as well as video analytics and images.
5. Data Reporting. 10GbE server networking etwork performance
can
improve data reporting performance, especially if the Hadoop cluster
is running
multiple data reductions.
6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop
could
be
reorganized
to use a cluster or network file system. This would allow Hadoop even
with its Java implementation
to have higher performance I/O and not have to be so concerned with
disk drive density in the same server.
7. Others?
thanks,
Saqib
Saqib Jang
Principal/Founder
Margalla Communications, Inc.
1339 Portola Road, Woodside, CA 94062
(650) 274 8745
www.margallacomm.com