On 10/08/2011 08:58, jagaran das wrote:
In my current project we are planning to streams of data to Namenode (20 Node Cluster).
Data Volume would be around 1 PB per day.
But there are application which can publish data at 1GBPS.
That's Gigabyte/s or Gigabit/s?
Few queries:
1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in. see below
2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly.
that won't solve your problem
3. Can multiple region servers of HBase help us ?? no
Please suggest how we can design the streaming part to handle such scale of data.
Data is written to datanodes, not namenodes. the NN is used to set up
the write chain and then just tracks node health -the data does not go
through it.
This changes your problem to one of
-can the NN set up write chains at the speed you want, or do you need
to throttle back the file creation rate by writing bigger files
-can the NN handle the (file x block count) volumes you expect
-what is the network traffic of the data ingress
-what is the total bandwidth of the replication traffic combined with
the data ingress traffic?
-do you have enough disks for the data
-do your HDDs have enough bandwidth?
-do you want to do any work with the data, and what CPU/HDD/net load
does this generate?
-what impact will disk & datanode replication traffic have?
-how much of the backbone will you have to allocated to the rebalancer.
A 1 PB/day, ignoring all network issues, you will reach the current
documented HDFS limits within four weeks. What are you going to do then,
or will you have processed it down?
I could imagine some experiments you could conduct against a namenode to
see what its limits are, but there are lot of datacentre bandwidth and
computation details you have to worry above and beyond datanode
performance issues.
Like Michael says, 1 PB/day sounds like a homework project, especially
if you haven't used hadoop at smaller scale. If it is homework, once
you've done the work (and submitted it), it'd be nice to see the final
paper.
If it is something you plan to take live, well, there are lots of issues
to address, of which the NN is just one of the issues -and one you can
test in advance. Ramping up the cluster with different loads will teach
you more about the bottlenecks. Otherwise: there are people who know how
to run Hadoop at scale, who, in exchange for money, will help you.
-steve