FAQ
In my current project we  are planning to streams of data to Namenode (20 Node Cluster).
Data Volume would be around 1 PB per day.
But there are application which can publish data at 1GBPS.

Few queries:

1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in.
2. Can we have multiple federated Name nodes  sharing the same slaves and then we can distribute the writes accordingly.
3. Can multiple region servers of HBase help us ??

Please suggest how we can design the streaming part to handle such scale of data.

Regards,
Jagaran Das

Search Discussions

  • Mathias Herberts at Aug 10, 2011 at 8:45 am
    Just curious, what are the techspecs of your datanodes to accomodate 1PB/day
    on 20 nodes?
    On Aug 10, 2011 10:12 AM, "jagaran das" wrote:
    In my current project we are planning to streams of data to Namenode (20
    Node Cluster).
    Data Volume would be around 1 PB per day.
    But there are application which can publish data at 1GBPS.

    Few queries:

    1. Can a single Namenode handle such high speed writes? Or it becomes
    unresponsive when GC cycle kicks in.
    2. Can we have multiple federated Name nodes sharing the same slaves and
    then we can distribute the writes accordingly.
    3. Can multiple region servers of HBase help us ??

    Please suggest how we can design the streaming part to handle such scale of data.
    Regards,
    Jagaran Das
  • Jagaran das at Aug 10, 2011 at 5:07 pm
    To be precise, the projected data is around 1 PB.
    But the publishing rate is also around 1GBPS.

    Please suggest.


    ________________________________
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Wednesday, 10 August 2011 12:58 AM
    Subject: Namenode Scalability

    In my current project we  are planning to streams of data to Namenode (20 Node Cluster).
    Data Volume would be around 1 PB per day.
    But there are application which can publish data at 1GBPS.

    Few queries:

    1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in.
    2. Can we have multiple federated Name nodes  sharing the same slaves and then we can distribute the writes accordingly.
    3. Can multiple region servers of HBase help us ??

    Please suggest how we can design the streaming part to handle such scale of data.

    Regards,
    Jagaran Das
  • Michel Segel at Aug 10, 2011 at 10:10 pm
    So many questions, why stop there?

    First question... What would cause the name node to have a GC issue?
    Second question... You're streaming 1PB a day. Is this a single stream of data?
    Are you writing this to one file before processing, or are you processing the data directly on the ingestion stream?

    Are you also filtering the data so that you are not saving all of the data?

    This sounds like a homework assignment than a real world problem.

    I guess people don't race cars against trains or have two trains traveling in different directions anymore... :-)


    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Aug 10, 2011, at 12:07 PM, jagaran das wrote:

    To be precise, the projected data is around 1 PB.
    But the publishing rate is also around 1GBPS.

    Please suggest.


    ________________________________
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Wednesday, 10 August 2011 12:58 AM
    Subject: Namenode Scalability

    In my current project we are planning to streams of data to Namenode (20 Node Cluster).
    Data Volume would be around 1 PB per day.
    But there are application which can publish data at 1GBPS.

    Few queries:

    1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in.
    2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly.
    3. Can multiple region servers of HBase help us ??

    Please suggest how we can design the streaming part to handle such scale of data.

    Regards,
    Jagaran Das
  • Jagaran das at Aug 10, 2011 at 11:16 pm
    What would cause the name node to have a GC issue?

    - I am writing opening at max 5000 connections and writing continuously through those 5000 connections to 5000 files at a time.
    - The volume of data that I would write through 5000 connections cannot be controlled as it is depends on upstream applications that publish data.

    Now if the heap memory nears the full size (let say M GB) and when the major GC cycle kicks in, the NameNode could stop responding for some time.
    This "stop the world" time should be directly proportional to the Heap Size.
    This may cause the data being blogged on the streaming application's memory.

    As of our architecture,

    It has a cluster of JMS Queue and We have multithreaded application that picks the messages from the queue   and streams it to NameNode of a 20 Node cluster
    using FileSystem API as exposed.

    BTW, in real world if you have a fast car, you can race and win against a slow train, it all depends from what reference frame you are in :)

    Regards,
    Jagaran

    ________________________________
    From: Michel Segel <michael_segel@hotmail.com>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Cc: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>; jagaran das <jagaran_das@yahoo.co.in>
    Sent: Wednesday, 10 August 2011 11:26 AM
    Subject: Re: Namenode Scalability

    So many questions, why stop there?

    First question... What would cause the name node to have a GC issue?
    Second question... You're streaming 1PB a day. Is this a single stream of data?
    Are you writing this to one file before processing, or are you processing the data directly on the ingestion stream?

    Are you also filtering the data so that you are not saving all of the data?

    This sounds like a homework assignment than a real world problem.

    I guess people don't race cars against trains or have two trains traveling in different directions anymore... :-)


    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Aug 10, 2011, at 12:07 PM, jagaran das wrote:

    To be precise, the projected data is around 1 PB.
    But the publishing rate is also around 1GBPS.

    Please suggest.


    ________________________________
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Wednesday, 10 August 2011 12:58 AM
    Subject: Namenode Scalability

    In my current project we  are planning to streams of data to Namenode (20 Node Cluster).
    Data Volume would be around 1 PB per day.
    But there are application which can publish data at 1GBPS.

    Few queries:

    1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in.
    2. Can we have multiple federated Name nodes  sharing the same slaves and then we can distribute the writes accordingly.
    3. Can multiple region servers of HBase help us ??

    Please suggest how we can design the streaming part to handle such scale of data.

    Regards,
    Jagaran Das
  • Dieter Plaetinck at Aug 17, 2011 at 7:49 am
    Hi,

    On Wed, 10 Aug 2011 13:26:18 -0500
    Michel Segel wrote:
    This sounds like a homework assignment than a real world problem.
    Why? just wondering.
    I guess people don't race cars against trains or have two trains
    traveling in different directions anymore... :-)
    huh?
  • Steve Loughran at Aug 17, 2011 at 10:47 am

    On 17/08/11 08:48, Dieter Plaetinck wrote:
    Hi,

    On Wed, 10 Aug 2011 13:26:18 -0500
    Michel Segelwrote:
    This sounds like a homework assignment than a real world problem.
    Why? just wondering.
    The question proposed a data rate comparable with Yahoo, Google and
    Facebook --yet it was ingress rather than egress, which was even more
    unusual. You'd have to be doing a web-scale search engine to need that
    data rate -and if you were doing that you need to know a lot more about
    how Hadoop works (i.e. the limited role of the NN). You'd also have to
    addressed the entire network infrastructure, the costs of the work on
    your external system, DNS load, power budget. Oh, and the fact that
    unless you were processing discarding those PB/day at the rate of
    ingress, you'd need to add a new Hadoop cluster at a rate of 1
    cluster/month, which is not only expensive, I don't think datacentre
    construction rates could handle it, even if your server vendor had set
    up a construction/test pipeline to ship down an assembled and test
    containerised cluster every few weeks (which we can do, incidentally :)
    I guess people don't race cars against trains or have two trains
    traveling in different directions anymore... :-)
    huh?
    Different Homework questions.
  • Steve Loughran at Aug 13, 2011 at 8:17 pm

    On 10/08/2011 08:58, jagaran das wrote:
    In my current project we are planning to streams of data to Namenode (20 Node Cluster).
    Data Volume would be around 1 PB per day.
    But there are application which can publish data at 1GBPS.
    That's Gigabyte/s or Gigabit/s?
    Few queries:

    1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in. see below
    2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly.
    that won't solve your problem
    3. Can multiple region servers of HBase help us ?? no
    Please suggest how we can design the streaming part to handle such scale of data.
    Data is written to datanodes, not namenodes. the NN is used to set up
    the write chain and then just tracks node health -the data does not go
    through it.

    This changes your problem to one of
    -can the NN set up write chains at the speed you want, or do you need
    to throttle back the file creation rate by writing bigger files
    -can the NN handle the (file x block count) volumes you expect
    -what is the network traffic of the data ingress
    -what is the total bandwidth of the replication traffic combined with
    the data ingress traffic?
    -do you have enough disks for the data
    -do your HDDs have enough bandwidth?
    -do you want to do any work with the data, and what CPU/HDD/net load
    does this generate?
    -what impact will disk & datanode replication traffic have?
    -how much of the backbone will you have to allocated to the rebalancer.

    A 1 PB/day, ignoring all network issues, you will reach the current
    documented HDFS limits within four weeks. What are you going to do then,
    or will you have processed it down?

    I could imagine some experiments you could conduct against a namenode to
    see what its limits are, but there are lot of datacentre bandwidth and
    computation details you have to worry above and beyond datanode
    performance issues.

    Like Michael says, 1 PB/day sounds like a homework project, especially
    if you haven't used hadoop at smaller scale. If it is homework, once
    you've done the work (and submitted it), it'd be nice to see the final
    paper.

    If it is something you plan to take live, well, there are lots of issues
    to address, of which the NN is just one of the issues -and one you can
    test in advance. Ramping up the cluster with different loads will teach
    you more about the bottlenecks. Otherwise: there are people who know how
    to run Hadoop at scale, who, in exchange for money, will help you.

    -steve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 10, '11 at 8:12a
activeAug 17, '11 at 10:47a
posts8
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase