FAQ
hi ,my hadoop friends:i have the 3 questions about hadoop.there are ....

1 the speed between the datanodes. Tera data in one datanodes , the data transfers from one datanode to the another datanode. if the speed is bad, Hadoop will be slow, i think. i heard the gNet architecture in Greenplum , then hadoop ? SAS storage + G-Ethernet is best answer, isn't it?
2 the GUI tool there is a hive web tool in hadoop. but it is not enough to use it for our business work. it is too simple to use it.
if hadoop+hive is designed into DWH. then how to use it for users. by CGI Tool(Command),? by New Developed webGUITOOL.?
3 5 computers Hadoop cluster and 1 computer SQLSERVER2000 5 computers Hadoop celeron 2.66G 1G memory Ethernet namenode + secondarynamenode + 3 datanode 1 computer SQLSERVER2000 celeron 2.66G 1G memory then i did select operation at the same data 100M . 5 computers Hadoop is 2mins 30secs 1 computer SQLSERVER2000 is 2mins 25secs
the result is that 5 computers Hadoop is not good .why .can anyone give me some advises.
thanks in adverse.

Search Discussions

  • Steve Loughran at Sep 6, 2010 at 10:15 am

    On 06/09/10 09:32, 褚 鵬兵 wrote:
    hi ,my hadoop friends:i have the 3 questions about hadoop.there are ....

    1 the speed between the datanodes. Tera data in one datanodes , the data transfers from one datanode to the another datanode. if the speed is bad, Hadoop will be slow, i think. i heard the gNet architecture in Greenplum , then hadoop ? SAS storage + G-Ethernet is best answer, isn't it?
    if your code has locality gigabit ether is fine, saves the hassle of
    getting faster stuff to work. Have you ever tried to debug infiniband
    cluster problems?
    2 the GUI tool there is a hive web tool in hadoop. but it is not enough to use it for our business work. it is too simple to use it.
    if hadoop+hive is designed into DWH. then how to use it for users. by CGI Tool(Command),? by New Developed webGUITOOL.?
    the community welcomes new contributions. I'd look at cascading,
    datameeer's stuff, and other things. Hive is designed for people who
    know SQL, like PHP developers.
    3 5 computers Hadoop cluster and 1 computer SQLSERVER2000 5 computers Hadoop celeron 2.66G 1G memory Ethernet namenode + secondarynamenode + 3 datanode 1 computer SQLSERVER2000 celeron 2.66G 1G memory then i did select operation at the same data 100M . 5 computers Hadoop is 2mins 30secs 1 computer SQLSERVER2000 is 2mins 25secs
    the result is that 5 computers Hadoop is not good .why .can anyone give me some advises.
    thanks in adverse.
    Indexes give RBMS speed, but limit their scale. If your dataset fits
    onto a single mssql or mysql and you can afford the index costs, stay
    with that data in a RAID array. Hadoop isn't trying to compete in that
    space -though things like CouchDB are trying to

    However, before you dismiss Hadoop, get in touch with your SQL server or
    oracle account team and say "we are planning on working with 15
    Petabytes of storage with data coming in at 1-2PB/month" and see what
    they say back and how big their quote is. The search terms "MapReduce a
    Major Step Backwards" shows some of the debate going on.
  • 褚 鵬兵 at Sep 8, 2010 at 1:59 am
    hi stevel:
    thanks for your reply.i have not tried to debug infiniband,although i only know it.
    Now, My hadoop cluster is made with HDFS+MAPREDUCE, ,hive, derby server.i want to put HBASE into cluster.how can i do it .can you help me .
    thanks.pengbing chu
    Date: Mon, 6 Sep 2010 11:14:10 +0100
    From: stevel@apache.org
    To: common-user@hadoop.apache.org
    Subject: Re: the question of hadoop
    On 06/09/10 09:32, 褚 鵬兵 wrote:

    hi ,my hadoop friends:i have the 3 questions about hadoop.there are ....

    1 the speed between the datanodes. Tera data in one datanodes , the data transfers from one datanode to the another datanode. if the speed is bad, Hadoop will be slow, i think. i heard the gNet architecture in Greenplum , then hadoop ? SAS storage + G-Ethernet is best answer, isn't it?
    if your code has locality gigabit ether is fine, saves the hassle of
    getting faster stuff to work. Have you ever tried to debug infiniband
    cluster problems?
    2 the GUI tool there is a hive web tool in hadoop. but it is not enough to use it for our business work. it is too simple to use it.
    if hadoop+hive is designed into DWH. then how to use it for users. by CGI Tool(Command),? by New Developed webGUITOOL.?
    the community welcomes new contributions. I'd look at cascading,
    datameeer's stuff, and other things. Hive is designed for people who
    know SQL, like PHP developers.
    3 5 computers Hadoop cluster and 1 computer SQLSERVER2000 5 computers Hadoop celeron 2.66G 1G memory Ethernet namenode + secondarynamenode + 3 datanode 1 computer SQLSERVER2000 celeron 2.66G 1G memory then i did select operation at the same data 100M . 5 computers Hadoop is 2mins 30secs 1 computer SQLSERVER2000 is 2mins 25secs
    the result is that 5 computers Hadoop is not good .why .can anyone give me some advises.
    thanks in adverse.
    Indexes give RBMS speed, but limit their scale. If your dataset fits
    onto a single mssql or mysql and you can afford the index costs, stay
    with that data in a RAID array. Hadoop isn't trying to compete in that
    space -though things like CouchDB are trying to

    However, before you dismiss Hadoop, get in touch with your SQL server or
    oracle account team and say "we are planning on working with 15
    Petabytes of storage with data coming in at 1-2PB/month" and see what
    they say back and how big their quote is. The search terms "MapReduce a
    Major Step Backwards" shows some of the debate going on.
  • Gang Luo at Sep 8, 2010 at 2:28 am
    Hi all,
    I need to change the block size (from 128m to 64m) and have to shut down the
    cluster first. I was wondering what will happen to the current files on HDFS
    (with 128M block size). Are they still there and usable? If so, what is the
    block size of those lagacy files?

    Thanks,
    -Gang
  • Jeff Zhang at Sep 8, 2010 at 3:04 am
    Those lagacy files won't change block size (NameNode have the mapping
    between block and file)
    only the new added files will have the block size of 128m


    On Tue, Sep 7, 2010 at 7:27 PM, Gang Luo wrote:
    Hi all,
    I need to change the block size (from 128m to 64m) and have to shut down the
    cluster first. I was wondering what will happen to the current files on HDFS
    (with 128M block size). Are they still there and usable? If so, what is the
    block size of those lagacy files?

    Thanks,
    -Gang





    --
    Best Regards

    Jeff Zhang
  • Alex Kozlov at Sep 8, 2010 at 5:32 pm
    The block size is a per-file property, so it will change only for the newly
    created files. If you want to change the block size for the 'legacy' files,
    you'll need to recreate them, for example with the distcp command (for the
    new block size 512M):
    *
    hadoop distcp -D dfs.block.size=536870912 <path-to-old-file>
    <path-to-new-file>*

    and then rm the old file.

    --
    Alex Kozlov
    Solutions Architect
    Cloudera, Inc
    twitter: alexvk2009

    Hadoop World 2010, October 12, New York City - Register now:
    http://www.cloudera.com/company/press-center/hadoop-world-nyc/
    On Tue, Sep 7, 2010 at 8:03 PM, Jeff Zhang wrote:

    Those lagacy files won't change block size (NameNode have the mapping
    between block and file)
    only the new added files will have the block size of 128m

    On Tue, Sep 7, 2010 at 7:27 PM, Gang Luo wrote:
    Hi all,
    I need to change the block size (from 128m to 64m) and have to shut down the
    cluster first. I was wondering what will happen to the current files on HDFS
    (with 128M block size). Are they still there and usable? If so, what is the
    block size of those lagacy files?

    Thanks,
    -Gang





    --
    Best Regards

    Jeff Zhang
  • Gang Luo at Sep 8, 2010 at 6:40 pm
    That makes sense. Thanks Alex and Jeff.

    -Gang




    ----- 原始邮件 ----
    发件人: Alex Kozlov <alexvk@cloudera.com>
    收件人: common-user@hadoop.apache.org
    发送日期: 2010/9/8 (周三) 1:31:14 下午
    主 题: Re: change HDFS block size

    The block size is a per-file property, so it will change only for the newly
    created files. If you want to change the block size for the 'legacy' files,
    you'll need to recreate them, for example with the distcp command (for the
    new block size 512M):
    *
    hadoop distcp -D dfs.block.size=536870912 <path-to-old-file>
    <path-to-new-file>*

    and then rm the old file.

    --
    Alex Kozlov
    Solutions Architect
    Cloudera, Inc
    twitter: alexvk2009

    Hadoop World 2010, October 12, New York City - Register now:
    http://www.cloudera.com/company/press-center/hadoop-world-nyc/
    On Tue, Sep 7, 2010 at 8:03 PM, Jeff Zhang wrote:

    Those lagacy files won't change block size (NameNode have the mapping
    between block and file)
    only the new added files will have the block size of 128m

    On Tue, Sep 7, 2010 at 7:27 PM, Gang Luo wrote:
    Hi all,
    I need to change the block size (from 128m to 64m) and have to shut down the
    cluster first. I was wondering what will happen to the current files on HDFS
    (with 128M block size). Are they still there and usable? If so, what is the
    block size of those lagacy files?

    Thanks,
    -Gang





    --
    Best Regards

    Jeff Zhang
  • Steve Loughran at Sep 8, 2010 at 11:08 am

    On 08/09/10 02:58, 褚 鵬兵 wrote:

    hi stevel:
    thanks for your reply.i have not tried to debug infiniband,although i only know it.
    Now, My hadoop cluster is made with HDFS+MAPREDUCE, ,hive, derby server.i want to put HBASE into cluster.how can i do it .can you help me .
    thanks.pengbing chu
    You'll have to look at the hbase web site & ask on their mailing lists,

    -steve
  • Chris Smith at Sep 8, 2010 at 6:50 pm
    2010/9/6 褚 鵬兵 <chu_pengbing@hotmail.com>:
    hi ,my hadoop friends:i have the 3 questions about hadoop.there are ....

    1 the speed between the datanodes.   Tera data in one datanodes ,   the data  transfers from one datanode to the another datanode.   if the speed  is bad, Hadoop will be slow, i think.   i heard the gNet architecture in Greenplum ,  then hadoop ?  SAS storage + G-Ethernet is best answer, isn't it?
    2 the GUI tool   there is a hive web tool in hadoop.   but it is not enough to use it for our business work.   it is too simple to use it.
    if hadoop+hive is designed into DWH.   then how to use it for users.   by CGI Tool(Command),?   by New Developed webGUITOOL.?
    3 5 computers Hadoop cluster and 1 computer SQLSERVER2000   5 computers Hadoop      celeron 2.66G      1G memory      Ethernet      namenode + secondarynamenode + 3 datanode   1 computer SQLSERVER2000      celeron 2.66G      1G memory  then i did select operation at the same data 100M .    5 computers Hadoop  is 2mins 30secs   1 computer SQLSERVER2000  is 2mins 25secs
    the result is that  5 computers Hadoop is not good .why .can anyone give me some advises.
    thanks in adverse.
    Why use Hadoop in preference to a database?

    At the recent Hadoop User Group (UK) meeting, Andy Kemp from
    http://www.forward.co.uk/ presented their experience in moving from a
    MySQL database approach to Hadoop.
    From my notes of his talk their system manages 120 million keywords
    and is updated at a rate of 20GB/day.

    They originally used a sharded MySQL database but found it couldn't
    scale to handle the types of queries their users required, e.g. "Can
    you cluster 17(?) million keyword phrases into thematic groups?".
    Their calculations indicated that the database approach would take
    more than a year to handle such a query.

    Moving to a cluster of 100 Hadoop nodes on Amazon EC2 reduced this
    time down to 7 hours. The issues then became one of the costs of
    storage and moving the data to and from the cluster.

    They then moved to a private VM system with about 30 VMs - I assume
    the processing took the same time as I didn't note this down.
    From there they then moved to dedicated hardware, 5 dedicated Hadoop
    nodes, and achieved better performance than the 30 VMs.

    Andy's talk, "Hadoop in Context" should available as a podcast here
    http://skillsmatter.com/podcast/cloud-grid/hadoop-in-context and would
    be well worth watching but when I lasted looked it hadn't been
    uploaded yet.

    At the same event, Ian Broadhead, from http://www.playfish.com/ gave a
    talk on managing the activity of over 1 million active Internet gamers
    producing over 50GB of data a day. Their original MySQL system took
    up to 50 times longer to process their data load than an EC2 cluster
    of Hadoop nodes. He talked about a typical workload being reduced
    from 2-3 days (using MySQL) down to 6 hours (using Hadoop).
    Unfortunately I don't think Ian's talk will appear as a podcast.

    However, most presentations during the evening made a point that
    Hadoop didn't completely replace their databases, just provided a
    convenient way to rapidly process large volumes of data, the output
    from Hadoop processing typically being stored in databases to satisfy
    general everyday business queries.

    I think the common theme here was that all of these users had large
    datasets of the order of 100's of GBs with multiple views of that data
    that handled in the order of 10's of millions of updates a day.

    I hope that helps.

    Chris

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 6, '10 at 8:32a
activeSep 8, '10 at 6:50p
posts9
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase