FAQ
Hi all

I'm doing a test and need create lots of files ( 100 million ) in HDFS, I use a shell script to do this , it's very very slow, how to create a lot files in HDFS quickly?
Thanks

Search Discussions

  • Ted Dunning at May 30, 2011 at 3:52 am
    First, it is virtually impossible to create 100 million files in HDFS
    because the name node can't hold that many.

    Secondly, file creation is bottle-necked by the name node so the files that
    you can create can't be created at more than about 1000 per second (and
    achieving more than half that rate is somewhat difficult).

    Thirdly, you need to check your cluster size because each data node can only
    store a limited number of blocks (exactly how many differs from version to
    version of Hadoop). For small clusters this is a more exigent limit than
    the size limit of the name node.

    Why is it that you need to do this?

    Perhaps there is a work-around? Consider for instance HAR files:

    http://www.cloudera.com/blog/2009/02/the-small-files-problem/


    2011/5/29 ccxixicc <ccxixicc@foxmail.com>
    Hi all
    I'm doing a test and need create lots of files ( 100 million ) in HDFS, I
    use a shell script to do this , it's very very slow, how to create a lot
    files in HDFS quickly?
    Thanks
  • Konstantin Boudnik at May 30, 2011 at 4:54 am
    Your best bet would be to take a look at synthetic load generator.

    10^8 files would be a problem for most cases because you'd need to have a
    really beefy NN for that (~48GB of JVM heap and all that). The biggest I've
    heard about hold something at the order of 1.15*10^8 objects (files & dirs)
    and is serving a largest Hadoop cluster in the world for Yahoo! production
    setup. You might want to check YDN for more details about this case, I guess.

    Hope it helps,
    Cos
    On Mon, May 30, 2011 at 10:44AM, ccxixicc wrote:
    Hi all
    I'm doing a test and need create lots of files ( 100 million ) in
    HDFS-L-NOT I use a shell script to do this , it's very very slow, how to
    create a lot files in HDFS quickly?
    Thanks
  • Ian Holsman at May 30, 2011 at 3:51 pm
    I don't know what your use case is, but you may want to investigate things like hbase or Cassandra or voldemort if you need lots of small files

    ---
    Ian Holsman - 703 879-3128

    I saw the angel in the marble and carved until I set him free -- Michelangelo
    On 30/05/2011, at 12:54 AM, Konstantin Boudnik wrote:

    Your best bet would be to take a look at synthetic load generator.

    10^8 files would be a problem for most cases because you'd need to have a
    really beefy NN for that (~48GB of JVM heap and all that). The biggest I've
    heard about hold something at the order of 1.15*10^8 objects (files & dirs)
    and is serving a largest Hadoop cluster in the world for Yahoo! production
    setup. You might want to check YDN for more details about this case, I guess.

    Hope it helps,
    Cos
    On Mon, May 30, 2011 at 10:44AM, ccxixicc wrote:
    Hi all
    I'm doing a test and need create lots of files ( 100 million ) in
    HDFS-L-NOT I use a shell script to do this , it's very very slow, how to
    create a lot files in HDFS quickly?
    Thanks

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedMay 30, '11 at 2:45a
activeMay 30, '11 at 3:51p
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase