FAQ
Hi,

I am writing an application to copy all files from a regular PC to a
SequenceFile. I can surely do this by simply recursing all directories on my
PC, but I wonder if there is any way to parallelize this, a MapReduce task
even. Tom White's books seems to imply that it will have to be a custom
application.

Thank you,
Mark

Search Discussions

  • Philip (flip) Kromer at Feb 2, 2009 at 5:44 am
    Could you tar.bz2 them up (setting up the tar so that it made a few dozen
    files), toss them onto the HDFS, and use
    http://stuartsierra.com/2008/04/24/a-million-little-files
    to go into SequenceFile?

    This lets you preserve the originals and do the sequence file conversion
    across the cluster. It's only really helpful, of course, if you also want to
    prepare a .tar.bz2 so you can clear out the sprawl

    flip
    On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner wrote:

    Hi,

    I am writing an application to copy all files from a regular PC to a
    SequenceFile. I can surely do this by simply recursing all directories on
    my
    PC, but I wonder if there is any way to parallelize this, a MapReduce task
    even. Tom White's books seems to imply that it will have to be a custom
    application.

    Thank you,
    Mark


    --
    http://www.infochimps.org
    Connected Open Free Data
  • Mark Kerzner at Feb 2, 2009 at 3:24 pm
    Truly, I do not see any advantage to doing this, as opposed to writing
    (Java) code which will copy files to HDFS, because then tarring becomes my
    bottleneck. Unless I write code measure the file sizes and prepare pointers
    for multiple tarring tasks. It becomes pretty complex though, and I thought
    of something simple. I might as well accept that copying one hard drive to
    HDFS is not going to be parallelized.
    Mark

    On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
    wrote:
    Could you tar.bz2 them up (setting up the tar so that it made a few dozen
    files), toss them onto the HDFS, and use
    http://stuartsierra.com/2008/04/24/a-million-little-files
    to go into SequenceFile?

    This lets you preserve the originals and do the sequence file conversion
    across the cluster. It's only really helpful, of course, if you also want
    to
    prepare a .tar.bz2 so you can clear out the sprawl

    flip
    On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner wrote:

    Hi,

    I am writing an application to copy all files from a regular PC to a
    SequenceFile. I can surely do this by simply recursing all directories on
    my
    PC, but I wonder if there is any way to parallelize this, a MapReduce task
    even. Tom White's books seems to imply that it will have to be a custom
    application.

    Thank you,
    Mark


    --
    http://www.infochimps.org
    Connected Open Free Data
  • Tom White at Feb 2, 2009 at 3:43 pm
    Is there any reason why it has to be a single SequenceFile? You could
    write a local program to write several block compressed SequenceFiles
    in parallel (to HDFS), each containing a portion of the files on your
    PC.

    Tom
    On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner wrote:
    Truly, I do not see any advantage to doing this, as opposed to writing
    (Java) code which will copy files to HDFS, because then tarring becomes my
    bottleneck. Unless I write code measure the file sizes and prepare pointers
    for multiple tarring tasks. It becomes pretty complex though, and I thought
    of something simple. I might as well accept that copying one hard drive to
    HDFS is not going to be parallelized.
    Mark

    On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
    wrote:
    Could you tar.bz2 them up (setting up the tar so that it made a few dozen
    files), toss them onto the HDFS, and use
    http://stuartsierra.com/2008/04/24/a-million-little-files
    to go into SequenceFile?

    This lets you preserve the originals and do the sequence file conversion
    across the cluster. It's only really helpful, of course, if you also want
    to
    prepare a .tar.bz2 so you can clear out the sprawl

    flip

    On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    I am writing an application to copy all files from a regular PC to a
    SequenceFile. I can surely do this by simply recursing all directories on
    my
    PC, but I wonder if there is any way to parallelize this, a MapReduce task
    even. Tom White's books seems to imply that it will have to be a custom
    application.

    Thank you,
    Mark


    --
    http://www.infochimps.org
    Connected Open Free Data
  • Mark Kerzner at Feb 2, 2009 at 3:46 pm
    No, no reason for a single file - just a little simpler to think about. By
    the way, can multiple MapReduce workers read the same SequenceFile
    simultaneously?
    On Mon, Feb 2, 2009 at 9:42 AM, Tom White wrote:

    Is there any reason why it has to be a single SequenceFile? You could
    write a local program to write several block compressed SequenceFiles
    in parallel (to HDFS), each containing a portion of the files on your
    PC.

    Tom
    On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner wrote:
    Truly, I do not see any advantage to doing this, as opposed to writing
    (Java) code which will copy files to HDFS, because then tarring becomes my
    bottleneck. Unless I write code measure the file sizes and prepare pointers
    for multiple tarring tasks. It becomes pretty complex though, and I thought
    of something simple. I might as well accept that copying one hard drive to
    HDFS is not going to be parallelized.
    Mark

    On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
    wrote:
    Could you tar.bz2 them up (setting up the tar so that it made a few
    dozen
    files), toss them onto the HDFS, and use
    http://stuartsierra.com/2008/04/24/a-million-little-files
    to go into SequenceFile?

    This lets you preserve the originals and do the sequence file conversion
    across the cluster. It's only really helpful, of course, if you also
    want
    to
    prepare a .tar.bz2 so you can clear out the sprawl

    flip

    On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    I am writing an application to copy all files from a regular PC to a
    SequenceFile. I can surely do this by simply recursing all directories
    on
    my
    PC, but I wonder if there is any way to parallelize this, a MapReduce task
    even. Tom White's books seems to imply that it will have to be a
    custom
    application.

    Thank you,
    Mark


    --
    http://www.infochimps.org
    Connected Open Free Data
  • Tom White at Feb 2, 2009 at 4:01 pm
    Yes. SequenceFile is splittable, which means it can be broken into
    chunks, called splits, each of which can be processed by a separate
    map task.

    Tom
    On Mon, Feb 2, 2009 at 3:46 PM, Mark Kerzner wrote:
    No, no reason for a single file - just a little simpler to think about. By
    the way, can multiple MapReduce workers read the same SequenceFile
    simultaneously?
    On Mon, Feb 2, 2009 at 9:42 AM, Tom White wrote:

    Is there any reason why it has to be a single SequenceFile? You could
    write a local program to write several block compressed SequenceFiles
    in parallel (to HDFS), each containing a portion of the files on your
    PC.

    Tom

    On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Truly, I do not see any advantage to doing this, as opposed to writing
    (Java) code which will copy files to HDFS, because then tarring becomes my
    bottleneck. Unless I write code measure the file sizes and prepare pointers
    for multiple tarring tasks. It becomes pretty complex though, and I thought
    of something simple. I might as well accept that copying one hard drive to
    HDFS is not going to be parallelized.
    Mark

    On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
    wrote:
    Could you tar.bz2 them up (setting up the tar so that it made a few
    dozen
    files), toss them onto the HDFS, and use
    http://stuartsierra.com/2008/04/24/a-million-little-files
    to go into SequenceFile?

    This lets you preserve the originals and do the sequence file conversion
    across the cluster. It's only really helpful, of course, if you also
    want
    to
    prepare a .tar.bz2 so you can clear out the sprawl

    flip

    On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    I am writing an application to copy all files from a regular PC to a
    SequenceFile. I can surely do this by simply recursing all directories
    on
    my
    PC, but I wonder if there is any way to parallelize this, a MapReduce task
    even. Tom White's books seems to imply that it will have to be a
    custom
    application.

    Thank you,
    Mark


    --
    http://www.infochimps.org
    Connected Open Free Data

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 2, '09 at 5:23a
activeFeb 2, '09 at 4:01p
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase