FAQ
Hi

I want to unzip a file that is living on an external (external from
HDFS) filesystem to HDFS, so that the unzipped files end up in some
folder on the HDFS. This needs to be as efficient as possible - so e.g.
if it is done i java code it probably needs to involve java.nio.channels
stuff or something that works directly with I/O resources. Can anyone
point me to the best/easiest/most efficient way to do this? I would like
to at least be able to invoke/initiate the unzip-process from java code,
but I guess I can invoke anything from java, so that is not much of a
requirement.

Regards, Per Steffensen

Search Discussions

  • Stephan Gammeter at Aug 30, 2011 at 4:34 pm
    Hey Per,

    Your performance will most likely be limited by your connection to HDFS
    and replication. If you are connected via 1Gbps lan and have 3-fold
    replication, then you can write at most 1 / 3 Gbps to HDFS. (Note: If
    you write many many small HDFS files then of course everything will be
    horribly slow anyways) I had to do something like this once (write files
    in a tar archive to a sequence file) and java was never the bottleneck.
    Or do you have massively higher connection to HDFS?

    best,
    Stephan
    On 30.08.2011 10:19, Per Steffensen wrote:
    Hi

    I want to unzip a file that is living on an external (external from
    HDFS) filesystem to HDFS, so that the unzipped files end up in some
    folder on the HDFS. This needs to be as efficient as possible - so
    e.g. if it is done i java code it probably needs to involve
    java.nio.channels stuff or something that works directly with I/O
    resources. Can anyone point me to the best/easiest/most efficient way
    to do this? I would like to at least be able to invoke/initiate the
    unzip-process from java code, but I guess I can invoke anything from
    java, so that is not much of a requirement.

    Regards, Per Steffensen
  • Per Steffensen at Aug 30, 2011 at 7:14 pm
    Hi

    Thanks for your reply.

    Well, Im am not sure about the speed of the connection to HDFS. The job
    that needs to unzip from a "normal" file to HDFS will be running on one
    of the machines participating in the HDFS, so I guess at least the
    access to the local part of the HDFS will be fast. But this of course
    will not help much because the data needs to be replicated to (2) other
    nodes. The connection among the HDFS-nodes I expect to be higher than 1
    Gbps - 10 or 100. The zip file will actually live on a machine remote to
    all the HDFS-nodes. Those machines will have a mount to a machine i DMZ
    where the zip file will live, and access the zip file over that mount
    (probably a sshfs-mount). The connection between HDFS-nodes and the
    machine in DMZ I also expect to be higher than 1 Gbps - 10 or 100. But
    basically I really dont know yet about the speed of the different
    connections mentioned.

    I seek the fastes way to do it. Of course I can use ZipFile etc. from
    the JDK to unzip and write the unzipped data to HDFS files, but if there
    are a more "direct"-I/O way I would prefer to do that. So basically this
    is a question about if a "smarter method" exist or not. Whether or not
    this "smarter method" will actually make the unzip-process faster or not
    of course will depend on whether or not the non-"direct"-I/O
    java-ZipFile-way will be a bigger bottleneck than the network-bandwith
    (among HDFS-nodes or between HDFS-nodes and DMZ).

    Any addition comments are very welcome.

    Stephan Gammeter skrev:
    Hey Per,

    Your performance will most likely be limited by your connection to
    HDFS and replication. If you are connected via 1Gbps lan and have
    3-fold replication, then you can write at most 1 / 3 Gbps to HDFS.
    (Note: If you write many many small HDFS files then of course
    everything will be horribly slow anyways) I had to do something like
    this once (write files in a tar archive to a sequence file) and java
    was never the bottleneck. Or do you have massively higher connection
    to HDFS?

    best,
    Stephan
    On 30.08.2011 10:19, Per Steffensen wrote:
    Hi

    I want to unzip a file that is living on an external (external from
    HDFS) filesystem to HDFS, so that the unzipped files end up in some
    folder on the HDFS. This needs to be as efficient as possible - so
    e.g. if it is done i java code it probably needs to involve
    java.nio.channels stuff or something that works directly with I/O
    resources. Can anyone point me to the best/easiest/most efficient way
    to do this? I would like to at least be able to invoke/initiate the
    unzip-process from java code, but I guess I can invoke anything from
    java, so that is not much of a requirement.

    Regards, Per Steffensen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedAug 30, '11 at 8:20a
activeAug 30, '11 at 7:14p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase