Thanks for your reply.
Well, Im am not sure about the speed of the connection to HDFS. The job
that needs to unzip from a "normal" file to HDFS will be running on one
of the machines participating in the HDFS, so I guess at least the
access to the local part of the HDFS will be fast. But this of course
will not help much because the data needs to be replicated to (2) other
nodes. The connection among the HDFS-nodes I expect to be higher than 1
Gbps - 10 or 100. The zip file will actually live on a machine remote to
all the HDFS-nodes. Those machines will have a mount to a machine i DMZ
where the zip file will live, and access the zip file over that mount
(probably a sshfs-mount). The connection between HDFS-nodes and the
machine in DMZ I also expect to be higher than 1 Gbps - 10 or 100. But
basically I really dont know yet about the speed of the different
I seek the fastes way to do it. Of course I can use ZipFile etc. from
the JDK to unzip and write the unzipped data to HDFS files, but if there
are a more "direct"-I/O way I would prefer to do that. So basically this
is a question about if a "smarter method" exist or not. Whether or not
this "smarter method" will actually make the unzip-process faster or not
of course will depend on whether or not the non-"direct"-I/O
java-ZipFile-way will be a bigger bottleneck than the network-bandwith
(among HDFS-nodes or between HDFS-nodes and DMZ).
Any addition comments are very welcome.
Stephan Gammeter skrev:
Your performance will most likely be limited by your connection to
HDFS and replication. If you are connected via 1Gbps lan and have
3-fold replication, then you can write at most 1 / 3 Gbps to HDFS.
(Note: If you write many many small HDFS files then of course
everything will be horribly slow anyways) I had to do something like
this once (write files in a tar archive to a sequence file) and java
was never the bottleneck. Or do you have massively higher connection
On 30.08.2011 10:19, Per Steffensen wrote:
I want to unzip a file that is living on an external (external from
HDFS) filesystem to HDFS, so that the unzipped files end up in some
folder on the HDFS. This needs to be as efficient as possible - so
e.g. if it is done i java code it probably needs to involve
java.nio.channels stuff or something that works directly with I/O
resources. Can anyone point me to the best/easiest/most efficient way
to do this? I would like to at least be able to invoke/initiate the
unzip-process from java code, but I guess I can invoke anything from
java, so that is not much of a requirement.
Regards, Per Steffensen