|| at May 12, 2011 at 11:09 am
Thanks for the reply.
Maybe I was not clear enough when explaining the use-case...Sorry for
1- we are not running map reduce jobs
2- the read from hdfs is sequential
3- the network is not heavily used
I want to read 1 file remotely from a distributed filesystem, I have 2
1- reading it from HDFS
2- reading it from a usual distributed filesystem (which have stored
the file in the same machine, rather splitting it in blocks and then
distribute them as hdfs did)
1 could get slower than 2 since 1 is introducing more overhead than 2
(at each new hdfs block to read, it is needed to establish the
connexion with the datanode containing this block...)
Is it right?
Yes it could get slower cause the operation would now involve a disk
read AND a network transfer (with other little overheads it carries
2011/5/12 Hassen Riahi <email@example.com>:
Thank you Elton and Stanley for your reply.
Given that we are not running map reduce jobs (at least until now) +
assuming that the read is sequential + in case where the network is
heavily used, I'll wait to see in general a degradation of
reading 1 file from hdfs (hdfs blocks will be read sequentially from
different datanodes) compared to reading it from a usual
store file without splitting it). is it right?
Read in hdfs is sequential, i.e. read one block after another. Each
client will connect to one data node to read a block. Then connect to
another (or the same) data node to read next block.
The reason for this sequential design, I guess, is avoiding n/w
explosion in a heavy map reduce job.
To my understanding, the reader read file blocks in parallel.
From: Hassen Riahi
Sent: 2011年5月7日 23:50
Subject: Read files from hdfs
is the read operation of 1 file stored in hdfs done in parallel?
I mean let's say that I have 1 file split in 2 blocks (hdfs block)
each block is stored in 1 rack.
When reading this file, both blocks are read in parallel? or the
block is read and then once done the read of the second block
If the later is right, the read of files in hdfs is then sequential.
is it right or am I missing something?