FAQ
Hi all,

is the read operation of 1 file stored in hdfs done in parallel?

I mean let's say that I have 1 file split in 2 blocks (hdfs block) and
each block is stored in 1 rack.
When reading this file, both blocks are read in parallel? or the first
block is read and then once done the read of the second block begins?
If the later is right, the read of files in hdfs is then sequential.
is it right or am I missing something?

Thanks,
Hassen

Search Discussions

  • Stanley Shi at May 7, 2011 at 4:02 pm
    To my understanding, the reader read file blocks in parallel.

    -----Original Message-----
    From: Hassen Riahi
    Sent: 2011年5月7日 23:50
    To: hdfs-user@hadoop.apache.org
    Subject: Read files from hdfs

    Hi all,

    is the read operation of 1 file stored in hdfs done in parallel?

    I mean let's say that I have 1 file split in 2 blocks (hdfs block) and
    each block is stored in 1 rack.
    When reading this file, both blocks are read in parallel? or the first
    block is read and then once done the read of the second block begins?
    If the later is right, the read of files in hdfs is then sequential.
    is it right or am I missing something?

    Thanks,
    Hassen
  • Elton sky at May 8, 2011 at 7:39 am
    Hassen,

    Read in hdfs is sequential, i.e. read one block after another. Each time the
    client will connect to one data node to read a block. Then connect to
    another (or the same) data node to read next block.
    The reason for this sequential design, I guess, is avoiding n/w traffic
    explosion in a heavy map reduce job.

    -Elton

    2011/5/8 <stanley.shi@emc.com>
    To my understanding, the reader read file blocks in parallel.

    -----Original Message-----
    From: Hassen Riahi
    Sent: 2011年5月7日 23:50
    To: hdfs-user@hadoop.apache.org
    Subject: Read files from hdfs

    Hi all,

    is the read operation of 1 file stored in hdfs done in parallel?

    I mean let's say that I have 1 file split in 2 blocks (hdfs block) and
    each block is stored in 1 rack.
    When reading this file, both blocks are read in parallel? or the first
    block is read and then once done the read of the second block begins?
    If the later is right, the read of files in hdfs is then sequential.
    is it right or am I missing something?

    Thanks,
    Hassen
  • Hassen Riahi at May 11, 2011 at 9:58 pm
    Thank you Elton and Stanley for your reply.

    Given that we are not running map reduce jobs (at least until now) +
    assuming that the read is sequential + in case where the network is
    not heavily used, I'll wait to see in general a degradation of
    performance when reading 1 file from hdfs (hdfs blocks will be read
    sequentially from different datanodes) compared to reading it from a
    usual filesystems (which store file without splitting it). is it right?

    Thanks,
    Hassen

    Hassen,

    Read in hdfs is sequential, i.e. read one block after another. Each
    time the client will connect to one data node to read a block. Then
    connect to another (or the same) data node to read next block.
    The reason for this sequential design, I guess, is avoiding n/w
    traffic explosion in a heavy map reduce job.

    -Elton

    2011/5/8 <stanley.shi@emc.com>
    To my understanding, the reader read file blocks in parallel.

    -----Original Message-----
    From: Hassen Riahi
    Sent: 2011年5月7日 23:50
    To: hdfs-user@hadoop.apache.org
    Subject: Read files from hdfs

    Hi all,

    is the read operation of 1 file stored in hdfs done in parallel?

    I mean let's say that I have 1 file split in 2 blocks (hdfs block) and
    each block is stored in 1 rack.
    When reading this file, both blocks are read in parallel? or the first
    block is read and then once done the read of the second block begins?
    If the later is right, the read of files in hdfs is then sequential.
    is it right or am I missing something?

    Thanks,
    Hassen
  • Harsh J at May 12, 2011 at 5:36 am
    Yes it could get slower cause the operation would now involve a disk
    read AND a network transfer (with other little overheads it carries
    along).

    2011/5/12 Hassen Riahi <hassen.riahi@cern.ch>:
    Thank you Elton and Stanley for your reply.
    Given that we are not running map reduce jobs (at least until now) +
    assuming that the read is sequential + in case where the network is not
    heavily used, I'll wait to see in general a degradation of performance when
    reading 1 file from hdfs (hdfs blocks will be read sequentially from
    different datanodes) compared to reading it from a usual filesystems (which
    store file without splitting it). is it right?
    Thanks,
    Hassen

    Hassen,
    Read in hdfs is sequential, i.e. read one block after another. Each time the
    client will connect to one data node to read a block. Then connect to
    another (or the same) data node to read next block.
    The reason for this sequential design, I guess, is avoiding n/w traffic
    explosion in a heavy map reduce job.
    -Elton

    2011/5/8 <stanley.shi@emc.com>
    To my understanding, the reader read file blocks in parallel.

    -----Original Message-----
    From: Hassen Riahi
    Sent: 2011年5月7日 23:50
    To: hdfs-user@hadoop.apache.org
    Subject: Read files from hdfs

    Hi all,

    is the read operation of 1 file stored in hdfs done in parallel?

    I mean let's say that I have 1 file split in 2 blocks (hdfs block) and
    each block is stored in 1 rack.
    When reading this file, both blocks are read in parallel? or the first
    block is read and then once done the read of the second block begins?
    If the later is right, the read of files in hdfs is then sequential.
    is it right or am I missing something?

    Thanks,
    Hassen


    --
    Harsh J
  • Hassen Riahi at May 12, 2011 at 11:09 am
    Thanks for the reply.

    Maybe I was not clear enough when explaining the use-case...Sorry for
    that.

    Assuming:

    1- we are not running map reduce jobs
    2- the read from hdfs is sequential
    3- the network is not heavily used

    I want to read 1 file remotely from a distributed filesystem, I have 2
    alternatives:

    1- reading it from HDFS
    2- reading it from a usual distributed filesystem (which have stored
    the file in the same machine, rather splitting it in blocks and then
    distribute them as hdfs did)

    1 could get slower than 2 since 1 is introducing more overhead than 2
    (at each new hdfs block to read, it is needed to establish the
    connexion with the datanode containing this block...)

    Is it right?

    Hassen
    Yes it could get slower cause the operation would now involve a disk
    read AND a network transfer (with other little overheads it carries
    along).

    2011/5/12 Hassen Riahi <hassen.riahi@cern.ch>:
    Thank you Elton and Stanley for your reply.
    Given that we are not running map reduce jobs (at least until now) +
    assuming that the read is sequential + in case where the network is
    not
    heavily used, I'll wait to see in general a degradation of
    performance when
    reading 1 file from hdfs (hdfs blocks will be read sequentially from
    different datanodes) compared to reading it from a usual
    filesystems (which
    store file without splitting it). is it right?
    Thanks,
    Hassen

    Hassen,
    Read in hdfs is sequential, i.e. read one block after another. Each
    time the
    client will connect to one data node to read a block. Then connect to
    another (or the same) data node to read next block.
    The reason for this sequential design, I guess, is avoiding n/w
    traffic
    explosion in a heavy map reduce job.
    -Elton

    2011/5/8 <stanley.shi@emc.com>
    To my understanding, the reader read file blocks in parallel.

    -----Original Message-----
    From: Hassen Riahi
    Sent: 2011年5月7日 23:50
    To: hdfs-user@hadoop.apache.org
    Subject: Read files from hdfs

    Hi all,

    is the read operation of 1 file stored in hdfs done in parallel?

    I mean let's say that I have 1 file split in 2 blocks (hdfs block)
    and
    each block is stored in 1 rack.
    When reading this file, both blocks are read in parallel? or the
    first
    block is read and then once done the read of the second block
    begins?
    If the later is right, the read of files in hdfs is then sequential.
    is it right or am I missing something?

    Thanks,
    Hassen


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedMay 7, '11 at 3:51p
activeMay 12, '11 at 11:09a
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase