FAQ
I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat a
lot to pipe to various programs.

I was wondering if its possible to prefetch the data for clients with more
bandwidth. Most of my clients have 10g interface and datanodes are 1g.

I was thinking, prefetch x blocks (even though it will cost extra memory)
while reading block y. After block y is read, read the prefetched blocked
and then throw it away.

It should be used like this:


export PREFETCH_BLOCKS=2 #default would be 1
hadoop fs -pcat hdfs://namenode/verylarge file | program

Any thoughts?










--
--- Get your facts first, then you can distort them as you please.--

Search Discussions

  • Steve Loughran at Jul 6, 2011 at 11:36 am

    On 06/07/11 11:08, Rita wrote:
    I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat a
    lot to pipe to various programs.

    I was wondering if its possible to prefetch the data for clients with more
    bandwidth. Most of my clients have 10g interface and datanodes are 1g.

    I was thinking, prefetch x blocks (even though it will cost extra memory)
    while reading block y. After block y is read, read the prefetched blocked
    and then throw it away.

    It should be used like this:


    export PREFETCH_BLOCKS=2 #default would be 1
    hadoop fs -pcat hdfs://namenode/verylarge file | program

    Any thoughts?
    Look at Russ Perry's work on doing very fast fetches from an HDFS filestore
    http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf

    Here the DFS client got some extra data on where every copy of every
    block was, and the client decided which machine to fetch it from. This
    made the best use of the entire cluster, by keeping each datanode busy.


    -steve
  • Rita at Jul 7, 2011 at 7:23 am
    Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
    see any example code for the implementation.


    On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran wrote:
    On 06/07/11 11:08, Rita wrote:

    I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat
    a
    lot to pipe to various programs.

    I was wondering if its possible to prefetch the data for clients with more
    bandwidth. Most of my clients have 10g interface and datanodes are 1g.

    I was thinking, prefetch x blocks (even though it will cost extra memory)
    while reading block y. After block y is read, read the prefetched blocked
    and then throw it away.

    It should be used like this:


    export PREFETCH_BLOCKS=2 #default would be 1
    hadoop fs -pcat hdfs://namenode/verylarge file | program

    Any thoughts?
    Look at Russ Perry's work on doing very fast fetches from an HDFS filestore
    http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf<http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf>

    Here the DFS client got some extra data on where every copy of every block
    was, and the client decided which machine to fetch it from. This made the
    best use of the entire cluster, by keeping each datanode busy.


    -steve


    --
    --- Get your facts first, then you can distort them as you please.--
  • Steve Loughran at Jul 7, 2011 at 9:36 am

    On 07/07/11 08:22, Rita wrote:
    Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
    see any example code for the implementation.
    No. I think I have access to russ's source somewhere, but there'd be
    paperwork in getting it released. Russ said it wasn't too hard to do, he
    just had to patch the DFS client to offer up the entire list of block
    locations to the client, and let the client program make the decision.
    If you discussed this on the hdfs-dev list (via a JIRA), you may be able
    to get a patch for this accepted, though you have to do the code and
    tests yourself.
    On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughranwrote:
    On 06/07/11 11:08, Rita wrote:

    I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat
    a
    lot to pipe to various programs.

    I was wondering if its possible to prefetch the data for clients with more
    bandwidth. Most of my clients have 10g interface and datanodes are 1g.

    I was thinking, prefetch x blocks (even though it will cost extra memory)
    while reading block y. After block y is read, read the prefetched blocked
    and then throw it away.

    It should be used like this:


    export PREFETCH_BLOCKS=2 #default would be 1
    hadoop fs -pcat hdfs://namenode/verylarge file | program

    Any thoughts?
    Look at Russ Perry's work on doing very fast fetches from an HDFS filestore
    http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf<http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf>

    Here the DFS client got some extra data on where every copy of every block
    was, and the client decided which machine to fetch it from. This made the
    best use of the entire cluster, by keeping each datanode busy.


    -steve
  • Rita at Jul 7, 2011 at 12:36 pm
    Thanks again Steve.

    I will try to implement it with thrift.

    On Thu, Jul 7, 2011 at 5:35 AM, Steve Loughran wrote:
    On 07/07/11 08:22, Rita wrote:

    Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
    see any example code for the implementation.
    No. I think I have access to russ's source somewhere, but there'd be
    paperwork in getting it released. Russ said it wasn't too hard to do, he
    just had to patch the DFS client to offer up the entire list of block
    locations to the client, and let the client program make the decision. If
    you discussed this on the hdfs-dev list (via a JIRA), you may be able to get
    a patch for this accepted, though you have to do the code and tests
    yourself.

    On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughranwrote:
    On 06/07/11 11:08, Rita wrote:

    I have many large files ranging from 2gb to 800gb and I use hadoop fs
    -cat
    a
    lot to pipe to various programs.

    I was wondering if its possible to prefetch the data for clients with
    more
    bandwidth. Most of my clients have 10g interface and datanodes are 1g.

    I was thinking, prefetch x blocks (even though it will cost extra
    memory)
    while reading block y. After block y is read, read the prefetched
    blocked
    and then throw it away.

    It should be used like this:


    export PREFETCH_BLOCKS=2 #default would be 1
    hadoop fs -pcat hdfs://namenode/verylarge file | program

    Any thoughts?


    Look at Russ Perry's work on doing very fast fetches from an HDFS
    filestore
    http://www.hpl.hp.com/****techreports/2009/HPL-2009-345.****pdf<http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf>
    <http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf<http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf>

    Here the DFS client got some extra data on where every copy of every
    block
    was, and the client decided which machine to fetch it from. This made the
    best use of the entire cluster, by keeping each datanode busy.


    -steve

    --
    --- Get your facts first, then you can distort them as you please.--

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 6, '11 at 10:09a
activeJul 7, '11 at 12:36p
posts5
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Rita: 3 posts Steve Loughran: 2 posts

People

Translate

site design / logo © 2022 Grokbase