FAQ
Hi,

Is is possible to cache data selectively on slave machines?

Lets say I have data partitioned as D1, D2... and so on. D1 is required by
Reducer R1, D2 by R2 and so on. I know this before hand because
HashPartitioner.getPartition was used to partition the data.

If I put D1, D2.. in distributed cache, then the data is copied on all
machines. Is is possible to cache data selectively on machines?

Thanks,
Taran

Search Discussions

  • Lohit at Nov 11, 2008 at 8:35 pm
    DistributedCache would copy the cache data on all nodes. If you know the mapping of R* to D*, how about Reduce reading the data from DFS, the D which it expects to. Distributed cache will only help if the data you are using is used by multiple tasks on same node, in that you would not try to access DFS multiple times. If you know that the each 'D' is read by one 'R' then you are not buying much with DistributedCache. Although you should also keep in mind if you are read takes long time you reducers might timeout failing to report status.

    Thanks,
    Lohit



    ----- Original Message ----
    From: Tarandeep Singh <tarandeep@gmail.com>
    To: core-user@hadoop.apache.org
    Sent: Tuesday, November 11, 2008 10:56:41 AM
    Subject: Caching data selectively on slaves

    Hi,

    Is is possible to cache data selectively on slave machines?

    Lets say I have data partitioned as D1, D2... and so on. D1 is required by
    Reducer R1, D2 by R2 and so on. I know this before hand because
    HashPartitioner.getPartition was used to partition the data.

    If I put D1, D2.. in distributed cache, then the data is copied on all
    machines. Is is possible to cache data selectively on machines?

    Thanks,
    Taran
  • Tarandeep Singh at Nov 11, 2008 at 9:15 pm
    Hi Lohit,

    I thought of keeping the data on DFS and reading it from there. But storing
    the data on DFS will turn out to be expensive-

    1) The data is replicated across cluster.
    2) While reading, the Reducer Ri may not have Data Di on the same machine,
    so a DFS read will occur.

    That was the reason I thought if I could selectively cache the data on
    respective machines.
    And thanks for the tip that I should try to keep my read time minimum else
    reducers might timeout. I will keep this in mind.

    -Taran
    On Tue, Nov 11, 2008 at 12:33 PM, lohit wrote:

    DistributedCache would copy the cache data on all nodes. If you know the
    mapping of R* to D*, how about Reduce reading the data from DFS, the D which
    it expects to. Distributed cache will only help if the data you are using is
    used by multiple tasks on same node, in that you would not try to access DFS
    multiple times. If you know that the each 'D' is read by one 'R' then you
    are not buying much with DistributedCache. Although you should also keep in
    mind if you are read takes long time you reducers might timeout failing to
    report status.

    Thanks,
    Lohit



    ----- Original Message ----
    From: Tarandeep Singh <tarandeep@gmail.com>
    To: core-user@hadoop.apache.org
    Sent: Tuesday, November 11, 2008 10:56:41 AM
    Subject: Caching data selectively on slaves

    Hi,

    Is is possible to cache data selectively on slave machines?

    Lets say I have data partitioned as D1, D2... and so on. D1 is required by
    Reducer R1, D2 by R2 and so on. I know this before hand because
    HashPartitioner.getPartition was used to partition the data.

    If I put D1, D2.. in distributed cache, then the data is copied on all
    machines. Is is possible to cache data selectively on machines?

    Thanks,
    Taran

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 11, '08 at 6:57p
activeNov 11, '08 at 9:15p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Tarandeep Singh: 2 posts Lohit: 1 post

People

Translate

site design / logo © 2022 Grokbase