|| at Nov 11, 2008 at 8:35 pm
DistributedCache would copy the cache data on all nodes. If you know the mapping of R* to D*, how about Reduce reading the data from DFS, the D which it expects to. Distributed cache will only help if the data you are using is used by multiple tasks on same node, in that you would not try to access DFS multiple times. If you know that the each 'D' is read by one 'R' then you are not buying much with DistributedCache. Although you should also keep in mind if you are read takes long time you reducers might timeout failing to report status.
----- Original Message ----
From: Tarandeep Singh <[email protected]
To: [email protected]
Sent: Tuesday, November 11, 2008 10:56:41 AM
Subject: Caching data selectively on slaves
Is is possible to cache data selectively on slave machines?
Lets say I have data partitioned as D1, D2... and so on. D1 is required by
Reducer R1, D2 by R2 and so on. I know this before hand because
HashPartitioner.getPartition was used to partition the data.
If I put D1, D2.. in distributed cache, then the data is copied on all
machines. Is is possible to cache data selectively on machines?