FAQ
I'd like to see if caching Map outputs is worth the time. The idea is that
for certain jobs, many of the Map tasks will do the same thing they did last
time they were run, for instance a monthly report with vast data, but very
little changing data. So every month the job is run, and some % of the Map
tasks are doing the same thing they did last month.

What if the Mapper first checked an HBase, or the HDFS for Map results from
the input it has, i.e. the Map input would be a "key" to the "value" that is
the output we got from the Map last month.

Would it be faster to search for these cached outputs, rather than re-run
the Map? That's the question I'm looking to answer.

Here are my questions:
Do you have any great HBase tutorials? I haven't used it.
Should I use HBase?
What would the code for the Mapper look like to first check the cache, and
if a result is found, don't process the input, just send what we got from
the cache straight to output?

Any input is greatly appreciated.
--
View this message in context: http://old.nabble.com/Map-Result-Caching-tp31415113p31415113.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Robert Evans at Apr 18, 2011 at 1:43 pm
    DoomUs,

    To me it seems like it should be something at the application level and less at the Hadoop level. I would think if there really is very little delta between the runs then the application would save the output of a map only job, and the next time would do a union of that and the output of processing a delta file, rather then trying to detect and cache results automatically behind the scenes. Or perhaps the application is just doing aggregation, so they only have to save a little bit extra information with the aggregated data to be able to process the delta on its own and then combine it with the previous result. I have seen delta processing work quite well in production.

    --Bobby Evans

    On 4/17/11 5:28 PM, "DoomUs" wrote:



    I'd like to see if caching Map outputs is worth the time. The idea is that
    for certain jobs, many of the Map tasks will do the same thing they did last
    time they were run, for instance a monthly report with vast data, but very
    little changing data. So every month the job is run, and some % of the Map
    tasks are doing the same thing they did last month.

    What if the Mapper first checked an HBase, or the HDFS for Map results from
    the input it has, i.e. the Map input would be a "key" to the "value" that is
    the output we got from the Map last month.

    Would it be faster to search for these cached outputs, rather than re-run
    the Map? That's the question I'm looking to answer.

    Here are my questions:
    Do you have any great HBase tutorials? I haven't used it.
    Should I use HBase?
    What would the code for the Mapper look like to first check the cache, and
    if a result is found, don't process the input, just send what we got from
    the cache straight to output?

    Any input is greatly appreciated.
    --
    View this message in context: http://old.nabble.com/Map-Result-Caching-tp31415113p31415113.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 16, '11 at 10:06p
activeApr 18, '11 at 1:43p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Robert Evans: 1 post DoomUs: 1 post

People

Translate

site design / logo © 2022 Grokbase