On Fri, Oct 9, 2009 at 11:02 PM, Ricky Ho wrote:
... To my understanding, within a Map/Reduce cycle, the input data set is
"freeze" (no change is allowed) while the output data set is "created from
scratch" (doesn't exist before). Therefore, the map/reduce model is
inherently "batch-oriented". Am I right ?
Current implementations are definitely batch oriented. Keep reading,
though.
I am thinking whether Hadoop is usable in processing many data streams in
parallel.
Abolutely.
For example, thinking about a e-commerce site which capture user's product
search in many log files, and they want to run some analytics on the log
files at real time.
Or consider Yahoo running their ad inventories in real-time.
One naïve way is to chunkify the log and perform Map/Reduce in small
batches. Since the input data file must be freezed, therefore we need to
switch subsequent write to a new logfile.
Which is not a big deal. Moreover, these small chunks can be merged every
so often.
However, the chunking approach is not good because the cutoff point is
quite arbitrary. Imagine if I want to calculate the popularity of a product
based on the frequency of searches within last 2 hours (a sliding time
window). I don't think Hadoop can do this computation.
Subject of a moderate delay of 5-20 minutes, this is no problem at all for
hadoop. This is especially true if you are doing straightforward
aggregations that are associative and commutative.
Of course, if we don't mind a distorted picture, we can use a jumping
window (1-3 PM, 3-5 PM ...) instead of a sliding window, then maybe OK. But
this is still not good, because we have to wait for two hours before getting
the new batch of result. (e.g. At 4:59 PM, we only have the result in the
1-3 PM batch)
Or just process each 10 minute period into aggregate form. Then add up the
latest 12 aggregates. Every day, merge all the small files for the day and
every month merge all the daily files.
There are very few businesses where a 10 minute delay is a big problem.
It doesn't seem like Hadoop is good at handling this kind of processing:
"Parallel processing of multiple real time data stream processing". Anyone
disagree ?
It isn't entirely natural, but it isn't a problem.
I'm wondering if a "mapper-only" model would work better. In this case,
there is no reducer (ie: no grouping). Each map task keep a history (ie:
sliding window) of data that it has seen and then write the result to the
output file.
This doesn't scale at all well.
Take a look at the Chukwa project for a well worked example of how to
process logs in near real-time with Hadoop.