Saumitra,
Two questions come to mind that could help you narrow down a solution:
1) How quickly do the downstream processes need the transformed data?
Reason: If you can delay the processing for a period of time, enough to batch the data into a blob that is a multiple of your block size, then you are obviously going to be working more towards the strong suit of vanilla MR.
2) What else will be running on the cluster?
Reason: If this is primarily setup for this use case then how often it runs / what resources it consumes when it does only needs to be optimized if it can't process them fast enough. If it is not then you could always setup a separate pool for this in the fairscheduler and allow for this to use a certain amount of overhead on the cluster when these events are being generated.
Outside of the fact that you would have a lot of small files on the cluster (which can be resolved by running a nightly job to blob them and then delete originals) I am not sure I would be too concerned about at least trying out this method. It would be helpful to know the size and type of data coming in as well as what type of operation you are looking to do if you would like a more concrete suggestion. Log data is a prime example of this type of workflow and there are many suggestions out there as well as projects that attempt to address this (i.e. Chukwa).
HTH,
Matt
-----Original Message-----
From: saumitra.shahapure@gmail.com On Behalf Of Saumitra Shahapure
Sent: Friday, June 24, 2011 12:12 PM
To: common-user@hadoop.apache.org
Subject: Queue support from HDFS
Hi,
Is queue-like structure supported from HDFS where stream of data is
processed when it's generated?
Specifically, I will have stream of data coming; and data independent
operation needs to be applied to it (so only Map function, reducer is
identity).
I wish to distribute data among nodes using HDFS and start processing it as
it arrives, preferably in single MR job.
I agree that it can be done by starting new MR job for each batch of data,
but is starting many MR jobs frequently for small data chunks a good idea?
(Consider new batch arrives after every few sec and processing of one batch
takes few mins)
Thanks,
--
Saumitra S. Shahapure
This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.
All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.
The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.