FAQ
Hi,

Is queue-like structure supported from HDFS where stream of data is
processed when it's generated?
Specifically, I will have stream of data coming; and data independent
operation needs to be applied to it (so only Map function, reducer is
identity).
I wish to distribute data among nodes using HDFS and start processing it as
it arrives, preferably in single MR job.

I agree that it can be done by starting new MR job for each batch of data,
but is starting many MR jobs frequently for small data chunks a good idea?
(Consider new batch arrives after every few sec and processing of one batch
takes few mins)

Thanks,
--
Saumitra S. Shahapure

Search Discussions

  • Jakob Homan at Jun 24, 2011 at 7:32 pm
    Not directly, but you may wish to take a look at the Kafka project
    (http://sna-projects.com/kafka/), which we use as a queue and then
    bring the data periodically into HDFS via an MR job. See this
    presentation: http://www.slideshare.net/ydn/hug-january-2011-kafka-presentation
    -Jakob



    On Fri, Jun 24, 2011 at 10:12 AM, Saumitra Shahapure
    wrote:
    Hi,

    Is queue-like structure supported from HDFS where stream of data is
    processed when it's generated?
    Specifically, I will have stream of data coming; and data independent
    operation needs to be applied to it (so only Map function, reducer is
    identity).
    I wish to distribute data among nodes using HDFS and start processing it as
    it arrives, preferably in single MR job.

    I agree that it can be done by starting new MR job for each batch of data,
    but is starting many MR jobs frequently for small data chunks a good idea?
    (Consider new batch arrives after every few sec and processing of one batch
    takes few mins)

    Thanks,
    --
    Saumitra S. Shahapure
  • Saumitra at Jun 25, 2011 at 8:02 pm
    Thanks for reply Jakob,

    As far as I understand, Kafka's hadoop consumers is MR job where mappers
    read from shared queue from Kafka and dump data to HDFS, but they are
    not dynamically created as queue elements start bursting up.

    Is there way so that new mappers are created when input queue of job
    grows or when input HDFS source get updated?

    On Saturday 25 June 2011 01:01 AM, Jakob Homan wrote:
    Not directly, but you may wish to take a look at the Kafka project
    (http://sna-projects.com/kafka/), which we use as a queue and then
    bring the data periodically into HDFS via an MR job. See this
    presentation: http://www.slideshare.net/ydn/hug-january-2011-kafka-presentation
    -Jakob



    On Fri, Jun 24, 2011 at 10:12 AM, Saumitra Shahapure
    wrote:
    Hi,

    Is queue-like structure supported from HDFS where stream of data is
    processed when it's generated?
    Specifically, I will have stream of data coming; and data independent
    operation needs to be applied to it (so only Map function, reducer is
    identity).
    I wish to distribute data among nodes using HDFS and start processing it as
    it arrives, preferably in single MR job.

    I agree that it can be done by starting new MR job for each batch of data,
    but is starting many MR jobs frequently for small data chunks a good idea?
    (Consider new batch arrives after every few sec and processing of one batch
    takes few mins)

    Thanks,
    --
    Saumitra S. Shahapure

    --
    Saumitra Shahapure
  • Bharath Mundlapudi at Jun 26, 2011 at 10:02 pm
    One solution i am thinking is lets say you have Kafka or some JMS implementation where your job client is subscribed to and submits job dynamically based on queue input size. You may need to run final job to combine them all.

    -Bharath


    ________________________________
    From: Saumitra <saumitra.official@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Saturday, June 25, 2011 1:05 PM
    Subject: Re: Queue support from HDFS

    Thanks for reply Jakob,

    As far as I understand, Kafka's hadoop consumers is MR job where mappers
    read from shared queue from Kafka and dump data to HDFS, but they are
    not dynamically created as queue elements start bursting up.

    Is there way so that new mappers are created when input queue of job
    grows or when input HDFS source get updated?

    On Saturday 25 June 2011 01:01 AM, Jakob Homan wrote:
    Not directly, but you may wish to take a look at the Kafka project
    (http://sna-projects.com/kafka/), which we use as a queue and then
    bring the data periodically into HDFS via an MR job.  See this
    presentation: http://www.slideshare.net/ydn/hug-january-2011-kafka-presentation
    -Jakob



    On Fri, Jun 24, 2011 at 10:12 AM, Saumitra Shahapure
    wrote:
    Hi,

    Is queue-like structure supported from HDFS where stream of data is
    processed when it's generated?
    Specifically, I will have stream of data coming; and data independent
    operation needs to be applied to it (so only Map function, reducer is
    identity).
    I wish to distribute data among nodes using HDFS and start processing it as
    it arrives, preferably in single MR job.

    I agree that it can be done by starting new MR job for each batch of data,
    but is starting many MR jobs frequently for small data chunks a good idea?
    (Consider new batch arrives after every few sec and processing of one batch
    takes few mins)

    Thanks,
    --
    Saumitra S. Shahapure

    --
    Saumitra Shahapure
  • GOEKE, MATTHEW (AG/1000) at Jun 27, 2011 at 11:13 am
    Saumitra,

    Two questions come to mind that could help you narrow down a solution:

    1) How quickly do the downstream processes need the transformed data?
    Reason: If you can delay the processing for a period of time, enough to batch the data into a blob that is a multiple of your block size, then you are obviously going to be working more towards the strong suit of vanilla MR.

    2) What else will be running on the cluster?
    Reason: If this is primarily setup for this use case then how often it runs / what resources it consumes when it does only needs to be optimized if it can't process them fast enough. If it is not then you could always setup a separate pool for this in the fairscheduler and allow for this to use a certain amount of overhead on the cluster when these events are being generated.

    Outside of the fact that you would have a lot of small files on the cluster (which can be resolved by running a nightly job to blob them and then delete originals) I am not sure I would be too concerned about at least trying out this method. It would be helpful to know the size and type of data coming in as well as what type of operation you are looking to do if you would like a more concrete suggestion. Log data is a prime example of this type of workflow and there are many suggestions out there as well as projects that attempt to address this (i.e. Chukwa).

    HTH,
    Matt

    -----Original Message-----
    From: saumitra.shahapure@gmail.com On Behalf Of Saumitra Shahapure
    Sent: Friday, June 24, 2011 12:12 PM
    To: common-user@hadoop.apache.org
    Subject: Queue support from HDFS

    Hi,

    Is queue-like structure supported from HDFS where stream of data is
    processed when it's generated?
    Specifically, I will have stream of data coming; and data independent
    operation needs to be applied to it (so only Map function, reducer is
    identity).
    I wish to distribute data among nodes using HDFS and start processing it as
    it arrives, preferably in single MR job.

    I agree that it can be done by starting new MR job for each batch of data,
    but is starting many MR jobs frequently for small data chunks a good idea?
    (Consider new batch arrives after every few sec and processing of one batch
    takes few mins)

    Thanks,
    --
    Saumitra S. Shahapure
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 24, '11 at 5:12p
activeJun 27, '11 at 11:13a
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase