Hi everyone,
I want to run a MR job continuously. Because i have streaming data and i
try to analyze it all the time in my way(algorithm). For example you want
to solve wordcount problem. It's the simplest one :) If you have some
multiple files and the new files are keep going, how do you handle it?
You could execute a MR job per one file but you have to do it repeatly. So
what do you think?

Thanks
Best regards...

--

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*

Search Discussions

  • Athanasios Papaoikonomou at Dec 5, 2011 at 9:19 pm
    Hi burak,

    Perhaps, you could implement a cron job that will execute your MR
    program periodically.

    Regards

    On 05-Dec-11 10:49 PM, burakkk wrote:
    Hi everyone,
    I want to run a MR job continuously. Because i have streaming data and
    i try to analyze it all the time in my way(algorithm). For example you
    want to solve wordcount problem. It's the simplest one :) If you have
    some multiple files and the new files are keep going, how do you
    handle it?
    You could execute a MR job per one file but you have to do it
    repeatly. So what do you think?

    Thanks
    Best regards...

    --

    *BURAK ISIKLI***| *http://burakisikli.wordpress.com*

    *
    *
  • Bejoy Ks at Dec 5, 2011 at 9:20 pm
    Burak
    If you have a continuous inflow of data, you can choose flume to
    aggregate the files into larger sequence files or so if they are small and
    when you have a substantial chunk of data(equal to hdfs block size). You
    can push that data on to hdfs based on your SLAs you need to schedule your
    jobs using oozie or simpe shell script. In very simple terms
    - push input data (could be from flume collector) into a staging hdfs dir
    - before triggering the job(hadoop jar) copy the input from staging to main
    input dir
    - execute the job
    - archive the input and output into archive dirs(any other dirs).
    - the output archive dir could be source of output data
    - delete output dir and empty input dir

    Hope it helps!...

    Regards
    Bejoy.K.S
    On Tue, Dec 6, 2011 at 2:19 AM, burakkk wrote:

    Hi everyone,
    I want to run a MR job continuously. Because i have streaming data and i
    try to analyze it all the time in my way(algorithm). For example you want
    to solve wordcount problem. It's the simplest one :) If you have some
    multiple files and the new files are keep going, how do you handle it?
    You could execute a MR job per one file but you have to do it repeatly. So
    what do you think?

    Thanks
    Best regards...

    --

    *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
    *
    *
  • Burakkk at Dec 5, 2011 at 10:34 pm
    Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
    execute the MR job on the same algorithm but different files have different
    velocity.

    Both Storm and facebook's hadoop are designed for that. But i want to use
    apache distribution.

    Bejoy Ks, i have a continuous inflow of data but i think i need a near
    real-time system.

    Mike Spreitzer, both output and input are continuous. Output isn't relevant
    to the input. Only that i want is all the incoming files are processed by
    the same job and the same algorithm.
    For ex, you think about wordcount problem. When you want to run wordcount,
    you implement that:
    http://wiki.apache.org/hadoop/WordCount

    But when the program find that code "job.waitForCompletion(true);", somehow
    job will end up. When you want to make it continuously, what will you do in
    hadoop without other tools?
    One more thing is you assumption that the input file's name is
    filename_timestamp(filename_20111206_0030)

    public static void main(String[] args) throws Exception { Configuration
    conf = new Configuration(); Job job = new Job(conf,
    "wordcount"); job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class); job.setReducerClass(Reduce.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.waitForCompletion(true); }
    On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks wrote:

    Burak
    If you have a continuous inflow of data, you can choose flume to
    aggregate the files into larger sequence files or so if they are small and
    when you have a substantial chunk of data(equal to hdfs block size). You
    can push that data on to hdfs based on your SLAs you need to schedule your
    jobs using oozie or simpe shell script. In very simple terms
    - push input data (could be from flume collector) into a staging hdfs dir
    - before triggering the job(hadoop jar) copy the input from staging to
    main input dir
    - execute the job
    - archive the input and output into archive dirs(any other dirs).
    - the output archive dir could be source of output data
    - delete output dir and empty input dir

    Hope it helps!...

    Regards
    Bejoy.K.S
    On Tue, Dec 6, 2011 at 2:19 AM, burakkk wrote:

    Hi everyone,
    I want to run a MR job continuously. Because i have streaming data and i
    try to analyze it all the time in my way(algorithm). For example you want
    to solve wordcount problem. It's the simplest one :) If you have some
    multiple files and the new files are keep going, how do you handle it?
    You could execute a MR job per one file but you have to do it repeatly. So
    what do you think?

    Thanks
    Best regards...

    --

    *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
    *
    *

    --

    *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
    *
    *
  • Ravi teja ch n v at Dec 6, 2011 at 5:10 am
    Hi Burak,
    Bejoy Ks, i have a continuous inflow of data but i think i need a near
    real-time system.

    Just to add to Bejoy's point,
    with Oozie, you can specify the data dependency for running your job.
    When specific amount of data is in, your can configure Oozie to run your job.
    I think this will suffice your requirement.

    Regards,
    Ravi Teja

    ________________________________________
    From: burakkk [burak.isikli@gmail.com]
    Sent: 06 December 2011 04:03:59
    To: mapreduce-user@hadoop.apache.org
    Cc: common-user@hadoop.apache.org
    Subject: Re: Running a job continuously

    Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
    execute the MR job on the same algorithm but different files have different
    velocity.

    Both Storm and facebook's hadoop are designed for that. But i want to use
    apache distribution.

    Bejoy Ks, i have a continuous inflow of data but i think i need a near
    real-time system.

    Mike Spreitzer, both output and input are continuous. Output isn't relevant
    to the input. Only that i want is all the incoming files are processed by
    the same job and the same algorithm.
    For ex, you think about wordcount problem. When you want to run wordcount,
    you implement that:
    http://wiki.apache.org/hadoop/WordCount

    But when the program find that code "job.waitForCompletion(true);", somehow
    job will end up. When you want to make it continuously, what will you do in
    hadoop without other tools?
    One more thing is you assumption that the input file's name is
    filename_timestamp(filename_20111206_0030)

    public static void main(String[] args) throws Exception { Configuration
    conf = new Configuration(); Job job = new Job(conf,
    "wordcount"); job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class); job.setReducerClass(Reduce.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.waitForCompletion(true); }
    On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks wrote:

    Burak
    If you have a continuous inflow of data, you can choose flume to
    aggregate the files into larger sequence files or so if they are small and
    when you have a substantial chunk of data(equal to hdfs block size). You
    can push that data on to hdfs based on your SLAs you need to schedule your
    jobs using oozie or simpe shell script. In very simple terms
    - push input data (could be from flume collector) into a staging hdfs dir
    - before triggering the job(hadoop jar) copy the input from staging to
    main input dir
    - execute the job
    - archive the input and output into archive dirs(any other dirs).
    - the output archive dir could be source of output data
    - delete output dir and empty input dir

    Hope it helps!...

    Regards
    Bejoy.K.S
    On Tue, Dec 6, 2011 at 2:19 AM, burakkk wrote:

    Hi everyone,
    I want to run a MR job continuously. Because i have streaming data and i
    try to analyze it all the time in my way(algorithm). For example you want
    to solve wordcount problem. It's the simplest one :) If you have some
    multiple files and the new files are keep going, how do you handle it?
    You could execute a MR job per one file but you have to do it repeatly. So
    what do you think?

    Thanks
    Best regards...

    --

    *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
    *
    *

    --

    *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
    *
    *
  • Mike Spreitzer at Dec 5, 2011 at 9:35 pm
    Burak,
    Before we can really answer your question, you need to give us some more
    information on the processing you want to do. Do you want output that is
    continuous or batched (if so, how)? How should the output at a given time
    be related to the input up to then and the previous outputs?

    Regards,
    Mike

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedDec 5, '11 at 8:49p
activeDec 6, '11 at 5:10a
posts6
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase