FAQ
Hi,

I have an HDFS folder and M/R job that periodically updates it by replacing the data with newly generated data.

I have a different M/R job that periodically or ad-hoc process the data in the folder.

The second job ,naturally, fails sometime, when the data is replaced by newly generated data and the job plan including the input paths have already been submitted.

Is there an elegant solution ?

My current though is to query the jobtracker for running jobs and go over all the input files, in the job XML to know if The swap should block until the input path is no longer in any current executed input path job.

Search Discussions

  • Tim Robertson at Aug 13, 2012 at 10:56 am
    How about introducing a distributed coordination and locking mechanism?
    ZooKeeper would be a good candidate for that kind of thing.


    On Mon, Aug 13, 2012 at 12:52 PM, David Ginzburg wrote:

    Hi,

    I have an HDFS folder and M/R job that periodically updates it by
    replacing the data with newly generated data.

    I have a different M/R job that periodically or ad-hoc process the data in
    the folder.

    The second job ,naturally, fails sometime, when the data is replaced by
    newly generated data and the job plan including the input paths have
    already been submitted.

    Is there an elegant solution ?

    My current though is to query the jobtracker for running jobs and go over
    all the input files, in the job XML to know if The swap should block until
    the input path is no longer in any current executed input path job.



  • Harsh J at Aug 13, 2012 at 12:03 pm
    David,

    While ZK can solve this, locking may only make you slower. Lets try to
    keep it simple?

    Have you considered keeping two directories? One where the older data
    is moved to (by the first job, instead of replacing files), for
    consumption by the second job, which triggers by watching this
    directory?

    That is,
    MR Job #1 (the producer), moves existing data to /path/b/timestamp,
    and writes new data to /path/a.
    MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole
    of available set of timestamps under /path/b at that point) for its
    input, and deletes it afterwards. Hence the #2 can monitor this
    directory for triggering itself.
    On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg wrote:
    Hi,

    I have an HDFS folder and M/R job that periodically updates it by replacing
    the data with newly generated data.

    I have a different M/R job that periodically or ad-hoc process the data in
    the folder.

    The second job ,naturally, fails sometime, when the data is replaced by
    newly generated data and the job plan including the input paths have already
    been submitted.

    Is there an elegant solution ?

    My current though is to query the jobtracker for running jobs and go over
    all the input files, in the job XML to know if The swap should block until
    the input path is no longer in any current executed input path job.




    --
    Harsh J
  • David Ginzburg at Aug 13, 2012 at 12:22 pm
    Hi,

    My problem is that some of the jobs that reads the folder are not under my control, i.e: a client submits a hive job.

    I was thinking of something like an mv(source,target ,long timeout) which will block until the folder is not in used or time out is reached .

    Is it possible that this problem is not a common one ?
    From: harsh@cloudera.com
    Date: Mon, 13 Aug 2012 17:33:02 +0530
    Subject: Re: Locks in M/R framework
    To: mapreduce-user@hadoop.apache.org

    David,

    While ZK can solve this, locking may only make you slower. Lets try to
    keep it simple?

    Have you considered keeping two directories? One where the older data
    is moved to (by the first job, instead of replacing files), for
    consumption by the second job, which triggers by watching this
    directory?

    That is,
    MR Job #1 (the producer), moves existing data to /path/b/timestamp,
    and writes new data to /path/a.
    MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole
    of available set of timestamps under /path/b at that point) for its
    input, and deletes it afterwards. Hence the #2 can monitor this
    directory for triggering itself.
    On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg wrote:
    Hi,

    I have an HDFS folder and M/R job that periodically updates it by replacing
    the data with newly generated data.

    I have a different M/R job that periodically or ad-hoc process the data in
    the folder.

    The second job ,naturally, fails sometime, when the data is replaced by
    newly generated data and the job plan including the input paths have already
    been submitted.

    Is there an elegant solution ?

    My current though is to query the jobtracker for running jobs and go over
    all the input files, in the job XML to know if The swap should block until
    the input path is no longer in any current executed input path job.




    --
    Harsh J
  • Tim Robertson at Aug 13, 2012 at 1:01 pm
    Hi David,

    You are probably aware, but you can specify the location of the data in
    Hive, so if you can keep it simple and manage directories you could rewrite
    the Hive metastore at the same time (e.g. redefine the tables for hive or
    just go the underlying Hive DB and change the SDS.location entry, but
    beware of race conditions).

    If your scenarios go beyond simple you might run into issues (deadlocks,
    race conditions, herding etc). If that happens, I'd still recommend
    sectioning off that problem into the likes of ZK. It's not particularly
    difficult to use and other than another service running, is probably as
    easy to code against as a directory managing solution would be; you might
    try https://github.com/Netflix/curator, which comes recommended from a
    colleague of mine although I have no experience with it.

    Don't get me wrong, I am all for simple and less moving parts if it works -
    just wanted to suggest something other systems are commonly using to
    overcome this. Your mv(...) example is classic ZK stuff.

    Cheers,
    Tim

    On Mon, Aug 13, 2012 at 2:22 PM, David Ginzburg wrote:

    Hi,

    My problem is that some of the jobs that reads the folder are not under my
    control, i.e: a client submits a hive job.

    I was thinking of something like an mv(source,target ,long timeout) which
    will block until the folder is not in used or time out is reached .

    Is it possible that this problem is not a common one ?
    From: harsh@cloudera.com
    Date: Mon, 13 Aug 2012 17:33:02 +0530
    Subject: Re: Locks in M/R framework
    To: mapreduce-user@hadoop.apache.org
    David,

    While ZK can solve this, locking may only make you slower. Lets try to
    keep it simple?

    Have you considered keeping two directories? One where the older data
    is moved to (by the first job, instead of replacing files), for
    consumption by the second job, which triggers by watching this
    directory?

    That is,
    MR Job #1 (the producer), moves existing data to /path/b/timestamp,
    and writes new data to /path/a.
    MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole
    of available set of timestamps under /path/b at that point) for its
    input, and deletes it afterwards. Hence the #2 can monitor this
    directory for triggering itself.
    On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg wrote:
    Hi,

    I have an HDFS folder and M/R job that periodically updates it by
    replacing
    the data with newly generated data.

    I have a different M/R job that periodically or ad-hoc process the
    data in
    the folder.

    The second job ,naturally, fails sometime, when the data is replaced by
    newly generated data and the job plan including the input paths have
    already
    been submitted.

    Is there an elegant solution ?

    My current though is to query the jobtracker for running jobs and go
    over
    all the input files, in the job XML to know if The swap should block
    until
    the input path is no longer in any current executed input path job.




    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedAug 13, '12 at 10:53a
activeAug 13, '12 at 1:01p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase