You are probably aware, but you can specify the location of the data in
Hive, so if you can keep it simple and manage directories you could rewrite
the Hive metastore at the same time (e.g. redefine the tables for hive or
just go the underlying Hive DB and change the SDS.location entry, but
beware of race conditions).
If your scenarios go beyond simple you might run into issues (deadlocks,
race conditions, herding etc). If that happens, I'd still recommend
sectioning off that problem into the likes of ZK. It's not particularly
difficult to use and other than another service running, is probably as
easy to code against as a directory managing solution would be; you might
which comes recommended from a
colleague of mine although I have no experience with it.
Don't get me wrong, I am all for simple and less moving parts if it works -
just wanted to suggest something other systems are commonly using to
overcome this. Your mv(...) example is classic ZK stuff.
On Mon, Aug 13, 2012 at 2:22 PM, David Ginzburg wrote:
My problem is that some of the jobs that reads the folder are not under my
control, i.e: a client submits a hive job.
I was thinking of something like an mv(source,target ,long timeout) which
will block until the folder is not in used or time out is reached .
Is it possible that this problem is not a common one ?
Date: Mon, 13 Aug 2012 17:33:02 +0530
Subject: Re: Locks in M/R framework
While ZK can solve this, locking may only make you slower. Lets try to
keep it simple?
Have you considered keeping two directories? One where the older data
is moved to (by the first job, instead of replacing files), for
consumption by the second job, which triggers by watching this
MR Job #1 (the producer), moves existing data to /path/b/timestamp,
and writes new data to /path/a.
MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole
of available set of timestamps under /path/b at that point) for its
input, and deletes it afterwards. Hence the #2 can monitor this
directory for triggering itself.
On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg wrote:
I have an HDFS folder and M/R job that periodically updates it by
the data with newly generated data.
I have a different M/R job that periodically or ad-hoc process the
The second job ,naturally, fails sometime, when the data is replaced by
newly generated data and the job plan including the input paths have
Is there an elegant solution ?
My current though is to query the jobtracker for running jobs and go
all the input files, in the job XML to know if The swap should block
the input path is no longer in any current executed input path job.