The biggest hurtle in hadoop adoption is that there is no easy way to setup
a pseudo cluster on developer's machine. People are steering off course to
build additional simulation tools and validation tools. In practice, those
tools don't provide nearly enough insight in things that could go wrong in a
real cluster. For example, if a pig job uses HBaseStorage for accessing
data. There is not a single hint that, the hbase-site.xml needs to be in a
jar file in the pig class path for pig job to distribute the hbase
environment to the MR cluster for the job to work. Regardless of how good
the simulation tools, they are limited to the silo environment. What we can
do to improve the integration is to have a set of installable packages that
can integrate well across hadoop ecosystems on the developer's machine.
This is similar situation that mass majority of developer on LAMP stack,
people don't start with compiling their own apache server and mysql server
to start development and testing. They start by installing binary packages
and getting their work tested on real software.
Hence, hadoop developers bearing the responsibility of testing release
packages, while end user developer should be responsible for certifying the
integrated system on his/her own cluster. There are already a list of tools
to validate a cluster, like Terasort or GridMix 1,2,3.
I think the bigger concern is that Hadoop ecosystem does not have a standard
method in linking dependencies. Hbase depends on Zookeeper, and Pig depends
on Hadoop and Hbase. Then pig decided to put hadoop-core jar in it's own
jar file. Chukwa depends on pig + hbase + hadoop and zookeeper. The
version incompatibility is probably what driving people nuts. Hence, there
is a new proposal on how to integrate among hadoop ecosystem. I urge
project owners to review the proposal and provide feedbacks.
The proposal is located at:
https://issues.apache.org/jira/secure/attachment/12470823/deployment.pdfThe related jiras are:
https://issues.apache.org/jira/browse/HADOOP-6255https://issues.apache.org/jira/browse/PIG-1857There are plans to file more jiras for related projects. The integration
would also be a lot easier if all related projects are using maven for
dependency management.
Regards,
Eric
On 2/17/11 9:33 AM, "Konstantin Boudnik" wrote:On Thu, Feb 17, 2011 at 05:45, Ian Holsman wrote:
I'm not sure it makes sense to all the testing packages under a different
umbrella that covers the code they test.
While there might be commonalities building a test harness, I would think
that each testing tool would need to have deep knowledge of the tool's
internals that it is testing. as such it would need someone with the
experience to code it.
That's pretty much true indeed if you are talking about tests for a
project or closely tightened projects such as Herriot in Hadoop.
Speaking of tools there are some benefits though. Say, PigUnit and
MRUnit are both xUnit frameworks. The former allows you to run Pig
jobs in local and cluster mode. The latter is to validate MB jobs
without a need to fire up a cluster.
I don't see what advantage combining PigUnit & say 'MRUnit' would be for
example.
Don't you think Pig user would benefit if Pig scripts can be tested
against MRUnit which gives you a flavor of cluster environment without
one? Now, do you think it is likely that someone will go great lengths
to make such an effort and build such a bridge right now?
Cos
--I
On Feb 16, 2011, at 2:50 PM, Konstantin Boudnik wrote:Steve.
If the project under discussion will provide a common harness where such a
test
artifact (think of a Maven artifact for example) will click and will be
executed automatically with all needed tools and dependencies resolved for
you
- would it be appealing for end-users' cause?
As Joep said this "...will reduce the effort to take any (set of ) changes
from development into production." Take it one step further: when your
cluster
is 'assembled' you need to validate it (on top of a concrete OS, etc.); is
it
desirable to follow N-steps process to bring about whatever testing
work-load
you need or you'd prefer to simply do something like:
wget
http://workloads.internal.mydomain.com/stackValidations/v12.4.pom \
&& mvn verify
and check the results later on?
These gonna be the same tools that dev. use for their tasks although
worksets
will be different. So what?
Cos
On Wed, Feb 16, 2011 at 11:37AM, Steve Loughran wrote:On 15/02/11 21:58, Konstantin Boudnik wrote:
While MrUnit discussion draws to its natural conclusion I would like
to bring up another point which might be well aligned with that
discussion. Patrick Hunt has brought up this idea earlier today and I
believe it has to be elaborated further.
A number of testing projects both for Hadoop and Hadoop-related
component were brought to life over last year or two. Among those are
MRUnit, PigUnit, YCSB, Herriot, and perhaps a few more. They all
focusing on more or less the same problem e.g. validation of Hadoop or
on-top-of-Hadoop components, or application level testing for Hadoop.
However, the fact that they all are spread across a wide variety of
projects seems to confuse/mislead Hadoop users.
How about incubating a bigger Hadoop (Pig, Oozie, HBase) testing
project which will take care about development and support of common
(where's possible) tools, frameworks and the like? Please feel free to
share your thoughts :)
--
I think it would be good though specific projects will need/have their
own testing needs -I'd expect more focus for testing redistributables to
be on helping Hadoop users test their stuff against subsets of data,
rather than the hadoop-*-dev problem of "stressing the hadoop stack once
your latest patch is applied".
That said, the whole problem of qualifying an OS, Java release and
cluster is something we'd expect most end user teams to have to do
-right now terasort is the main stress test.