FAQ
Hey all,

We have many projects that use hive. Now that hive is in maven I really
wanted to be able to embed hive in my application for unit testing. I know
some work has been done to unit test UDF's with mocks etc, but that is not
the testing I am looking for. I have workflows that need to chain together
tasks some of these tasks are inside our outside of hive. Having dealt with
the build hive and test infrastructure a bit I understand why it is how it
is. However I find that it is very complex in in some cases I do really see
much of the need for it.

It took me some time but I was able to create a maven project that includes
all the hive dependencies and cascading dependencies like thrift, derby etc.
Classpath is only have the battle.


public class SimpleTest extends HadoopTestCase {

private static final Path ROOT_DIR = new Path("testing");
private SessionState ss;

public SimpleTest() throws IOException {
super(HadoopTestCase.LOCAL_MR, HadoopTestCase.LOCAL_FS, 1, 1);
}

private Path getDir(Path dir) {
if (isLocalFS()) {
String localPathRoot = System
.getProperty("test.build.data", "/tmp").replace(' ', '+');
dir = new Path(localPathRoot, dir);
}
return dir;
}

public void setUp() throws Exception {
super.setUp();
Path rootDir = getDir(ROOT_DIR);
Configuration conf = createJobConf();
FileSystem fs = FileSystem.get(conf);
fs.delete(rootDir, true);
Path metastorePath = new Path("/tmp/metastore_db");
fs.delete(metastorePath, true);
Path warehouse = new Path("/tmp/warehouse");
fs.delete(warehouse, true);
fs.mkdirs(warehouse);

SessionState.initHiveLog4j();
ss = new SessionState(new HiveConf(SimpleTest.class));
SessionState.start(ss);
}

public void testA() {
JobConf c = createJobConf();
Assert.assertEquals(0, doHiveCommand("create table a (id int)", c));
Assert.assertEquals(0, doHiveCommand("create table b (id int)", c));
Assert.assertEquals(9, doHiveCommand("create table a (id int)", c));
Assert.assertEquals(0, doHiveCommand("select count(1) from a", c));
}

public int doHiveCommand(String cmd, Configuration h2conf) {
int ret = 0;
String cmd_trimmed = cmd.trim();
String[] tokens = cmd_trimmed.split("\\s+");
String cmd_1 = cmd_trimmed.substring(tokens[0].length()).trim();
CommandProcessor proc = null;

HiveConf c = (HiveConf) ss.getConf();
proc = CommandProcessorFactory.get(tokens[0], c);
CommandProcessor p = CommandProcessorFactory.get(tokens[0]);

if (proc instanceof Driver) {
ret = proc.run(cmd).getResponseCode();
} else {
ret = proc.run(cmd_1).getResponseCode();
}
return ret;
}
}

This code does not run. Two really annoying things stop it from running.

1) You have to set the HADOOP_HOME environment variable to point at a
bin/hadoop.
I see why this is needed but it is kinda bogus, we are already are a HADOOP
since we inherited a test case.
ExecDriver is in the hadoop enviaronment as well yet it insists on forking a
process.
This is really annoying because environment variables are read only in java
so this had to be set in the IDE.

2) Before the fork hive does some logic to figure out the path to hive-exec
jar. Which does not seem to work right. You should be able to set
HiveConf.ConfVars.HIVEJAR.varname, but it seems like that variables has no
effect since HiveConf.getJar() somehow caches that information on startup.
Again you have to bake a static path into your configuration file. (Or maybe
you don't but I tried setting this in configuration objects all over the
place with no effect)

<configuration>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/tmp/warehouse</value>
<description>location of default database for the warehouse</description>
</property>

<property>
<name>hive.jar.path</name>
<value>/home/edward/.m2/repository/org/apache/hive/hive-exec/0.7.1-SNAPSHOT/hive-exec-0.7.1-SNAPSHOT.jar</value>
<description>location of default database for the warehouse</description>
</property>

</configuration>

Anyway with these two hardcodes you can bring up an in process hive, or an
inprocess hive_service instance.

So what does anyone thing can be done to get hive over the hump and remove
the hadoop forking, and jar hardcoding? Or is the problem that I just
improperly setup the hive environment in my test case in terms of object
initialization.

Edward

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedSep 14, '11 at 7:02p
activeSep 14, '11 at 7:02p
posts1
users1
websitehive.apache.org

1 user in discussion

Edward Capriolo: 1 post

People

Translate

site design / logo © 2022 Grokbase