FAQ
Hi all,

For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition
to uploading data into HDFS and using MapReduce to load/transform the data,
I'd like to integrate more closely with Hive. Specifically, to run the
CREATE TABLE statements needed to automatically inject table defintions into
Hive's metastore for the data files that sqoop loads into HDFS. Doing this
requires linking against Hive in some way (either directly by using one of
their API libraries, or "loosely" by piping commands into a Hive instance).

In either case, there's a dependency there. I was hoping someone on this
list with more Ivy experience than I knows what's the best way to make this
happen. Hive isn't in the maven2 repository that Hadoop pulls most of its
dependencies from. It might be necessary for sqoop to have access to a full
build of Hive. It doesn't seem like a good idea to check that binary
distribution into Hadoop svn, but I'm not sure what's the most expedient
alternative. Is it acceptable to just require that developers who wish to
compile/test/run sqoop have a separate standalone Hive deployment and a
proper HIVE_HOME variable? This would keep our source repo "clean." The
downside here is that it makes it difficult to test Hive-specific
integration functionality with Hudson and requires extra leg-work of
developers.

Thanks,
- Aaron Kimball

Search Discussions

  • Edward Capriolo at May 15, 2009 at 10:01 pm

    On Fri, May 15, 2009 at 5:05 PM, Aaron Kimball wrote:
    Hi all,

    For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition
    to uploading data into HDFS and using MapReduce to load/transform the data,
    I'd like to integrate more closely with Hive. Specifically, to run the
    CREATE TABLE statements needed to automatically inject table defintions into
    Hive's metastore for the data files that sqoop loads into HDFS. Doing this
    requires linking against Hive in some way (either directly by using one of
    their API libraries, or "loosely" by piping commands into a Hive instance).

    In either case, there's a dependency there. I was hoping someone on this
    list with more Ivy experience than I knows what's the best way to make this
    happen. Hive isn't in the maven2 repository that Hadoop pulls most of its
    dependencies from. It might be necessary for sqoop to have access to a full
    build of Hive. It doesn't seem like a good idea to check that binary
    distribution into Hadoop svn, but I'm not sure what's the most expedient
    alternative. Is it acceptable to just require that developers who wish to
    compile/test/run sqoop have a separate standalone Hive deployment and a
    proper HIVE_HOME variable? This would keep our source repo "clean." The
    downside here is that it makes it difficult to test Hive-specific
    integration functionality with Hudson and requires extra leg-work of
    developers.

    Thanks,
    - Aaron Kimball
    Aaron,

    I have a similar situation. I am using the GPL geo-ip library as a
    hive UDF. Due to apache/GPL issues it the code would not be
    compatible.

    Currently my build process reference all if the Hive lib/*.jar files.
    It does not really need all of that but not being exactly sure what I
    need I reference all of them.

    I was thinking one option is to run a GIT system. This way I can
    integrate my patch into my forked hive.

    I see your problem though, you have a few Hive Entry Points
    1) JDBC
    2) Hive Thrift Server
    3) scripting
    4) Java API

    The JDBC and Thrift should be the lightest. In that a few Jar files
    would make up the entry point rather then the entire hive
    distribution.

    Although now that Hive has had two releases maybe Hive should be in
    maven. With that hive could be an optional or a mandatory ant target
    for sqoop.
  • Owen O'Malley at May 15, 2009 at 10:07 pm

    On May 15, 2009, at 2:05 PM, Aaron Kimball wrote:

    In either case, there's a dependency there.
    You need to split it so that there are no cycles in the dependency
    tree. In the short term it looks like:

    avro:
    core: avro
    hdfs: core
    mapred: hdfs, core
    hive: mapred, core
    pig: mapred, core

    Adding a dependence from core to hive would be bad. To integrate with
    Hive, you need to add a contrib module to Hive that adds it.

    -- Owen
  • Aaron Kimball at May 15, 2009 at 10:26 pm
    Yikes. So part of sqoop would wind up in one source repository, and part in
    another? This makes my head hurt a bit.

    I'm also not convinced how that helps. So if I write (e.g.,)
    o.a.h.sqoop.HiveImporter and check that into a contrib module in the Hive
    project, then the main sqoop program (o.a.h.sqoop.Sqoop) still needs to
    compile against/load at runtime o.a.h.s.HiveImporter. So the net result is
    the same: building/running a cohesive program requires fetching resources
    from the hive repo and compiling them in.

    For the moment, though, I'm finding that the Hive JDBC interface is actually
    misbehaving more than I care to wrangle with. My current solution is
    actually to generate script files and run them with "hive -f <tmpfilename>",
    which doesn't require any compile-time linkage. So maybe this is a nonissue
    for the moment.

    - Aaron
    On Fri, May 15, 2009 at 3:06 PM, Owen O'Malley wrote:

    On May 15, 2009, at 2:05 PM, Aaron Kimball wrote:

    In either case, there's a dependency there.
    You need to split it so that there are no cycles in the dependency tree. In
    the short term it looks like:

    avro:
    core: avro
    hdfs: core
    mapred: hdfs, core
    hive: mapred, core
    pig: mapred, core

    Adding a dependence from core to hive would be bad. To integrate with Hive,
    you need to add a contrib module to Hive that adds it.

    -- Owen
  • Owen O'Malley at May 20, 2009 at 12:39 pm

    On May 15, 2009, at 3:25 PM, Aaron Kimball wrote:

    Yikes. So part of sqoop would wind up in one source repository, and
    part in
    another? This makes my head hurt a bit.
    I'd say rather that Sqoop is in Mapred and the adapter to Hive is in
    Hive.
    I'm also not convinced how that helps.
    Clearly, what you need to arrange is to not have a compile time
    dependence on Hive. Clearly we don't want cycles in the dependence
    tree, so you need to figure out how to make the adapter for Hive a
    plugin rather than a part of the Sqoop core.

    -- Owen
  • Ashish Thusoo at May 20, 2009 at 5:29 pm
    You could either do what Owen suggested and put the plugin in hive contrib, or you could just put the whole thing in hive contrib as then you would have access to all the lower level api (core, hdfs, hive etc.). Owen's approach makes a lot of sense if you think that the hive dependency is a loose one and you would have plugins for other systems to achieve your goal. However, if this is a hard dependency, then putting it in hive contrib make more sense. Either approach is fine, depending upon your goals.

    Ashish

    -----Original Message-----
    From: Owen O'Malley
    Sent: Wednesday, May 20, 2009 5:39 AM
    To: core-user@hadoop.apache.org
    Subject: Re: Linking against Hive in Hadoop development tree

    On May 15, 2009, at 3:25 PM, Aaron Kimball wrote:

    Yikes. So part of sqoop would wind up in one source repository, and
    part in another? This makes my head hurt a bit.
    I'd say rather that Sqoop is in Mapred and the adapter to Hive is in Hive.
    I'm also not convinced how that helps.
    Clearly, what you need to arrange is to not have a compile time dependence on Hive. Clearly we don't want cycles in the dependence tree, so you need to figure out how to make the adapter for Hive a plugin rather than a part of the Sqoop core.

    -- Owen
  • Aaron Kimball at May 20, 2009 at 5:44 pm
    I've worked around needing any compile-time dependencies for now. :) No
    longer an issue.

    - Aaron

    On Wed, May 20, 2009 at 10:29 AM, Ashish Thusoo wrote:

    You could either do what Owen suggested and put the plugin in hive contrib,
    or you could just put the whole thing in hive contrib as then you would have
    access to all the lower level api (core, hdfs, hive etc.). Owen's approach
    makes a lot of sense if you think that the hive dependency is a loose one
    and you would have plugins for other systems to achieve your goal. However,
    if this is a hard dependency, then putting it in hive contrib make more
    sense. Either approach is fine, depending upon your goals.

    Ashish

    -----Original Message-----
    From: Owen O'Malley
    Sent: Wednesday, May 20, 2009 5:39 AM
    To: core-user@hadoop.apache.org
    Subject: Re: Linking against Hive in Hadoop development tree

    On May 15, 2009, at 3:25 PM, Aaron Kimball wrote:

    Yikes. So part of sqoop would wind up in one source repository, and
    part in another? This makes my head hurt a bit.
    I'd say rather that Sqoop is in Mapred and the adapter to Hive is in Hive.
    I'm also not convinced how that helps.
    Clearly, what you need to arrange is to not have a compile time dependence
    on Hive. Clearly we don't want cycles in the dependence tree, so you need to
    figure out how to make the adapter for Hive a plugin rather than a part of
    the Sqoop core.

    -- Owen
  • Tom White at May 20, 2009 at 10:07 am

    On Fri, May 15, 2009 at 11:06 PM, Owen O'Malley wrote:
    On May 15, 2009, at 2:05 PM, Aaron Kimball wrote:

    In either case, there's a dependency there.
    You need to split it so that there are no cycles in the dependency tree. In the short term it looks like:

    avro:
    core: avro
    hdfs: core
    mapred: hdfs, core
    Why does mapred depend on hdfs? MapReduce should only depend on the
    FileSystem interface, shouldn't it?

    Tom
    hive: mapred, core
    pig: mapred, core

    Adding a dependence from core to hive would be bad. To integrate with Hive, you need to add a contrib module to Hive that adds it.

    -- Owen
  • Owen O'Malley at May 20, 2009 at 12:34 pm

    On May 20, 2009, at 3:07 AM, Tom White wrote:

    Why does mapred depend on hdfs? MapReduce should only depend on the
    FileSystem interface, shouldn't it?
    Yes, I should have been consistent. In terms of compile-time
    dependences, mapred only depends on core.

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 15, '09 at 9:05p
activeMay 20, '09 at 5:44p
posts9
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase