FAQ
Hi,
For the first time I used Hive to load couple of word count data input
files into tables with and without OVERWRITE. Both the times the input file
in HDFS got deleted. Is that a expected behavior? And couldn't find any
definitive answer on the Hive wiki.

hive> LOAD DATA INPATH '/user/vmplanet/output/part-00000' OVERWRITE
INTO TABLE t_word_count;

Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in VMware.

Thanks,
Shiva

Search Discussions

  • Bill Graham at Jan 22, 2010 at 6:52 pm
    Hive doesn't delete the files upon load, it moves them to a location under
    the Hive warehouse directory. Try looking under
    /user/hive/warehouse/t_word_count.
    On Fri, Jan 22, 2010 at 10:44 AM, Shiva wrote:

    Hi,
    For the first time I used Hive to load couple of word count data input
    files into tables with and without OVERWRITE. Both the times the input file
    in HDFS got deleted. Is that a expected behavior? And couldn't find any
    definitive answer on the Hive wiki.

    hive> LOAD DATA INPATH '/user/vmplanet/output/part-00000' OVERWRITE
    INTO TABLE t_word_count;

    Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in VMware.

    Thanks,
    Shiva
  • Zheng Shao at Jan 22, 2010 at 7:06 pm
    If you want the files to stay there, you can try "CREATE EXTERNAL
    TABLE" with a location (instead of create table + load)

    Zheng
    On Fri, Jan 22, 2010 at 10:51 AM, Bill Graham wrote:
    Hive doesn't delete the files upon load, it moves them to a location under
    the Hive warehouse directory. Try looking under
    /user/hive/warehouse/t_word_count.
    On Fri, Jan 22, 2010 at 10:44 AM, Shiva wrote:

    Hi,
    For the first time I used Hive to load couple of word count data input
    files into tables with and without OVERWRITE.  Both the times the input file
    in HDFS got deleted. Is that a expected behavior? And couldn't find any
    definitive answer on the Hive wiki.

    hive> LOAD  DATA  INPATH  '/user/vmplanet/output/part-00000'  OVERWRITE
    INTO  TABLE  t_word_count;

    Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in
    VMware.

    Thanks,
    Shiva


    --
    Yours,
    Zheng
  • Shiva at Jan 22, 2010 at 8:10 pm
    I can try that. Here is what I am trying to do.

    Load some fact data from a file (say weblogs moved to HDFS after some
    cleanup and transformation) and then do summarization at daily or weekly
    level. In that case, I would like to create a one fact table which get
    loaded with daily data and bring dimensional data from MySQL to perform
    summarization.

    I appreciate any input on this technique, performance and how I can get
    dimensional data to Hive (from MySQL -> to file -> HDFS -> Hive).
    Thanks,
    Shiva
    On Fri, Jan 22, 2010 at 11:05 AM, Zheng Shao wrote:

    If you want the files to stay there, you can try "CREATE EXTERNAL
    TABLE" with a location (instead of create table + load)

    Zheng
    On Fri, Jan 22, 2010 at 10:51 AM, Bill Graham wrote:
    Hive doesn't delete the files upon load, it moves them to a location under
    the Hive warehouse directory. Try looking under
    /user/hive/warehouse/t_word_count.
    On Fri, Jan 22, 2010 at 10:44 AM, Shiva wrote:

    Hi,
    For the first time I used Hive to load couple of word count data
    input
    files into tables with and without OVERWRITE. Both the times the input
    file
    in HDFS got deleted. Is that a expected behavior? And couldn't find any
    definitive answer on the Hive wiki.

    hive> LOAD DATA INPATH '/user/vmplanet/output/part-00000' OVERWRITE
    INTO TABLE t_word_count;

    Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in
    VMware.

    Thanks,
    Shiva


    --
    Yours,
    Zheng
  • Eric Sammer at Jan 22, 2010 at 9:53 pm

    On 1/22/10 3:09 PM, Shiva wrote:
    I can try that. Here is what I am trying to do.

    Load some fact data from a file (say weblogs moved to HDFS after some
    cleanup and transformation) and then do summarization at daily or weekly
    level. In that case, I would like to create a one fact table which get
    loaded with daily data and bring dimensional data from MySQL to perform
    summarization.

    I appreciate any input on this technique, performance and how I can get
    dimensional data to Hive (from MySQL -> to file -> HDFS -> Hive).
    Thanks,
    Shiva
    Shiva:

    This is very common. I use Hive to do something very similar.

    Cloudera has a tool called sqoop that will "export" MySQL tables to
    files on HDFS that Hive can understand. Once there, you can easily join
    the data in your Hive queries.

    http://www.cloudera.com/hadoop-sqoop

    Sqoop is smarter than just doing an export to a local file system and
    then copying to HDFS and should save you a fair amount of time and
    effort. Check out the link.

    Hope this helps.
    --
    Eric Sammer
    eric@lifeless.net
    http://esammer.blogspot.com
  • Shiva at Jan 25, 2010 at 4:48 pm
    Eric,
    Thanks for the details. I took a quick look at the link and it seems like
    a tool that I would help me here. Do I need to download whole Cloudera's
    Distribution for Hadoop <http://www.cloudera.com/hadoop> just to get sqoop?
    I already have Hadoop, Hive and Pig setup completed.
    I appreciate your input,
    Shiva
    On Fri, Jan 22, 2010 at 1:53 PM, Eric Sammer wrote:
    On 1/22/10 3:09 PM, Shiva wrote:
    I can try that. Here is what I am trying to do.

    Load some fact data from a file (say weblogs moved to HDFS after some
    cleanup and transformation) and then do summarization at daily or weekly
    level. In that case, I would like to create a one fact table which get
    loaded with daily data and bring dimensional data from MySQL to perform
    summarization.

    I appreciate any input on this technique, performance and how I can get
    dimensional data to Hive (from MySQL -> to file -> HDFS -> Hive).
    Thanks,
    Shiva
    Shiva:

    This is very common. I use Hive to do something very similar.

    Cloudera has a tool called sqoop that will "export" MySQL tables to
    files on HDFS that Hive can understand. Once there, you can easily join
    the data in your Hive queries.

    http://www.cloudera.com/hadoop-sqoop

    Sqoop is smarter than just doing an export to a local file system and
    then copying to HDFS and should save you a fair amount of time and
    effort. Check out the link.

    Hope this helps.
    --
    Eric Sammer
    eric@lifeless.net
    http://esammer.blogspot.com
  • Eric Sammer at Jan 25, 2010 at 5:05 pm

    On 1/25/10 11:47 AM, Shiva wrote:
    Eric,
    Thanks for the details. I took a quick look at the link and it seems
    like a tool that I would help me here. Do I need to download whole
    Cloudera's Distribution for Hadoop <http://www.cloudera.com/hadoop> just
    to get sqoop? I already have Hadoop, Hive and Pig setup completed.
    I appreciate your input,
    Shiva
    Shiva:

    As far as I know, sqoop is only available bundled with Cloudera's
    distribution (CDH). That said, you may be able to download the tarball
    version of CDH and just pull sqoop out of it. I've never tried this,
    though, and I don't know if Sqoop depends on anything specific in CDH.
    Someone from Cloudera might be able to fill in the blanks here.

    For what it's worth, CDH2 is a great distribution with some nice
    packaging and configuration layout on top of the ASF distro along with
    some helpful patches. It's worth checking out.

    The tarball version of CDH is available at:
    http://archive.cloudera.com/cdh/testing/hadoop-0.20.1+152.tar.gz

    Best of luck!
    --
    Eric Sammer
    eric@lifeless.net
    http://esammer.blogspot.com
  • Shiva at Jan 26, 2010 at 5:18 pm
    Eric,
    I gave it a shot to use by just pulling out sqoop from Cloudera and
    installing it in my Hadoop environment. Though it worked for few queries
    like listing tables, databases not for importing data. There is dependency
    on other java files and patches; and recompiling Hadoop with them. So for
    now I installed CDH and sqoop is working well with initial tests. I will
    continue testing more this week. I heard from Aaron (Cloudera) that sqoop is
    part of 0.21 release.
    Thanks,
    Shiva
    On Mon, Jan 25, 2010 at 9:04 AM, Eric Sammer wrote:
    On 1/25/10 11:47 AM, Shiva wrote:
    Eric,
    Thanks for the details. I took a quick look at the link and it seems
    like a tool that I would help me here. Do I need to download whole
    Cloudera's Distribution for Hadoop <http://www.cloudera.com/hadoop> just
    to get sqoop? I already have Hadoop, Hive and Pig setup completed.
    I appreciate your input,
    Shiva
    Shiva:

    As far as I know, sqoop is only available bundled with Cloudera's
    distribution (CDH). That said, you may be able to download the tarball
    version of CDH and just pull sqoop out of it. I've never tried this,
    though, and I don't know if Sqoop depends on anything specific in CDH.
    Someone from Cloudera might be able to fill in the blanks here.

    For what it's worth, CDH2 is a great distribution with some nice
    packaging and configuration layout on top of the ASF distro along with
    some helpful patches. It's worth checking out.

    The tarball version of CDH is available at:
    http://archive.cloudera.com/cdh/testing/hadoop-0.20.1+152.tar.gz

    Best of luck!
    --
    Eric Sammer
    eric@lifeless.net
    http://esammer.blogspot.com
  • Shiva at Jan 22, 2010 at 7:59 pm
    Yeap. I found them there. Why would Hive move it?
    Cheers,
    Shiva
    On Fri, Jan 22, 2010 at 10:51 AM, Bill Graham wrote:

    Hive doesn't delete the files upon load, it moves them to a location under
    the Hive warehouse directory. Try looking under
    /user/hive/warehouse/t_word_count.

    On Fri, Jan 22, 2010 at 10:44 AM, Shiva wrote:

    Hi,
    For the first time I used Hive to load couple of word count data input
    files into tables with and without OVERWRITE. Both the times the input file
    in HDFS got deleted. Is that a expected behavior? And couldn't find any
    definitive answer on the Hive wiki.

    hive> LOAD DATA INPATH '/user/vmplanet/output/part-00000' OVERWRITE
    INTO TABLE t_word_count;

    Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in
    VMware.

    Thanks,
    Shiva
  • Edward Capriolo at Jan 22, 2010 at 8:03 pm
    There are only two real options

    1) move
    2) copy

    Hive does the move. Its fast. If you want a copy, copy it before load.

    If you do not want to move or copy, external tables are your last option.

    Edward
    On Fri, Jan 22, 2010 at 2:58 PM, Shiva wrote:
    Yeap. I found them there.  Why would Hive move it?
    Cheers,
    Shiva
    On Fri, Jan 22, 2010 at 10:51 AM, Bill Graham wrote:

    Hive doesn't delete the files upon load, it moves them to a location under
    the Hive warehouse directory. Try looking under
    /user/hive/warehouse/t_word_count.
    On Fri, Jan 22, 2010 at 10:44 AM, Shiva wrote:

    Hi,
    For the first time I used Hive to load couple of word count data
    input files into tables with and without OVERWRITE.  Both the times the
    input file in HDFS got deleted. Is that a expected behavior? And couldn't
    find any definitive answer on the Hive wiki.

    hive> LOAD  DATA  INPATH  '/user/vmplanet/output/part-00000'  OVERWRITE
    INTO  TABLE  t_word_count;

    Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in
    VMware.

    Thanks,
    Shiva

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 22, '10 at 6:45p
activeJan 26, '10 at 5:18p
posts10
users5
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase