Grokbase Groups Hive user July 2010
FAQ
Hi,

I am new to Hive and Hadoop in general. I have a table in Oracle that has
millions of rows and I'd like to export it into HDFS so that I can run some
Hive queries. My first question is, is it recommended to export the entire
table as a single file (possibly 5GB), or more files with smaller sizes (10
files each 500mb)? also, does it matter if I put the files under different
sub-directories before I do the data load in Hive? or everything has to be
under the same folder?

Thanks,
T

p.s. I am sorry if this post is submitted twice.

Search Discussions

  • Sarah Sproehnle at Jul 8, 2010 at 1:07 am
    Hi Todd,

    Are you planning to use Sqoop to do this import? If not, you should.
    :) It will do a parallel import, using MapReduce, to load the table
    into Hadoop. With the --hive-import option, it will also create the
    Hive table definition.

    Cheers,
    Sarah
    On Wed, Jul 7, 2010 at 5:51 PM, Todd Lee wrote:
    Hi,
    I am new to Hive and Hadoop in general. I have a table in Oracle that has
    millions of rows and I'd like to export it into HDFS so that I can run some
    Hive queries. My first question is, is it recommended to export the entire
    table as a single file (possibly 5GB), or more files with smaller sizes (10
    files each 500mb)? also, does it matter if I put the files under different
    sub-directories before I do the data load in Hive? or everything has to be
    under the same folder?
    Thanks,
    T
    p.s. I am sorry if this post is submitted twice.


    --
    Sarah Sproehnle
    Educational Services
    Cloudera, Inc
    http://www.cloudera.com/training
  • Todd Lee at Jul 8, 2010 at 1:12 am
    thanks. but is it going to create 1 big file in HDFS? I am currently
    considering writing my own cascading job for this.

    thx,
    T
    On Wed, Jul 7, 2010 at 6:06 PM, Sarah Sproehnle wrote:

    Hi Todd,

    Are you planning to use Sqoop to do this import? If not, you should.
    :) It will do a parallel import, using MapReduce, to load the table
    into Hadoop. With the --hive-import option, it will also create the
    Hive table definition.

    Cheers,
    Sarah
    On Wed, Jul 7, 2010 at 5:51 PM, Todd Lee wrote:
    Hi,
    I am new to Hive and Hadoop in general. I have a table in Oracle that has
    millions of rows and I'd like to export it into HDFS so that I can run some
    Hive queries. My first question is, is it recommended to export the entire
    table as a single file (possibly 5GB), or more files with smaller sizes (10
    files each 500mb)? also, does it matter if I put the files under different
    sub-directories before I do the data load in Hive? or everything has to be
    under the same folder?
    Thanks,
    T
    p.s. I am sorry if this post is submitted twice.


    --
    Sarah Sproehnle
    Educational Services
    Cloudera, Inc
    http://www.cloudera.com/training
  • Edward Capriolo at Jul 8, 2010 at 1:30 am

    On Wed, Jul 7, 2010 at 9:11 PM, Todd Lee wrote:
    thanks. but is it going to create 1 big file in HDFS? I am currently
    considering writing my own cascading job for this.
    thx,
    T
    On Wed, Jul 7, 2010 at 6:06 PM, Sarah Sproehnle wrote:

    Hi Todd,

    Are you planning to use Sqoop to do this import?  If not, you should.
    :)  It will do a parallel import, using MapReduce, to load the table
    into Hadoop.  With the --hive-import option, it will also create the
    Hive table definition.

    Cheers,
    Sarah
    On Wed, Jul 7, 2010 at 5:51 PM, Todd Lee wrote:
    Hi,
    I am new to Hive and Hadoop in general. I have a table in Oracle that
    has
    millions of rows and I'd like to export it into HDFS so that I can run
    some
    Hive queries. My first question is, is it recommended to export the
    entire
    table as a single file (possibly 5GB), or more files with smaller sizes
    (10
    files each 500mb)? also, does it matter if I put the files under
    different
    sub-directories before I do the data load in Hive? or everything has to
    be
    under the same folder?
    Thanks,
    T
    p.s. I am sorry if this post is submitted twice.


    --
    Sarah Sproehnle
    Educational Services
    Cloudera, Inc
    http://www.cloudera.com/training
    Hadoop does not handle many small files well. Look up "hadoop small
    file problem". Performance wise you should try to have as few files as
    possible, but you should notice no difference in runtime between 1, 5
    or even 500 files when your data is as big as 5GB.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJul 8, '10 at 12:51a
activeJul 8, '10 at 1:30a
posts4
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase