Grokbase Groups Hive user August 2011
FAQ
Hello,

I have external tables in Hive stored in a single flat text file. When I
execute queries against it, all of my jobs are run as a single map task,
even on very large tables.

What steps do I need to make to ensure that these queries are split up and
pushed out to multiple TTs? Do I need to store the Hive tables in a
different internal file format? Make some configuration changes?

Thanks!
Jon

Search Discussions

  • Loren Siebert at Aug 15, 2011 at 5:38 pm
    Is your external file compressed with GZip or BZip? Those file formats aren’t splittable, so they get assigned to one mapper.
    On Aug 15, 2011, at 10:23 AM, Jon Bender wrote:

    Hello,

    I have external tables in Hive stored in a single flat text file. When I execute queries against it, all of my jobs are run as a single map task, even on very large tables.

    What steps do I need to make to ensure that these queries are split up and pushed out to multiple TTs? Do I need to store the Hive tables in a different internal file format? Make some configuration changes?

    Thanks!
    Jon
  • Jon Bender at Aug 15, 2011 at 5:47 pm
    It's actually just an uncompressed UTF-8 text file.

    This was essentially the create table clause:
    CREATE EXTERNAL TABLE foo
    ROW FORMAT DELIMITED
    STORED AS TEXTFILE
    LOCATION '/data/foo'

    Using Hive 0.7.
    On Mon, Aug 15, 2011 at 10:37 AM, Loren Siebert wrote:

    Is your external file compressed with GZip or BZip? Those file formats
    aren’t splittable, so they get assigned to one mapper.
    On Aug 15, 2011, at 10:23 AM, Jon Bender wrote:

    Hello,

    I have external tables in Hive stored in a single flat text file. When I
    execute queries against it, all of my jobs are run as a single map task,
    even on very large tables.
    What steps do I need to make to ensure that these queries are split up
    and pushed out to multiple TTs? Do I need to store the Hive tables in a
    different internal file format? Make some configuration changes?
    Thanks!
    Jon
  • Ayon Sinha at Aug 15, 2011 at 5:58 pm
    Can you try to recreate the external table with fields terminated by and lines  terminated by clauses?

    -Ayon
    See My Photos on Flickr
    Also check out my Blog for answers to commonly asked questions.



    ________________________________
    From: Jon Bender <jonathan.bender@gmail.com>
    To: user@hive.apache.org
    Sent: Monday, August 15, 2011 10:47 AM
    Subject: Re: Single Map task for Hive queries


    It's actually just an uncompressed UTF-8 text file.

    This was essentially the create table clause:
    CREATE EXTERNAL TABLE foo
    ROW FORMAT DELIMITED
    STORED AS TEXTFILE
    LOCATION '/data/foo'

    Using Hive 0.7.


    On Mon, Aug 15, 2011 at 10:37 AM, Loren Siebert wrote:

    Is your external file compressed with GZip or BZip? Those file formats aren’t splittable, so they get assigned to one mapper.
    On Aug 15, 2011, at 10:23 AM, Jon Bender wrote:

    Hello,

    I have external tables in Hive stored in a single flat text file.  When I execute queries against it, all of my jobs are run as a single map task, even on very large tables.

    What steps do I need to make to ensure that these queries are split up and pushed out to multiple TTs?  Do I need to store the Hive tables in a different internal file format?  Make some configuration changes?

    Thanks!
    Jon
  • Loren Siebert at Aug 15, 2011 at 6:00 pm
    You should not have to do anything special to Hive to make it use all of your TT’s. The actual MR job should be governed by your mapred-site.xml file.

    When you run sample MR jobs (like the Pi example) and look at the job tracker, are you seeing all your TT’s getting used?
    On Aug 15, 2011, at 10:47 AM, Jon Bender wrote:

    It's actually just an uncompressed UTF-8 text file.

    This was essentially the create table clause:
    CREATE EXTERNAL TABLE foo
    ROW FORMAT DELIMITED
    STORED AS TEXTFILE
    LOCATION '/data/foo'

    Using Hive 0.7.

    On Mon, Aug 15, 2011 at 10:37 AM, Loren Siebert wrote:
    Is your external file compressed with GZip or BZip? Those file formats aren’t splittable, so they get assigned to one mapper.
    On Aug 15, 2011, at 10:23 AM, Jon Bender wrote:

    Hello,

    I have external tables in Hive stored in a single flat text file. When I execute queries against it, all of my jobs are run as a single map task, even on very large tables.

    What steps do I need to make to ensure that these queries are split up and pushed out to multiple TTs? Do I need to store the Hive tables in a different internal file format? Make some configuration changes?

    Thanks!
    Jon
  • Jon Bender at Aug 15, 2011 at 6:08 pm
    Yeah MapReduce itself is set up to use all of my task trackers--only one Map
    Task gets created one the external table queries.

    I tried querying another external table (composed of some 20 files) and it
    created 20 map tasks in turn during the query. I will try the LINES
    TERMINATED BY clause next to try and parallelize within a single file.
    On Mon, Aug 15, 2011 at 11:00 AM, Loren Siebert wrote:

    You should not have to do anything special to Hive to make it use all of
    your TT’s. The actual MR job should be governed by your mapred-site.xml
    file.

    When you run sample MR jobs (like the Pi example) and look at the job
    tracker, are you seeing all your TT’s getting used?

    On Aug 15, 2011, at 10:47 AM, Jon Bender wrote:

    It's actually just an uncompressed UTF-8 text file.

    This was essentially the create table clause:
    CREATE EXTERNAL TABLE foo
    ROW FORMAT DELIMITED
    STORED AS TEXTFILE
    LOCATION '/data/foo'

    Using Hive 0.7.
    On Mon, Aug 15, 2011 at 10:37 AM, Loren Siebert wrote:

    Is your external file compressed with GZip or BZip? Those file formats
    aren’t splittable, so they get assigned to one mapper.
    On Aug 15, 2011, at 10:23 AM, Jon Bender wrote:

    Hello,

    I have external tables in Hive stored in a single flat text file. When
    I execute queries against it, all of my jobs are run as a single map task,
    even on very large tables.
    What steps do I need to make to ensure that these queries are split up
    and pushed out to multiple TTs? Do I need to store the Hive tables in a
    different internal file format? Make some configuration changes?
    Thanks!
    Jon
  • Steven Wong at Aug 17, 2011 at 1:28 am
    The TERMINATED clauses don't affect how files are split among mappers. Is your hive.input.format set to org...CombineHiveInputFormat? If so, is your mapred.max.split.size set low enough? If not, there is another config to control, but I don't remember the name offhand. They are all Hadoop configs.


    From: Jon Bender
    Sent: Monday, August 15, 2011 11:08 AM
    To: user@hive.apache.org
    Subject: Re: Single Map task for Hive queries

    Yeah MapReduce itself is set up to use all of my task trackers--only one Map Task gets created one the external table queries.

    I tried querying another external table (composed of some 20 files) and it created 20 map tasks in turn during the query. I will try the LINES TERMINATED BY clause next to try and parallelize within a single file.
    On Mon, Aug 15, 2011 at 11:00 AM, Loren Siebert wrote:
    You should not have to do anything special to Hive to make it use all of your TT's. The actual MR job should be governed by your mapred-site.xml file.

    When you run sample MR jobs (like the Pi example) and look at the job tracker, are you seeing all your TT's getting used?

    On Aug 15, 2011, at 10:47 AM, Jon Bender wrote:


    It's actually just an uncompressed UTF-8 text file.

    This was essentially the create table clause:
    CREATE EXTERNAL TABLE foo
    ROW FORMAT DELIMITED
    STORED AS TEXTFILE
    LOCATION '/data/foo'

    Using Hive 0.7.

    On Mon, Aug 15, 2011 at 10:37 AM, Loren Siebert wrote:
    Is your external file compressed with GZip or BZip? Those file formats aren't splittable, so they get assigned to one mapper.
    On Aug 15, 2011, at 10:23 AM, Jon Bender wrote:

    Hello,

    I have external tables in Hive stored in a single flat text file. When I execute queries against it, all of my jobs are run as a single map task, even on very large tables.

    What steps do I need to make to ensure that these queries are split up and pushed out to multiple TTs? Do I need to store the Hive tables in a different internal file format? Make some configuration changes?

    Thanks!
    Jon

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedAug 15, '11 at 5:24p
activeAug 17, '11 at 1:28a
posts7
users4
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase