FAQ
Hello,

I like to do a reporting with Hive on something like tracking data.
The raw data which is about 2 gigs or more a day I want to query with hive. This works already for me, no problem.
Also I want to cascade down the reporting data to something like client, date, something in Hive like partitioned by (client String, date String).
That means I have multiple aggrgation-levels. I like to do all levels in Hive for a consistent reporting source.
And here is the thing: Might it a problem if it comes to many small files?
The aggrgation level e.g. client/date might produce files about 1MB and in amount of 1000 a day.
Is this a problem? I read about the "to many open files problem" with hadoop. And might this lead to a bad hive/map-reduce performance?
Maybe someone has some clues for that...

Thanks in advance
labtrax
--
GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
gratis Handy-Flat! http://portal.gmx.net/de/go/dsl

Search Discussions

  • Edward Capriolo at Jan 31, 2011 at 4:42 pm

    On Mon, Jan 31, 2011 at 11:08 AM, wrote:
    Hello,

    I like to do a reporting with Hive on something like tracking data.
    The raw data which is about 2 gigs or more a day I want to query with hive. This works already for me, no problem.
    Also I want to cascade down the reporting data to something like client, date, something in Hive like partitioned by (client String, date String).
    That means I have multiple aggrgation-levels. I like to do all levels in Hive for a consistent reporting source.
    And here is the thing: Might it a problem if it comes to many small files?
    The aggrgation level e.g. client/date might produce files about 1MB and in amount of 1000 a day.
    Is this a problem? I read about the "to many open files problem" with hadoop. And might this lead to a bad hive/map-reduce performance?
    Maybe someone has some clues for that...

    Thanks in advance
    labtrax
    --
    GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
    gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
    You probably do not want to partition on something that has a lot of
    cardinality such as client_id . You do not want many small partitions
    it is bad for the NameNode and mad for Map Reduce performance. So if
    you have 1000 client ids that is 1000+ files per day and that is
    trouble over a long period of time.

    One option is to bucket on client into 64 Buckets on client_id. hive
    can use the bucket to prune the amount of information that may get
    table-scanned for scan. It is a compromise between many files and
    really large files.

    Generally you want big files so hadoop can use brute force table scans.

    Edward
  • Ajo Fod at Jan 31, 2011 at 6:16 pm
    I've noticed that it takes a while for each map job to be set up in hive ...
    and the way I set up the job I noticed that there were as many maps as
    files/buckets.

    I read a recommendation somewhere to design jobs such that they take at
    least a minute.

    Cheers,
    -Ajo.
    On Mon, Jan 31, 2011 at 8:08 AM, wrote:

    Hello,

    I like to do a reporting with Hive on something like tracking data.
    The raw data which is about 2 gigs or more a day I want to query with hive.
    This works already for me, no problem.
    Also I want to cascade down the reporting data to something like client,
    date, something in Hive like partitioned by (client String, date String).
    That means I have multiple aggrgation-levels. I like to do all levels in
    Hive for a consistent reporting source.
    And here is the thing: Might it a problem if it comes to many small files?
    The aggrgation level e.g. client/date might produce files about 1MB and in
    amount of 1000 a day.
    Is this a problem? I read about the "to many open files problem" with
    hadoop. And might this lead to a bad hive/map-reduce performance?
    Maybe someone has some clues for that...

    Thanks in advance
    labtrax
    --
    GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
    gratis Handy-Flat! http://portal.gmx.net/de/go/dsl

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 31, '11 at 4:08p
activeJan 31, '11 at 6:16p
posts3
users3
websitehive.apache.org

3 users in discussion

Ajo Fod: 1 post Edward Capriolo: 1 post Hive1: 1 post

People

Translate

site design / logo © 2022 Grokbase