Grokbase Groups Hive user August 2010
FAQ
Hi,

I have a problem and I hope someone has an idea on how to solve it.

My dataset consists of just very simple key-value pairs of strings
coming from PostgreSQL using Sqoop.

1) I need to count how often a key occurs -> Easy
2) I need to count how often a key-value pair occurs -> Easy

I need to output this data to PostgreSQL again, into two tables:

a) "keys" with the columns: id, key_name, count
b) "values" with the columns: id, key_id, value_name, count

Now the ids I'm referring to don't exist yet and I'm looking into
solutions to generate them. They have to be integers/longs but they
don't have to be in any order/pattern. I'm not concerned about
performance either as this query will be run monthly at most.

Do you have any idea how I could introduce this new column into the
output of query 1)? I could easily introduce it into 2) with a join
then. I thought about using a custom reducer script but apart from the
fact that I've never done it so far it would require that there is
only one reducer so that I can simulate an auto-incrementer. My
current best idea is to write a regular MR job that processes the Hive
output but I'd love to do everything in Hive if possible.

I might very well approach this problem completely wrong so don't
hesitate to propose a better solution or bash me for my poor
understanding of Hive :)

Thanks for any input and help.

Cheers,
Lars

Search Discussions

  • Tim Robertson at Aug 10, 2010 at 6:29 am
    Hi Lars,

    I had a similar need and came across
    https://issues.apache.org/jira/browse/HIVE-1304 but haven't got round
    to trying it yet.

    Cheers,
    Tim

    On Tue, Aug 10, 2010 at 12:41 AM, Lars Francke wrote:
    Hi,

    I have a problem and I hope someone has an idea on how to solve it.

    My dataset consists of just very simple key-value pairs of strings
    coming from PostgreSQL using Sqoop.

    1) I need to count how often a key occurs -> Easy
    2) I need to count how often a key-value pair occurs -> Easy

    I need to output this data to PostgreSQL again, into two tables:

    a) "keys" with the columns: id, key_name, count
    b) "values" with the columns: id, key_id, value_name, count

    Now the ids I'm referring to don't exist yet and I'm looking into
    solutions to generate them. They have to be integers/longs but they
    don't have to be in any order/pattern. I'm not concerned about
    performance either as this query will be run monthly at most.

    Do you have any idea how I could introduce this new column into the
    output of query 1)? I could easily introduce it into 2) with a join
    then. I thought about using a custom reducer script but apart from the
    fact that I've never done it so far it would require that there is
    only one reducer so that I can simulate an auto-incrementer. My
    current best idea is to write a regular MR job that processes the Hive
    output but I'd love to do everything in Hive if possible.

    I might very well approach this problem completely wrong so don't
    hesitate to propose a better solution or bash me for my poor
    understanding of Hive :)

    Thanks for any input and help.

    Cheers,
    Lars
  • Lars Francke at Aug 10, 2010 at 9:31 am
    Hi Tim,
    I had a similar need and came across
    https://issues.apache.org/jira/browse/HIVE-1304 but haven't got round
    to trying it yet.
    well that looks exactly like what I'm looking for. The missing link
    for me was this line:

    set mapred.reduce.tasks=1;

    I've used it before but I don't know why I didn't think of it now.
    I've even mentioned it in my initial mail :(
    Thanks! I'll go ahead and try it right now.

    Cheers,
    Lars

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedAug 9, '10 at 10:42p
activeAug 10, '10 at 9:31a
posts3
users2
websitehive.apache.org

2 users in discussion

Lars Francke: 2 posts Tim Robertson: 1 post

People

Translate

site design / logo © 2022 Grokbase