FAQ
Hi all,

I am porting custom MR code to Hive and have written working UDFs
where I need them. Is there a work around to having to do this in
Hive:

select * from
(
select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
f2_y, count (1) as count
from table
group by name_id, x, y, f2_x, f2_y

UNION ALL

select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
f2_y, count (1) as count
from table
group by name_id, x, y, f2_x, f2_y

--- etc etc increasing in zoom
)

The issue being that this does many passes over the table, whereas
previously in my Map() I would just emit many times from the same
input record and then let it all group in the shuffle and sort.
I actually emit 184 times for an input record (23 zoom levels of
google maps, and 8 ways to derive the name_id) for a single record
which means 184 union statements - Is it possible in hive to force it
to emit many times from the source record in the stage-1 map?

(ahem) Does anyone know if Pig can do this if not in Hive?

I hope I have explained this well enough to make sense.

Thanks in advance,
Tim

Search Discussions

  • Sonal Goyal at Nov 8, 2010 at 4:40 am
    Hey Tim,

    You have an interesting problem. Have you tried creating a UDTF for your
    case, so that you can possibly emit more than one record for each row of
    your input?

    http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF

    Thanks and Regards,
    Sonal

    Sonal Goyal | Founder and CEO | Nube Technologies LLP
    http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal




    On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson wrote:

    Hi all,

    I am porting custom MR code to Hive and have written working UDFs
    where I need them. Is there a work around to having to do this in
    Hive:

    select * from
    (
    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
    y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    UNION ALL

    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
    y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    --- etc etc increasing in zoom
    )

    The issue being that this does many passes over the table, whereas
    previously in my Map() I would just emit many times from the same
    input record and then let it all group in the shuffle and sort.
    I actually emit 184 times for an input record (23 zoom levels of
    google maps, and 8 ways to derive the name_id) for a single record
    which means 184 union statements - Is it possible in hive to force it
    to emit many times from the source record in the stage-1 map?

    (ahem) Does anyone know if Pig can do this if not in Hive?

    I hope I have explained this well enough to make sense.

    Thanks in advance,
    Tim
  • Namit Jain at Nov 8, 2010 at 5:55 am
    Other option would be to create a wrapper script (not use either UDF or UDTF)

    That script, in any language, can emit any number of output rows per input row.


    Look at:
    http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform

    for details


    ________________________________
    From: Sonal Goyal [sonalgoyal4@gmail.com]
    Sent: Sunday, November 07, 2010 8:40 PM
    To: user@hive.apache.org
    Subject: Re: Unions causing many scans of input - workaround?

    Hey Tim,

    You have an interesting problem. Have you tried creating a UDTF for your case, so that you can possibly emit more than one record for each row of your input?

    http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF

    Thanks and Regards,
    Sonal

    Sonal Goyal | Founder and CEO | Nube Technologies LLP
    http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal





    On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson wrote:
    Hi all,

    I am porting custom MR code to Hive and have written working UDFs
    where I need them. Is there a work around to having to do this in
    Hive:

    select * from
    (
    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
    y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    UNION ALL

    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
    y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    --- etc etc increasing in zoom
    )

    The issue being that this does many passes over the table, whereas
    previously in my Map() I would just emit many times from the same
    input record and then let it all group in the shuffle and sort.
    I actually emit 184 times for an input record (23 zoom levels of
    google maps, and 8 ways to derive the name_id) for a single record
    which means 184 union statements - Is it possible in hive to force it
    to emit many times from the source record in the stage-1 map?

    (ahem) Does anyone know if Pig can do this if not in Hive?

    I hope I have explained this well enough to make sense.

    Thanks in advance,
    Tim
  • Tim Robertson at Nov 8, 2010 at 6:35 am
    Thank you both,

    A quick glance looks like that is what I am looking for. When I get
    it working, I'll post the solution.

    Cheers,
    Tim
    On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain wrote:
    Other option would be to create a wrapper script (not use either UDF or
    UDTF)
    That script, in any language, can emit any number of output rows per input
    row.

    Look at:
    http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
    for details

    ________________________________
    From: Sonal Goyal [sonalgoyal4@gmail.com]
    Sent: Sunday, November 07, 2010 8:40 PM
    To: user@hive.apache.org
    Subject: Re: Unions causing many scans of input - workaround?

    Hey Tim,

    You have an interesting problem. Have you tried creating a UDTF for your
    case, so that you can possibly emit more than one record for each row of
    your input?

    http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF

    Thanks and Regards,
    Sonal

    Sonal Goyal | Founder and CEO | Nube Technologies LLP
    http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal




    On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson wrote:

    Hi all,

    I am porting custom MR code to Hive and have written working UDFs
    where I need them.  Is there a work around to having to do this in
    Hive:

    select * from
    (
    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
    y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    UNION ALL

    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
    y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    --- etc etc increasing in zoom
    )

    The issue being that this does many passes over the table, whereas
    previously in my Map() I would just emit many times from the same
    input record and then let it all group in the shuffle and sort.
    I actually emit 184 times for an input record (23 zoom levels of
    google maps, and 8 ways to derive the name_id) for a single record
    which means 184 union statements - Is it possible in hive to force it
    to emit many times from the source record in the stage-1 map?

    (ahem) Does anyone know if Pig can do this if not in Hive?

    I hope I have explained this well enough to make sense.

    Thanks in advance,
    Tim
  • Tim Robertson at Nov 8, 2010 at 8:17 am
    I am writing a GenericUDTF now, but notice on
    http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF

    the method docs show:
    /**
    * Called to notify the UDTF that there are no more rows to process. Note that
    * forward() should not be called in this function. Only clean up code should
    * be run.
    */
    public abstract void close() throws HiveException;

    but the example does exactly that:
    @Override
    public void close() throws HiveException {
    forwardObj[0] = count;
    forward(forwardObj);
    forward(forwardObj);
    }

    I'll assume the example is correct and continue, but it might be worth
    fixing that page.

    Cheers,
    Tim




    On Mon, Nov 8, 2010 at 7:35 AM, Tim Robertson wrote:
    Thank you both,

    A quick glance looks like that is what I am looking for.  When I get
    it working, I'll post the solution.

    Cheers,
    Tim
    On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain wrote:
    Other option would be to create a wrapper script (not use either UDF or
    UDTF)
    That script, in any language, can emit any number of output rows per input
    row.

    Look at:
    http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
    for details

    ________________________________
    From: Sonal Goyal [sonalgoyal4@gmail.com]
    Sent: Sunday, November 07, 2010 8:40 PM
    To: user@hive.apache.org
    Subject: Re: Unions causing many scans of input - workaround?

    Hey Tim,

    You have an interesting problem. Have you tried creating a UDTF for your
    case, so that you can possibly emit more than one record for each row of
    your input?

    http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF

    Thanks and Regards,
    Sonal

    Sonal Goyal | Founder and CEO | Nube Technologies LLP
    http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal





    On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson <timrobertson100@gmail.com>
    wrote:
    Hi all,

    I am porting custom MR code to Hive and have written working UDFs
    where I need them.  Is there a work around to having to do this in
    Hive:

    select * from
    (
    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
    y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    UNION ALL

    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
    y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    --- etc etc increasing in zoom
    )

    The issue being that this does many passes over the table, whereas
    previously in my Map() I would just emit many times from the same
    input record and then let it all group in the shuffle and sort.
    I actually emit 184 times for an input record (23 zoom levels of
    google maps, and 8 ways to derive the name_id) for a single record
    which means 184 union statements - Is it possible in hive to force it
    to emit many times from the source record in the stage-1 map?

    (ahem) Does anyone know if Pig can do this if not in Hive?

    I hope I have explained this well enough to make sense.

    Thanks in advance,
    Tim
  • Tim Robertson at Nov 8, 2010 at 4:22 pm
    For anyone stumbling across this thread the (pretty much) working UDTF
    can be seen:
    http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-spatial/src/main/java/org/gbif/occurrence/spatial/udf/TaxonDensityUDTF.java

    allowing for the following query:

    create table taxon_density as
    select taxonId,tileX,tileY,zoom,clusterX,clusterY,sum(count) from
    occurrence_record lateral view taxonDensityUDTF(kingdom_concept_id,
    phylum_concept_id, class_concept_id, order_concept_id,
    family_concept_id, genus_concept_id,
    species_concept_id,nub_concept_id, latitude, longitude, 10) e
    as taxonId,tileX,tileY,zoom,clusterX,clusterY,count
    group by taxonId,tileX,tileY,zoom,clusterX,clusterY;

    Thanks again for the pointers Sonal and Namit, and also on the other thread,

    Tim



    On Mon, Nov 8, 2010 at 9:17 AM, Tim Robertson wrote:
    I am writing a GenericUDTF now, but notice on
    http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF

    the method docs show:
    /**
    * Called to notify the UDTF that there are no more rows to process. Note that
    * forward() should not be called in this function. Only clean up code should
    * be run.
    */
    public abstract void close() throws HiveException;

    but the example does exactly that:
    @Override
    public void close() throws HiveException {
    forwardObj[0] = count;
    forward(forwardObj);
    forward(forwardObj);
    }

    I'll assume the example is correct and continue, but it might be worth
    fixing that page.

    Cheers,
    Tim




    On Mon, Nov 8, 2010 at 7:35 AM, Tim Robertson wrote:
    Thank you both,

    A quick glance looks like that is what I am looking for.  When I get
    it working, I'll post the solution.

    Cheers,
    Tim
    On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain wrote:
    Other option would be to create a wrapper script (not use either UDF or
    UDTF)
    That script, in any language, can emit any number of output rows per input
    row.

    Look at:
    http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
    for details

    ________________________________
    From: Sonal Goyal [sonalgoyal4@gmail.com]
    Sent: Sunday, November 07, 2010 8:40 PM
    To: user@hive.apache.org
    Subject: Re: Unions causing many scans of input - workaround?

    Hey Tim,

    You have an interesting problem. Have you tried creating a UDTF for your
    case, so that you can possibly emit more than one record for each row of
    your input?

    http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF

    Thanks and Regards,
    Sonal

    Sonal Goyal | Founder and CEO | Nube Technologies LLP
    http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal





    On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson <timrobertson100@gmail.com>
    wrote:
    Hi all,

    I am porting custom MR code to Hive and have written working UDFs
    where I need them.  Is there a work around to having to do this in
    Hive:

    select * from
    (
    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
    y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    UNION ALL

    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
    y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
    f2_y, count (1) as count
    from table
    group by name_id, x, y, f2_x, f2_y

    --- etc etc increasing in zoom
    )

    The issue being that this does many passes over the table, whereas
    previously in my Map() I would just emit many times from the same
    input record and then let it all group in the shuffle and sort.
    I actually emit 184 times for an input record (23 zoom levels of
    google maps, and 8 ways to derive the name_id) for a single record
    which means 184 union statements - Is it possible in hive to force it
    to emit many times from the source record in the stage-1 map?

    (ahem) Does anyone know if Pig can do this if not in Hive?

    I hope I have explained this well enough to make sense.

    Thanks in advance,
    Tim

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedNov 7, '10 at 9:01p
activeNov 8, '10 at 4:22p
posts6
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase