Grokbase Groups Pig user July 2010
FAQ
Hello -

I'd like to use pig to process log files containing BigDecimals. I'm
loading my data as JSON via a custom LoadFunc. One approach seems to be
to represent the BigDecimal fields as DataType.BYTEARRAY, and then write
an algebraic EvalFunc:

Example:

1. Data

class Log{
String id;
long timestmap;
BigDecimal costA;
BigDecmal costB;
}

2. Convert the Log class to JSON:

org.codehaus.jackson.map.ObjectMapper mapper...
mapper.writeValueAsString(object)

3. Generated log files look like this:

{"id":"someid", "timestamp":"sometimestamp", "costA":1.00, "costB":1.23456}
{"id":"someid", "timestamp":"sometimestamp", "costA":2.00, "costB":2.23456}
{"id":"someid", "timestamp":"sometimestamp", "costA":3.00, "costB":3.23456}

4. In a custom PIG LoadFunc, decode the JSON:

Log logEntry = mapper.readValue(encoded, Log.class);

5. Convert the hydrated logEntry to Pig Tuple:

Tuple tuple = TupleFactory.getInstance().newTuple(NUMBER_OF_FIELDS);
tuple.set(0, logEntry.getID());
tuple.set(1, logEntry.getTimestamp());
tuple.set(2, logEntry.getCostA());
tuple.set(3, logEntry.getCostB());

Except that you clearly cannot set a BigDecimal into the DefaultTuple,
as it does not know how to recognize the BigDecimal.

So, what is the recommended way to proceed from here?

* Do I write my own Tuple impl?
* Do I shove the BigDecimal into the DefaultTuple as a byte array and
use an EvalFunc to read the byte array field? This func could then
create a BigDecimal and perform the BigDecimal.add().

-Todd

Search Discussions

  • Dmitriy Ryaboy at Jul 9, 2010 at 12:17 am
    the latter.
    On Thu, Jul 8, 2010 at 5:09 PM, ToddG wrote:

    Hello -

    I'd like to use pig to process log files containing BigDecimals. I'm
    loading my data as JSON via a custom LoadFunc. One approach seems to be to
    represent the BigDecimal fields as DataType.BYTEARRAY, and then write an
    algebraic EvalFunc:

    Example:

    1. Data

    class Log{
    String id;
    long timestmap;
    BigDecimal costA;
    BigDecmal costB;
    }

    2. Convert the Log class to JSON:

    org.codehaus.jackson.map.ObjectMapper mapper...
    mapper.writeValueAsString(object)

    3. Generated log files look like this:

    {"id":"someid", "timestamp":"sometimestamp", "costA":1.00, "costB":1.23456}
    {"id":"someid", "timestamp":"sometimestamp", "costA":2.00, "costB":2.23456}
    {"id":"someid", "timestamp":"sometimestamp", "costA":3.00, "costB":3.23456}

    4. In a custom PIG LoadFunc, decode the JSON:

    Log logEntry = mapper.readValue(encoded, Log.class);

    5. Convert the hydrated logEntry to Pig Tuple:

    Tuple tuple = TupleFactory.getInstance().newTuple(NUMBER_OF_FIELDS);
    tuple.set(0, logEntry.getID());
    tuple.set(1, logEntry.getTimestamp());
    tuple.set(2, logEntry.getCostA());
    tuple.set(3, logEntry.getCostB());

    Except that you clearly cannot set a BigDecimal into the DefaultTuple, as
    it does not know how to recognize the BigDecimal.

    So, what is the recommended way to proceed from here?

    * Do I write my own Tuple impl?
    * Do I shove the BigDecimal into the DefaultTuple as a byte array and use
    an EvalFunc to read the byte array field? This func could then create a
    BigDecimal and perform the BigDecimal.add().

    -Todd

  • ToddG at Jul 20, 2010 at 9:19 pm
    Follow Up: Thanks Dmitriy, that worked out really well. I just followed
    lead of builtin/IntAvg.java. In my case, I wound up storing intermediate
    BigDecimal values as chararrays...expensive to create all those objects,
    but conceptually simple.

    -Todd
    On 7/8/10 5:15 PM, Dmitriy Ryaboy wrote:
    the latter.

    On Thu, Jul 8, 2010 at 5:09 PM, ToddGwrote:

    Hello -

    I'd like to use pig to process log files containing BigDecimals. I'm
    loading my data as JSON via a custom LoadFunc. One approach seems to be to
    represent the BigDecimal fields as DataType.BYTEARRAY, and then write an
    algebraic EvalFunc:

    Example:

    1. Data

    class Log{
    String id;
    long timestmap;
    BigDecimal costA;
    BigDecmal costB;
    }

    2. Convert the Log class to JSON:

    org.codehaus.jackson.map.ObjectMapper mapper...
    mapper.writeValueAsString(object)

    3. Generated log files look like this:

    {"id":"someid", "timestamp":"sometimestamp", "costA":1.00, "costB":1.23456}
    {"id":"someid", "timestamp":"sometimestamp", "costA":2.00, "costB":2.23456}
    {"id":"someid", "timestamp":"sometimestamp", "costA":3.00, "costB":3.23456}

    4. In a custom PIG LoadFunc, decode the JSON:

    Log logEntry = mapper.readValue(encoded, Log.class);

    5. Convert the hydrated logEntry to Pig Tuple:

    Tuple tuple = TupleFactory.getInstance().newTuple(NUMBER_OF_FIELDS);
    tuple.set(0, logEntry.getID());
    tuple.set(1, logEntry.getTimestamp());
    tuple.set(2, logEntry.getCostA());
    tuple.set(3, logEntry.getCostB());

    Except that you clearly cannot set a BigDecimal into the DefaultTuple, as
    it does not know how to recognize the BigDecimal.

    So, what is the recommended way to proceed from here?

    * Do I write my own Tuple impl?
    * Do I shove the BigDecimal into the DefaultTuple as a byte array and use
    an EvalFunc to read the byte array field? This func could then create a
    BigDecimal and perform the BigDecimal.add().

    -Todd


  • Rohan Rai at Aug 16, 2010 at 9:06 am
    Is Elephant-bird compatible with Pig 0.7 ??

    Regards
    Rohan

    The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
  • Dmitriy Ryaboy at Aug 16, 2010 at 10:06 am
    Rohan,
    At the moment ElephantBird only works with 0.6.
    We intend to upgrade it soonish. It should be fairly straightforward as we
    already generate all the required InputFormats and Readers.
    Patches welcome :).

    -Dmitriy
    On Mon, Aug 16, 2010 at 2:03 AM, Rohan Rai wrote:

    Is Elephant-bird compatible with Pig 0.7 ??

    Regards
    Rohan

    The information contained in this communication is intended solely for the
    use of the individual or entity to whom it is addressed and others
    authorized to receive it. It may contain confidential or legally privileged
    information. If you are not the intended recipient you are hereby notified
    that any disclosure, copying, distribution or taking any action in reliance
    on the contents of this information is strictly prohibited and may be
    unlawful. If you have received this communication in error, please notify us
    immediately by responding to this email and then delete it from your system.
    The firm is neither liable for the proper and complete transmission of the
    information contained in this communication nor for any delay in its
    receipt.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 9, '10 at 12:10a
activeAug 16, '10 at 10:06a
posts5
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase