Grokbase Groups Pig user January 2010
FAQ
Hey pig gurus -

I'm having an issue with cast-to-tuple errors, such as:

ERROR 2999: Unexpected internal error.
org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator cannot be cast to
org.apache.pig.data.Tuple

Any help understanding where I've gone wrong would be appreciated!


DETAILS:

Given some Apache logs I'd like to see the percentage of responses by
response code by minute. Basically, I'd like to generate the following:

"""
day_hour_min response_code response_code_count total_responses
response_code_pct
200101011458 200 9 10
0.9
200101011458 503 1 10
0.1
"""


I'm using the following steps, says describe. Note `counted' looks correct.

"""
data: {date: chararray,hour: chararray,minute: chararray,response_code:
chararray}

grouped_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),data: {date: chararray,hour: chararray,minute:
chararray,response_code: chararray}}

grouped_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),data: {date:
chararray,hour: chararray,minute: chararray,response_code: chararray}}

counted_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),count: long}
counted_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),count: long}

counted: {counted_by_minute_by_response_code::group: (date: chararray,hour:
chararray,minute: chararray,response_code:
chararray),counted_by_minute_by_response_code::count:
long,counted_by_minute::group: (date: chararray,hour: chararray,minute:
chararray),counted_by_minute::count: long}
"""


Everything works up until my join, where illustrate gives the above
exception. Strangely, I can store the output but it only contains the date,
hour, and minute fields -- missing the counts. For example:

"""
20100110 1 9 20100110 1 9
"""

counted = join
counted_by_minute_by_response_code by (group.date, group.hour,
group.minute),
counted_by_minute by (group.date, group.hour, group.minute)
parallel 1;


I've tried writing this a few ways now and always have an issue when
referencing members of the group tuple. For example, I concat
date+hour+minute together and got one step further, but then ran into what I
believe is the same issue when doing the following:

counted_pct = foreach counted generate
counted_by_minute_by_response_code::group.timebucket as timebucket,
counted_by_minute_by_response_code::group.response_code as response_code,
counted_by_minute_by_response_code::count as response_code_count,
counted_by_minute::count as response_code_count_total,
(float)counted_by_minute_by_response_code::count /
(float)counted_by_minute::count as response_code_pct;

Here I got "java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple" when referencing timebucket or response_code.
Removing those two items allowed the script to complete (although with not
very useful output).

Any thoughts on what the problem might be?

Thanks!
Travis

Search Discussions

  • Rekha Joshi at Jan 18, 2010 at 6:01 am
    Ideally better if you could provide your pig version and the script.

    However I suspect the dump/store in your case after join would work fine , and even the explain/describe, but the issue is only in illustrate.
    There are some issues logged on for illustrate behavior eg: PIG-534,for classcastexcpetion you get with other format can be seen under PIG-449

    Cheers,
    /R


    On 1/17/10 3:20 AM, "Travis Crawford" wrote:

    Hey pig gurus -

    I'm having an issue with cast-to-tuple errors, such as:

    ERROR 2999: Unexpected internal error.
    org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator cannot be cast to
    org.apache.pig.data.Tuple

    Any help understanding where I've gone wrong would be appreciated!


    DETAILS:

    Given some Apache logs I'd like to see the percentage of responses by
    response code by minute. Basically, I'd like to generate the following:

    """
    day_hour_min response_code response_code_count total_responses
    response_code_pct
    200101011458 200 9 10
    0.9
    200101011458 503 1 10
    0.1
    """


    I'm using the following steps, says describe. Note `counted' looks correct.

    """
    data: {date: chararray,hour: chararray,minute: chararray,response_code:
    chararray}

    grouped_by_minute: {group: (date: chararray,hour: chararray,minute:
    chararray),data: {date: chararray,hour: chararray,minute:
    chararray,response_code: chararray}}

    grouped_by_minute_by_response_code: {group: (date: chararray,hour:
    chararray,minute: chararray,response_code: chararray),data: {date:
    chararray,hour: chararray,minute: chararray,response_code: chararray}}

    counted_by_minute: {group: (date: chararray,hour: chararray,minute:
    chararray),count: long}
    counted_by_minute_by_response_code: {group: (date: chararray,hour:
    chararray,minute: chararray,response_code: chararray),count: long}

    counted: {counted_by_minute_by_response_code::group: (date: chararray,hour:
    chararray,minute: chararray,response_code:
    chararray),counted_by_minute_by_response_code::count:
    long,counted_by_minute::group: (date: chararray,hour: chararray,minute:
    chararray),counted_by_minute::count: long}
    """


    Everything works up until my join, where illustrate gives the above
    exception. Strangely, I can store the output but it only contains the date,
    hour, and minute fields -- missing the counts. For example:

    """
    20100110 1 9 20100110 1 9
    """

    counted = join
    counted_by_minute_by_response_code by (group.date, group.hour,
    group.minute),
    counted_by_minute by (group.date, group.hour, group.minute)
    parallel 1;


    I've tried writing this a few ways now and always have an issue when
    referencing members of the group tuple. For example, I concat
    date+hour+minute together and got one step further, but then ran into what I
    believe is the same issue when doing the following:

    counted_pct = foreach counted generate
    counted_by_minute_by_response_code::group.timebucket as timebucket,
    counted_by_minute_by_response_code::group.response_code as response_code,
    counted_by_minute_by_response_code::count as response_code_count,
    counted_by_minute::count as response_code_count_total,
    (float)counted_by_minute_by_response_code::count /
    (float)counted_by_minute::count as response_code_pct;

    Here I got "java.lang.ClassCastException: java.lang.String cannot be cast to
    org.apache.pig.data.Tuple" when referencing timebucket or response_code.
    Removing those two items allowed the script to complete (although with not
    very useful output).

    Any thoughts on what the problem might be?

    Thanks!
    Travis

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 16, '10 at 9:51p
activeJan 18, '10 at 6:01a
posts2
users2
websitepig.apache.org

2 users in discussion

Rekha Joshi: 1 post Travis Crawford: 1 post

People

Translate

site design / logo © 2021 Grokbase