Grokbase Groups Pig dev August 2010
FAQ
Column pruner causes wrong results when using both Custom Store UDF and PigStorage
----------------------------------------------------------------------------------

Key: PIG-1537
URL: https://issues.apache.org/jira/browse/PIG-1537
Project: Pig
Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I have script which is of this pattern and it uses 2 StoreFunc's:
{code}

register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Viraj Bhat (JIRA) at Aug 5, 2010 at 1:01 am
    [ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Viraj Bhat updated PIG-1537:
    ----------------------------

    Description:
    I have script which is of this pattern and it uses 2 StoreFunc's:

    {code}
    register loader.jar
    register piggy-bank/java/build/storage.jar;
    %DEFAULT OUTPUTDIR /user/viraj/prunecol/

    ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

    ss_sc_filtered_0 = FILTER ss_sc_0 BY
    a#'id' matches '1.*' OR
    a#'id' matches '2.*' OR
    a#'id' matches '3.*' OR
    a#'id' matches '4.*';

    ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

    ss_sc_filtered_1 = FILTER ss_sc_1 BY
    a#'id' matches '65.*' OR
    a#'id' matches '466.*' OR
    a#'id' matches '043.*' OR
    a#'id' matches '044.*' OR
    a#'id' matches '0650.*' OR
    a#'id' matches '001.*';

    ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

    ss_sc_all_proj = FOREACH ss_sc_all GENERATE
    a#'query' as query,
    a#'testid' as testid,
    a#'timestamp' as timestamp,
    a,
    b,
    c;

    ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

    ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c;

    STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

    ss_sc_all_map_count = group ss_sc_all_map all;

    count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);

    STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
    {code}

    I run this script using:

    a) java -cp pig0.7.jar script.pig
    b) java -cp pig0.7.jar -t PruneColumns script.pig

    What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.

    Is due to the fact that there are 2 store func's used?

    Viraj

    was:
    I have script which is of this pattern and it uses 2 StoreFunc's:
    {code}

    register loader.jar
    register piggy-bank/java/build/storage.jar;
    %DEFAULT OUTPUTDIR /user/viraj/prunecol/

    ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

    ss_sc_filtered_0 = FILTER ss_sc_0 BY
    a#'id' matches '1.*' OR
    a#'id' matches '2.*' OR
    a#'id' matches '3.*' OR
    a#'id' matches '4.*';

    ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

    ss_sc_filtered_1 = FILTER ss_sc_1 BY
    a#'id' matches '65.*' OR
    a#'id' matches '466.*' OR
    a#'id' matches '043.*' OR
    a#'id' matches '044.*' OR
    a#'id' matches '0650.*' OR
    a#'id' matches '001.*';

    ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

    ss_sc_all_proj = FOREACH ss_sc_all GENERATE
    a#'query' as query,
    a#'testid' as testid,
    a#'timestamp' as timestamp,
    a,
    b,
    c;

    ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

    ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c;

    STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

    ss_sc_all_map_count = group ss_sc_all_map all;

    count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);

    STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


    I run this script using:

    a) java -cp pig0.7.jar script.pig
    b) java -cp pig0.7.jar -t PruneColumns script.pig

    What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.

    Is due to the fact that there are 2 store func's used?

    Viraj

    Column pruner causes wrong results when using both Custom Store UDF and PigStorage
    ----------------------------------------------------------------------------------

    Key: PIG-1537
    URL: https://issues.apache.org/jira/browse/PIG-1537
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.7.0
    Reporter: Viraj Bhat

    I have script which is of this pattern and it uses 2 StoreFunc's:
    {code}
    register loader.jar
    register piggy-bank/java/build/storage.jar;
    %DEFAULT OUTPUTDIR /user/viraj/prunecol/
    ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
    ss_sc_filtered_0 = FILTER ss_sc_0 BY
    a#'id' matches '1.*' OR
    a#'id' matches '2.*' OR
    a#'id' matches '3.*' OR
    a#'id' matches '4.*';
    ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
    ss_sc_filtered_1 = FILTER ss_sc_1 BY
    a#'id' matches '65.*' OR
    a#'id' matches '466.*' OR
    a#'id' matches '043.*' OR
    a#'id' matches '044.*' OR
    a#'id' matches '0650.*' OR
    a#'id' matches '001.*';
    ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
    ss_sc_all_proj = FOREACH ss_sc_all GENERATE
    a#'query' as query,
    a#'testid' as testid,
    a#'timestamp' as timestamp,
    a,
    b,
    c;
    ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
    ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c;
    STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
    ss_sc_all_map_count = group ss_sc_all_map all;
    count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);
    STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
    {code}
    I run this script using:
    a) java -cp pig0.7.jar script.pig
    b) java -cp pig0.7.jar -t PruneColumns script.pig
    What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.
    Is due to the fact that there are 2 store func's used?
    Viraj
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Aug 5, 2010 at 4:57 pm
    [ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-1537:
    --------------------------------

    Assignee: Daniel Dai
    Fix Version/s: 0.8.0

    Daniel, can we test if this is a problem with 0.8

    Viraj, is this data specific and if so can you provide data tp reproduce. Also, do you know which one produces correct results.
    Column pruner causes wrong results when using both Custom Store UDF and PigStorage
    ----------------------------------------------------------------------------------

    Key: PIG-1537
    URL: https://issues.apache.org/jira/browse/PIG-1537
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.7.0
    Reporter: Viraj Bhat
    Assignee: Daniel Dai
    Fix For: 0.8.0


    I have script which is of this pattern and it uses 2 StoreFunc's:
    {code}
    register loader.jar
    register piggy-bank/java/build/storage.jar;
    %DEFAULT OUTPUTDIR /user/viraj/prunecol/
    ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
    ss_sc_filtered_0 = FILTER ss_sc_0 BY
    a#'id' matches '1.*' OR
    a#'id' matches '2.*' OR
    a#'id' matches '3.*' OR
    a#'id' matches '4.*';
    ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
    ss_sc_filtered_1 = FILTER ss_sc_1 BY
    a#'id' matches '65.*' OR
    a#'id' matches '466.*' OR
    a#'id' matches '043.*' OR
    a#'id' matches '044.*' OR
    a#'id' matches '0650.*' OR
    a#'id' matches '001.*';
    ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
    ss_sc_all_proj = FOREACH ss_sc_all GENERATE
    a#'query' as query,
    a#'testid' as testid,
    a#'timestamp' as timestamp,
    a,
    b,
    c;
    ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
    ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c;
    STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
    ss_sc_all_map_count = group ss_sc_all_map all;
    count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);
    STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
    {code}
    I run this script using:
    a) java -cp pig0.7.jar script.pig
    b) java -cp pig0.7.jar -t PruneColumns script.pig
    What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.
    Is due to the fact that there are 2 store func's used?
    Viraj
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Viraj Bhat (JIRA) at Aug 5, 2010 at 10:54 pm
    [ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895858#action_12895858 ]

    Viraj Bhat commented on PIG-1537:
    ---------------------------------

    Hi Olga, I have given the specific script with UDF's for Daniel to test. Thanks Daniel for your help.
    The script which does not use Column Pruner optimization or disables it using -t gives correct results.
    Viraj
    Column pruner causes wrong results when using both Custom Store UDF and PigStorage
    ----------------------------------------------------------------------------------

    Key: PIG-1537
    URL: https://issues.apache.org/jira/browse/PIG-1537
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.7.0
    Reporter: Viraj Bhat
    Assignee: Daniel Dai
    Fix For: 0.8.0


    I have script which is of this pattern and it uses 2 StoreFunc's:
    {code}
    register loader.jar
    register piggy-bank/java/build/storage.jar;
    %DEFAULT OUTPUTDIR /user/viraj/prunecol/
    ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
    ss_sc_filtered_0 = FILTER ss_sc_0 BY
    a#'id' matches '1.*' OR
    a#'id' matches '2.*' OR
    a#'id' matches '3.*' OR
    a#'id' matches '4.*';
    ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
    ss_sc_filtered_1 = FILTER ss_sc_1 BY
    a#'id' matches '65.*' OR
    a#'id' matches '466.*' OR
    a#'id' matches '043.*' OR
    a#'id' matches '044.*' OR
    a#'id' matches '0650.*' OR
    a#'id' matches '001.*';
    ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
    ss_sc_all_proj = FOREACH ss_sc_all GENERATE
    a#'query' as query,
    a#'testid' as testid,
    a#'timestamp' as timestamp,
    a,
    b,
    c;
    ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
    ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c;
    STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
    ss_sc_all_map_count = group ss_sc_all_map all;
    count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);
    STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
    {code}
    I run this script using:
    a) java -cp pig0.7.jar script.pig
    b) java -cp pig0.7.jar -t PruneColumns script.pig
    What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.
    Is due to the fact that there are 2 store func's used?
    Viraj
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 1, 2010 at 1:31 am
    [ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich resolved PIG-1537.
    ---------------------------------

    Resolution: Fixed
    Column pruner causes wrong results when using both Custom Store UDF and PigStorage
    ----------------------------------------------------------------------------------

    Key: PIG-1537
    URL: https://issues.apache.org/jira/browse/PIG-1537
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.7.0
    Reporter: Viraj Bhat
    Assignee: Daniel Dai
    Fix For: 0.8.0


    I have script which is of this pattern and it uses 2 StoreFunc's:
    {code}
    register loader.jar
    register piggy-bank/java/build/storage.jar;
    %DEFAULT OUTPUTDIR /user/viraj/prunecol/
    ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
    ss_sc_filtered_0 = FILTER ss_sc_0 BY
    a#'id' matches '1.*' OR
    a#'id' matches '2.*' OR
    a#'id' matches '3.*' OR
    a#'id' matches '4.*';
    ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
    ss_sc_filtered_1 = FILTER ss_sc_1 BY
    a#'id' matches '65.*' OR
    a#'id' matches '466.*' OR
    a#'id' matches '043.*' OR
    a#'id' matches '044.*' OR
    a#'id' matches '0650.*' OR
    a#'id' matches '001.*';
    ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
    ss_sc_all_proj = FOREACH ss_sc_all GENERATE
    a#'query' as query,
    a#'testid' as testid,
    a#'timestamp' as timestamp,
    a,
    b,
    c;
    ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
    ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c;
    STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
    ss_sc_all_map_count = group ss_sc_all_map all;
    count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1);
    STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
    {code}
    I run this script using:
    a) java -cp pig0.7.jar script.pig
    b) java -cp pig0.7.jar -t PruneColumns script.pig
    What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options.
    Is due to the fact that there are 2 store func's used?
    Viraj
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedAug 5, '10 at 1:01a
activeSep 1, '10 at 1:31a
posts5
users1
websitepig.apache.org

1 user in discussion

Olga Natkovich (JIRA): 5 posts

People

Translate

site design / logo © 2022 Grokbase