Grokbase Groups Pig user May 2010
FAQ
Hi,
I am new to Hadoop and Pig Latin Language.
I am trying to convert the below Hive QL to Pig Latin. Any suggestions please.

INSERT OVERWRITE TABLE A
SELECT id, org_type, dept_type, cnt, cnt_distinct
FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT dept_id) cnt_distinct
FROM B
WHERE visible_flag = 1
GROUP BY id, dept_type

Questions:
1. Is there an option to overwrite the table ? OR what does Pig Latin offer ?
2. You can see in the inner Query "'S' org_type" I am creating a new column and inserting 'S' as the value to this. what does Pig Latin offer ?
3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the count based on how many dept_type and id has and generating a new column and inserting the count in there. How can I do this in pig ?

Thanks for you help.

Regards
MD


_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

Search Discussions

  • Thejas Nair at May 5, 2010 at 6:08 pm
    Hi Syed,

    1. Released versions of pig don't support concept of table, there will be
    one in owl specific loaders once they are available. Pig-latin output goes
    into files (if store cmd is used) or STDOUT (if dump is used). The behavior
    if the file already exists is determined by the StoreFunc , PigStorage will
    give an error if the file already exists.


    Re 2 & 3 - here is the translation to pig-latin -

    L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);

    FIL = filter L by visible_flag == 1;

    G = group FIL BY (id, dept_type);

    FE = foreach G {
    DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
    generate group.id, 'S' as org_type, group.dept_type, COUNT_STAR(FIL) as
    cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
    }

    describe FE;
    FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
    bytearray,org_type: chararray}

    store FE into 'A'

    On 5/4/10 4:36 PM, "Syed Wasti" wrote:



    Hi,
    I am new to Hadoop and Pig Latin Language.
    I am trying to convert the below Hive QL to Pig Latin. Any suggestions please.

    INSERT OVERWRITE TABLE A
    SELECT id, org_type, dept_type, cnt, cnt_distinct
    FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
    dept_id) cnt_distinct
    FROM B
    WHERE visible_flag = 1
    GROUP BY id, dept_type

    Questions:
    1. Is there an option to overwrite the table ? OR what does Pig Latin offer ?
    2. You can see in the inner Query "'S' org_type" I am creating a new column
    and inserting 'S' as the value to this. what does Pig Latin offer ?
    3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the count
    based on how many dept_type and id has and generating a new column and
    inserting the count in there. How can I do this in pig ?

    Thanks for you help.

    Regards
    MD


    _________________________________________________________________
    Hotmail is redefining busy with tools for the New Busy. Get more from your
    inbox.
    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:
    en-US:WM_HMP:042010_2
  • Dmitriy Ryaboy at May 5, 2010 at 6:49 pm
    Under the hood Hive tables are just files too.
    I am not sure what the INSERT OVERWRITE semantics are in edge cases
    (like if your query fails), but you may be able to simulate it using
    'fs -mv' and 'fs -rmf' commands that Pig provides to operate on the
    hadoop file system.
    Note that for safety, Pig will refuse to run if you are trying to
    write into a directory that already exists, so you *must* use a move
    or a remove if you might already have data in the target location.

    All of that goes out the window for both Hive and Pig if you are using
    custom SerDes/StoreFuncs, which can do more or less whatever they
    want.

    -D
    On Wed, May 5, 2010 at 11:07 AM, Thejas Nair wrote:
    Hi Syed,

    1. Released versions of  pig don't support concept of table, there will be
    one in owl specific loaders once they are available. Pig-latin output goes
    into files (if store cmd is used) or STDOUT (if dump is used). The behavior
    if the file already exists is determined by the StoreFunc , PigStorage will
    give an error if the file already exists.


    Re 2 & 3  - here is the translation to pig-latin -

    L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);

    FIL = filter L by visible_flag == 1;

    G = group FIL BY (id, dept_type);

    FE = foreach G  {
    DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
    generate group.id, 'S' as org_type,  group.dept_type, COUNT_STAR(FIL) as
    cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
    }

    describe FE;
    FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
    bytearray,org_type: chararray}

    store FE into 'A'

    On 5/4/10 4:36 PM, "Syed Wasti" wrote:



    Hi,
    I am new to Hadoop and Pig Latin Language.
    I am trying to convert the below Hive QL to Pig Latin. Any suggestions please.

    INSERT OVERWRITE TABLE A
    SELECT id, org_type, dept_type, cnt, cnt_distinct
    FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
    dept_id) cnt_distinct
    FROM B
    WHERE visible_flag = 1
    GROUP BY id, dept_type

    Questions:
    1. Is there an option to overwrite the table ? OR what does Pig Latin offer ?
    2. You can see in the inner Query "'S' org_type" I am creating a new column
    and inserting 'S' as the value to this. what does Pig Latin offer ?
    3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the count
    based on how many dept_type and id has and generating a new column and
    inserting the count in there. How can I do this in pig ?

    Thanks for you help.

    Regards
    MD


    _________________________________________________________________
    Hotmail is redefining busy with tools for the New Busy. Get more from your
    inbox.
    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:
    en-US:WM_HMP:042010_2
  • Edward Capriolo at May 5, 2010 at 7:14 pm

    On Wed, May 5, 2010 at 2:49 PM, Dmitriy Ryaboy wrote:

    Under the hood Hive tables are just files too.
    I am not sure what the INSERT OVERWRITE semantics are in edge cases
    (like if your query fails), but you may be able to simulate it using
    'fs -mv' and 'fs -rmf' commands that Pig provides to operate on the
    hadoop file system.
    Note that for safety, Pig will refuse to run if you are trying to
    write into a directory that already exists, so you *must* use a move
    or a remove if you might already have data in the target location.

    All of that goes out the window for both Hive and Pig if you are using
    custom SerDes/StoreFuncs, which can do more or less whatever they
    want.

    -D
    On Wed, May 5, 2010 at 11:07 AM, Thejas Nair wrote:
    Hi Syed,

    1. Released versions of pig don't support concept of table, there will be
    one in owl specific loaders once they are available. Pig-latin output goes
    into files (if store cmd is used) or STDOUT (if dump is used). The behavior
    if the file already exists is determined by the StoreFunc , PigStorage will
    give an error if the file already exists.


    Re 2 & 3 - here is the translation to pig-latin -

    L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);

    FIL = filter L by visible_flag == 1;

    G = group FIL BY (id, dept_type);

    FE = foreach G {
    DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
    generate group.id, 'S' as org_type, group.dept_type, COUNT_STAR(FIL) as
    cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
    }

    describe FE;
    FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
    bytearray,org_type: chararray}

    store FE into 'A'

    On 5/4/10 4:36 PM, "Syed Wasti" wrote:



    Hi,
    I am new to Hadoop and Pig Latin Language.
    I am trying to convert the below Hive QL to Pig Latin. Any suggestions
    please.
    INSERT OVERWRITE TABLE A
    SELECT id, org_type, dept_type, cnt, cnt_distinct
    FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
    dept_id) cnt_distinct
    FROM B
    WHERE visible_flag = 1
    GROUP BY id, dept_type

    Questions:
    1. Is there an option to overwrite the table ? OR what does Pig Latin
    offer ?
    2. You can see in the inner Query "'S' org_type" I am creating a new
    column
    and inserting 'S' as the value to this. what does Pig Latin offer ?
    3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the
    count
    based on how many dept_type and id has and generating a new column and
    inserting the count in there. How can I do this in pig ?

    Thanks for you help.

    Regards
    MD


    _________________________________________________________________
    Hotmail is redefining busy with tools for the New Busy. Get more from
    your
    inbox.
    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL
    :
    en-US:WM_HMP:042010_2
    The semantics of INSERT OVERWRITE are simple. The output of your queries are
    written to a temp folder and the final step it is moved to its final
    destination. So you should never end up with partial files in the final
    directory.
  • Dmitriy Ryaboy at May 5, 2010 at 7:25 pm
    in that case:

    store into 'tmpdir';
    exec;
    fs -rmf 'destdir'
    mv 'tmpdir' 'destdir'

    -D
    On Wed, May 5, 2010 at 12:13 PM, Edward Capriolo wrote:
    On Wed, May 5, 2010 at 2:49 PM, Dmitriy Ryaboy wrote:

    Under the hood Hive tables are just files too.
    I am not sure what the INSERT OVERWRITE semantics are in edge cases
    (like if your query fails), but you may be able to simulate it using
    'fs -mv' and 'fs -rmf' commands that Pig provides to operate on the
    hadoop file system.
    Note that for safety, Pig will refuse to run if you are trying to
    write into a directory that already exists, so you *must* use a move
    or a remove if you might already have data in the target location.

    All of that goes out the window for both Hive and Pig if you are using
    custom SerDes/StoreFuncs, which can do more or less whatever they
    want.

    -D
    On Wed, May 5, 2010 at 11:07 AM, Thejas Nair wrote:
    Hi Syed,

    1. Released versions of  pig don't support concept of table, there will be
    one in owl specific loaders once they are available. Pig-latin output goes
    into files (if store cmd is used) or STDOUT (if dump is used). The behavior
    if the file already exists is determined by the StoreFunc , PigStorage will
    give an error if the file already exists.


    Re 2 & 3  - here is the translation to pig-latin -

    L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);

    FIL = filter L by visible_flag == 1;

    G = group FIL BY (id, dept_type);

    FE = foreach G  {
    DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
    generate group.id, 'S' as org_type,  group.dept_type, COUNT_STAR(FIL) as
    cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
    }

    describe FE;
    FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
    bytearray,org_type: chararray}

    store FE into 'A'

    On 5/4/10 4:36 PM, "Syed Wasti" wrote:



    Hi,
    I am new to Hadoop and Pig Latin Language.
    I am trying to convert the below Hive QL to Pig Latin. Any suggestions
    please.
    INSERT OVERWRITE TABLE A
    SELECT id, org_type, dept_type, cnt, cnt_distinct
    FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
    dept_id) cnt_distinct
    FROM B
    WHERE visible_flag = 1
    GROUP BY id, dept_type

    Questions:
    1. Is there an option to overwrite the table ? OR what does Pig Latin
    offer ?
    2. You can see in the inner Query "'S' org_type" I am creating a new
    column
    and inserting 'S' as the value to this. what does Pig Latin offer ?
    3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the
    count
    based on how many dept_type and id has and generating a new column and
    inserting the count in there. How can I do this in pig ?

    Thanks for you help.

    Regards
    MD


    _________________________________________________________________
    Hotmail is redefining busy with tools for the New Busy. Get more from
    your
    inbox.
    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL
    :
    en-US:WM_HMP:042010_2
    The semantics of INSERT OVERWRITE are simple. The output of your queries are
    written to a temp folder and the final step it is moved to its final
    destination. So you should never end up with partial files in the final
    directory.
  • Syed Wasti at May 6, 2010 at 4:50 pm
    Thank you all for your expert suggestions. This was of great help.

    Regards
    Syed Wasti

    Date: Wed, 5 May 2010 12:25:08 -0700
    Subject: Re: Pig Latin Questions
    From: dvryaboy@gmail.com
    To: pig-user@hadoop.apache.org
    CC: mdwasti@hotmail.com

    in that case:

    store into 'tmpdir';
    exec;
    fs -rmf 'destdir'
    mv 'tmpdir' 'destdir'

    -D
    On Wed, May 5, 2010 at 12:13 PM, Edward Capriolo wrote:
    On Wed, May 5, 2010 at 2:49 PM, Dmitriy Ryaboy wrote:

    Under the hood Hive tables are just files too.
    I am not sure what the INSERT OVERWRITE semantics are in edge cases
    (like if your query fails), but you may be able to simulate it using
    'fs -mv' and 'fs -rmf' commands that Pig provides to operate on the
    hadoop file system.
    Note that for safety, Pig will refuse to run if you are trying to
    write into a directory that already exists, so you *must* use a move
    or a remove if you might already have data in the target location.

    All of that goes out the window for both Hive and Pig if you are using
    custom SerDes/StoreFuncs, which can do more or less whatever they
    want.

    -D
    On Wed, May 5, 2010 at 11:07 AM, Thejas Nair wrote:
    Hi Syed,

    1. Released versions of pig don't support concept of table, there will be
    one in owl specific loaders once they are available. Pig-latin output goes
    into files (if store cmd is used) or STDOUT (if dump is used). The behavior
    if the file already exists is determined by the StoreFunc , PigStorage will
    give an error if the file already exists.


    Re 2 & 3 - here is the translation to pig-latin -

    L = load 'B' as (id, dept_type, dept_id, visible_flag, org_type);

    FIL = filter L by visible_flag == 1;

    G = group FIL BY (id, dept_type);

    FE = foreach G {
    DEPT_IDS = FIL.dept_id; DIST_DEPT_IDS = distinct DEPT_IDS;
    generate group.id, 'S' as org_type, group.dept_type, COUNT_STAR(FIL) as
    cnt, COUNT(DIST_DEPT_IDS) as cnt_distinct ;
    }

    describe FE;
    FE: {cnt_distinct: long,cnt: long,id: bytearray,dept_type:
    bytearray,org_type: chararray}

    store FE into 'A'

    On 5/4/10 4:36 PM, "Syed Wasti" wrote:



    Hi,
    I am new to Hadoop and Pig Latin Language.
    I am trying to convert the below Hive QL to Pig Latin. Any suggestions
    please.
    INSERT OVERWRITE TABLE A
    SELECT id, org_type, dept_type, cnt, cnt_distinct
    FROM (SELECT id, 'S' org_type, dept_type, COUNT(1) cnt, COUNT(DISTINCT
    dept_id) cnt_distinct
    FROM B
    WHERE visible_flag = 1
    GROUP BY id, dept_type

    Questions:
    1. Is there an option to overwrite the table ? OR what does Pig Latin
    offer ?
    2. You can see in the inner Query "'S' org_type" I am creating a new
    column
    and inserting 'S' as the value to this. what does Pig Latin offer ?
    3. Related to Q2, "COUNT(1) cnt" for every id I am incrementing the
    count
    based on how many dept_type and id has and generating a new column and
    inserting the count in there. How can I do this in pig ?

    Thanks for you help.

    Regards
    MD


    _________________________________________________________________
    Hotmail is redefining busy with tools for the New Busy. Get more from
    your
    inbox.
    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL
    :
    en-US:WM_HMP:042010_2
    The semantics of INSERT OVERWRITE are simple. The output of your queries are
    written to a temp folder and the final step it is moved to its final
    destination. So you should never end up with partial files in the final
    directory.
    _________________________________________________________________
    The New Busy is not the old busy. Search, chat and e-mail from your inbox.
    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 4, '10 at 11:37p
activeMay 6, '10 at 4:50p
posts6
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase