Grokbase Groups Pig user July 2011
FAQ
Hi all,

I'm trying to do a map-side only merge join [1] in pig using Zebra's
TableLoader. (My data allows merge join.) But I'm being unable to use the
TableLoader. Even a simple script that loads a table and just stores it back
doesn't work -

----
A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('',
'sorted');
store A into 'my_output';
----


'my_input' is input directory containing a single file with just 1 column -
---
1
2
3
---

The error I get is -

"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
error. Failed to find deleted column groupsjava.io.IOException: BT Schema
file doesn't exist: *file:/......./my_input/.btschema*"


I have tried specifying the schema using the 'AS' clause and the DESCRIBE
statement as well, but its fetches me the same error. Is the .btschema file
required? Is there any documentation available on its format? (I tried
comma-separated column names with/without type info)


I am also willing to work with any other loader that satisfies the merge
join constraints. Thanks in anticipation.


Regards,
Ankur


[1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*

Search Discussions

  • Ashutosh Chauhan at Jul 20, 2011 at 5:22 pm
    Hey Ankur,

    Zebra's TableLoader works with the data written out using Zebra's
    TableStorer. So, you need to write the data first using Zebra and then
    subsequently load using TableLoader and do merge-join.

    Ashutosh
    On Tue, Jul 19, 2011 at 14:28, Ankur Jain wrote:
    Hi all,

    I'm trying to do a map-side only merge join [1] in pig using Zebra's
    TableLoader. (My data allows merge join.) But I'm being unable to use the
    TableLoader. Even a simple script that loads a table and just stores it back
    doesn't work -

    ----
    A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('',
    'sorted');
    store A into 'my_output';
    ----


    'my_input' is input directory containing a single file with just 1 column -
    ---
    1
    2
    3
    ---

    The error I get is -

    "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
    error. Failed to find deleted column groupsjava.io.IOException: BT Schema
    file doesn't exist: *file:/......./my_input/.btschema*"


    I have tried specifying the schema using the 'AS' clause and the DESCRIBE
    statement as well, but its fetches me the same error. Is the .btschema file
    required? Is there any documentation available on its format? (I tried
    comma-separated column names with/without type info)


    I am also willing to work with any other loader that satisfies the merge
    join constraints. Thanks in anticipation.


    Regards,
    Ankur


    [1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*
  • Ankur Jain at Jul 20, 2011 at 7:14 pm
    Thanks Ashutosh! Right, I too realized that yesterday. So, is there any
    other loader that implements
    CollectableLoadFunc interface required by the merge join?


    Thanks,
    Ankur

    On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan wrote:

    Hey Ankur,

    Zebra's TableLoader works with the data written out using Zebra's
    TableStorer. So, you need to write the data first using Zebra and then
    subsequently load using TableLoader and do merge-join.

    Ashutosh
    On Tue, Jul 19, 2011 at 14:28, Ankur Jain wrote:
    Hi all,

    I'm trying to do a map-side only merge join [1] in pig using Zebra's
    TableLoader. (My data allows merge join.) But I'm being unable to use the
    TableLoader. Even a simple script that loads a table and just stores it back
    doesn't work -

    ----
    A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('',
    'sorted');
    store A into 'my_output';
    ----


    'my_input' is input directory containing a single file with just 1 column -
    ---
    1
    2
    3
    ---

    The error I get is -

    "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
    error. Failed to find deleted column groupsjava.io.IOException: BT Schema
    file doesn't exist: *file:/......./my_input/.btschema*"


    I have tried specifying the schema using the 'AS' clause and the DESCRIBE
    statement as well, but its fetches me the same error. Is the .btschema file
    required? Is there any documentation available on its format? (I tried
    comma-separated column names with/without type info)


    I am also willing to work with any other loader that satisfies the merge
    join constraints. Thanks in anticipation.


    Regards,
    Ankur


    [1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*
  • Tomas Svarovsky at Jul 20, 2011 at 7:54 pm
    Not sure if this would be helpful, but docs says that the default
    PigStorage does implement that. I guess that your data needs to be
    already sorted if you do not want to go through the reduce phase
    during the join.

    T
    On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain wrote:
    Thanks Ashutosh! Right, I too realized that yesterday. So, is there any
    other loader that implements
    CollectableLoadFunc interface required by the merge join?


    Thanks,
    Ankur

    On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan wrote:

    Hey Ankur,

    Zebra's TableLoader works with the data written out using Zebra's
    TableStorer. So, you need to write the data first using Zebra and then
    subsequently load using TableLoader and do merge-join.

    Ashutosh
    On Tue, Jul 19, 2011 at 14:28, Ankur Jain wrote:
    Hi all,

    I'm trying to do a map-side only merge join [1] in pig using Zebra's
    TableLoader. (My data allows merge join.) But I'm being unable to use the
    TableLoader. Even a simple script that loads a table and just stores it back
    doesn't work -

    ----
    A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('',
    'sorted');
    store A into 'my_output';
    ----


    'my_input' is input directory containing a single file with just 1 column -
    ---
    1
    2
    3
    ---

    The error I get is -

    "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
    error. Failed to find deleted column groupsjava.io.IOException: BT Schema
    file doesn't exist: *file:/......./my_input/.btschema*"


    I have tried specifying the schema using the 'AS' clause and the DESCRIBE
    statement as well, but its fetches me the same error. Is the .btschema file
    required? Is there any documentation available on its format? (I tried
    comma-separated column names with/without type info)


    I am also willing to work with any other loader that satisfies the merge
    join constraints. Thanks in anticipation.


    Regards,
    Ankur


    [1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*
  • Ashutosh Chauhan at Jul 20, 2011 at 8:16 pm
    It depends on whether you want to do inner or outer (also called
    co-group) merge join. If you are doing inner merge join on two tables
    PigStorage satisfies all the criteria and can be used. If you want to
    do outer merge join (or inner merge join on more then two tables),
    then you need CollectableLoadFunc which PigStorage doesn't implement
    and only Zebra's TableLoader does.

    Hope it helps,
    Ashutosh
    On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky
    wrote:
    Not sure if this would be helpful, but docs says that the default
    PigStorage does implement that. I guess that your data needs to be
    already sorted if you do not want to go through the reduce phase
    during the join.

    T
    On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain wrote:
    Thanks Ashutosh! Right, I too realized that yesterday. So, is there any
    other loader that implements
    CollectableLoadFunc interface required by the merge join?


    Thanks,
    Ankur

    On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan wrote:

    Hey Ankur,

    Zebra's TableLoader works with the data written out using Zebra's
    TableStorer. So, you need to write the data first using Zebra and then
    subsequently load using TableLoader and do merge-join.

    Ashutosh
    On Tue, Jul 19, 2011 at 14:28, Ankur Jain wrote:
    Hi all,

    I'm trying to do a map-side only merge join [1] in pig using Zebra's
    TableLoader. (My data allows merge join.) But I'm being unable to use the
    TableLoader. Even a simple script that loads a table and just stores it back
    doesn't work -

    ----
    A = load 'my_input' using org.apache.hadoop.zebra.pig.TableLoader('',
    'sorted');
    store A into 'my_output';
    ----


    'my_input' is input directory containing a single file with just 1 column -
    ---
    1
    2
    3
    ---

    The error I get is -

    "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
    error. Failed to find deleted column groupsjava.io.IOException: BT Schema
    file doesn't exist: *file:/......./my_input/.btschema*"


    I have tried specifying the schema using the 'AS' clause and the DESCRIBE
    statement as well, but its fetches me the same error. Is the .btschema file
    required? Is there any documentation available on its format? (I tried
    comma-separated column names with/without type info)


    I am also willing to work with any other loader that satisfies the merge
    join constraints. Thanks in anticipation.


    Regards,
    Ankur


    [1] *http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*
  • Ankur Jain at Jul 20, 2011 at 9:16 pm
    Yeah, I need (full) outer join, which has this constraint on the loader.

    Thanks.

    On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan wrote:

    It depends on whether you want to do inner or outer (also called
    co-group) merge join. If you are doing inner merge join on two tables
    PigStorage satisfies all the criteria and can be used. If you want to
    do outer merge join (or inner merge join on more then two tables),
    then you need CollectableLoadFunc which PigStorage doesn't implement
    and only Zebra's TableLoader does.

    Hope it helps,
    Ashutosh
    On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky
    wrote:
    Not sure if this would be helpful, but docs says that the default
    PigStorage does implement that. I guess that your data needs to be
    already sorted if you do not want to go through the reduce phase
    during the join.

    T
    On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain wrote:
    Thanks Ashutosh! Right, I too realized that yesterday. So, is there any
    other loader that implements
    CollectableLoadFunc interface required by the merge join?


    Thanks,
    Ankur


    On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan <
    hashutosh@apache.org>wrote:
    Hey Ankur,

    Zebra's TableLoader works with the data written out using Zebra's
    TableStorer. So, you need to write the data first using Zebra and then
    subsequently load using TableLoader and do merge-join.

    Ashutosh
    On Tue, Jul 19, 2011 at 14:28, Ankur Jain wrote:
    Hi all,

    I'm trying to do a map-side only merge join [1] in pig using Zebra's
    TableLoader. (My data allows merge join.) But I'm being unable to use
    the
    TableLoader. Even a simple script that loads a table and just stores
    it
    back
    doesn't work -

    ----
    A = load 'my_input' using
    org.apache.hadoop.zebra.pig.TableLoader('',
    'sorted');
    store A into 'my_output';
    ----


    'my_input' is input directory containing a single file with just 1 column -
    ---
    1
    2
    3
    ---

    The error I get is -

    "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
    error. Failed to find deleted column groupsjava.io.IOException: BT
    Schema
    file doesn't exist: *file:/......./my_input/.btschema*"


    I have tried specifying the schema using the 'AS' clause and the DESCRIBE
    statement as well, but its fetches me the same error. Is the
    .btschema
    file
    required? Is there any documentation available on its format? (I
    tried
    comma-separated column names with/without type info)


    I am also willing to work with any other loader that satisfies the
    merge
    join constraints. Thanks in anticipation.


    Regards,
    Ankur


    [1] *
    http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*
  • Ashutosh Chauhan at Jul 20, 2011 at 9:22 pm
    If you control the generation of data which needs to be joined, then
    you can store it with Zebra and then do the joins. If not, then you
    either need to rewrite the data using Zebra or need to implement
    another loader which implements CollectableLoadFunc.

    Ashutosh
    On Wed, Jul 20, 2011 at 14:16, Ankur Jain wrote:
    Yeah, I need (full) outer join, which has this constraint on the loader.

    Thanks.

    On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan wrote:

    It depends on whether you want to do inner or outer (also called
    co-group) merge join. If you are doing inner merge join on two tables
    PigStorage satisfies all the criteria and can be used.  If you want to
    do outer merge join (or inner merge join on more then two tables),
    then you need CollectableLoadFunc which PigStorage doesn't implement
    and only Zebra's TableLoader does.

    Hope it helps,
    Ashutosh
    On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky
    wrote:
    Not sure if this would be helpful, but docs says that the default
    PigStorage does implement that. I guess that your data needs to be
    already sorted if you do not want to go through the reduce phase
    during the join.

    T

    On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <ankurmania@gmail.com>
    wrote:
    Thanks Ashutosh! Right, I too realized that yesterday. So, is there any
    other loader that implements
    CollectableLoadFunc interface required by the merge join?


    Thanks,
    Ankur


    On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan <
    hashutosh@apache.org>wrote:
    Hey Ankur,

    Zebra's TableLoader works with the data written out using Zebra's
    TableStorer. So, you need to write the data first using Zebra and then
    subsequently load using TableLoader and do merge-join.

    Ashutosh
    On Tue, Jul 19, 2011 at 14:28, Ankur Jain <ankurmania@gmail.com>
    wrote:
    Hi all,

    I'm trying to do a map-side only merge join [1] in pig using Zebra's
    TableLoader. (My data allows merge join.) But I'm being unable to use
    the
    TableLoader. Even a simple script that loads a table and just stores
    it
    back
    doesn't work -

    ----
    A = load 'my_input' using
    org.apache.hadoop.zebra.pig.TableLoader('',
    'sorted');
    store A into 'my_output';
    ----


    'my_input' is input directory containing a single file with just 1 column -
    ---
    1
    2
    3
    ---

    The error I get is -

    "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
    error. Failed to find deleted column groupsjava.io.IOException: BT
    Schema
    file doesn't exist: *file:/......./my_input/.btschema*"


    I have tried specifying the schema using the 'AS' clause and the DESCRIBE
    statement as well, but its fetches me the same error. Is the
    .btschema
    file
    required? Is there any documentation available on its format? (I
    tried
    comma-separated column names with/without type info)


    I am also willing to work with any other loader that satisfies the
    merge
    join constraints. Thanks in anticipation.


    Regards,
    Ankur


    [1] *
    http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*
  • Ankur Jain at Jul 20, 2011 at 10:48 pm
    Thanks Ashutosh. Let me re-consider various options available to me.

    -Ankur
    On Wed, Jul 20, 2011 at 2:21 PM, Ashutosh Chauhan wrote:

    If you control the generation of data which needs to be joined, then
    you can store it with Zebra and then do the joins. If not, then you
    either need to rewrite the data using Zebra or need to implement
    another loader which implements CollectableLoadFunc.

    Ashutosh
    On Wed, Jul 20, 2011 at 14:16, Ankur Jain wrote:
    Yeah, I need (full) outer join, which has this constraint on the loader.

    Thanks.


    On Wed, Jul 20, 2011 at 1:15 PM, Ashutosh Chauhan <hashutosh@apache.org
    wrote:
    It depends on whether you want to do inner or outer (also called
    co-group) merge join. If you are doing inner merge join on two tables
    PigStorage satisfies all the criteria and can be used. If you want to
    do outer merge join (or inner merge join on more then two tables),
    then you need CollectableLoadFunc which PigStorage doesn't implement
    and only Zebra's TableLoader does.

    Hope it helps,
    Ashutosh
    On Wed, Jul 20, 2011 at 12:54, Tomas Svarovsky
    wrote:
    Not sure if this would be helpful, but docs says that the default
    PigStorage does implement that. I guess that your data needs to be
    already sorted if you do not want to go through the reduce phase
    during the join.

    T

    On Wed, Jul 20, 2011 at 12:13 PM, Ankur Jain <ankurmania@gmail.com>
    wrote:
    Thanks Ashutosh! Right, I too realized that yesterday. So, is there
    any
    other loader that implements
    CollectableLoadFunc interface required by the merge join?


    Thanks,
    Ankur


    On Wed, Jul 20, 2011 at 10:22 AM, Ashutosh Chauhan <
    hashutosh@apache.org>wrote:
    Hey Ankur,

    Zebra's TableLoader works with the data written out using Zebra's
    TableStorer. So, you need to write the data first using Zebra and
    then
    subsequently load using TableLoader and do merge-join.

    Ashutosh
    On Tue, Jul 19, 2011 at 14:28, Ankur Jain <ankurmania@gmail.com>
    wrote:
    Hi all,

    I'm trying to do a map-side only merge join [1] in pig using
    Zebra's
    TableLoader. (My data allows merge join.) But I'm being unable to
    use
    the
    TableLoader. Even a simple script that loads a table and just
    stores
    it
    back
    doesn't work -

    ----
    A = load 'my_input' using
    org.apache.hadoop.zebra.pig.TableLoader('',
    'sorted');
    store A into 'my_output';
    ----


    'my_input' is input directory containing a single file with just
    1
    column -
    ---
    1
    2
    3
    ---

    The error I get is -

    "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
    error. Failed to find deleted column groupsjava.io.IOException: BT
    Schema
    file doesn't exist: *file:/......./my_input/.btschema*"


    I have tried specifying the schema using the 'AS' clause and the DESCRIBE
    statement as well, but its fetches me the same error. Is the
    .btschema
    file
    required? Is there any documentation available on its format? (I
    tried
    comma-separated column names with/without type info)


    I am also willing to work with any other loader that satisfies the
    merge
    join constraints. Thanks in anticipation.


    Regards,
    Ankur


    [1] *
    http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Merge+Joins*

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 19, '11 at 9:29p
activeJul 20, '11 at 10:48p
posts8
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase