Grokbase Groups Pig user October 2010
FAQ
Hi all!

I am struggling to find a working solution to load data from HBase directly. I
am using Cloudera CDH3b3 which comes with Pig 0.7. What would be the easiest
way to load data from HBase?
If it matters: we need the rows to be included, too.

I have checked ElephantBird, but it seems to require Pig 0.6. I could
downgrade, but it seems... well... :)

On the other hand, loading from HBase with rows is only added in Pig 0.8:
https://issues.apache.org/jira/browse/PIG-915
https://issues.apache.org/jira/browse/PIG-1205
But judging from the last issue Pig 0.8 requires HBase 0.20.6?

I can install latest Pig from source if needed, but I'd rather leave Hadoop
and HBase at their versions (0.20.2 and 0.89.20100924 respectively).

Should I write my own UDF? I'd appreciate some pointers.

Thanks,

Anze

Search Discussions

  • Dmitriy Ryaboy at Oct 25, 2010 at 10:02 pm
    Anze, the reason we bumped up to 20.6 in the ticket was because HBase
    20.2 had a bug in it. Ask the HBase folks, but I'd say you should
    upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
    working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
    of the other features we added to the built-in loader for pig 0.8
    (notably, it does not do storage). But I don't recommend downgrading
    to pig 0.6, as 7 and especially 8 are great improvements to the
    software.

    -D

    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from HBase directly. I
    am using Cloudera CDH3b3 which comes with Pig 0.7. What would be the easiest
    way to load data from HBase?
    If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig 0.6. I could
    downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only added in Pig 0.8:
    https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase 0.20.6?

    I can install latest Pig from source if needed, but I'd rather leave Hadoop
    and HBase at their versions (0.20.2 and 0.89.20100924 respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Anze at Oct 25, 2010 at 10:39 pm
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera doesn't ship it
    yet and we would like to keep our install at "official" versions, even if
    beta. Of course, since this is a development / testing cluster, we could bend
    the rules if really necessary...

    I have written a small MR job (actually, just "M" job :) that exports the
    tables to files (allowing me to use Pig 0.7), but that is a bit cumbersome and
    slow.

    If I install the latest Pig (0.8), will it work at all with HBase 0.20.2?
    In other words, are scan filters (which were fixed in 0.20.6) needed as part
    of user-defined parameters or as part of Pig optimizations in reading from
    HBase? Hope my question makes sense... :)

    Thanks again,

    Anze

    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was because HBase
    20.2 had a bug in it. Ask the HBase folks, but I'd say you should
    upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
    working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
    of the other features we added to the built-in loader for pig 0.8
    (notably, it does not do storage). But I don't recommend downgrading
    to pig 0.6, as 7 and especially 8 are great improvements to the
    software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from HBase
    directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
    would be the easiest way to load data from HBase?
    If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig 0.6. I could
    downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only added in Pig 0.8:
    https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase 0.20.6?

    I can install latest Pig from source if needed, but I'd rather leave
    Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
    respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Dmitriy Ryaboy at Oct 25, 2010 at 10:47 pm
    I think that you might be able to get away with 20.2 if you don't use
    the filtering options.

    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:

    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera doesn't ship it
    yet and we would like to keep our install at "official" versions, even if
    beta. Of course, since this is a development / testing cluster, we could bend
    the rules if really necessary...

    I have written a small MR job (actually, just "M" job :) that exports the
    tables to files (allowing me to use Pig 0.7), but that is a bit cumbersome and
    slow.

    If I install the latest Pig (0.8), will it work at all with HBase 0.20.2?
    In other words, are scan filters (which were fixed in 0.20.6) needed as part
    of user-defined parameters or as part of Pig optimizations in reading from
    HBase? Hope my question makes sense... :)

    Thanks again,

    Anze

    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was because HBase
    20.2 had a bug in it. Ask the HBase folks, but I'd say you should
    upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
    working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
    of the other features we added to the built-in loader for pig 0.8
    (notably, it does not do storage). But I don't recommend downgrading
    to pig 0.6, as 7 and especially 8 are great improvements to the
    software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from HBase
    directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
    would be the easiest way to load data from HBase?
    If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig 0.6. I could
    downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only added in Pig 0.8:
    https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase 0.20.6?

    I can install latest Pig from source if needed, but I'd rather leave
    Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
    respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Anze at Oct 26, 2010 at 9:50 am
    Great! :)

    Thanks for helping me out.

    All the best,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    I think that you might be able to get away with 20.2 if you don't use
    the filtering options.
    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera doesn't ship
    it yet and we would like to keep our install at "official" versions,
    even if beta. Of course, since this is a development / testing cluster,
    we could bend the rules if really necessary...

    I have written a small MR job (actually, just "M" job :) that exports the
    tables to files (allowing me to use Pig 0.7), but that is a bit
    cumbersome and slow.

    If I install the latest Pig (0.8), will it work at all with HBase 0.20.2?
    In other words, are scan filters (which were fixed in 0.20.6) needed as
    part of user-defined parameters or as part of Pig optimizations in
    reading from HBase? Hope my question makes sense... :)

    Thanks again,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was because HBase
    20.2 had a bug in it. Ask the HBase folks, but I'd say you should
    upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
    working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
    of the other features we added to the built-in loader for pig 0.8
    (notably, it does not do storage). But I don't recommend downgrading
    to pig 0.6, as 7 and especially 8 are great improvements to the
    software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from HBase
    directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
    would be the easiest way to load data from HBase?
    If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig 0.6. I could
    downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only added in Pig
    0.8: https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase 0.20.6?

    I can install latest Pig from source if needed, but I'd rather leave
    Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
    respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Anze at Oct 26, 2010 at 1:33 pm
    Hmmm, not quite there yet. :-/

    I installed:
    - HBase 0.20.6
    - Cloudera CDH3b3 Hadoop (0.20.2)
    - Pig 0.8 (since official download is empty (?) I fetched the Pig trunk from
    SVN and built it)

    Now it complains about "Failed to create DataStorage". Any ideas? Should I
    upgrade Haddop too?

    This is getting a bit complicated to install. :)

    I would appreciate some pointers - google revealed nothing useful.

    Thanks,

    Anze

    On Tuesday 26 October 2010, Anze wrote:
    Great! :)

    Thanks for helping me out.

    All the best,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    I think that you might be able to get away with 20.2 if you don't use
    the filtering options.
    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera doesn't
    ship it yet and we would like to keep our install at "official"
    versions, even if beta. Of course, since this is a development /
    testing cluster, we could bend the rules if really necessary...

    I have written a small MR job (actually, just "M" job :) that exports
    the tables to files (allowing me to use Pig 0.7), but that is a bit
    cumbersome and slow.

    If I install the latest Pig (0.8), will it work at all with HBase
    0.20.2? In other words, are scan filters (which were fixed in 0.20.6)
    needed as part of user-defined parameters or as part of Pig
    optimizations in reading from HBase? Hope my question makes sense...
    :)

    Thanks again,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was because HBase
    20.2 had a bug in it. Ask the HBase folks, but I'd say you should
    upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
    working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
    of the other features we added to the built-in loader for pig 0.8
    (notably, it does not do storage). But I don't recommend downgrading
    to pig 0.6, as 7 and especially 8 are great improvements to the
    software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from HBase
    directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
    would be the easiest way to load data from HBase?
    If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig 0.6. I
    could downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only added in Pig
    0.8: https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase 0.20.6?

    I can install latest Pig from source if needed, but I'd rather leave
    Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
    respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Dmitriy Ryaboy at Oct 26, 2010 at 5:54 pm
    Yeah pig 8 is not officially released yet, it will be cut at the end
    of the month or beginning of next month.

    Failed to create DataStorage sounds vaguely familiar.. can you send
    the full pig session and the full error? I think it's not connecting
    to hbase on the client-side, or something along those lines. You have
    all the conf files in PIG_CLASSPATH right?

    -D
    On Tue, Oct 26, 2010 at 6:32 AM, Anze wrote:

    Hmmm, not quite there yet. :-/

    I installed:
    - HBase 0.20.6
    - Cloudera CDH3b3 Hadoop (0.20.2)
    - Pig 0.8 (since official download is empty (?) I fetched the Pig trunk from
    SVN and built it)

    Now it complains about "Failed to create DataStorage". Any ideas? Should I
    upgrade Haddop too?

    This is getting a bit complicated to install. :)

    I would appreciate some pointers - google revealed nothing useful.

    Thanks,

    Anze

    On Tuesday 26 October 2010, Anze wrote:
    Great! :)

    Thanks for helping me out.

    All the best,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    I think that you might be able to get away with 20.2 if you don't use
    the filtering options.
    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera doesn't
    ship it yet and we would like to keep our install at "official"
    versions, even if beta. Of course, since this is a development /
    testing cluster, we could bend the rules if really necessary...

    I have written a small MR job (actually, just "M" job :) that exports
    the tables to files (allowing me to use Pig 0.7), but that is a bit
    cumbersome and slow.

    If I install the latest Pig (0.8), will it work at all with HBase
    0.20.2? In other words, are scan filters (which were fixed in 0.20.6)
    needed as part of user-defined parameters or as part of Pig
    optimizations in reading from HBase? Hope my question makes sense...
    :)

    Thanks again,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was because HBase
    20.2 had a bug in it. Ask the HBase folks, but I'd say you should
    upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
    working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys and most
    of the other features we added to the built-in loader for pig 0.8
    (notably, it does not do storage). But I don't recommend downgrading
    to pig 0.6, as 7 and especially 8 are great improvements to the
    software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from HBase
    directly. I am using Cloudera CDH3b3 which comes with Pig 0.7. What
    would be the easiest way to load data from HBase?
    If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig 0.6. I
    could downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only added in Pig
    0.8: https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase 0.20.6?

    I can install latest Pig from source if needed, but I'd rather leave
    Hadoop and HBase at their versions (0.20.2 and 0.89.20100924
    respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Anze at Oct 27, 2010 at 6:51 am
    ... You have all the conf files in PIG_CLASSPATH right?
    I think I do:
    ***
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    dry run:
    /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m -Dpig.log.dir=/opt/pig/bin/../logs
    -Dpig.log.file=pig.log -Dpig.home.dir=/opt/pig/bin/.. -
    Dpig.root.logger=INFO,console,DRFA -classpath
    /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
    sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/pig/bin/../build/test/classes:/opt/pig/bin/../pig-
    *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
    SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbase-0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
    test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar org.apache.pig.Main
    ***

    Generated log file contains:
    ***
    Error before Pig is launched
    ----------------------------
    ERROR 2999: Unexpected internal error. Failed to create DataStorage

    java.lang.RuntimeException: Failed to create DataStorage
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.(HExecutionEngine.java:212)
    at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:132)
    at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
    at org.apache.pig.PigServer.(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)
    Caused by: java.io.IOException: Call to namenode.admundus.com/10.0.0.3:8020
    failed on local exception: java.io.EOFException
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at
    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
    at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:170)
    at
    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
    at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
    ... 9 more
    Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at
    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
    ================================================================================

    And the Pig complains:
    ***
    log4j:WARN No appenders could be found for logger
    (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.
    2010-10-27 08:46:44,762 [main] INFO org.apache.pig.Main - Logging error
    messages to: /opt/pig/bin/pig_1288162004754.log
    2010-10-27 08:46:44,970 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
    hadoop file system at: hdfs://...:8020/
    2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999:
    Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/bin/pig_1288162004754.log
    ***

    Any idea what is wrong? I have searched the net and most answers talk about
    incompatible versions of Hadoop and Pig (but the posts are old).

    Thanks,

    Anze

    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Yeah pig 8 is not officially released yet, it will be cut at the end
    of the month or beginning of next month.

    Failed to create DataStorage sounds vaguely familiar.. can you send
    the full pig session and the full error? I think it's not connecting
    to hbase on the client-side, or something along those lines. You have
    all the conf files in PIG_CLASSPATH right?

    -D
    On Tue, Oct 26, 2010 at 6:32 AM, Anze wrote:
    Hmmm, not quite there yet. :-/

    I installed:
    - HBase 0.20.6
    - Cloudera CDH3b3 Hadoop (0.20.2)
    - Pig 0.8 (since official download is empty (?) I fetched the Pig trunk
    from SVN and built it)

    Now it complains about "Failed to create DataStorage". Any ideas? Should
    I upgrade Haddop too?

    This is getting a bit complicated to install. :)

    I would appreciate some pointers - google revealed nothing useful.

    Thanks,

    Anze
    On Tuesday 26 October 2010, Anze wrote:
    Great! :)

    Thanks for helping me out.

    All the best,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    I think that you might be able to get away with 20.2 if you don't use
    the filtering options.
    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera doesn't
    ship it yet and we would like to keep our install at "official"
    versions, even if beta. Of course, since this is a development /
    testing cluster, we could bend the rules if really necessary...

    I have written a small MR job (actually, just "M" job :) that
    exports the tables to files (allowing me to use Pig 0.7), but that
    is a bit cumbersome and slow.

    If I install the latest Pig (0.8), will it work at all with HBase
    0.20.2? In other words, are scan filters (which were fixed in
    0.20.6) needed as part of user-defined parameters or as part of Pig
    optimizations in reading from HBase? Hope my question makes
    sense...

    :)

    Thanks again,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was because
    HBase 20.2 had a bug in it. Ask the HBase folks, but I'd say you
    should upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
    working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys and
    most of the other features we added to the built-in loader for pig
    0.8 (notably, it does not do storage). But I don't recommend
    downgrading to pig 0.6, as 7 and especially 8 are great
    improvements to the software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from
    HBase directly. I am using Cloudera CDH3b3 which comes with Pig
    0.7. What would be the easiest way to load data from HBase?
    If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig 0.6. I
    could downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only added in
    Pig 0.8: https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase 0.20.6?

    I can install latest Pig from source if needed, but I'd rather
    leave Hadoop and HBase at their versions (0.20.2 and
    0.89.20100924 respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Dmitriy Ryaboy at Oct 27, 2010 at 7:49 am
    The same way you have /etc/hadoop/conf on the claspath, you want to
    put the hbase conf directory on the classpath.

    -D
    On Tue, Oct 26, 2010 at 11:50 PM, Anze wrote:

    ... You have all the conf files in PIG_CLASSPATH right?
    I think I do:
    ***
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    dry run:
    /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m -Dpig.log.dir=/opt/pig/bin/../logs
    -Dpig.log.file=pig.log -Dpig.home.dir=/opt/pig/bin/.. -
    Dpig.root.logger=INFO,console,DRFA -classpath
    /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
    sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/pig/bin/../build/test/classes:/opt/pig/bin/../pig-
    *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
    SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbase-0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
    test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar org.apache.pig.Main
    ***

    Generated log file contains:
    ***
    Error before Pig is launched
    ----------------------------
    ERROR 2999: Unexpected internal error. Failed to create DataStorage

    java.lang.RuntimeException: Failed to create DataStorage
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
    at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:212)
    at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:132)
    at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
    at org.apache.pig.PigServer.<init>(PigServer.java:225)
    at org.apache.pig.PigServer.<init>(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
    at org.apache.pig.Main.run(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)
    Caused by: java.io.IOException: Call to namenode.admundus.com/10.0.0.3:8020
    failed on local exception: java.io.EOFException
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at
    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
    at
    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
    at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
    ... 9 more
    Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at
    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
    ================================================================================

    And the Pig complains:
    ***
    log4j:WARN No appenders could be found for logger
    (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.
    2010-10-27 08:46:44,762 [main] INFO  org.apache.pig.Main - Logging error
    messages to: /opt/pig/bin/pig_1288162004754.log
    2010-10-27 08:46:44,970 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
    hadoop file system at: hdfs://...:8020/
    2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999:
    Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/bin/pig_1288162004754.log
    ***

    Any idea what is wrong? I have searched the net and most answers talk about
    incompatible versions of Hadoop and Pig (but the posts are old).

    Thanks,

    Anze

    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Yeah pig 8 is not officially released yet, it will be cut at the end
    of the month or beginning of next month.

    Failed to create DataStorage sounds vaguely familiar.. can you send
    the full pig session and the full error? I think it's not connecting
    to hbase on the client-side, or something along those lines. You have
    all the conf files in PIG_CLASSPATH right?

    -D
    On Tue, Oct 26, 2010 at 6:32 AM, Anze wrote:
    Hmmm, not quite there yet. :-/

    I installed:
    - HBase 0.20.6
    - Cloudera CDH3b3 Hadoop (0.20.2)
    - Pig 0.8 (since official download is empty (?) I fetched the Pig trunk
    from SVN and built it)

    Now it complains about "Failed to create DataStorage". Any ideas? Should
    I upgrade Haddop too?

    This is getting a bit complicated to install. :)

    I would appreciate some pointers - google revealed nothing useful.

    Thanks,

    Anze
    On Tuesday 26 October 2010, Anze wrote:
    Great! :)

    Thanks for helping me out.

    All the best,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    I think that you might be able to get away with 20.2 if you don't use
    the filtering options.
    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera doesn't
    ship it yet and we would like to keep our install at "official"
    versions, even if beta. Of course, since this is a development /
    testing cluster, we could bend the rules if really necessary...

    I have written a small MR job (actually, just "M" job :) that
    exports the tables to files (allowing me to use Pig 0.7), but that
    is a bit cumbersome and slow.

    If I install the latest Pig (0.8), will it work at all with HBase
    0.20.2? In other words, are scan filters (which were fixed in
    0.20.6) needed as part of user-defined parameters or as part of Pig
    optimizations in reading from HBase? Hope my question makes
    sense...

    :)

    Thanks again,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was because
    HBase 20.2 had a bug in it. Ask the HBase folks, but I'd say you
    should upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's been
    working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys and
    most of the other features we added to the built-in loader for pig
    0.8 (notably, it does not do storage). But I don't recommend
    downgrading to pig 0.6, as 7 and especially 8 are great
    improvements to the software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from
    HBase directly. I am using Cloudera CDH3b3 which comes with Pig
    0.7. What would be the easiest way to load data from HBase?
    If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig 0.6. I
    could downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only added in
    Pig 0.8: https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase 0.20.6?

    I can install latest Pig from source if needed, but I'd rather
    leave Hadoop and HBase at their versions (0.20.2 and
    0.89.20100924 respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Anze at Oct 27, 2010 at 8:31 am
    Thanks, I guess I would trip over that later on - but for this immediate
    problem it doesn't help (of course, because Pig fails at the start, when I'm
    not working with HBase yet).

    I have tracked the error message to HBaseStorage.init() and added some
    debugging info:
    -----
    public void init() {
    // check if name node is set, if not we set local as fail back
    String nameNode = this.properties.getProperty(FILE_SYSTEM_LOCATION);
    System.out.println("NAMENODE: " + nameNode); // debug
    if (nameNode == null || nameNode.length() == 0) {
    nameNode = "local";
    }
    this.configuration =
    ConfigurationUtil.toConfiguration(this.properties);
    try {
    if (this.uri != null) {
    this.fs = FileSystem.get(this.uri, this.configuration);
    } else {
    this.fs = FileSystem.get(this.configuration);
    }
    } catch (IOException e) {
    e.printStackTrace(); // debug
    throw new RuntimeException("Failed to create DataStorage", e);
    }
    short defaultReplication = fs.getDefaultReplication();
    properties.setProperty(DEFAULT_REPLICATION_FACTOR_KEY,
    Short.valueOf(defaultReplication).toString());
    }
    -----

    The run now looks like this:
    -----
    root:/opt/pig# bin/pig
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    2010-10-27 10:18:18,728 [main] INFO org.apache.pig.Main - Logging error
    messages to: /opt/pig/pig_1288167498720.log
    2010-10-27 10:18:18,940 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
    hadoop file system at: hdfs://<MY NAMENODE>:8020/
    NAMENODE: hdfs://<MY NAMENODE>:8020/
    java.io.IOException: Call to <MY NAMENODE>/10.0.0.3:8020 failed on local
    exception: java.io.EOFException
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at
    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
    at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:170)
    at
    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
    at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:73)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.(HExecutionEngine.java:212)
    at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:132)
    at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
    at org.apache.pig.PigServer.(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)
    Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at
    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
    2010-10-27 10:18:19,124 [main] ERROR org.apache.pig.Main - ERROR 2999:
    Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/pig_1288167498720.log
    -----

    I have replaced the name of my server with <MY NAMENODE> in the above listing.
    BTW, this works as it should:
    # hadoop fs -ls hdfs://<MY NAMENODE>:8020/

    I would appreciate some pointers, I have no idea what is causing this...

    Anze

    On Wednesday 27 October 2010, Dmitriy Ryaboy wrote:
    The same way you have /etc/hadoop/conf on the claspath, you want to
    put the hbase conf directory on the classpath.

    -D
    On Tue, Oct 26, 2010 at 11:50 PM, Anze wrote:
    ... You have all the conf files in PIG_CLASSPATH right?
    I think I do:
    ***
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    dry run:
    /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m
    -Dpig.log.dir=/opt/pig/bin/../logs -Dpig.log.file=pig.log
    -Dpig.home.dir=/opt/pig/bin/.. -
    Dpig.root.logger=INFO,console,DRFA -classpath
    /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
    sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/pig
    /bin/../build/test/classes:/opt/pig/bin/../pig-
    *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
    SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbase-
    0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
    test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar
    org.apache.pig.Main ***

    Generated log file contains:
    ***
    Error before Pig is launched
    ----------------------------
    ERROR 2999: Unexpected internal error. Failed to create DataStorage

    java.lang.RuntimeException: Failed to create DataStorage
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.
    java:75) at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorag
    e.java:58) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExec
    utionEngine.java:212) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExec
    utionEngine.java:132) at
    org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
    org.apache.pig.PigServer.<init>(PigServer.java:225)
    at org.apache.pig.PigServer.<init>(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
    at org.apache.pig.Main.run(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)
    Caused by: java.io.IOException: Call to
    namenode.admundus.com/10.0.0.3:8020 failed on local exception:
    java.io.EOFException
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at
    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
    at
    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSy
    stem.java:82) at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.
    java:72) ... 9 more
    Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at
    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
    =========================================================================
    =======

    And the Pig complains:
    ***
    log4j:WARN No appenders could be found for logger
    (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.
    2010-10-27 08:46:44,762 [main] INFO org.apache.pig.Main - Logging error
    messages to: /opt/pig/bin/pig_1288162004754.log
    2010-10-27 08:46:44,970 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
    Connecting to hadoop file system at: hdfs://...:8020/
    2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999:
    Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/bin/pig_1288162004754.log
    ***

    Any idea what is wrong? I have searched the net and most answers talk
    about incompatible versions of Hadoop and Pig (but the posts are old).

    Thanks,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Yeah pig 8 is not officially released yet, it will be cut at the end
    of the month or beginning of next month.

    Failed to create DataStorage sounds vaguely familiar.. can you send
    the full pig session and the full error? I think it's not connecting
    to hbase on the client-side, or something along those lines. You have
    all the conf files in PIG_CLASSPATH right?

    -D
    On Tue, Oct 26, 2010 at 6:32 AM, Anze wrote:
    Hmmm, not quite there yet. :-/

    I installed:
    - HBase 0.20.6
    - Cloudera CDH3b3 Hadoop (0.20.2)
    - Pig 0.8 (since official download is empty (?) I fetched the Pig
    trunk from SVN and built it)

    Now it complains about "Failed to create DataStorage". Any ideas?
    Should I upgrade Haddop too?

    This is getting a bit complicated to install. :)

    I would appreciate some pointers - google revealed nothing useful.

    Thanks,

    Anze
    On Tuesday 26 October 2010, Anze wrote:
    Great! :)

    Thanks for helping me out.

    All the best,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    I think that you might be able to get away with 20.2 if you don't
    use the filtering options.
    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera
    doesn't ship it yet and we would like to keep our install at
    "official" versions, even if beta. Of course, since this is a
    development / testing cluster, we could bend the rules if really
    necessary...

    I have written a small MR job (actually, just "M" job :) that
    exports the tables to files (allowing me to use Pig 0.7), but
    that is a bit cumbersome and slow.

    If I install the latest Pig (0.8), will it work at all with HBase
    0.20.2? In other words, are scan filters (which were fixed in
    0.20.6) needed as part of user-defined parameters or as part of
    Pig optimizations in reading from HBase? Hope my question makes
    sense...

    :)

    Thanks again,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was because
    HBase 20.2 had a bug in it. Ask the HBase folks, but I'd say you
    should upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's
    been working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys and
    most of the other features we added to the built-in loader for
    pig 0.8 (notably, it does not do storage). But I don't
    recommend downgrading to pig 0.6, as 7 and especially 8 are
    great improvements to the software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from
    HBase directly. I am using Cloudera CDH3b3 which comes with
    Pig 0.7. What would be the easiest way to load data from
    HBase? If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig 0.6.
    I could downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only added
    in Pig 0.8: https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase 0.20.6?

    I can install latest Pig from source if needed, but I'd rather
    leave Hadoop and HBase at their versions (0.20.2 and
    0.89.20100924 respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Anze at Oct 28, 2010 at 8:43 am
    Does anyone know, should Pig (0.8 - svn trunk) work with Hadoop 0.20.2?

    I still can't start the Pig...

    Thanks,

    Anze

    On Wednesday 27 October 2010, Anze wrote:
    Thanks, I guess I would trip over that later on - but for this immediate
    problem it doesn't help (of course, because Pig fails at the start, when
    I'm not working with HBase yet).

    I have tracked the error message to HBaseStorage.init() and added some
    debugging info:
    -----
    public void init() {
    // check if name node is set, if not we set local as fail back
    String nameNode =
    this.properties.getProperty(FILE_SYSTEM_LOCATION);
    System.out.println("NAMENODE: " + nameNode); // debug
    if (nameNode == null || nameNode.length() == 0) {
    nameNode = "local";
    }
    this.configuration =
    ConfigurationUtil.toConfiguration(this.properties);
    try {
    if (this.uri != null) {
    this.fs = FileSystem.get(this.uri, this.configuration);
    } else {
    this.fs = FileSystem.get(this.configuration);
    }
    } catch (IOException e) {
    e.printStackTrace(); // debug
    throw new RuntimeException("Failed to create DataStorage", e);
    }
    short defaultReplication = fs.getDefaultReplication();
    properties.setProperty(DEFAULT_REPLICATION_FACTOR_KEY,

    Short.valueOf(defaultReplication).toString()); }
    -----

    The run now looks like this:
    -----
    root:/opt/pig# bin/pig
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    2010-10-27 10:18:18,728 [main] INFO org.apache.pig.Main - Logging error
    messages to: /opt/pig/pig_1288167498720.log
    2010-10-27 10:18:18,940 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
    to hadoop file system at: hdfs://<MY NAMENODE>:8020/
    NAMENODE: hdfs://<MY NAMENODE>:8020/
    java.io.IOException: Call to <MY NAMENODE>/10.0.0.3:8020 failed on local
    exception: java.io.EOFException
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at
    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
    at
    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSyst
    em.java:82) at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.ja
    va:73) at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.
    java:58) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecut
    ionEngine.java:212) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecut
    ionEngine.java:132) at
    org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
    org.apache.pig.PigServer.<init>(PigServer.java:225)
    at org.apache.pig.PigServer.<init>(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
    at org.apache.pig.Main.run(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)
    Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at
    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
    2010-10-27 10:18:19,124 [main] ERROR org.apache.pig.Main - ERROR 2999:
    Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/pig_1288167498720.log
    -----

    I have replaced the name of my server with <MY NAMENODE> in the above
    listing. BTW, this works as it should:
    # hadoop fs -ls hdfs://<MY NAMENODE>:8020/

    I would appreciate some pointers, I have no idea what is causing this...

    Anze
    On Wednesday 27 October 2010, Dmitriy Ryaboy wrote:
    The same way you have /etc/hadoop/conf on the claspath, you want to
    put the hbase conf directory on the classpath.

    -D
    On Tue, Oct 26, 2010 at 11:50 PM, Anze wrote:
    ... You have all the conf files in PIG_CLASSPATH right?
    I think I do:
    ***
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    dry run:
    /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m
    -Dpig.log.dir=/opt/pig/bin/../logs -Dpig.log.file=pig.log
    -Dpig.home.dir=/opt/pig/bin/.. -
    Dpig.root.logger=INFO,console,DRFA -classpath
    /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
    sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/p
    ig /bin/../build/test/classes:/opt/pig/bin/../pig-
    *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
    SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbas
    e- 0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
    test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar
    org.apache.pig.Main ***

    Generated log file contains:
    ***
    Error before Pig is launched
    ----------------------------
    ERROR 2999: Unexpected internal error. Failed to create DataStorage

    java.lang.RuntimeException: Failed to create DataStorage

    at

    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorag
    e. java:75) at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStor
    ag e.java:58) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HEx
    ec utionEngine.java:212) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HEx
    ec utionEngine.java:132) at
    org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
    org.apache.pig.PigServer.<init>(PigServer.java:225)

    at org.apache.pig.PigServer.<init>(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
    at org.apache.pig.Main.run(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)

    Caused by: java.io.IOException: Call to
    namenode.admundus.com/10.0.0.3:8020 failed on local exception:
    java.io.EOFException

    at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at

    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)

    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
    at

    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFile
    Sy stem.java:82) at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)

    at
    org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at
    org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at
    org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at

    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorag
    e. java:72) ... 9 more
    Caused by: java.io.EOFException

    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at

    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501
    )

    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

    =======================================================================
    == =======

    And the Pig complains:
    ***
    log4j:WARN No appenders could be found for logger
    (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.
    2010-10-27 08:46:44,762 [main] INFO org.apache.pig.Main - Logging
    error messages to: /opt/pig/bin/pig_1288162004754.log
    2010-10-27 08:46:44,970 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
    Connecting to hadoop file system at: hdfs://...:8020/
    2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999:
    Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/bin/pig_1288162004754.log
    ***

    Any idea what is wrong? I have searched the net and most answers talk
    about incompatible versions of Hadoop and Pig (but the posts are old).

    Thanks,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Yeah pig 8 is not officially released yet, it will be cut at the end
    of the month or beginning of next month.

    Failed to create DataStorage sounds vaguely familiar.. can you send
    the full pig session and the full error? I think it's not connecting
    to hbase on the client-side, or something along those lines. You have
    all the conf files in PIG_CLASSPATH right?

    -D
    On Tue, Oct 26, 2010 at 6:32 AM, Anze wrote:
    Hmmm, not quite there yet. :-/

    I installed:
    - HBase 0.20.6
    - Cloudera CDH3b3 Hadoop (0.20.2)
    - Pig 0.8 (since official download is empty (?) I fetched the Pig
    trunk from SVN and built it)

    Now it complains about "Failed to create DataStorage". Any ideas?
    Should I upgrade Haddop too?

    This is getting a bit complicated to install. :)

    I would appreciate some pointers - google revealed nothing useful.

    Thanks,

    Anze
    On Tuesday 26 October 2010, Anze wrote:
    Great! :)

    Thanks for helping me out.

    All the best,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    I think that you might be able to get away with 20.2 if you don't
    use the filtering options.
    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera
    doesn't ship it yet and we would like to keep our install at
    "official" versions, even if beta. Of course, since this is a
    development / testing cluster, we could bend the rules if
    really necessary...

    I have written a small MR job (actually, just "M" job :) that
    exports the tables to files (allowing me to use Pig 0.7), but
    that is a bit cumbersome and slow.

    If I install the latest Pig (0.8), will it work at all with
    HBase 0.20.2? In other words, are scan filters (which were
    fixed in 0.20.6) needed as part of user-defined parameters or
    as part of Pig optimizations in reading from HBase? Hope my
    question makes sense...

    :)

    Thanks again,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was
    because HBase 20.2 had a bug in it. Ask the HBase folks, but
    I'd say you should upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's
    been working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys
    and most of the other features we added to the built-in
    loader for pig 0.8 (notably, it does not do storage). But I
    don't recommend downgrading to pig 0.6, as 7 and especially 8
    are great improvements to the software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from
    HBase directly. I am using Cloudera CDH3b3 which comes with
    Pig 0.7. What would be the easiest way to load data from
    HBase? If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig
    0.6. I could downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only
    added in Pig 0.8:
    https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase
    0.20.6?

    I can install latest Pig from source if needed, but I'd
    rather leave Hadoop and HBase at their versions (0.20.2 and
    0.89.20100924 respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Dmitriy Ryaboy at Oct 28, 2010 at 3:28 pm
    It works with 20.2, and the error trace you pasted appears to be
    completely independent of HBaseStorage..

    I see that you are using the snapshot jar -- try putting your hadoop
    jars and various dependencies on your classpath, and only using the
    -nohadoop jar that pig also builds.

    -D
    On Thu, Oct 28, 2010 at 1:42 AM, Anze wrote:

    Does anyone know, should Pig (0.8 - svn trunk) work with Hadoop 0.20.2?

    I still can't start the Pig...

    Thanks,

    Anze

    On Wednesday 27 October 2010, Anze wrote:
    Thanks, I guess I would trip over that later on - but for this immediate
    problem it doesn't help (of course, because Pig fails at the start, when
    I'm not working with HBase yet).

    I have tracked the error message to HBaseStorage.init() and added some
    debugging info:
    -----
    public void init() {
    // check if name node is set, if not we set local as fail back
    String nameNode =
    this.properties.getProperty(FILE_SYSTEM_LOCATION);
    System.out.println("NAMENODE: " + nameNode); // debug
    if (nameNode == null || nameNode.length() == 0) {
    nameNode = "local";
    }
    this.configuration =
    ConfigurationUtil.toConfiguration(this.properties);
    try {
    if (this.uri != null) {
    this.fs = FileSystem.get(this.uri, this.configuration);
    } else {
    this.fs = FileSystem.get(this.configuration);
    }
    } catch (IOException e) {
    e.printStackTrace(); // debug
    throw new RuntimeException("Failed to create DataStorage", e);
    }
    short defaultReplication = fs.getDefaultReplication();
    properties.setProperty(DEFAULT_REPLICATION_FACTOR_KEY,

    Short.valueOf(defaultReplication).toString()); }
    -----

    The run now looks like this:
    -----
    root:/opt/pig# bin/pig
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    2010-10-27 10:18:18,728 [main] INFO  org.apache.pig.Main - Logging error
    messages to: /opt/pig/pig_1288167498720.log
    2010-10-27 10:18:18,940 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
    to hadoop file system at: hdfs://<MY NAMENODE>:8020/
    NAMENODE: hdfs://<MY NAMENODE>:8020/
    java.io.IOException: Call to <MY NAMENODE>/10.0.0.3:8020 failed on local
    exception: java.io.EOFException
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at
    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
    at
    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSyst
    em.java:82) at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
    at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.ja
    va:73) at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.
    java:58) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecut
    ionEngine.java:212) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecut
    ionEngine.java:132) at
    org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
    org.apache.pig.PigServer.<init>(PigServer.java:225)
    at org.apache.pig.PigServer.<init>(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
    at org.apache.pig.Main.run(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)
    Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at
    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
    2010-10-27 10:18:19,124 [main] ERROR org.apache.pig.Main - ERROR 2999:
    Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/pig_1288167498720.log
    -----

    I have replaced the name of my server with <MY NAMENODE> in the above
    listing. BTW, this works as it should:
    # hadoop fs -ls hdfs://<MY NAMENODE>:8020/

    I would appreciate some pointers, I have no idea what is causing this...

    Anze
    On Wednesday 27 October 2010, Dmitriy Ryaboy wrote:
    The same way you have /etc/hadoop/conf on the claspath, you want to
    put the hbase conf directory on the classpath.

    -D
    On Tue, Oct 26, 2010 at 11:50 PM, Anze wrote:
    ... You have all the conf files in PIG_CLASSPATH right?
    I think I do:
    ***
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    dry run:
    /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m
    -Dpig.log.dir=/opt/pig/bin/../logs -Dpig.log.file=pig.log
    -Dpig.home.dir=/opt/pig/bin/.. -
    Dpig.root.logger=INFO,console,DRFA -classpath
    /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
    sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/opt/p
    ig /bin/../build/test/classes:/opt/pig/bin/../pig-
    *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
    SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/hbas
    e- 0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
    test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar
    org.apache.pig.Main ***

    Generated log file contains:
    ***
    Error before Pig is launched
    ----------------------------
    ERROR 2999: Unexpected internal error. Failed to create DataStorage

    java.lang.RuntimeException: Failed to create DataStorage

    at

    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorag
    e. java:75) at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStor
    ag e.java:58) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HEx
    ec utionEngine.java:212) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HEx
    ec utionEngine.java:132) at
    org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
    org.apache.pig.PigServer.<init>(PigServer.java:225)

    at org.apache.pig.PigServer.<init>(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
    at org.apache.pig.Main.run(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)

    Caused by: java.io.IOException: Call to
    namenode.admundus.com/10.0.0.3:8020 failed on local exception:
    java.io.EOFException

    at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at

    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)

    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
    at

    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFile
    Sy stem.java:82) at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)

    at
    org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at
    org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at
    org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at

    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorag
    e. java:72) ... 9 more
    Caused by: java.io.EOFException

    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at

    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501
    )

    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

    =======================================================================
    == =======

    And the Pig complains:
    ***
    log4j:WARN No appenders could be found for logger
    (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.
    2010-10-27 08:46:44,762 [main] INFO  org.apache.pig.Main - Logging
    error messages to: /opt/pig/bin/pig_1288162004754.log
    2010-10-27 08:46:44,970 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
    Connecting to hadoop file system at: hdfs://...:8020/
    2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR 2999:
    Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/bin/pig_1288162004754.log
    ***

    Any idea what is wrong? I have searched the net and most answers talk
    about incompatible versions of Hadoop and Pig (but the posts are old).

    Thanks,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Yeah pig 8 is not officially released yet, it will be cut at the end
    of the month or beginning of next month.

    Failed to create DataStorage sounds vaguely familiar.. can you send
    the full pig session and the full error? I think it's not connecting
    to hbase on the client-side, or something along those lines. You have
    all the conf files in PIG_CLASSPATH right?

    -D
    On Tue, Oct 26, 2010 at 6:32 AM, Anze wrote:
    Hmmm, not quite there yet. :-/

    I installed:
    - HBase 0.20.6
    - Cloudera CDH3b3 Hadoop (0.20.2)
    - Pig 0.8 (since official download is empty (?) I fetched the Pig
    trunk from SVN and built it)

    Now it complains about "Failed to create DataStorage". Any ideas?
    Should I upgrade Haddop too?

    This is getting a bit complicated to install. :)

    I would appreciate some pointers - google revealed nothing useful.

    Thanks,

    Anze
    On Tuesday 26 October 2010, Anze wrote:
    Great! :)

    Thanks for helping me out.

    All the best,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    I think that you might be able to get away with 20.2 if you don't
    use the filtering options.
    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera
    doesn't ship it yet and we would like to keep our install at
    "official" versions, even if beta. Of course, since this is a
    development / testing cluster, we could bend the rules if
    really necessary...

    I have written a small MR job (actually, just "M" job :) that
    exports the tables to files (allowing me to use Pig 0.7), but
    that is a bit cumbersome and slow.

    If I install the latest Pig (0.8), will it work at all with
    HBase 0.20.2? In other words, are scan filters (which were
    fixed in 0.20.6) needed as part of user-defined parameters or
    as part of Pig optimizations in reading from HBase? Hope my
    question makes sense...

    :)

    Thanks again,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was
    because HBase 20.2 had a bug in it. Ask the HBase folks, but
    I'd say you should upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and it's
    been working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row keys
    and most of the other features we added to the built-in
    loader for pig 0.8 (notably, it does not do storage). But I
    don't recommend downgrading to pig 0.6, as 7 and especially 8
    are great improvements to the software.

    -D
    On Mon, Oct 25, 2010 at 7:01 AM, Anze wrote:
    Hi all!

    I am struggling to find a working solution to load data from
    HBase directly. I am using Cloudera CDH3b3 which comes with
    Pig 0.7. What would be the easiest way to load data from
    HBase? If it matters: we need the rows to be included, too.

    I have checked ElephantBird, but it seems to require Pig
    0.6. I could downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only
    added in Pig 0.8:
    https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase
    0.20.6?

    I can install latest Pig from source if needed, but I'd
    rather leave Hadoop and HBase at their versions (0.20.2 and
    0.89.20100924 respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze
  • Anze Skerlavaj at Oct 29, 2010 at 3:46 pm
    Dmitriy, thanks for answering! I will try it and post here how it goes...
    Right now I'm in a middle of Pig 0.7 session (I gave up and exported data from
    HBase to HDFS). Next week... :)

    Anze

    On Thursday 28 October 2010, Dmitriy Ryaboy wrote:
    It works with 20.2, and the error trace you pasted appears to be
    completely independent of HBaseStorage..

    I see that you are using the snapshot jar -- try putting your hadoop
    jars and various dependencies on your classpath, and only using the
    -nohadoop jar that pig also builds.

    -D
    On Thu, Oct 28, 2010 at 1:42 AM, Anze wrote:
    Does anyone know, should Pig (0.8 - svn trunk) work with Hadoop 0.20.2?

    I still can't start the Pig...

    Thanks,

    Anze
    On Wednesday 27 October 2010, Anze wrote:
    Thanks, I guess I would trip over that later on - but for this immediate
    problem it doesn't help (of course, because Pig fails at the start, when
    I'm not working with HBase yet).

    I have tracked the error message to HBaseStorage.init() and added some
    debugging info:
    -----
    public void init() {
    // check if name node is set, if not we set local as fail back
    String nameNode =
    this.properties.getProperty(FILE_SYSTEM_LOCATION);
    System.out.println("NAMENODE: " + nameNode); // debug
    if (nameNode == null || nameNode.length() == 0) {
    nameNode = "local";
    }
    this.configuration =
    ConfigurationUtil.toConfiguration(this.properties);
    try {
    if (this.uri != null) {
    this.fs = FileSystem.get(this.uri, this.configuration);
    } else {
    this.fs = FileSystem.get(this.configuration);
    }
    } catch (IOException e) {
    e.printStackTrace(); // debug
    throw new RuntimeException("Failed to create DataStorage",
    e); }
    short defaultReplication = fs.getDefaultReplication();
    properties.setProperty(DEFAULT_REPLICATION_FACTOR_KEY,

    Short.valueOf(defaultReplication).toString()); }
    -----

    The run now looks like this:
    -----
    root:/opt/pig# bin/pig
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    2010-10-27 10:18:18,728 [main] INFO org.apache.pig.Main - Logging error
    messages to: /opt/pig/pig_1288167498720.log
    2010-10-27 10:18:18,940 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
    Connecting to hadoop file system at: hdfs://<MY NAMENODE>:8020/
    NAMENODE: hdfs://<MY NAMENODE>:8020/
    java.io.IOException: Call to <MY NAMENODE>/10.0.0.3:8020 failed on local
    exception: java.io.EOFException
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at
    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
    at
    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileS
    yst em.java:82) at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
    at
    org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at
    org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at
    org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at
    org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage
    .ja va:73) at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStora
    ge. java:58) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExe
    cut ionEngine.java:212) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExe
    cut ionEngine.java:132) at
    org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
    org.apache.pig.PigServer.<init>(PigServer.java:225)
    at org.apache.pig.PigServer.<init>(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
    at org.apache.pig.Main.run(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)
    Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at
    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
    2010-10-27 10:18:19,124 [main] ERROR org.apache.pig.Main - ERROR 2999:
    Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/pig_1288167498720.log
    -----

    I have replaced the name of my server with <MY NAMENODE> in the above
    listing. BTW, this works as it should:
    # hadoop fs -ls hdfs://<MY NAMENODE>:8020/

    I would appreciate some pointers, I have no idea what is causing this...

    Anze
    On Wednesday 27 October 2010, Dmitriy Ryaboy wrote:
    The same way you have /etc/hadoop/conf on the claspath, you want to
    put the hbase conf directory on the classpath.

    -D
    On Tue, Oct 26, 2010 at 11:50 PM, Anze wrote:
    ... You have all the conf files in PIG_CLASSPATH right?
    I think I do:
    ***
    PIG_HOME: /opt/pig/bin/..
    PIG_CONF_DIR: /opt/pig/bin/../conf
    dry run:
    /usr/lib/jvm/java-6-sun/bin/java -Xmx1000m
    -Dpig.log.dir=/opt/pig/bin/../logs -Dpig.log.file=pig.log
    -Dpig.home.dir=/opt/pig/bin/.. -
    Dpig.root.logger=INFO,console,DRFA -classpath
    /opt/pig/bin/../conf:/usr/lib/jvm/java-6-
    sun/lib/tools.jar:/etc/hadoop/conf:/opt/pig/bin/../build/classes:/op
    t/p ig /bin/../build/test/classes:/opt/pig/bin/../pig-
    *-core.jar:/opt/pig/bin/../build/pig-0.8.0-
    SNAPSHOT.jar:/opt/pig/bin/../lib/automaton.jar:/opt/pig/bin/../lib/h
    bas e- 0.20.6.jar:/opt/pig/bin/../lib/hbase-0.20.6-
    test.jar:/opt/pig/bin/../lib/zookeeper-hbase-1329.jar
    org.apache.pig.Main ***

    Generated log file contains:
    ***
    Error before Pig is launched
    ----------------------------
    ERROR 2999: Unexpected internal error. Failed to create DataStorage

    java.lang.RuntimeException: Failed to create DataStorage

    at

    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataSto
    rag e. java:75) at
    org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataS
    tor ag e.java:58) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(
    HEx ec utionEngine.java:212) at
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(
    HEx ec utionEngine.java:132) at
    org.apache.pig.impl.PigContext.connect(PigContext.java:183) at
    org.apache.pig.PigServer.<init>(PigServer.java:225)

    at org.apache.pig.PigServer.<init>(PigServer.java:214)
    at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
    at org.apache.pig.Main.run(Main.java:450)
    at org.apache.pig.Main.main(Main.java:107)

    Caused by: java.io.IOException: Call to
    namenode.admundus.com/10.0.0.3:8020 failed on local exception:
    java.io.EOFException

    at
    org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at
    org.apache.hadoop.ipc.Client.call(Client.java:743) at
    org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at
    $Proxy0.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at

    org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:10
    6)

    at
    org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207) at
    org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170) at

    org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedF
    ile Sy stem.java:82) at
    org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:137
    8)

    at

    org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at

    org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at
    org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at
    org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at

    org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataSto
    rag e. java:72) ... 9 more
    Caused by: java.io.EOFException

    at java.io.DataInputStream.readInt(DataInputStream.java:375)
    at

    org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:
    501 )

    at
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

    ====================================================================
    === == =======

    And the Pig complains:
    ***
    log4j:WARN No appenders could be found for logger
    (org.apache.hadoop.conf.Configuration).
    log4j:WARN Please initialize the log4j system properly.
    2010-10-27 08:46:44,762 [main] INFO org.apache.pig.Main - Logging
    error messages to: /opt/pig/bin/pig_1288162004754.log
    2010-10-27 08:46:44,970 [main] INFO
    org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
    Connecting to hadoop file system at: hdfs://...:8020/
    2010-10-27 08:46:45,158 [main] ERROR org.apache.pig.Main - ERROR
    2999: Unexpected internal error. Failed to create DataStorage
    Details at logfile: /opt/pig/bin/pig_1288162004754.log
    ***

    Any idea what is wrong? I have searched the net and most answers
    talk about incompatible versions of Hadoop and Pig (but the posts
    are old).

    Thanks,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Yeah pig 8 is not officially released yet, it will be cut at the
    end of the month or beginning of next month.

    Failed to create DataStorage sounds vaguely familiar.. can you send
    the full pig session and the full error? I think it's not
    connecting to hbase on the client-side, or something along those
    lines. You have all the conf files in PIG_CLASSPATH right?

    -D
    On Tue, Oct 26, 2010 at 6:32 AM, Anze wrote:
    Hmmm, not quite there yet. :-/

    I installed:
    - HBase 0.20.6
    - Cloudera CDH3b3 Hadoop (0.20.2)
    - Pig 0.8 (since official download is empty (?) I fetched the Pig
    trunk from SVN and built it)

    Now it complains about "Failed to create DataStorage". Any ideas?
    Should I upgrade Haddop too?

    This is getting a bit complicated to install. :)

    I would appreciate some pointers - google revealed nothing
    useful.

    Thanks,

    Anze
    On Tuesday 26 October 2010, Anze wrote:
    Great! :)

    Thanks for helping me out.

    All the best,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    I think that you might be able to get away with 20.2 if you
    don't use the filtering options.
    On Mon, Oct 25, 2010 at 3:39 PM, Anze wrote:
    Dmitriy, thanks for the answer!

    The problem with upgrading to HBase 0.20.6 is that cloudera
    doesn't ship it yet and we would like to keep our install at
    "official" versions, even if beta. Of course, since this is
    a development / testing cluster, we could bend the rules if
    really necessary...

    I have written a small MR job (actually, just "M" job :)
    that exports the tables to files (allowing me to use Pig
    0.7), but that is a bit cumbersome and slow.

    If I install the latest Pig (0.8), will it work at all with
    HBase 0.20.2? In other words, are scan filters (which were
    fixed in 0.20.6) needed as part of user-defined parameters
    or as part of Pig optimizations in reading from HBase? Hope
    my question makes sense...

    :)

    Thanks again,

    Anze
    On Tuesday 26 October 2010, Dmitriy Ryaboy wrote:
    Anze, the reason we bumped up to 20.6 in the ticket was
    because HBase 20.2 had a bug in it. Ask the HBase folks,
    but I'd say you should upgrade.
    FWIW we upgraded to 20.6 from 20.2 a few months back and
    it's been working smoothly.

    The Elephant-Bird hbase loader for pig 0.6 does add row
    keys and most of the other features we added to the
    built-in loader for pig 0.8 (notably, it does not do
    storage). But I don't recommend downgrading to pig 0.6, as
    7 and especially 8 are great improvements to the software.

    -D

    On Mon, Oct 25, 2010 at 7:01 AM, Anze <anzenews@volja.net>
    wrote:
    Hi all!

    I am struggling to find a working solution to load data
    from HBase directly. I am using Cloudera CDH3b3 which
    comes with Pig 0.7. What would be the easiest way to
    load data from HBase? If it matters: we need the rows to
    be included, too.

    I have checked ElephantBird, but it seems to require Pig
    0.6. I could downgrade, but it seems... well... :)

    On the other hand, loading from HBase with rows is only
    added in Pig 0.8:
    https://issues.apache.org/jira/browse/PIG-915
    https://issues.apache.org/jira/browse/PIG-1205
    But judging from the last issue Pig 0.8 requires HBase
    0.20.6?

    I can install latest Pig from source if needed, but I'd
    rather leave Hadoop and HBase at their versions (0.20.2
    and 0.89.20100924 respectively).

    Should I write my own UDF? I'd appreciate some pointers.

    Thanks,

    Anze

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedOct 25, '10 at 2:01p
activeOct 29, '10 at 3:46p
posts13
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase