Grokbase Groups Pig user January 2011
FAQ
Hi everyone. In considering Pig for our HBase querying needs, I've run into a discrepancy between the size of Pig's result set and the size of the table being queried. I hope this is due to a misunderstanding of HBase and Pig on my part. The test case which generates the discrepancy is fairly simple, however.

The link below contains a Jython script which populates an HBase table with data in two column familes. A corresponding Pig query retrieves data for one column and saves it to a CSV:

https://gist.github.com/766929

The Jython script has the following usage:
jython hbase_test.py [table] [column count] [row count] [batch count]
This will populate a table named [table] with two column families. The first contains static data. The second contains the given number of columns, populated with data.

The Pig query will return an inaccurate number of results for certain table sizes and configurations, most notably with tables exceeding 1.8 million rows in length and with more than 2 columns in the queried column family, eg.
jython hbase_test.py test 3 1800000 100000
For instance, if I execute the above command and the corresponding Pig query, the results number 905914. Note that if the table is re-populated and queried a second time, a different number results. If I run the query again without re-populating the table, I get the same number of results. The HBase shell returns an accurate row count.

Some notes on reproducing this issue (or not):

* If the Jython script doesn't populate the meta column family, the issue goes away with the same query.
* If the Jython script populates 2 columns instead of 3, the issue goes away with the same query.
* The size of the column key or its value may influence whether the issue occurs.
For instance, if I change the script to store 'value_%d' instead of 'value_%d_%d', retaining the random int, the issue goes away with the same query.

I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard using the stock Java that came with the OS. Attached is a log of the Pig console output. The error logs contain nothing of import.

Am I doing anything incorrectly? Is there a way I can work around this issue without compromising the column family being queried?

This appears to be a fairly simple case of Pig/HBase usage. Can anyone else reproduce the issue?

thanks,
Ian.

Search Discussions

  • Dmitriy Ryaboy at Jan 5, 2011 at 10:24 pm
    That certainly sounds like a bug. I wonder if there is anything interesting
    in the HBase logs when you run the job that gets the wrong result?
    On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens wrote:

    Hi everyone. In considering Pig for our HBase querying needs, I've run into
    a discrepancy between the size of Pig's result set and the size of the table
    being queried. I hope this is due to a misunderstanding of HBase and Pig on
    my part. The test case which generates the discrepancy is fairly simple,
    however.

    The link below contains a Jython script which populates an HBase table with
    data in two column familes. A corresponding Pig query retrieves data for one
    column and saves it to a CSV:

    https://gist.github.com/766929

    The Jython script has the following usage:
    jython hbase_test.py [table] [column count] [row count] [batch count]
    This will populate a table named [table] with two column families. The
    first contains static data. The second contains the given number of columns,
    populated with data.

    The Pig query will return an inaccurate number of results for certain table
    sizes and configurations, most notably with tables exceeding 1.8 million
    rows in length and with more than 2 columns in the queried column family,
    eg.
    jython hbase_test.py test 3 1800000 100000
    For instance, if I execute the above command and the corresponding Pig
    query, the results number 905914. Note that if the table is re-populated and
    queried a second time, a different number results. If I run the query again
    without re-populating the table, I get the same number of results. The HBase
    shell returns an accurate row count.

    Some notes on reproducing this issue (or not):

    * If the Jython script doesn't populate the meta column family, the issue
    goes away with the same query.
    * If the Jython script populates 2 columns instead of 3, the issue goes
    away with the same query.
    * The size of the column key or its value may influence whether the issue
    occurs.
    For instance, if I change the script to store 'value_%d' instead of
    'value_%d_%d', retaining the random int, the issue goes away with the same
    query.

    I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
    using the stock Java that came with the OS. Attached is a log of the Pig
    console output. The error logs contain nothing of import.

    Am I doing anything incorrectly? Is there a way I can work around this
    issue without compromising the column family being queried?

    This appears to be a fairly simple case of Pig/HBase usage. Can anyone else
    reproduce the issue?

    thanks,
    Ian.
  • Ian Stevens at Jan 6, 2011 at 5:49 pm

    On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:

    That certainly sounds like a bug. I wonder if there is anything interesting
    in the HBase logs when you run the job that gets the wrong result?
    Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from about the time of the failed query. I restarted HBase before making the query, so there might be noise in the log associated with a restart.

    master.log: http://pastebin.com/VwiXZ9BB
    zookeeper.log: http://pastebin.com/CnFVyFT2

    I believe logging level is set to DEBUG for both logs.

    Let me know if you need further logging.

    thanks,
    Ian.

    On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens wrote:

    Hi everyone. In considering Pig for our HBase querying needs, I've run into
    a discrepancy between the size of Pig's result set and the size of the table
    being queried. I hope this is due to a misunderstanding of HBase and Pig on
    my part. The test case which generates the discrepancy is fairly simple,
    however.

    The link below contains a Jython script which populates an HBase table with
    data in two column familes. A corresponding Pig query retrieves data for one
    column and saves it to a CSV:

    https://gist.github.com/766929

    The Jython script has the following usage:
    jython hbase_test.py [table] [column count] [row count] [batch count]
    This will populate a table named [table] with two column families. The
    first contains static data. The second contains the given number of columns,
    populated with data.

    The Pig query will return an inaccurate number of results for certain table
    sizes and configurations, most notably with tables exceeding 1.8 million
    rows in length and with more than 2 columns in the queried column family,
    eg.
    jython hbase_test.py test 3 1800000 100000
    For instance, if I execute the above command and the corresponding Pig
    query, the results number 905914. Note that if the table is re-populated and
    queried a second time, a different number results. If I run the query again
    without re-populating the table, I get the same number of results. The HBase
    shell returns an accurate row count.

    Some notes on reproducing this issue (or not):

    * If the Jython script doesn't populate the meta column family, the issue
    goes away with the same query.
    * If the Jython script populates 2 columns instead of 3, the issue goes
    away with the same query.
    * The size of the column key or its value may influence whether the issue
    occurs.
    For instance, if I change the script to store 'value_%d' instead of
    'value_%d_%d', retaining the random int, the issue goes away with the same
    query.

    I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
    using the stock Java that came with the OS. Attached is a log of the Pig
    console output. The error logs contain nothing of import.

    Am I doing anything incorrectly? Is there a way I can work around this
    issue without compromising the column family being queried?

    This appears to be a fairly simple case of Pig/HBase usage. Can anyone else
    reproduce the issue?

    thanks,
    Ian.
  • Dmitriy Ryaboy at Jan 6, 2011 at 6:33 pm
    Do you happen to have the region server logs as well?
    The .out as well as .log

    D
    On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens wrote:
    On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:

    That certainly sounds like a bug. I wonder if there is anything
    interesting
    in the HBase logs when you run the job that gets the wrong result?
    Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from
    about the time of the failed query. I restarted HBase before making the
    query, so there might be noise in the log associated with a restart.

    master.log: http://pastebin.com/VwiXZ9BB
    zookeeper.log: http://pastebin.com/CnFVyFT2

    I believe logging level is set to DEBUG for both logs.

    Let me know if you need further logging.

    thanks,
    Ian.

    On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens wrote:

    Hi everyone. In considering Pig for our HBase querying needs, I've run
    into
    a discrepancy between the size of Pig's result set and the size of the
    table
    being queried. I hope this is due to a misunderstanding of HBase and Pig
    on
    my part. The test case which generates the discrepancy is fairly simple,
    however.

    The link below contains a Jython script which populates an HBase table
    with
    data in two column familes. A corresponding Pig query retrieves data for
    one
    column and saves it to a CSV:

    https://gist.github.com/766929

    The Jython script has the following usage:
    jython hbase_test.py [table] [column count] [row count] [batch count]
    This will populate a table named [table] with two column families. The
    first contains static data. The second contains the given number of
    columns,
    populated with data.

    The Pig query will return an inaccurate number of results for certain
    table
    sizes and configurations, most notably with tables exceeding 1.8 million
    rows in length and with more than 2 columns in the queried column
    family,
    eg.
    jython hbase_test.py test 3 1800000 100000
    For instance, if I execute the above command and the corresponding Pig
    query, the results number 905914. Note that if the table is re-populated
    and
    queried a second time, a different number results. If I run the query
    again
    without re-populating the table, I get the same number of results. The
    HBase
    shell returns an accurate row count.

    Some notes on reproducing this issue (or not):

    * If the Jython script doesn't populate the meta column family, the
    issue
    goes away with the same query.
    * If the Jython script populates 2 columns instead of 3, the issue goes
    away with the same query.
    * The size of the column key or its value may influence whether the
    issue
    occurs.
    For instance, if I change the script to store 'value_%d' instead of
    'value_%d_%d', retaining the random int, the issue goes away with the
    same
    query.

    I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
    using the stock Java that came with the OS. Attached is a log of the Pig
    console output. The error logs contain nothing of import.

    Am I doing anything incorrectly? Is there a way I can work around this
    issue without compromising the column family being queried?

    This appears to be a fairly simple case of Pig/HBase usage. Can anyone
    else
    reproduce the issue?

    thanks,
    Ian.
  • Ian Stevens at Jan 6, 2011 at 6:54 pm
    The regionserver.out is empty. The regionserver.log contains only the following for the relevant time period:

    Thu Jan 6 12:19:57 EST 2011 Starting regionserver on istevens.syncapse.local
    ulimit -n 256
    2011-01-06 12:19:59,588 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Not starting a distinct region server because hbase.cluster.distributed is false

    Ian.
    On 2011-01-06, at 1:32 PM, Dmitriy Ryaboy wrote:

    Do you happen to have the region server logs as well?
    The .out as well as .log

    D
    On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens wrote:
    On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:

    That certainly sounds like a bug. I wonder if there is anything
    interesting
    in the HBase logs when you run the job that gets the wrong result?
    Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log from
    about the time of the failed query. I restarted HBase before making the
    query, so there might be noise in the log associated with a restart.

    master.log: http://pastebin.com/VwiXZ9BB
    zookeeper.log: http://pastebin.com/CnFVyFT2

    I believe logging level is set to DEBUG for both logs.

    Let me know if you need further logging.

    thanks,
    Ian.

    On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <[email protected]>
    wrote:
    Hi everyone. In considering Pig for our HBase querying needs, I've run
    into
    a discrepancy between the size of Pig's result set and the size of the
    table
    being queried. I hope this is due to a misunderstanding of HBase and Pig
    on
    my part. The test case which generates the discrepancy is fairly simple,
    however.

    The link below contains a Jython script which populates an HBase table
    with
    data in two column familes. A corresponding Pig query retrieves data for
    one
    column and saves it to a CSV:

    https://gist.github.com/766929

    The Jython script has the following usage:
    jython hbase_test.py [table] [column count] [row count] [batch count]
    This will populate a table named [table] with two column families. The
    first contains static data. The second contains the given number of
    columns,
    populated with data.

    The Pig query will return an inaccurate number of results for certain
    table
    sizes and configurations, most notably with tables exceeding 1.8 million
    rows in length and with more than 2 columns in the queried column
    family,
    eg.
    jython hbase_test.py test 3 1800000 100000
    For instance, if I execute the above command and the corresponding Pig
    query, the results number 905914. Note that if the table is re-populated
    and
    queried a second time, a different number results. If I run the query
    again
    without re-populating the table, I get the same number of results. The
    HBase
    shell returns an accurate row count.

    Some notes on reproducing this issue (or not):

    * If the Jython script doesn't populate the meta column family, the
    issue
    goes away with the same query.
    * If the Jython script populates 2 columns instead of 3, the issue goes
    away with the same query.
    * The size of the column key or its value may influence whether the
    issue
    occurs.
    For instance, if I change the script to store 'value_%d' instead of
    'value_%d_%d', retaining the random int, the issue goes away with the
    same
    query.

    I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard
    using the stock Java that came with the OS. Attached is a log of the Pig
    console output. The error logs contain nothing of import.

    Am I doing anything incorrectly? Is there a way I can work around this
    issue without compromising the column family being queried?

    This appears to be a fairly simple case of Pig/HBase usage. Can anyone
    else
    reproduce the issue?

    thanks,
    Ian.
  • Dmitriy Ryaboy at Jan 9, 2011 at 12:05 am
    Ian, I looked through the code and I don't see how this could be happening..
    just to make sure this isn't an HBase issue -- can you run an equivalent
    java MR program to count the rows? The shell one is sequential and doesn't
    use all the mapreduce machinery.

    The job you want to run is org.apache.hadoop.hbase.mapreduce.RowCounter in
    the hbase jar, I believe.
    On Thu, Jan 6, 2011 at 10:53 AM, Ian Stevens wrote:

    The regionserver.out is empty. The regionserver.log contains only the
    following for the relevant time period:

    Thu Jan 6 12:19:57 EST 2011 Starting regionserver on
    istevens.syncapse.local
    ulimit -n 256
    2011-01-06 12:19:59,588 WARN
    org.apache.hadoop.hbase.regionserver.HRegionServer: Not starting a distinct
    region server because hbase.cluster.distributed is false

    Ian.
    On 2011-01-06, at 1:32 PM, Dmitriy Ryaboy wrote:

    Do you happen to have the region server logs as well?
    The .out as well as .log

    D
    On Thu, Jan 6, 2011 at 9:49 AM, Ian Stevens wrote:
    On 2011-01-05, at 5:23 PM, Dmitriy Ryaboy wrote:

    That certainly sounds like a bug. I wonder if there is anything
    interesting
    in the HBase logs when you run the job that gets the wrong result?
    Hi Dmitriy. I've posted the corresponding master.log and zookeeper.log
    from
    about the time of the failed query. I restarted HBase before making the
    query, so there might be noise in the log associated with a restart.

    master.log: http://pastebin.com/VwiXZ9BB
    zookeeper.log: http://pastebin.com/CnFVyFT2

    I believe logging level is set to DEBUG for both logs.

    Let me know if you need further logging.

    thanks,
    Ian.

    On Wed, Jan 5, 2011 at 1:14 PM, Ian Stevens <[email protected]>
    wrote:
    Hi everyone. In considering Pig for our HBase querying needs, I've run
    into
    a discrepancy between the size of Pig's result set and the size of the
    table
    being queried. I hope this is due to a misunderstanding of HBase and
    Pig
    on
    my part. The test case which generates the discrepancy is fairly
    simple,
    however.

    The link below contains a Jython script which populates an HBase table
    with
    data in two column familes. A corresponding Pig query retrieves data
    for
    one
    column and saves it to a CSV:

    https://gist.github.com/766929

    The Jython script has the following usage:
    jython hbase_test.py [table] [column count] [row count] [batch count]
    This will populate a table named [table] with two column families. The
    first contains static data. The second contains the given number of
    columns,
    populated with data.

    The Pig query will return an inaccurate number of results for certain
    table
    sizes and configurations, most notably with tables exceeding 1.8
    million
    rows in length and with more than 2 columns in the queried column
    family,
    eg.
    jython hbase_test.py test 3 1800000 100000
    For instance, if I execute the above command and the corresponding Pig
    query, the results number 905914. Note that if the table is
    re-populated
    and
    queried a second time, a different number results. If I run the query
    again
    without re-populating the table, I get the same number of results. The
    HBase
    shell returns an accurate row count.

    Some notes on reproducing this issue (or not):

    * If the Jython script doesn't populate the meta column family, the
    issue
    goes away with the same query.
    * If the Jython script populates 2 columns instead of 3, the issue
    goes
    away with the same query.
    * The size of the column key or its value may influence whether the
    issue
    occurs.
    For instance, if I change the script to store 'value_%d' instead of
    'value_%d_%d', retaining the random int, the issue goes away with the
    same
    query.

    I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow
    Leopard
    using the stock Java that came with the OS. Attached is a log of the
    Pig
    console output. The error logs contain nothing of import.

    Am I doing anything incorrectly? Is there a way I can work around this
    issue without compromising the column family being queried?

    This appears to be a fairly simple case of Pig/HBase usage. Can anyone
    else
    reproduce the issue?

    thanks,
    Ian.
  • Mr. Lukas at Jan 20, 2011 at 4:25 pm
    Hi pig users,
    I'm also using pig 0.8 together with HBase 0.20.6 and think, my problem is
    related to Ian's. When processing a table with millions of rows (stored in
    multiple), HBaseStorage won't scan the full table but only a few hundred
    records.

    The following minimal example reproduces my problem (for this table):

    REGISTER '/path/to/guava-r07.jar'
    SET DEFAULT_PARALLEL 30;
    items = LOAD 'hbase://some-table' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:column', '-caster
    HBaseBinaryConverter -caching 500 -loadKey') AS (key:bytearray,
    a_column:long);
    items = GROUP items ALL;
    item_count = FOREACH items GENERATE COUNT_STAR($1);
    DUMP item_count

    Pig issues just one mapper and I guess, that it scans just one region of the
    table. Or did i miss some fundamental configuration options?

    Best regards,
    Lukas
  • Dmitriy Ryaboy at Jan 21, 2011 at 12:42 am
    This is quite odd because I do the same thing on a multi-million row table
    and get multiple regions ...
    You do have multiple regions, right? What happens if you only specify the
    -loadKey parameter and none of the others?
    On Thu, Jan 20, 2011 at 8:24 AM, Mr. Lukas wrote:

    Hi pig users,
    I'm also using pig 0.8 together with HBase 0.20.6 and think, my problem is
    related to Ian's. When processing a table with millions of rows (stored in
    multiple), HBaseStorage won't scan the full table but only a few hundred
    records.

    The following minimal example reproduces my problem (for this table):

    REGISTER '/path/to/guava-r07.jar'
    SET DEFAULT_PARALLEL 30;
    items = LOAD 'hbase://some-table' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:column', '-caster
    HBaseBinaryConverter -caching 500 -loadKey') AS (key:bytearray,
    a_column:long);
    items = GROUP items ALL;
    item_count = FOREACH items GENERATE COUNT_STAR($1);
    DUMP item_count

    Pig issues just one mapper and I guess, that it scans just one region of
    the
    table. Or did i miss some fundamental configuration options?

    Best regards,
    Lukas
  • Mr. Lukas at Jan 24, 2011 at 9:08 am
    Hi Dmitriy,
    Sorry for the late reply, I was out of office.
    Discarding the caster and caching option (i.e. using only the -loadkey
    option) does not change anything except that some
    FIELD_DISCARDED_TYPE_CONVERSION_FAILED warnings are issued.
    On Fri, Jan 21, 2011 at 1:42 AM, Dmitriy Ryaboy wrote:

    This is quite odd because I do the same thing on a multi-million row table
    and get multiple regions ...
    You do have multiple regions, right? What happens if you only specify the
    -loadKey parameter and none of the others?
    On Thu, Jan 20, 2011 at 8:24 AM, Mr. Lukas wrote:

    Hi pig users,
    I'm also using pig 0.8 together with HBase 0.20.6 and think, my problem is
    related to Ian's. When processing a table with millions of rows (stored in
    multiple), HBaseStorage won't scan the full table but only a few hundred
    records.

    The following minimal example reproduces my problem (for this table):

    REGISTER '/path/to/guava-r07.jar'
    SET DEFAULT_PARALLEL 30;
    items = LOAD 'hbase://some-table' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:column', '-caster
    HBaseBinaryConverter -caching 500 -loadKey') AS (key:bytearray,
    a_column:long);
    items = GROUP items ALL;
    item_count = FOREACH items GENERATE COUNT_STAR($1);
    DUMP item_count

    Pig issues just one mapper and I guess, that it scans just one region of
    the
    table. Or did i miss some fundamental configuration options?

    Best regards,
    Lukas
  • Mr. Lukas at Jan 25, 2011 at 1:53 pm
    Hello again,
    I just found something interesting in the logs:

    INFO org.apache.pig.backend.hadoop.hbase.HBaseTableInputFormat:
    setScan with ranges: 5192296858534827628530496329220096 -
    5192343374370748142029900260897474 ( 46515835920513499403931677378)

    But in my case, it should more be from 1020576114013268896970538800 to
    72576215356229636519498348368 (when interpreting those numbers as the
    arbitrary precision integer representation of the row key).

    Best regards,
    Lukas
    On Mon, Jan 24, 2011 at 10:07 AM, Mr. Lukas wrote:
    Hi Dmitriy,
    Sorry for the late reply, I was out of office.
    Discarding the caster and caching option (i.e. using only the -loadkey
    option) does not change anything except that some
    FIELD_DISCARDED_TYPE_CONVERSION_FAILED warnings are issued.
    On Fri, Jan 21, 2011 at 1:42 AM, Dmitriy Ryaboy wrote:

    This is quite odd because I do the same thing on a multi-million row table
    and get multiple regions ...
    You do have multiple regions, right? What happens if you only specify the
    -loadKey parameter and none of the others?
    On Thu, Jan 20, 2011 at 8:24 AM, Mr. Lukas wrote:

    Hi pig users,
    I'm also using pig 0.8 together with HBase 0.20.6 and think, my problem is
    related to Ian's. When processing a table with millions of rows (stored in
    multiple), HBaseStorage won't scan the full table but only a few hundred
    records.

    The following minimal example reproduces my problem (for this table):

    REGISTER '/path/to/guava-r07.jar'
    SET DEFAULT_PARALLEL 30;
    items = LOAD 'hbase://some-table' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:column', '-caster
    HBaseBinaryConverter -caching 500 -loadKey') AS (key:bytearray,
    a_column:long);
    items = GROUP items ALL;
    item_count = FOREACH items GENERATE COUNT_STAR($1);
    DUMP item_count

    Pig issues just one mapper and I guess, that it scans just one region of
    the
    table. Or did i miss some fundamental configuration options?

    Best regards,
    Lukas
  • Lukas at Jan 27, 2011 at 9:30 am
    I created a JIRA for this issue: https://issues.apache.org/jira/browse/PIG-1828

    Best,
    Lukas
    On Tue, Jan 25, 2011 at 2:52 PM, Mr. Lukas wrote:
    Hello again,
    I just found something interesting in the logs:

    INFO org.apache.pig.backend.hadoop.hbase.HBaseTableInputFormat:
    setScan with ranges: 5192296858534827628530496329220096 -
    5192343374370748142029900260897474 ( 46515835920513499403931677378)

    But in my case, it should more be from 1020576114013268896970538800 to
    72576215356229636519498348368 (when interpreting those numbers as the
    arbitrary precision integer representation of the row key).

    Best regards,
    Lukas
    On Mon, Jan 24, 2011 at 10:07 AM, Mr. Lukas wrote:
    Hi Dmitriy,
    Sorry for the late reply, I was out of office.
    Discarding the caster and caching option (i.e. using only the -loadkey
    option) does not change anything except that some
    FIELD_DISCARDED_TYPE_CONVERSION_FAILED warnings are issued.
    On Fri, Jan 21, 2011 at 1:42 AM, Dmitriy Ryaboy wrote:

    This is quite odd because I do the same thing on a multi-million row table
    and get multiple regions ...
    You do have multiple regions, right? What happens if you only specify the
    -loadKey parameter and none of the others?
    On Thu, Jan 20, 2011 at 8:24 AM, Mr. Lukas wrote:

    Hi pig users,
    I'm also using pig 0.8 together with HBase 0.20.6 and think, my problem is
    related to Ian's. When processing a table with millions of rows (stored in
    multiple), HBaseStorage won't scan the full table but only a few hundred
    records.

    The following minimal example reproduces my problem (for this table):

    REGISTER '/path/to/guava-r07.jar'
    SET DEFAULT_PARALLEL 30;
    items = LOAD 'hbase://some-table' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:column', '-caster
    HBaseBinaryConverter -caching 500 -loadKey') AS (key:bytearray,
    a_column:long);
    items = GROUP items ALL;
    item_count = FOREACH items GENERATE COUNT_STAR($1);
    DUMP item_count

    Pig issues just one mapper and I guess, that it scans just one region of
    the
    table. Or did i miss some fundamental configuration options?

    Best regards,
    Lukas

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 5, '11 at 9:14p
activeJan 27, '11 at 9:30a
posts11
users3
websitepig.apache.org

People

Translate

site design / logo © 2023 Grokbase