Grokbase Groups Pig user May 2011
FAQ
Hello sirs,

I'm having a bit of trouble with a pig script that loads a data range from
hbase using org.apache.pig.backend.hadoop.hbase.HBaseStorage. Functionally
it works great, gives me the answer that I need. Performance-wise it is
suffering because of, I think, 2 things:

1) Only 4 mappers are launched initially to load in data from hbase
(includes parsing, data munging of each record, etc), and I notice the task
is very CPU intensive. What ends up happening is that each mapper finishes
in different times, namely, 1, 5, 15, 45 minutes. The script is held up by
the single mapper taking longer to iterate over more records. Is it
possible to spawn more mappers for jobs running over hbase? What determines
the number of mappers in this case?

2) Later in the job, I use the same variable which should already have the
loaded hbase data, but pig decides to go to hbase to grab the data again
(with only 4 mappers). Then this latency mentioned in (1) is compounded.
Is there a way to explicitly tell pig to keep around certain temp files so
it doesn't have to hammer the hbase cluster?

Note: Cluster of 5 nodes, 1 serves as master everything, the other 4 run
tasktrackers, regionservers, datanodes. The data range set is ~500k
records, 1-2k bytes each record.


Thanks!

--young

Note: the bottleneck might be the regionserver itself because it might hold
all the data specified in the range. A quick way to confirm this would be
to increase the number of map tasks and measure the performance difference
(or lack thereof). But because the task itself is CPU intensive, it doesn't
seem like it is I/O bound, so the regionserver shouldn't be the
bottleneck...

Search Discussions

  • Dmitriy Ryaboy at May 15, 2011 at 3:19 am
    Young,
    The number of tasks is equal to the number of regions your query
    covers. It sounds like your regions are not well-balanced (at least
    for this application -- they may be equal in terms of bytes, but
    different in number of records or the amount of processing each record
    needs).

    As for recalculating the data, that's usually determined by pig
    automatically. You can store your intermediate data to a location in
    hdfs, call exec() to ensure it gets written, and load it back in; but
    really, Pig should be doing that for you automatically if you have
    multiquery turned on.

    -D
    On Fri, May 13, 2011 at 5:22 PM, Young Maeng wrote:
    Hello sirs,

    I'm having a bit of trouble with a pig script that loads a data range from
    hbase using org.apache.pig.backend.hadoop.hbase.HBaseStorage.  Functionally
    it works great, gives me the answer that I need.  Performance-wise it is
    suffering because of, I think, 2 things:

    1)  Only 4 mappers are launched initially to load in data from hbase
    (includes parsing, data munging of each record, etc), and I notice the task
    is very CPU intensive.  What ends up happening is that each mapper finishes
    in different times, namely, 1, 5, 15, 45 minutes.  The script is held up by
    the single mapper taking longer to iterate over more records.  Is it
    possible to spawn more mappers for jobs running over hbase?  What determines
    the number of mappers in this case?

    2)  Later in the job, I use the same variable which should already have the
    loaded hbase data, but pig decides to go to hbase to grab the data again
    (with only 4 mappers).  Then this latency mentioned in (1) is compounded.
    Is there a way to explicitly tell pig to keep around certain temp files so
    it doesn't have to hammer the hbase cluster?

    Note: Cluster of 5 nodes, 1 serves as master everything, the other 4 run
    tasktrackers, regionservers, datanodes.  The data range set is ~500k
    records, 1-2k bytes each record.


    Thanks!

    --young

    Note: the bottleneck might be the regionserver itself because it might hold
    all the data specified in the range.  A quick way to confirm this would be
    to increase the number of map tasks and measure the performance difference
    (or lack thereof).  But because the task itself is CPU intensive, it doesn't
    seem like it is I/O bound, so the regionserver shouldn't be the
    bottleneck...

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 14, '11 at 2:09p
activeMay 15, '11 at 3:19a
posts2
users2
websitepig.apache.org

2 users in discussion

Dmitriy Ryaboy: 1 post Young Maeng: 1 post

People

Translate

site design / logo © 2021 Grokbase