I'm having a bit of trouble with a pig script that loads a data range from
hbase using org.apache.pig.backend.hadoop.hbase.HBaseStorage. Functionally
it works great, gives me the answer that I need. Performance-wise it is
suffering because of, I think, 2 things:
1) Only 4 mappers are launched initially to load in data from hbase
(includes parsing, data munging of each record, etc), and I notice the task
is very CPU intensive. What ends up happening is that each mapper finishes
in different times, namely, 1, 5, 15, 45 minutes. The script is held up by
the single mapper taking longer to iterate over more records. Is it
possible to spawn more mappers for jobs running over hbase? What determines
the number of mappers in this case?
2) Later in the job, I use the same variable which should already have the
loaded hbase data, but pig decides to go to hbase to grab the data again
(with only 4 mappers). Then this latency mentioned in (1) is compounded.
Is there a way to explicitly tell pig to keep around certain temp files so
it doesn't have to hammer the hbase cluster?
Note: Cluster of 5 nodes, 1 serves as master everything, the other 4 run
tasktrackers, regionservers, datanodes. The data range set is ~500k
records, 1-2k bytes each record.
Note: the bottleneck might be the regionserver itself because it might hold
all the data specified in the range. A quick way to confirm this would be
to increase the number of map tasks and measure the performance difference
(or lack thereof). But because the task itself is CPU intensive, it doesn't
seem like it is I/O bound, so the regionserver shouldn't be the