Grokbase Groups Pig user May 2013
FAQ
Hi,

I have a very weird issue with my PIG script. Following is the content of
my script

*REGISTER /home/hadoopuser/Workspace/lib/piggybank.jar*
*REGISTER /home/hadoopuser/Workspace/lib/datafu.jar;*
*REGISTER
/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/hbase/hbase-0.94.2-cdh4.2.1-security.jar;
*
*REGISTER
/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/zookeeper/zookeeper-3.4.5-cdh4.2.1.jar;
*
*SET default_parallel 15;*

*records = LOAD 'hbase://dm-re' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('v:ctm v:src','-caching
5000 -gt 1366098805& -lt 1366102543&') as
(time:chararray,company:chararray);*

*records_iso = FOREACH records GENERATE
org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO(time,'yyyy-MM-dd
HH:mm:ss Z') as iso_time;*
*records_group = GROUP records_iso ALL;*
*result = FOREACH records_group GENERATE MAX(records_iso.iso_time) as
maxtime;*
*DUMP result*

When i try to run this script in cluster of 5 nodes with 20 map slots, most
of the map tasks fail with the following error after 10 mins of
initializing,
*Task attempt <id> failed to report status for 600 seconds. Killing!*

I tried to decrease the caching size to less than 100 or so, (under the
intuition that maybe fetching and processing more cache is taking more
time) but still the same issue. However if i manage to load the rows (using
lt and gt) such that number of map tasks are <=2, the job will be
successfully finished. When the number of tasks is > 2 , it is always the
case that 2-4 tasks are completed and the rest all fail with the above
mentioned error. I attach the task tracker log hereby for this attempt. I
don't see any error except for some zookeeper connection warnings. I
manually checked from that node and doing a 'hbase zkcli' connects without
any issue. Hence, I assume that zookeeper is configured properly.

I don't really understand where to debug this problem. It would be great if
someone could provide assistance. Some configurations of the cluster, which
i think maybe relevant here,
*dfs.block.size = 1 GB
io.sort.mb = 1 GB
HRegion size = 1 GB

*
and the size of the hbase table is close to 250 GB. I have observed 100%
cpu usage by the mapred user on the node, while the task is under
execution. I am not really sure, what to optimize in this case for the job
to complete. It would be good if someone can throw some light in this
direction.

PS: All my nodes in the cluster are configured on a EBS backed amazon ec2
cluster.


--
Regards,
Praveen Bysani
http://www.praveenbysani.com

Search Discussions

  • Cheolsoo Park at May 14, 2013 at 10:29 pm
    Hi,

    Sounds like your mappers are overloaded. Can you try the following?

    1. You can set mapred.max.split.size to a smaller value, so more mappers
    can be launched.

    or

    2. You can set mapred.task.timeout to a larger value. The default value is
    600 seconds.

    Thanks,
    Cheolsoo


    On Mon, May 13, 2013 at 8:03 PM, Praveen Bysani wrote:

    Hi,

    I have a very weird issue with my PIG script. Following is the content of
    my script

    *REGISTER /home/hadoopuser/Workspace/lib/piggybank.jar*
    *REGISTER /home/hadoopuser/Workspace/lib/datafu.jar;*
    *REGISTER
    /opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/hbase/hbase-0.94.2-cdh4.2.1-security.jar;
    *
    *REGISTER
    /opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/lib/zookeeper/zookeeper-3.4.5-cdh4.2.1.jar;
    *
    *SET default_parallel 15;*

    *records = LOAD 'hbase://dm-re' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('v:ctm v:src','-caching
    5000 -gt 1366098805& -lt 1366102543&') as
    (time:chararray,company:chararray);*

    *records_iso = FOREACH records GENERATE
    org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO(time,'yyyy-MM-dd
    HH:mm:ss Z') as iso_time;*
    *records_group = GROUP records_iso ALL;*
    *result = FOREACH records_group GENERATE MAX(records_iso.iso_time) as
    maxtime;*
    *DUMP result*

    When i try to run this script in cluster of 5 nodes with 20 map slots,
    most of the map tasks fail with the following error after 10 mins of
    initializing,
    *Task attempt <id> failed to report status for 600 seconds. Killing!*

    I tried to decrease the caching size to less than 100 or so, (under the
    intuition that maybe fetching and processing more cache is taking more
    time) but still the same issue. However if i manage to load the rows (using
    lt and gt) such that number of map tasks are <=2, the job will be
    successfully finished. When the number of tasks is > 2 , it is always the
    case that 2-4 tasks are completed and the rest all fail with the above
    mentioned error. I attach the task tracker log hereby for this attempt. I
    don't see any error except for some zookeeper connection warnings. I
    manually checked from that node and doing a 'hbase zkcli' connects without
    any issue. Hence, I assume that zookeeper is configured properly.

    I don't really understand where to debug this problem. It would be great
    if someone could provide assistance. Some configurations of the cluster,
    which i think maybe relevant here,
    *dfs.block.size = 1 GB
    io.sort.mb = 1 GB
    HRegion size = 1 GB

    *
    and the size of the hbase table is close to 250 GB. I have observed 100%
    cpu usage by the mapred user on the node, while the task is under
    execution. I am not really sure, what to optimize in this case for the job
    to complete. It would be good if someone can throw some light in this
    direction.

    PS: All my nodes in the cluster are configured on a EBS backed amazon ec2
    cluster.


    --
    Regards,
    Praveen Bysani
    http://www.praveenbysani.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 14, '13 at 3:10a
activeMay 14, '13 at 10:29p
posts2
users2
websitepig.apache.org

2 users in discussion

Praveen Bysani: 1 post Cheolsoo Park: 1 post

People

Translate

site design / logo © 2021 Grokbase