FAQ
Hey there, I have a small cluster running on 0.20.2. Everything is
fine but once in a while, when a job with a lot of map tasks is
running I start getting the error:
Lost task tracker: tracker_cluster1:localhost.localdomain/
127.0.0.1:xxxxx
Before getting the error, the task attempt has been running for 7h
(and normally it takes 46sec to complete). Sometimes, another task
attempt is launched in paralel, takes 50 sec. to complete and so the
first one gets killed (the second one can even be launched in the same
task tracker and work). But in the end, I get so many "Lost task
tracker" so the job get killed.
The job will end up with some of the task trackers blacklisted.
If I kill the "zombie tasks", remove the jobtracker and tasktracer pid
files, remove the userlogs and stop/start mapred, everything works
fine again, but some days later, the error will happen again.
Any idea why this happens? Could someway be related with having too
many attempt folders in the userlogs (even that there is space left on
device)?
Thanks in advance.

--
View this message in context: http://lucene.472066.n3.nabble.com/So-many-unexpected-Lost-task-tracker-errors-making-the-job-to-be-killed-Options-tp2917961p2917961.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Search Discussions

  • Shantian Purkad at May 9, 2011 at 5:01 pm
    I have been seeing this a lot on my cluster as well.

    This typically happens for me if there are many maps (more than 5000) in a job.

    Here is my cluster summary

    342316 files and directories, 94294 blocks = 436610 total. Heap Size is 258.12
    MB / 528 MB (48%)

    Configured Capacity : 26.57 TB
    DFS Used : 19.52 TB
    Non DFS Used : 606.9 GB
    DFS Remaining : 6.46 TB
    DFS Used% : 73.46 %
    DFS Remaining% : 24.31 %
    Live Nodes : 6
    Dead Nodes : 0
    Decommissioning Nodes : 0
    Number of Under-Replicated Blocks : 57
    I am trying to load a 1TB data from one table to another using Hive queries. It
    works well if I do it on small data sizes (around 500GB at a time).
    It is simple query insert into dest table (dynamic partition) select a, b, c
    from source table.

    Any idea how can I get this working. (Compressing the Map output to improve the
    performance) Have 8 Maps and 6 reduces per node.


    Thanks and Regards,
    Shantian




    ________________________________
    From: Marc Sturlese <marc.sturlese@gmail.com>
    To: hadoop-user@lucene.apache.org
    Sent: Mon, May 9, 2011 1:30:27 AM
    Subject: So many unexpected "Lost task tracker" errors making the job to be
    killed Options

    Hey there, I have a small cluster running on 0.20.2. Everything is
    fine but once in a while, when a job with a lot of map tasks is
    running I start getting the error:
    Lost task tracker: tracker_cluster1:localhost.localdomain/
    127.0.0.1:xxxxx
    Before getting the error, the task attempt has been running for 7h
    (and normally it takes 46sec to complete). Sometimes, another task
    attempt is launched in paralel, takes 50 sec. to complete and so the
    first one gets killed (the second one can even be launched in the same
    task tracker and work). But in the end, I get so many "Lost task
    tracker" so the job get killed.
    The job will end up with some of the task trackers blacklisted.
    If I kill the "zombie tasks", remove the jobtracker and tasktracer pid
    files, remove the userlogs and stop/start mapred, everything works
    fine again, but some days later, the error will happen again.
    Any idea why this happens? Could someway be related with having too
    many attempt folders in the userlogs (even that there is space left on
    device)?
    Thanks in advance.

    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/So-many-unexpected-Lost-task-tracker-errors-making-the-job-to-be-killed-Options-tp2917961p2917961.html

    Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 9, '11 at 8:30a
activeMay 9, '11 at 5:01p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Shantian Purkad: 1 post Marc Sturlese: 1 post

People

Translate

site design / logo © 2022 Grokbase