FAQ
Hello, all
I have met the problem "too many fetch failures" when I submit a big
job(e.g. tasks>10000). And I know this error occurs when several reducers
are unable to fetch the given map output. However, I'm sure slaves can
contact each other.
I feel puzzled and have no idea to deal with it. Maybe the network
transfer is bad, but how can I solve it? Increase
mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
changes?
Thank you!
Inifok

Search Discussions

  • Ted Dunning at Aug 19, 2009 at 7:44 am
    Which version of hadoop are you running?
    On Tue, Aug 18, 2009 at 10:23 PM, yang song wrote:

    Hello, all
    I have met the problem "too many fetch failures" when I submit a big
    job(e.g. tasks>10000). And I know this error occurs when several reducers
    are unable to fetch the given map output. However, I'm sure slaves can
    contact each other.
    I feel puzzled and have no idea to deal with it. Maybe the network
    transfer is bad, but how can I solve it? Increase
    mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
    changes?
    Thank you!
    Inifok


    --
    Ted Dunning, CTO
    DeepDyve
  • Yang song at Aug 19, 2009 at 12:20 pm
    I'm sorry, the version is 0.19.1

    2009/8/19 Ted Dunning <ted.dunning@gmail.com>
    Which version of hadoop are you running?
    On Tue, Aug 18, 2009 at 10:23 PM, yang song wrote:

    Hello, all
    I have met the problem "too many fetch failures" when I submit a big
    job(e.g. tasks>10000). And I know this error occurs when several reducers
    are unable to fetch the given map output. However, I'm sure slaves can
    contact each other.
    I feel puzzled and have no idea to deal with it. Maybe the network
    transfer is bad, but how can I solve it? Increase
    mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
    changes?
    Thank you!
    Inifok


    --
    Ted Dunning, CTO
    DeepDyve
  • Ted Dunning at Aug 19, 2009 at 6:17 pm
    I think I remember something about 19.1 in which certain failures would
    cause this. Consider using an updated 19 or moving to 20 as well.
    On Wed, Aug 19, 2009 at 5:19 AM, yang song wrote:

    I'm sorry, the version is 0.19.1
  • Yang song at Aug 20, 2009 at 5:40 am
    Thank you Ted. Update current cluster is a huge work, we don't want to
    do so. Could you tell me how 0.19.1 causes certain failures in detail?
    Thanks again.

    2009/8/20 Ted Dunning <ted.dunning@gmail.com>
    I think I remember something about 19.1 in which certain failures would
    cause this. Consider using an updated 19 or moving to 20 as well.
    On Wed, Aug 19, 2009 at 5:19 AM, yang song wrote:

    I'm sorry, the version is 0.19.1
  • Ted Dunning at Aug 20, 2009 at 6:26 am
    I think that the problem that I am remembering was due to poor recovery from
    this problem. The underlying fault is likely due to poor connectivity
    between your machines. Test that all members of your cluster can access all
    others on all ports used by hadoop.

    See here for hints: http://markmail.org/message/lgafou6d434n2dvx
    On Wed, Aug 19, 2009 at 10:39 PM, yang song wrote:

    Thank you Ted. Update current cluster is a huge work, we don't want to
    do so. Could you tell me how 0.19.1 causes certain failures in detail?
    Thanks again.

    2009/8/20 Ted Dunning <ted.dunning@gmail.com>
    I think I remember something about 19.1 in which certain failures would
    cause this. Consider using an updated 19 or moving to 20 as well.

    On Wed, Aug 19, 2009 at 5:19 AM, yang song <hadoop.inifok@gmail.com>
    wrote:
    I'm sorry, the version is 0.19.1


    --
    Ted Dunning, CTO
    DeepDyve
  • Jason Venner at Aug 20, 2009 at 6:59 am
    The number 1 cause of this is something that causes a connection to get a
    map output to fail. I have seen:
    1) firewall
    2) misconfigured ip addresses (ie: the task tracker attempting the fetch
    received an incorrect ip address when it looked up the name of the
    tasktracker with the map segment)
    3) rare, the http server on the serving tasktracker is overloaded due to
    insufficient threads or listen backlog, this can happen if the number of
    fetches per reduce is large and the number of reduces or the number of maps
    is very large

    There are probably other cases, this recently happened to me when I had 6000
    maps and 20 reducers on a 10 node cluster, which I believe was case 3 above.
    Since I didn't actually need to reduce ( I got my summary data via counters
    in the map phase) I never re-tuned the cluster.
    On Wed, Aug 19, 2009 at 11:25 PM, Ted Dunning wrote:

    I think that the problem that I am remembering was due to poor recovery
    from
    this problem. The underlying fault is likely due to poor connectivity
    between your machines. Test that all members of your cluster can access
    all
    others on all ports used by hadoop.

    See here for hints: http://markmail.org/message/lgafou6d434n2dvx
    On Wed, Aug 19, 2009 at 10:39 PM, yang song wrote:

    Thank you Ted. Update current cluster is a huge work, we don't want to
    do so. Could you tell me how 0.19.1 causes certain failures in detail?
    Thanks again.

    2009/8/20 Ted Dunning <ted.dunning@gmail.com>
    I think I remember something about 19.1 in which certain failures would
    cause this. Consider using an updated 19 or moving to 20 as well.

    On Wed, Aug 19, 2009 at 5:19 AM, yang song <hadoop.inifok@gmail.com>
    wrote:
    I'm sorry, the version is 0.19.1


    --
    Ted Dunning, CTO
    DeepDyve


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Koji Noguchi at Aug 20, 2009 at 5:15 pm
    Probably unrelated to your problem, but one extreme case I've seen,
    a user's job with large gzip inputs (non-splittable),
    20 mappers 800 reducers. Each map outputted like 20G.
    Too many reducers were hitting a single node as soon as a mapper finished.

    I think we tried something like

    mapred.reduce.parallel.copies=1
    (to reduce number of reducer copier threads)
    mapred.reduce.slowstart.completed.maps=1.0
    (so that reducers would have 20 mappers to pull from, instead of 800
    reducers hitting 1 mapper node as soon as it finishes.)


    Koji
    On 8/19/09 11:59 PM, "Jason Venner" wrote:

    The number 1 cause of this is something that causes a connection to get a
    map output to fail. I have seen:
    1) firewall
    2) misconfigured ip addresses (ie: the task tracker attempting the fetch
    received an incorrect ip address when it looked up the name of the
    tasktracker with the map segment)
    3) rare, the http server on the serving tasktracker is overloaded due to
    insufficient threads or listen backlog, this can happen if the number of
    fetches per reduce is large and the number of reduces or the number of maps
    is very large

    There are probably other cases, this recently happened to me when I had 6000
    maps and 20 reducers on a 10 node cluster, which I believe was case 3 above.
    Since I didn't actually need to reduce ( I got my summary data via counters
    in the map phase) I never re-tuned the cluster.
    On Wed, Aug 19, 2009 at 11:25 PM, Ted Dunning wrote:

    I think that the problem that I am remembering was due to poor recovery
    from
    this problem. The underlying fault is likely due to poor connectivity
    between your machines. Test that all members of your cluster can access
    all
    others on all ports used by hadoop.

    See here for hints: http://markmail.org/message/lgafou6d434n2dvx

    On Wed, Aug 19, 2009 at 10:39 PM, yang song <hadoop.inifok@gmail.com>
    wrote:
    Thank you Ted. Update current cluster is a huge work, we don't want to
    do so. Could you tell me how 0.19.1 causes certain failures in detail?
    Thanks again.

    2009/8/20 Ted Dunning <ted.dunning@gmail.com>
    I think I remember something about 19.1 in which certain failures would
    cause this. Consider using an updated 19 or moving to 20 as well.

    On Wed, Aug 19, 2009 at 5:19 AM, yang song <hadoop.inifok@gmail.com>
    wrote:
    I'm sorry, the version is 0.19.1


    --
    Ted Dunning, CTO
    DeepDyve
  • Arun C Murthy at Aug 19, 2009 at 4:32 pm
    I'd dig around a bit more to check if it's there it's caused by a
    specific set of nodes... i.e. are maps on specific tasktrackers
    failing in this manner?

    Arun
    On Aug 18, 2009, at 10:23 PM, yang song wrote:

    Hello, all
    I have met the problem "too many fetch failures" when I submit a
    big
    job(e.g. tasks>10000). And I know this error occurs when several
    reducers
    are unable to fetch the given map output. However, I'm sure slaves can
    contact each other.
    I feel puzzled and have no idea to deal with it. Maybe the network
    transfer is bad, but how can I solve it? Increase
    mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
    changes?
    Thank you!
    Inifok
  • 谭东 at Aug 20, 2009 at 6:52 am
    More fewer reducers there are, More data each reducer will deal with, More
    network transmission each reducer will attached to, and more probably one
    reducer will fail.
    SO INCREMENT your reducers, then try again.
    I think that the problem that I am remembering was due to poor recovery from
    this problem. The underlying fault is likely due to poor connectivity
    between your machines. Test that all members of your cluster can access all
    others on all ports used by hadoop.
    See here for hints: http://markmail.org/message/lgafou6d434n2dvx
    On Wed, Aug 19, 2009 at 10:39 PM, yang song wrote:
    Thank you Ted. Update current cluster is a huge work, we don't want to
    do so. Could you tell me how 0.19.1 causes certain failures in detail?
    Thanks again.

    2009/8/20 Ted Dunning <ted.dunning@gmail.com>
    I think I remember something about 19.1 in which certain failures would
    cause this. Consider using an updated 19 or moving to 20 as well.

    On Wed, Aug 19, 2009 at 5:19 AM, yang song <hadoop.inifok@gmail.com>
    wrote:
    I'm sorry, the version is 0.19.1
    --
    Ted Dunning, CTO
    DeepDyve
  • 谭东 at Aug 20, 2009 at 6:56 am
    Fewer reducers there are, More data each reducer will deal with, More
    network transmission each reducer will be attached to, and more probably one
    reducer will fail.
    SO INCREMENT your reducers, then try again.
    I think that the problem that I am remembering was due to poor recovery from
    this problem. The underlying fault is likely due to poor connectivity
    between your machines. Test that all members of your cluster can access all
    others on all ports used by hadoop.
    See here for hints: http://markmail.org/message/lgafou6d434n2dvx
    On Wed, Aug 19, 2009 at 10:39 PM, yang song wrote:
    Thank you Ted. Update current cluster is a huge work, we don't want to
    do so. Could you tell me how 0.19.1 causes certain failures in detail?
    Thanks again.

    2009/8/20 Ted Dunning <ted.dunning@gmail.com>
    I think I remember something about 19.1 in which certain failures would
    cause this. Consider using an updated 19 or moving to 20 as well.

    On Wed, Aug 19, 2009 at 5:19 AM, yang song <hadoop.inifok@gmail.com>
    wrote:
    I'm sorry, the version is 0.19.1
    --
    Ted Dunning, CTO
    DeepDyve
  • Related Discussions

    Discussion Navigation
    viewthread | post
    Discussion Overview
    groupcommon-user @
    categorieshadoop
    postedAug 19, '09 at 5:23a
    activeAug 20, '09 at 5:15p
    posts11
    users6
    websitehadoop.apache.org...
    irc#hadoop

    People

    Translate

    site design / logo © 2022 Grokbase