FAQ
Hi,

I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
size (so 1600maps are created) and a replication factor of 1 is being used.
All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of
50Mbps between each of the nodes (this was configured using linux "tc"). I
see that around 90% of the map tasks are reading data over the network i.e.
most of the map tasks are not being scheduled at the nodes where the data to
be processed by them is located.
My understanding was that Hadoop tries to schedule as many data-local maps
as possible. But in this situation, this does not seem to happen. Any reason
why this is happening? and is there a way to actually configure hadoop to
ensure the maximum possible node locality?
Any help regarding this is very much appreciated.

Thanks,
Virajith

Search Discussions

  • Harsh J at Jul 12, 2011 at 12:55 pm
    How much of bandwidth did you see being utilized? What was the count
    of number of tasks launched as data-local map tasks versus rack local
    ones?

    A little bit of edge record data is always read over network but that
    is highly insignificant compared to the amount of data read locally (a
    whole block size, if available).

    On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
    wrote:
    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
    data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
    size (so 1600maps are created) and a replication factor of 1 is being used.
    All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of
    50Mbps between each of the nodes (this was configured using linux "tc"). I
    see that around 90% of the map tasks are reading data over the network i.e.
    most of the map tasks are not being scheduled at the nodes where the data to
    be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local maps
    as possible. But in this situation, this does not seem to happen. Any reason
    why this is happening? and is there a way to actually configure hadoop to
    ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith


    --
    Harsh J
  • Virajith Jalaparti at Jul 12, 2011 at 1:07 pm
    How do I find the number of data-local map tasks that are launched? I
    checked the log files but didnt see any information about this. All the map
    tasks are rack local since I am running the job just using a single rack.
    From the completion time per map (comparing it to the case where I have
    1Gbps of bandwidth between the nodes i.e. the case where network bandwidth
    is not a bottle neck), I saw that more than 90% of the maps are actually
    reading data over the network.

    I understand that there might be some maps that are actually launched as
    non-data local task but I am surprised that around 90% of the maps are
    actually running as non-data local tasks.

    I have not measured how much bandwidth was being used but I think the whole
    50Mbps is being used.

    Thanks,
    Virajith

    On Tue, Jul 12, 2011 at 1:55 PM, Harsh J wrote:

    How much of bandwidth did you see being utilized? What was the count
    of number of tasks launched as data-local map tasks versus rack local
    ones?

    A little bit of edge record data is always read over network but that
    is highly insignificant compared to the amount of data read locally (a
    whole block size, if available).

    On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
    wrote:
    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
    data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
    size (so 1600maps are created) and a replication factor of 1 is being used.
    All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of
    50Mbps between each of the nodes (this was configured using linux "tc"). I
    see that around 90% of the map tasks are reading data over the network i.e.
    most of the map tasks are not being scheduled at the nodes where the data to
    be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local maps
    as possible. But in this situation, this does not seem to happen. Any reason
    why this is happening? and is there a way to actually configure hadoop to
    ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith


    --
    Harsh J
  • Harsh J at Jul 12, 2011 at 1:44 pm
    Virajith,

    You can see the number of data local vs. non.'s counters in the job itself.

    On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti
    wrote:
    How do I find the number of data-local map tasks that are launched? I
    checked the log files but didnt see any information about this. All the map
    tasks are rack local since I am running the job just using a single rack.
    From the completion time per map (comparing it to the case where I have
    1Gbps of bandwidth between the nodes i.e. the case where network bandwidth
    is not a bottle neck), I saw that more than 90% of the maps are actually
    reading data over the network.

    I understand that there might be some maps that  are actually launched as
    non-data local task but  I am surprised that around 90% of the maps are
    actually running as non-data local tasks.

    I have not measured how much bandwidth was being used but I think the whole
    50Mbps is being used.

    Thanks,
    Virajith

    On Tue, Jul 12, 2011 at 1:55 PM, Harsh J wrote:

    How much of bandwidth did you see being utilized? What was the count
    of number of tasks launched as data-local map tasks versus rack local
    ones?

    A little bit of edge record data is always read over network but that
    is highly insignificant compared to the amount of data read locally (a
    whole block size, if available).

    On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
    wrote:
    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
    input
    data using a 20 node cluster of nodes. HDFS is configured to use 128MB
    block
    size (so 1600maps are created) and a replication factor of 1 is being
    used.
    All the 20 nodes are also hdfs datanodes. I was using a bandwidth value
    of
    50Mbps between each of the nodes (this was configured using linux "tc").
    I
    see that around 90% of the map tasks are reading data over the network
    i.e.
    most of the map tasks are not being scheduled at the nodes where the
    data to
    be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local
    maps
    as possible. But in this situation, this does not seem to happen. Any
    reason
    why this is happening? and is there a way to actually configure hadoop
    to
    ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith


    --
    Harsh J


    --
    Harsh J
  • Virajith Jalaparti at Jul 12, 2011 at 2:00 pm
    Harsh,

    I am assuming you mean the web-interface of the jobtracker, right? What I
    see there is appended at the end of the email. Is there supposed to be a
    counter which is equal to the number of data-local jobs? One obvious way to
    find this would be to look at the location of the input split of each of the
    mappers and see if that is the same as that of the map task.

    Do I need to enable some config parameter to actually see the counter which
    shows the number of data-local tasks?

    Thanks
    Virajith


    ==================================================================================
    Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed
    Task Attempts
    map 100.00%
    1600 0 0 1600 0 3 / 46
    reduce 100.00%
    20 0 0 20 0 0 / 1

    Counter Map
    Reduce Total
    Job Counters Launched reduce tasks 0
    0 21
    Rack-local map tasks 0
    0 1,649
    Launched map tasks 0
    0 1,649
    FileSystemCounters FILE_BYTES_READ 215,256,891,609
    494,340,016,724 709,596,908,333
    HDFS_BYTES_READ 215,481,828,554
    0 215,481,828,554
    FILE_BYTES_WRITTEN 430,057,823,630
    494,340,016,724 924,397,840,354
    HDFS_BYTES_WRITTEN 0
    215,457,161,571 215,457,161,571
    Map-Reduce Framework Reduce input groups 0
    20,369,713 20,369,713
    Combine output records 0
    0 0
    Map input records 20,443,005
    0 20,443,005
    Reduce shuffle bytes 0
    214,894,166,095 214,894,166,095
    Reduce output records 0
    20,443,005 20,443,005
    Spilled Records 40,886,010
    46,997,605 87,883,615
    Map output bytes 214,913,316,171
    0 214,913,316,171
    Map input bytes 215,457,082,591
    0 215,457,082,591
    Map output records 20,443,005
    0 20,443,005
    Combine input records 0
    0 0
    Reduce input records 0
    20,443,005 20,443,005

    On Tue, Jul 12, 2011 at 2:43 PM, Harsh J wrote:

    Virajith,

    You can see the number of data local vs. non.'s counters in the job itself.

    On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti
    wrote:
    How do I find the number of data-local map tasks that are launched? I
    checked the log files but didnt see any information about this. All the map
    tasks are rack local since I am running the job just using a single rack.
    From the completion time per map (comparing it to the case where I have
    1Gbps of bandwidth between the nodes i.e. the case where network bandwidth
    is not a bottle neck), I saw that more than 90% of the maps are actually
    reading data over the network.

    I understand that there might be some maps that are actually launched as
    non-data local task but I am surprised that around 90% of the maps are
    actually running as non-data local tasks.

    I have not measured how much bandwidth was being used but I think the whole
    50Mbps is being used.

    Thanks,
    Virajith

    On Tue, Jul 12, 2011 at 1:55 PM, Harsh J wrote:

    How much of bandwidth did you see being utilized? What was the count
    of number of tasks launched as data-local map tasks versus rack local
    ones?

    A little bit of edge record data is always read over network but that
    is highly insignificant compared to the amount of data read locally (a
    whole block size, if available).

    On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
    wrote:
    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
    input
    data using a 20 node cluster of nodes. HDFS is configured to use 128MB
    block
    size (so 1600maps are created) and a replication factor of 1 is being
    used.
    All the 20 nodes are also hdfs datanodes. I was using a bandwidth
    value
    of
    50Mbps between each of the nodes (this was configured using linux
    "tc").
    I
    see that around 90% of the map tasks are reading data over the network
    i.e.
    most of the map tasks are not being scheduled at the nodes where the
    data to
    be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local
    maps
    as possible. But in this situation, this does not seem to happen. Any
    reason
    why this is happening? and is there a way to actually configure hadoop
    to
    ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith


    --
    Harsh J


    --
    Harsh J
  • Sudharsan Sampath at Jul 12, 2011 at 1:06 pm
    what's the map task capacity of each node ?
    On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
    data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
    size (so 1600maps are created) and a replication factor of 1 is being used.
    All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of
    50Mbps between each of the nodes (this was configured using linux "tc"). I
    see that around 90% of the map tasks are reading data over the network i.e.
    most of the map tasks are not being scheduled at the nodes where the data to
    be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local maps
    as possible. But in this situation, this does not seem to happen. Any reason
    why this is happening? and is there a way to actually configure hadoop to
    ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith
  • Virajith Jalaparti at Jul 12, 2011 at 1:08 pm
    Each node is configured to run 8map tasks. I am using 2.4 GHz 64-bit Quad
    Core Xeon using machines.

    -Virajith
    On Tue, Jul 12, 2011 at 2:05 PM, Sudharsan Sampath wrote:

    what's the map task capacity of each node ?
    On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
    data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
    size (so 1600maps are created) and a replication factor of 1 is being used.
    All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of
    50Mbps between each of the nodes (this was configured using linux "tc"). I
    see that around 90% of the map tasks are reading data over the network i.e.
    most of the map tasks are not being scheduled at the nodes where the data to
    be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local maps
    as possible. But in this situation, this does not seem to happen. Any reason
    why this is happening? and is there a way to actually configure hadoop to
    ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith
  • Arun C Murthy at Jul 12, 2011 at 2:31 pm
    Why are you running with replication factor of 1?

    Also, it depends on the scheduler you are using. The CapacityScheduler in 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with FairScheduler.

    IAC, running on a single rack with replication of 1 implies rack-locality for all tasks which, in most cases, is good enough.

    Arun
    On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input data using a 20 node cluster of nodes. HDFS is configured to use 128MB block size (so 1600maps are created) and a replication factor of 1 is being used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of 50Mbps between each of the nodes (this was configured using linux "tc"). I see that around 90% of the map tasks are reading data over the network i.e. most of the map tasks are not being scheduled at the nodes where the data to be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local maps as possible. But in this situation, this does not seem to happen. Any reason why this is happening? and is there a way to actually configure hadoop to ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith
  • Virajith Jalaparti at Jul 12, 2011 at 2:37 pm
    I am using a replication factor of 1 since I dont to incur the overhead of
    replication and I am not much worried about reliability.

    I am just using the default Hadoop scheduler (FIFO, I think!). In case of a
    single rack, rack-locality doesn't really have any meaning. Obviously
    everything will run in the same rack. I am concerned about data-local maps.
    I assumed that Hadoop would do a much better job at ensuring data-local maps
    but it doesnt seem to be the case here.

    -Virajith
    On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy wrote:

    Why are you running with replication factor of 1?

    Also, it depends on the scheduler you are using. The CapacityScheduler in
    0.20.203 (not 0.20.2) has much better locality for jobs, similarly with
    FairScheduler.

    IAC, running on a single rack with replication of 1 implies rack-locality
    for all tasks which, in most cases, is good enough.

    Arun
    On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
    data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
    size (so 1600maps are created) and a replication factor of 1 is being used.
    All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of
    50Mbps between each of the nodes (this was configured using linux "tc"). I
    see that around 90% of the map tasks are reading data over the network i.e.
    most of the map tasks are not being scheduled at the nodes where the data to
    be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local
    maps as possible. But in this situation, this does not seem to happen. Any
    reason why this is happening? and is there a way to actually configure
    hadoop to ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith
  • Virajith Jalaparti at Jul 12, 2011 at 5:03 pm
    I am attaching the config files I was using for these runs with this email.
    I am not sure if something in them is causing this non-data locality of
    Hadoop.

    Thanks,
    Virajith
    On Tue, Jul 12, 2011 at 3:36 PM, Virajith Jalaparti wrote:

    I am using a replication factor of 1 since I dont to incur the overhead of
    replication and I am not much worried about reliability.

    I am just using the default Hadoop scheduler (FIFO, I think!). In case of a
    single rack, rack-locality doesn't really have any meaning. Obviously
    everything will run in the same rack. I am concerned about data-local maps.
    I assumed that Hadoop would do a much better job at ensuring data-local maps
    but it doesnt seem to be the case here.

    -Virajith

    On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy wrote:

    Why are you running with replication factor of 1?

    Also, it depends on the scheduler you are using. The CapacityScheduler in
    0.20.203 (not 0.20.2) has much better locality for jobs, similarly with
    FairScheduler.

    IAC, running on a single rack with replication of 1 implies rack-locality
    for all tasks which, in most cases, is good enough.

    Arun
    On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
    input data using a 20 node cluster of nodes. HDFS is configured to use 128MB
    block size (so 1600maps are created) and a replication factor of 1 is being
    used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth
    value of 50Mbps between each of the nodes (this was configured using linux
    "tc"). I see that around 90% of the map tasks are reading data over the
    network i.e. most of the map tasks are not being scheduled at the nodes
    where the data to be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local
    maps as possible. But in this situation, this does not seem to happen. Any
    reason why this is happening? and is there a way to actually configure
    hadoop to ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith
  • Aaron Baff at Jul 12, 2011 at 5:11 pm
    Well, if you think about it, you'll have more/better locality if more nodes with the same blocks. It gives the scheduler more leeway to find a node that has a block that hasn't been processed yet. Have you tried it with replication of 2 or 3 and seen what that does?

    --Aaron

    --------------------------------------------------------
    From: Virajith Jalaparti
    Sent: Tuesday, July 12, 2011 7:37 AM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: Lack of data locality in Hadoop-0.20.2

    I am using a replication factor of 1 since I dont to incur the overhead of replication and I am not much worried about reliability.

    I am just using the default Hadoop scheduler (FIFO, I think!). In case of a single rack, rack-locality doesn't really have any meaning. Obviously everything will run in the same rack. I am concerned about data-local maps. I assumed that Hadoop would do a much better job at ensuring data-local maps but it doesnt seem to be the case here.

    -Virajith
    On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy wrote:
    Why are you running with replication factor of 1?

    Also, it depends on the scheduler you are using. The CapacityScheduler in 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with FairScheduler.

    IAC, running on a single rack with replication of 1 implies rack-locality for all tasks which, in most cases, is good enough.

    Arun
    On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input data using a 20 node cluster of nodes. HDFS is configured to use 128MB block size (so 1600maps are created) and a replication factor of 1 is being used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of 50Mbps between each of the nodes (this was configured using linux "tc"). I see that around 90% of the map tasks are reading data over the network i.e. most of the map tasks are not being scheduled at the nodes where the data to be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local maps as possible. But in this situation, this does not seem to happen. Any reason why this is happening? and is there a way to actually configure hadoop to ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith
  • Arun C Murthy at Jul 12, 2011 at 5:15 pm
    As Aaron mentioned the scheduler has very little leeway when you have a single replica.

    OTOH, schedulers equate rack-locality to node-locality - this makes sense sense for a large-scale system since intra-rack b/w is good enough for most installs of Hadoop.

    Arun
    On Jul 12, 2011, at 7:36 AM, Virajith Jalaparti wrote:

    I am using a replication factor of 1 since I dont to incur the overhead of replication and I am not much worried about reliability.

    I am just using the default Hadoop scheduler (FIFO, I think!). In case of a single rack, rack-locality doesn't really have any meaning. Obviously everything will run in the same rack. I am concerned about data-local maps. I assumed that Hadoop would do a much better job at ensuring data-local maps but it doesnt seem to be the case here.

    -Virajith

    On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy wrote:
    Why are you running with replication factor of 1?

    Also, it depends on the scheduler you are using. The CapacityScheduler in 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with FairScheduler.

    IAC, running on a single rack with replication of 1 implies rack-locality for all tasks which, in most cases, is good enough.

    Arun
    On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input data using a 20 node cluster of nodes. HDFS is configured to use 128MB block size (so 1600maps are created) and a replication factor of 1 is being used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of 50Mbps between each of the nodes (this was configured using linux "tc"). I see that around 90% of the map tasks are reading data over the network i.e. most of the map tasks are not being scheduled at the nodes where the data to be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local maps as possible. But in this situation, this does not seem to happen. Any reason why this is happening? and is there a way to actually configure hadoop to ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith
  • Virajith Jalaparti at Jul 12, 2011 at 5:28 pm
    I agree that the scheduler has lesser leeway when the replication factor is
    1. However, I would still expect the number of data-local tasks to be more
    than 10% even when the replication factor is 1. Presumably, the scheduler
    would have greater number of opportunities to schedule data-local tasks as
    compared to just 10%. (Please note that I am inferring that a map was
    non-local based on the observed completion time. I don't know why but the
    logs of my jobs don't show the DATA_LOCAL_MAPS counter information.)

    I will try using higher replication factors and see how much improvement I
    can get.

    Thanks,
    Virajith
    On Tue, Jul 12, 2011 at 6:15 PM, Arun C Murthy wrote:

    As Aaron mentioned the scheduler has very little leeway when you have a
    single replica.

    OTOH, schedulers equate rack-locality to node-locality - this makes sense
    sense for a large-scale system since intra-rack b/w is good enough for most
    installs of Hadoop.

    Arun

    On Jul 12, 2011, at 7:36 AM, Virajith Jalaparti wrote:

    I am using a replication factor of 1 since I dont to incur the overhead of
    replication and I am not much worried about reliability.

    I am just using the default Hadoop scheduler (FIFO, I think!). In case of a
    single rack, rack-locality doesn't really have any meaning. Obviously
    everything will run in the same rack. I am concerned about data-local maps.
    I assumed that Hadoop would do a much better job at ensuring data-local maps
    but it doesnt seem to be the case here.

    -Virajith
    On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy wrote:

    Why are you running with replication factor of 1?

    Also, it depends on the scheduler you are using. The CapacityScheduler in
    0.20.203 (not 0.20.2) has much better locality for jobs, similarly with
    FairScheduler.

    IAC, running on a single rack with replication of 1 implies rack-locality
    for all tasks which, in most cases, is good enough.

    Arun
    On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
    input data using a 20 node cluster of nodes. HDFS is configured to use 128MB
    block size (so 1600maps are created) and a replication factor of 1 is being
    used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth
    value of 50Mbps between each of the nodes (this was configured using linux
    "tc"). I see that around 90% of the map tasks are reading data over the
    network i.e. most of the map tasks are not being scheduled at the nodes
    where the data to be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local
    maps as possible. But in this situation, this does not seem to happen. Any
    reason why this is happening? and is there a way to actually configure
    hadoop to ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith
  • Allen Wittenauer at Jul 12, 2011 at 6:20 pm

    On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:

    I agree that the scheduler has lesser leeway when the replication factor is
    1. However, I would still expect the number of data-local tasks to be more
    than 10% even when the replication factor is 1.
    How did you load your data?

    Did you load it from outside the grid or from one of the datanodes? If you loaded from one of the datanodes, you'll basically have no real locality, especially with a rep factor of 1.
  • Virajith Jalaparti at Jul 12, 2011 at 6:35 pm

    On 7/12/2011 7:20 PM, Allen Wittenauer wrote:
    On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:

    I agree that the scheduler has lesser leeway when the replication factor is
    1. However, I would still expect the number of data-local tasks to be more
    than 10% even when the replication factor is 1.
    How did you load your data?

    Did you load it from outside the grid or from one of the datanodes? If you loaded from one of the datanodes, you'll basically have no real locality, especially with a rep factor of 1.
    I create the data using the randomwriter in the hadoop examples. I
    essentially run the example at http://wiki.apache.org/hadoop/Sort (%
    bin/hadoop jar hadoop-*-examples.jar randomwriter rand % bin/hadoop jar
    hadoop-*-examples.jar sort rand rand-sort) with the necessary parameters.

    -Virajith
  • Virajith Jalaparti at Jul 12, 2011 at 10:21 pm
    Is the non-data local nature of the maps possible due to the amount of HDFS
    data read by each map being greater than the HDFS block size? In the job I
    was running, the HDFS block size dfs.block.size was 134217728 and the
    HDFS_BYTES_READ by the maps was 134678218 and FILE_BYTES_READ was 134698338.

    So, HDFS_BYTES_READ is greater than dfs.block.size. Does this imply that
    most of the map tasks will be non-local? Further would Hadoop ensure that
    the map task is scheduled on the node which has the larger chunk of the data
    that is to be read by the task?

    Thanks,
    Virajith

    On Tue, Jul 12, 2011 at 7:20 PM, Allen Wittenauer wrote:

    On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:

    I agree that the scheduler has lesser leeway when the replication factor is
    1. However, I would still expect the number of data-local tasks to be more
    than 10% even when the replication factor is 1.
    How did you load your data?

    Did you load it from outside the grid or from one of the datanodes?
    If you loaded from one of the datanodes, you'll basically have no real
    locality, especially with a rep factor of 1.

  • Aaron Baff at Jul 12, 2011 at 10:36 pm
    The number of bytes read can exceed the block size somewhat because each block rarely starts/ends on a record (e.g. line) boundary. So usually it reads to read a bit before and/or after the actual block boundary in to correctly read in all of the records it is supposed to. If you look, it's not having to read all that much extra data.

    --Aaron

    -----------------------------------------------------------------
    From: Virajith Jalaparti
    Sent: Tuesday, July 12, 2011 3:21 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: Lack of data locality in Hadoop-0.20.2

    Is the non-data local nature of the maps possible due to the amount of HDFS data read by each map being greater than the HDFS block size? In the job I was running, the HDFS block size dfs.block.size was 134217728 and the HDFS_BYTES_READ by the maps was 134678218 and FILE_BYTES_READ was 134698338.
    So, HDFS_BYTES_READ is greater than dfs.block.size. Does this imply that most of the map tasks will be non-local? Further would Hadoop ensure that the map task is scheduled on the node which has the larger chunk of the data that is to be read by the task?

    Thanks,
    Virajith

    On Tue, Jul 12, 2011 at 7:20 PM, Allen Wittenauer wrote:
    On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:

    I agree that the scheduler has lesser leeway when the replication factor is
    1. However, I would still expect the number of data-local tasks to be more
    than 10% even when the replication factor is 1.
    How did you load your data?

    Did you load it from outside the grid or from one of the datanodes? If you loaded from one of the datanodes, you'll basically have no real locality, especially with a rep factor of 1.
  • Matei Zaharia at Jul 12, 2011 at 6:24 pm
    Hi Virajith,

    The default FIFO scheduler just isn't optimized for locality for small jobs. You should be able to get substantially more locality even with 1 replica if you use the fair scheduler, although the version of the scheduler in 0.20 doesn't contain the locality optimization. Try the Cloudera distribution to get a 0.20-compatible Hadoop that does contain it.

    I also think your value of 10% inferred on completion time might be a little off, because you have quite a few more data blocks than nodes so it should be easy to make the first few waves of tasks data-local. Try a version of Hadoop that correctly measures this counter.

    Matei
    On Jul 12, 2011, at 1:27 PM, Virajith Jalaparti wrote:

    I agree that the scheduler has lesser leeway when the replication factor is 1. However, I would still expect the number of data-local tasks to be more than 10% even when the replication factor is 1. Presumably, the scheduler would have greater number of opportunities to schedule data-local tasks as compared to just 10%. (Please note that I am inferring that a map was non-local based on the observed completion time. I don't know why but the logs of my jobs don't show the DATA_LOCAL_MAPS counter information.)

    I will try using higher replication factors and see how much improvement I can get.

    Thanks,
    Virajith

    On Tue, Jul 12, 2011 at 6:15 PM, Arun C Murthy wrote:
    As Aaron mentioned the scheduler has very little leeway when you have a single replica.

    OTOH, schedulers equate rack-locality to node-locality - this makes sense sense for a large-scale system since intra-rack b/w is good enough for most installs of Hadoop.

    Arun
    On Jul 12, 2011, at 7:36 AM, Virajith Jalaparti wrote:

    I am using a replication factor of 1 since I dont to incur the overhead of replication and I am not much worried about reliability.

    I am just using the default Hadoop scheduler (FIFO, I think!). In case of a single rack, rack-locality doesn't really have any meaning. Obviously everything will run in the same rack. I am concerned about data-local maps. I assumed that Hadoop would do a much better job at ensuring data-local maps but it doesnt seem to be the case here.

    -Virajith

    On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy wrote:
    Why are you running with replication factor of 1?

    Also, it depends on the scheduler you are using. The CapacityScheduler in 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with FairScheduler.

    IAC, running on a single rack with replication of 1 implies rack-locality for all tasks which, in most cases, is good enough.

    Arun
    On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input data using a 20 node cluster of nodes. HDFS is configured to use 128MB block size (so 1600maps are created) and a replication factor of 1 is being used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of 50Mbps between each of the nodes (this was configured using linux "tc"). I see that around 90% of the map tasks are reading data over the network i.e. most of the map tasks are not being scheduled at the nodes where the data to be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local maps as possible. But in this situation, this does not seem to happen. Any reason why this is happening? and is there a way to actually configure hadoop to ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith
  • Virajith Jalaparti at Jul 13, 2011 at 12:20 pm
    Hi Matei,

    Using the fair scheduler of the cloudera distribution seems to have (mostly)
    solved the problem. Thanks a lot for the suggestion.

    -Virajith
    On Tue, Jul 12, 2011 at 7:23 PM, Matei Zaharia wrote:

    Hi Virajith,

    The default FIFO scheduler just isn't optimized for locality for small
    jobs. You should be able to get substantially more locality even with 1
    replica if you use the fair scheduler, although the version of the scheduler
    in 0.20 doesn't contain the locality optimization. Try the Cloudera
    distribution to get a 0.20-compatible Hadoop that does contain it.

    I also think your value of 10% inferred on completion time might be a
    little off, because you have quite a few more data blocks than nodes so it
    should be easy to make the first few waves of tasks data-local. Try a
    version of Hadoop that correctly measures this counter.

    Matei

    On Jul 12, 2011, at 1:27 PM, Virajith Jalaparti wrote:

    I agree that the scheduler has lesser leeway when the replication factor is
    1. However, I would still expect the number of data-local tasks to be more
    than 10% even when the replication factor is 1. Presumably, the scheduler
    would have greater number of opportunities to schedule data-local tasks as
    compared to just 10%. (Please note that I am inferring that a map was
    non-local based on the observed completion time. I don't know why but the
    logs of my jobs don't show the DATA_LOCAL_MAPS counter information.)

    I will try using higher replication factors and see how much improvement I
    can get.

    Thanks,
    Virajith
    On Tue, Jul 12, 2011 at 6:15 PM, Arun C Murthy wrote:

    As Aaron mentioned the scheduler has very little leeway when you have a
    single replica.

    OTOH, schedulers equate rack-locality to node-locality - this makes sense
    sense for a large-scale system since intra-rack b/w is good enough for most
    installs of Hadoop.

    Arun

    On Jul 12, 2011, at 7:36 AM, Virajith Jalaparti wrote:

    I am using a replication factor of 1 since I dont to incur the overhead of
    replication and I am not much worried about reliability.

    I am just using the default Hadoop scheduler (FIFO, I think!). In case of
    a single rack, rack-locality doesn't really have any meaning. Obviously
    everything will run in the same rack. I am concerned about data-local maps.
    I assumed that Hadoop would do a much better job at ensuring data-local maps
    but it doesnt seem to be the case here.

    -Virajith
    On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy wrote:

    Why are you running with replication factor of 1?

    Also, it depends on the scheduler you are using. The CapacityScheduler in
    0.20.203 (not 0.20.2) has much better locality for jobs, similarly with
    FairScheduler.

    IAC, running on a single rack with replication of 1 implies rack-locality
    for all tasks which, in most cases, is good enough.

    Arun
    On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

    Hi,

    I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
    input data using a 20 node cluster of nodes. HDFS is configured to use 128MB
    block size (so 1600maps are created) and a replication factor of 1 is being
    used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth
    value of 50Mbps between each of the nodes (this was configured using linux
    "tc"). I see that around 90% of the map tasks are reading data over the
    network i.e. most of the map tasks are not being scheduled at the nodes
    where the data to be processed by them is located.
    My understanding was that Hadoop tries to schedule as many data-local
    maps as possible. But in this situation, this does not seem to happen. Any
    reason why this is happening? and is there a way to actually configure
    hadoop to ensure the maximum possible node locality?
    Any help regarding this is very much appreciated.

    Thanks,
    Virajith

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJul 12, '11 at 12:45p
activeJul 13, '11 at 12:20p
posts19
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase