FAQ
I have a 5 slave servers and I would like to be rackaware meaning each
server represents 1 rack resulting in 5 racks. I have looked around
for examples online but could not find anything concrete. Can someone
please show me an example on how to set this up?

TIA

Search Discussions

  • Jeff Hammerbacher at Mar 3, 2010 at 5:38 am
    Hey,

    Have you looked at
    http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html#Hadoop+Rack+Awareness?
    If there's something lacking in that documentation, perhaps you can submit a
    patch with a more useful explanation.

    Regards,
    Jeff
    On Tue, Mar 2, 2010 at 7:43 PM, Mag Gam wrote:

    I have a 5 slave servers and I would like to be rackaware meaning each
    server represents 1 rack resulting in 5 racks. I have looked around
    for examples online but could not find anything concrete. Can someone
    please show me an example on how to set this up?

    TIA
  • Mag Gam at Mar 3, 2010 at 12:12 pm
    hello jeff:

    yes, I have looked at this page many times.

    I guess I am a bit confused with this line, "The default
    implementation of the same runs a script/command configured using
    topology.script.file.name. If topology.script.file.name is not set,
    the rack id /default-rack is returned for any passed IP address."

    Do I have to write a script which will convert my IP address into a
    unique rack? Once we get the unique id hdfs will write according to
    the id?

    An example would be very helpful. There is only 1 paragraph about this
    but its far too important not to have an example or two.




    On Wed, Mar 3, 2010 at 12:37 AM, Jeff Hammerbacher wrote:
    Hey,

    Have you looked at
    http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html#Hadoop+Rack+Awareness?
    If there's something lacking in that documentation, perhaps you can submit a
    patch with a more useful explanation.

    Regards,
    Jeff
    On Tue, Mar 2, 2010 at 7:43 PM, Mag Gam wrote:

    I have a 5 slave servers and I would like to be rackaware meaning each
    server represents 1 rack resulting in 5 racks. I have looked around
    for examples online but could not find anything concrete. Can someone
    please show me an example on how to set this up?

    TIA
  • Allen Wittenauer at Mar 3, 2010 at 4:58 pm

    On 3/3/10 4:11 AM, "Mag Gam" wrote:
    An example would be very helpful. There is only 1 paragraph about this
    but its far too important not to have an example or two.
    I covered this in my preso to apachecon last year:

    http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile&do=view&
    target=aw-apachecon-eu-2009.pdf

    aka

    http://bit.ly/d3UU4A

    You might find the example in/out helpful. No code, but it is (seriously)
    trivial to write.
  • Neil Bliss at Mar 3, 2010 at 7:09 pm

    On Wed, Mar 3, 2010 at 8:57 AM, Allen Wittenauer wrote:
    On 3/3/10 4:11 AM, "Mag Gam" wrote:
    An example would be very helpful. There is only 1 paragraph about this
    but its far too important not to have an example or two.
    I covered this in my preso to apachecon last year:


    http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile&do=view&
    target=aw-apachecon-eu-2009.pdf

    aka

    http://bit.ly/d3UU4A

    You might find the example in/out helpful. No code, but it is (seriously)
    trivial to write.
    So in the "rack awareness" output, are there meaningful distance metrics
    that should be provided, or is it simply a "local rack vs. non-local rack"
    determination?

    thanks,

    Neil
  • Mag Gam at Mar 4, 2010 at 1:02 am
    Thanks Alan! Your presentation is very nice!

    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.


    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?

    TIA




    On Wed, Mar 3, 2010 at 11:57 AM, Allen Wittenauer
    wrote:

    On 3/3/10 4:11 AM, "Mag Gam" wrote:
    An example would be very helpful. There is only 1 paragraph about this
    but its far too important not to have an example or two.
    I covered this in my preso to apachecon last year:

    http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile&do=view&
    target=aw-apachecon-eu-2009.pdf

    aka

    http://bit.ly/d3UU4A

    You might find the example in/out helpful.  No code, but it is (seriously)
    trivial to write.

  • Michael Thomas at Mar 4, 2010 at 1:09 am
    Hi,

    From what I've observed, if you don't provide a script for rack
    awareness, hadoop treats all nodes as if they belong the the same rack.

    Here is the section in hadoop-site xml where we define our rack
    awareness script:

    <property>
    <name>topology.script.file.name</name>
    <value>/usr/bin/ip-to-rack.sh</value>
    </property>


    This is the ip-to-rack.sh script that we use on our cluster. It assumes
    that the reverse lookup of the IP returns a hostname of the form
    'compute-x-y', where 'x' is the rack id. This type of naming is common
    for Rocks-managed clusters:

    #!/bin/sh

    # The default rule assumes that the nodes are connected to the PDU and
    switch
    # located in the same rack. Only the exceptions need to be
    # explicitly listed here.
    for ip in $@ ; do
    hostname=`nslookup $ip | grep "name =" | awk '{print $4}' | sed -e
    's/\.local\.$//' `
    case $hostname in
    compute-14-3) rack="/Rack15" ;;
    *)
    rack=`echo $hostname | sed -e
    's/^[a-z]*-\([0-9]*\)-[0-9]*.*/\/Rack\1/'`
    ;;
    esac
    echo $rack
    done

    Hope this helps,

    --Mike
    On 03/03/2010 05:01 PM, Mag Gam wrote:
    Thanks Alan! Your presentation is very nice!

    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.


    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?

    TIA




    On Wed, Mar 3, 2010 at 11:57 AM, Allen Wittenauer
    wrote:

    On 3/3/10 4:11 AM, "Mag Gam" wrote:
    An example would be very helpful. There is only 1 paragraph about this
    but its far too important not to have an example or two.
    I covered this in my preso to apachecon last year:

    http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile&do=view&
    target=aw-apachecon-eu-2009.pdf

    aka

    http://bit.ly/d3UU4A

    You might find the example in/out helpful. No code, but it is (seriously)
    trivial to write.

  • Allen Wittenauer at Mar 4, 2010 at 5:22 am

    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question: Is there a difference between all nodes in one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference. In both cases, all copies of
    the blocks can go to pretty much any node. When a MR job runs, every node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
  • Steve Loughran at Mar 4, 2010 at 2:21 pm

    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question: Is there a difference between all nodes in one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference. In both cases, all copies of
    the blocks can go to pretty much any node. When a MR job runs, every node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.*
    for rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP
    address by the top bytes, returning "10.1.1" for everything in rack one,
    "10.1.2" for rack 2; Hadoop will be happy
  • Stephen Watt at Mar 4, 2010 at 3:25 pm
    Hi Folks

    I notice in 0.20.2 we now have a new jar in the lib called
    mockito-all-1.8.0 (mockito.org states it is a Java Test mock-up
    framework). In the interest of teaching this man to fish, in cases such as
    this, how can I determine the reason for the addition of this jar to the
    release ? I've taken a look at the JIRA tickets in the bundled
    CHANGES.txt, but its not immediately obvious without opening up every
    single patch within every single mentioned ticket. Is there another, more
    efficient means ?

    Kind regards
    Steve Watt
  • Todd Lipcon at Mar 4, 2010 at 4:02 pm
    Hi Stephen,

    You can consult the svn logs for ivy.xml. The dependency on mockito was
    backported with HDFS-927 to enable those tests.

    -Todd
    On Thu, Mar 4, 2010 at 7:24 AM, Stephen Watt wrote:

    Hi Folks

    I notice in 0.20.2 we now have a new jar in the lib called
    mockito-all-1.8.0 (mockito.org states it is a Java Test mock-up
    framework). In the interest of teaching this man to fish, in cases such as
    this, how can I determine the reason for the addition of this jar to the
    release ? I've taken a look at the JIRA tickets in the bundled
    CHANGES.txt, but its not immediately obvious without opening up every
    single patch within every single mentioned ticket. Is there another, more
    efficient means ?

    Kind regards
    Steve Watt
  • Allen Wittenauer at Mar 4, 2010 at 5:42 pm

    On 3/4/10 7:24 AM, "Stephen Watt" wrote:
    I notice in 0.20.2 we now have a new jar in the lib called
    mockito-all-1.8.0 (mockito.org states it is a Java Test mock-up
    framework). In the interest of teaching this man to fish, in cases such as
    this, how can I determine the reason for the addition of this jar to the
    release ? I've taken a look at the JIRA tickets in the bundled
    CHANGES.txt, but its not immediately obvious without opening up every
    single patch within every single mentioned ticket. Is there another, more
    efficient means ?
    You'll eventually get used to things that are in the release notes being
    incomplete. :/

    That said, they are still worlds ahead where we were in the early days when
    we didn't have any at all. :)
  • Mag Gam at Mar 4, 2010 at 11:39 pm
    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.


    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughran wrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question:  Is there a difference between all nodes in one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference.  In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Jeff Hammerbacher at Mar 5, 2010 at 2:30 am
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to improve the
    existing documentation: https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added to the
    existing documentation that would have helped you figure this out sooner.

    Thanks,
    Jeff
    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gam wrote:

    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.


    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughran wrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question: Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference. In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Mag Gam at Mar 18, 2010 at 3:35 am
    Well, I didn't really solve the problem. Now I have even more questions.

    I came across this script,
    http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

    but it makes no sense to me! Can someone please try to explain what
    its trying to do?


    MikeThomas:

    Your script isn't working for me. I think there are some syntax
    errors. Is this how its supposed to look: http://pastebin.ca/1844287

    thanks


    On Thu, Mar 4, 2010 at 10:30 PM, Jeff Hammerbacher wrote:
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to improve the
    existing documentation: https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added to the
    existing documentation that would have helped you figure this out sooner.

    Thanks,
    Jeff
    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gam wrote:

    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.


    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughran wrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question:  Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference.  In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Christopher Tubbs at Mar 18, 2010 at 7:15 am
    Hadoop will identify data nodes in your cluster by name and execute
    your script with the data node as an argument. The expected output of
    your script is the name of the rack on which it is located.

    The script you referenced takes the node name as an argument ($1), and
    crawls through a separate file looking up that node in the left
    column, and printing the value in the second column if it finds it.

    If you were to use this script, you would just create the topology
    file that lists all your nodes by name/ip on the left and the rack
    they are in on the right.
    On Wed, Mar 17, 2010 at 11:34 PM, Mag Gam wrote:
    Well,  I didn't really solve the problem. Now I have even more questions.

    I came across this script,
    http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

    but it makes no sense to me! Can someone please try to explain what
    its trying to do?


    MikeThomas:

    Your script isn't working for me. I think there are some syntax
    errors. Is this how its supposed to look: http://pastebin.ca/1844287

    thanks


    On Thu, Mar 4, 2010 at 10:30 PM, Jeff Hammerbacher wrote:
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to improve the
    existing documentation: https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added to the
    existing documentation that would have helped you figure this out sooner.

    Thanks,
    Jeff
    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gam wrote:

    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.


    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughran wrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question:  Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference.  In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Mag Gam at Mar 19, 2010 at 12:42 am
    Chris:

    This clears up my questions a lot! Thankyou.

    So, if I have 4 data servers and I want 2 racks. I can do this

    #!/bin/bash
    #rack1.sh
    echo rack1

    #bin/bash
    #rack2.sh
    echo rack2


    So, I can do this for 2 servers


    <property>
    <name>topology.script.file.name</name>
    <value>rack1.sh</value>
    </property>

    And for the other 2 servers, I can do this:


    <property>
    <name>topology.script.file.name</name>
    <value>rack2.sh</value>
    </property>


    correct?

    On Thu, Mar 18, 2010 at 3:15 AM, Christopher Tubbs wrote:
    Hadoop will identify data nodes in your cluster by name and execute
    your script with the data node as an argument. The expected output of
    your script is the name of the rack on which it is located.

    The script you referenced takes the node name as an argument ($1), and
    crawls through a separate file looking up that node in the left
    column, and printing the value in the second column if it finds it.

    If you were to use this script, you would just create the topology
    file that lists all your nodes by name/ip on the left and the rack
    they are in on the right.
    On Wed, Mar 17, 2010 at 11:34 PM, Mag Gam wrote:
    Well,  I didn't really solve the problem. Now I have even more questions.

    I came across this script,
    http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

    but it makes no sense to me! Can someone please try to explain what
    its trying to do?


    MikeThomas:

    Your script isn't working for me. I think there are some syntax
    errors. Is this how its supposed to look: http://pastebin.ca/1844287

    thanks


    On Thu, Mar 4, 2010 at 10:30 PM, Jeff Hammerbacher wrote:
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to improve the
    existing documentation: https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added to the
    existing documentation that would have helped you figure this out sooner.

    Thanks,
    Jeff
    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gam wrote:

    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.


    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughran wrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question:  Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference.  In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Michael Thomas at Mar 19, 2010 at 1:11 am

    On 03/18/2010 05:41 PM, Mag Gam wrote:
    Chris:

    This clears up my questions a lot! Thankyou.

    So, if I have 4 data servers and I want 2 racks. I can do this

    #!/bin/bash
    #rack1.sh
    echo rack1

    #bin/bash
    #rack2.sh
    echo rack2


    So, I can do this for 2 servers


    <property>
    <name>topology.script.file.name</name>
    <value>rack1.sh</value>
    </property>

    And for the other 2 servers, I can do this:


    <property>
    <name>topology.script.file.name</name>
    <value>rack2.sh</value>
    </property>


    correct?
    Incorrect. You only specify a single topology.script.file.name on the
    namenode. This attribute is ignored on the datanodes.

    Also be aware that the namenode invokes the script with the _ip address_
    of each datanode, not the hostname of each datanode. If you need to
    convert the ip address to a hostname in order to determine which rack
    the datanode is on, then you have to put that conversion logic in your
    topology.script.file.name.

    --Mike
    On Thu, Mar 18, 2010 at 3:15 AM, Christopher Tubbswrote:
    Hadoop will identify data nodes in your cluster by name and execute
    your script with the data node as an argument. The expected output of
    your script is the name of the rack on which it is located.

    The script you referenced takes the node name as an argument ($1), and
    crawls through a separate file looking up that node in the left
    column, and printing the value in the second column if it finds it.

    If you were to use this script, you would just create the topology
    file that lists all your nodes by name/ip on the left and the rack
    they are in on the right.

    On Wed, Mar 17, 2010 at 11:34 PM, Mag Gamwrote:
    Well, I didn't really solve the problem. Now I have even more questions.

    I came across this script,
    http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

    but it makes no sense to me! Can someone please try to explain what
    its trying to do?


    MikeThomas:

    Your script isn't working for me. I think there are some syntax
    errors. Is this how its supposed to look: http://pastebin.ca/1844287

    thanks



    On Thu, Mar 4, 2010 at 10:30 PM, Jeff Hammerbacherwrote:
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to improve the
    existing documentation: https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added to the
    existing documentation that would have helped you figure this out sooner.

    Thanks,
    Jeff

    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gamwrote:
    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.



    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughranwrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam"wrote:
    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question: Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference. In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample<configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Christopher Tubbs at Mar 19, 2010 at 5:01 am
    You only specify the script on the namenode.
    So, you could do something like:

    #!/bin/bash
    #rack_decider.sh

    if [ $1 = "server1.mydomain" -o $1 = "192.168.0.1" ] ; then
    echo rack1
    elif [ $1 = "server2.mydomain" -o $1 = "192.168.0.2" ] ; then
    echo rack1
    elif [ $1 = "server3.mydomain" -o $1 = "192.168.0.3" ] ; then
    echo rack2
    elif [ $1 = "server4.mydomain" -o $1 = "192.168.0.4" ] ; then
    echo rack2
    else
    echo unknown_rack
    fi
    # EOF

    Of course, this is by far the most basic script you could have (I'm
    not sure why it wasn't offered as an example instead of a more
    complicated one).
    On Thu, Mar 18, 2010 at 8:41 PM, Mag Gam wrote:
    Chris:

    This clears up my questions a lot! Thankyou.

    So, if I have 4 data servers and I want 2 racks. I can do this

    #!/bin/bash
    #rack1.sh
    echo rack1

    #bin/bash
    #rack2.sh
    echo rack2


    So, I can do this for 2 servers


    <property>
    <name>topology.script.file.name</name>
    <value>rack1.sh</value>
    </property>

    And for the other 2 servers, I can do this:


    <property>
    <name>topology.script.file.name</name>
    <value>rack2.sh</value>
    </property>


    correct?

    On Thu, Mar 18, 2010 at 3:15 AM, Christopher Tubbs wrote:
    Hadoop will identify data nodes in your cluster by name and execute
    your script with the data node as an argument. The expected output of
    your script is the name of the rack on which it is located.

    The script you referenced takes the node name as an argument ($1), and
    crawls through a separate file looking up that node in the left
    column, and printing the value in the second column if it finds it.

    If you were to use this script, you would just create the topology
    file that lists all your nodes by name/ip on the left and the rack
    they are in on the right.
    On Wed, Mar 17, 2010 at 11:34 PM, Mag Gam wrote:
    Well,  I didn't really solve the problem. Now I have even more questions.

    I came across this script,
    http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

    but it makes no sense to me! Can someone please try to explain what
    its trying to do?


    MikeThomas:

    Your script isn't working for me. I think there are some syntax
    errors. Is this how its supposed to look: http://pastebin.ca/1844287

    thanks


    On Thu, Mar 4, 2010 at 10:30 PM, Jeff Hammerbacher wrote:
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to improve the
    existing documentation: https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added to the
    existing documentation that would have helped you figure this out sooner.

    Thanks,
    Jeff
    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gam wrote:

    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.


    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughran wrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question:  Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference.  In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Mag Gam at Mar 19, 2010 at 11:32 am
    Thanks everyone. I think everyone can agree that this part of the
    documentation is lacking for hadoop.

    Can someone please provide be a use case, for example:

    #server 1
    Input > script.sh
    Output > rack01

    #server 2
    Input > script.sh
    Output > rack02


    Is this how its supposed to work? I am bad with bash so I am trying to
    understand the logic so I can implement it with another language such
    as tcl

    On Fri, Mar 19, 2010 at 1:00 AM, Christopher Tubbs wrote:
    You only specify the script on the namenode.
    So, you could do something like:

    #!/bin/bash
    #rack_decider.sh

    if [ $1 = "server1.mydomain" -o $1 = "192.168.0.1" ] ; then
    echo rack1
    elif [ $1 = "server2.mydomain" -o $1 = "192.168.0.2" ] ; then
    echo rack1
    elif [ $1 = "server3.mydomain" -o $1 = "192.168.0.3" ] ; then
    echo rack2
    elif [ $1 = "server4.mydomain" -o $1 = "192.168.0.4" ] ; then
    echo rack2
    else
    echo unknown_rack
    fi
    # EOF

    Of course, this is by far the most basic script you could have (I'm
    not sure why it wasn't offered as an example instead of a more
    complicated one).
    On Thu, Mar 18, 2010 at 8:41 PM, Mag Gam wrote:
    Chris:

    This clears up my questions a lot! Thankyou.

    So, if I have 4 data servers and I want 2 racks. I can do this

    #!/bin/bash
    #rack1.sh
    echo rack1

    #bin/bash
    #rack2.sh
    echo rack2


    So, I can do this for 2 servers


    <property>
    <name>topology.script.file.name</name>
    <value>rack1.sh</value>
    </property>

    And for the other 2 servers, I can do this:


    <property>
    <name>topology.script.file.name</name>
    <value>rack2.sh</value>
    </property>


    correct?

    On Thu, Mar 18, 2010 at 3:15 AM, Christopher Tubbs wrote:
    Hadoop will identify data nodes in your cluster by name and execute
    your script with the data node as an argument. The expected output of
    your script is the name of the rack on which it is located.

    The script you referenced takes the node name as an argument ($1), and
    crawls through a separate file looking up that node in the left
    column, and printing the value in the second column if it finds it.

    If you were to use this script, you would just create the topology
    file that lists all your nodes by name/ip on the left and the rack
    they are in on the right.
    On Wed, Mar 17, 2010 at 11:34 PM, Mag Gam wrote:
    Well,  I didn't really solve the problem. Now I have even more questions.

    I came across this script,
    http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

    but it makes no sense to me! Can someone please try to explain what
    its trying to do?


    MikeThomas:

    Your script isn't working for me. I think there are some syntax
    errors. Is this how its supposed to look: http://pastebin.ca/1844287

    thanks


    On Thu, Mar 4, 2010 at 10:30 PM, Jeff Hammerbacher wrote:
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to improve the
    existing documentation: https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added to the
    existing documentation that would have helped you figure this out sooner.

    Thanks,
    Jeff
    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gam wrote:

    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.


    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughran wrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question:  Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference.  In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Christopher Tubbs at Mar 19, 2010 at 1:36 pm
    More like the following (shown with the bash prompt). You could type
    this for testing. However, ultimately, hadoop itself will actually be
    executing this script and reading its output.

    $ ./script.sh server1.mydomain
    rack1
    $ ./script.sh server2.mydomain
    rack1
    $ ./script.sh server3.mydomain
    rack2
    $ ./script.sh server4.mydomain
    rack2
    On Fri, Mar 19, 2010 at 7:32 AM, Mag Gam wrote:
    Thanks everyone. I think everyone can agree that this part of the
    documentation is lacking for hadoop.

    Can someone please provide be a use case, for example:

    #server 1
    Input > script.sh
    Output > rack01

    #server 2
    Input > script.sh
    Output > rack02


    Is this how its supposed to work? I am bad with bash so I am trying to
    understand the logic so I can implement it with another language such
    as tcl

    On Fri, Mar 19, 2010 at 1:00 AM, Christopher Tubbs wrote:
    You only specify the script on the namenode.
    So, you could do something like:

    #!/bin/bash
    #rack_decider.sh

    if [ $1 = "server1.mydomain" -o $1 = "192.168.0.1" ] ; then
    echo rack1
    elif [ $1 = "server2.mydomain" -o $1 = "192.168.0.2" ] ; then
    echo rack1
    elif [ $1 = "server3.mydomain" -o $1 = "192.168.0.3" ] ; then
    echo rack2
    elif [ $1 = "server4.mydomain" -o $1 = "192.168.0.4" ] ; then
    echo rack2
    else
    echo unknown_rack
    fi
    # EOF

    Of course, this is by far the most basic script you could have (I'm
    not sure why it wasn't offered as an example instead of a more
    complicated one).
    On Thu, Mar 18, 2010 at 8:41 PM, Mag Gam wrote:
    Chris:

    This clears up my questions a lot! Thankyou.

    So, if I have 4 data servers and I want 2 racks. I can do this

    #!/bin/bash
    #rack1.sh
    echo rack1

    #bin/bash
    #rack2.sh
    echo rack2


    So, I can do this for 2 servers


    <property>
    <name>topology.script.file.name</name>
    <value>rack1.sh</value>
    </property>

    And for the other 2 servers, I can do this:


    <property>
    <name>topology.script.file.name</name>
    <value>rack2.sh</value>
    </property>


    correct?

    On Thu, Mar 18, 2010 at 3:15 AM, Christopher Tubbs wrote:
    Hadoop will identify data nodes in your cluster by name and execute
    your script with the data node as an argument. The expected output of
    your script is the name of the rack on which it is located.

    The script you referenced takes the node name as an argument ($1), and
    crawls through a separate file looking up that node in the left
    column, and printing the value in the second column if it finds it.

    If you were to use this script, you would just create the topology
    file that lists all your nodes by name/ip on the left and the rack
    they are in on the right.
    On Wed, Mar 17, 2010 at 11:34 PM, Mag Gam wrote:
    Well,  I didn't really solve the problem. Now I have even more questions.

    I came across this script,
    http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

    but it makes no sense to me! Can someone please try to explain what
    its trying to do?


    MikeThomas:

    Your script isn't working for me. I think there are some syntax
    errors. Is this how its supposed to look: http://pastebin.ca/1844287

    thanks


    On Thu, Mar 4, 2010 at 10:30 PM, Jeff Hammerbacher wrote:
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to improve the
    existing documentation: https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added to the
    existing documentation that would have helped you figure this out sooner.

    Thanks,
    Jeff
    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gam wrote:

    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.


    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughran wrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam" wrote:

    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question:  Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference.  In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample <configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Allen Wittenauer at Mar 19, 2010 at 3:13 pm

    On 3/19/10 4:32 AM, "Mag Gam" wrote:

    Thanks everyone. I think everyone can agree that this part of the
    documentation is lacking for hadoop.

    Can someone please provide be a use case, for example:

    #server 1
    Input > script.sh
    Output > rack01

    #server 2
    Input > script.sh
    Output > rack02
    I think you have it in your head that the NameNode asks the DataNode what
    rack it is. This is completely backwards. The DataNode has *no* concept of
    what a rack is. It is purely a storage process. There isn't much logic in
    it at all.

    The topology script is *ONLY* run by the NameNode and JobTracker processes.
    That's it. It is not run on the compute nodes. That setting is completely
    *ignored* by the DataNode and TaskTracker processes.

    So to rewrite your use case:

    # NameNode
    Input > server 1
    Output > rack01

    # NameNode
    Input > server 2
    Output > rack02
    Is this how its supposed to work? I am bad with bash so I am trying to
    understand the logic so I can implement it with another language such
    as tcl

    The program logic is :

    Input -> IP address or Hostname
    Output -> /racklocation

    That's it. There is nothing fancy going on here.
  • Mag Gam at Mar 20, 2010 at 12:55 pm
    Thanks!

    I managed to write a script which will give me rack01, rack02
    depending on the ip address.

    Now, how would I check how many racks there are in my cluster? That
    not documented anywhere.

    My intention is after I get all this working is to submit a bug report
    for the documentation team so they can fix this.



    On Fri, Mar 19, 2010 at 10:13 AM, Allen Wittenauer
    wrote:

    On 3/19/10 4:32 AM, "Mag Gam" wrote:

    Thanks everyone. I think everyone can agree that this part of the
    documentation is lacking for hadoop.

    Can someone please provide be a use case, for example:

    #server 1
    Input > script.sh
    Output > rack01

    #server 2
    Input > script.sh
    Output > rack02
    I think you have it in your head that the NameNode asks the DataNode what
    rack it is.  This is completely backwards.  The DataNode has *no* concept of
    what a rack is.  It is purely a storage process.  There isn't much logic in
    it at all.

    The topology script is *ONLY* run by the NameNode and JobTracker processes.
    That's it.  It is not run on the compute nodes.  That setting is completely
    *ignored* by the DataNode and TaskTracker processes.

    So to rewrite your use case:

    # NameNode
    Input > server 1
    Output > rack01

    # NameNode
    Input > server 2
    Output > rack02
    Is this how its supposed to work? I am bad with bash so I am trying to
    understand the logic so I can implement it with another language such
    as tcl

    The program logic is :

    Input -> IP address or Hostname
    Output -> /racklocation

    That's it.  There is nothing fancy going on here.
  • Christopher Tubbs at Mar 20, 2010 at 1:46 pm
    I think hadoop's fsck will report the number of racks.
    On Sat, Mar 20, 2010 at 8:55 AM, Mag Gam wrote:
    Thanks!

    I managed to write a script which will give me rack01, rack02
    depending on the ip address.

    Now, how would I check how many racks there are in my cluster? That
    not documented anywhere.

    My intention is after I get all this working is to submit a bug report
    for the documentation team so they can fix this.



    On Fri, Mar 19, 2010 at 10:13 AM, Allen Wittenauer
    wrote:

    On 3/19/10 4:32 AM, "Mag Gam" wrote:

    Thanks everyone. I think everyone can agree that this part of the
    documentation is lacking for hadoop.

    Can someone please provide be a use case, for example:

    #server 1
    Input > script.sh
    Output > rack01

    #server 2
    Input > script.sh
    Output > rack02
    I think you have it in your head that the NameNode asks the DataNode what
    rack it is.  This is completely backwards.  The DataNode has *no* concept of
    what a rack is.  It is purely a storage process.  There isn't much logic in
    it at all.

    The topology script is *ONLY* run by the NameNode and JobTracker processes.
    That's it.  It is not run on the compute nodes.  That setting is completely
    *ignored* by the DataNode and TaskTracker processes.

    So to rewrite your use case:

    # NameNode
    Input > server 1
    Output > rack01

    # NameNode
    Input > server 2
    Output > rack02
    Is this how its supposed to work? I am bad with bash so I am trying to
    understand the logic so I can implement it with another language such
    as tcl

    The program logic is :

    Input -> IP address or Hostname
    Output -> /racklocation

    That's it.  There is nothing fancy going on here.
  • Michael Thomas at Mar 20, 2010 at 6:05 pm

    On 03/20/2010 05:55 AM, Mag Gam wrote:
    Thanks!

    I managed to write a script which will give me rack01, rack02
    depending on the ip address.

    Now, how would I check how many racks there are in my cluster? That
    not documented anywhere.
    'hadoop fsck /' will report the # of racks at the end:

    [...]
    Number of data-nodes: 163
    Number of racks: 12

    You can verify the IP-to-rack mappings with 'hadoop dfsadmin -report':
    [...]
    Name: 10.3.255.144:50010
    Rack: /Rack12
    [...]
    Name: 10.3.255.62:50010
    Rack: /Rack16

    --Mike
    My intention is after I get all this working is to submit a bug report
    for the documentation team so they can fix this.



    On Fri, Mar 19, 2010 at 10:13 AM, Allen Wittenauer
    wrote:


    On 3/19/10 4:32 AM, "Mag Gam"wrote:
    Thanks everyone. I think everyone can agree that this part of the
    documentation is lacking for hadoop.

    Can someone please provide be a use case, for example:

    #server 1
    Input> script.sh
    Output> rack01

    #server 2
    Input> script.sh
    Output> rack02
    I think you have it in your head that the NameNode asks the DataNode what
    rack it is. This is completely backwards. The DataNode has *no* concept of
    what a rack is. It is purely a storage process. There isn't much logic in
    it at all.

    The topology script is *ONLY* run by the NameNode and JobTracker processes.
    That's it. It is not run on the compute nodes. That setting is completely
    *ignored* by the DataNode and TaskTracker processes.

    So to rewrite your use case:

    # NameNode
    Input> server 1
    Output> rack01

    # NameNode
    Input> server 2
    Output> rack02
    Is this how its supposed to work? I am bad with bash so I am trying to
    understand the logic so I can implement it with another language such
    as tcl

    The program logic is :

    Input -> IP address or Hostname
    Output -> /racklocation

    That's it. There is nothing fancy going on here.
  • Michael Thomas at Mar 19, 2010 at 1:21 am

    On 03/17/2010 08:34 PM, Mag Gam wrote:
    Well, I didn't really solve the problem. Now I have even more questions.

    I came across this script,
    http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

    but it makes no sense to me! Can someone please try to explain what
    its trying to do?


    MikeThomas:

    Your script isn't working for me. I think there are some syntax
    errors. Is this how its supposed to look: http://pastebin.ca/1844287
    Not quite. A couple of lines got incorrectly wrapped. It should look
    like this:

    http://pastebin.ca/1845286

    --Mike
    On Thu, Mar 4, 2010 at 10:30 PM, Jeff Hammerbacherwrote:
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to improve the
    existing documentation: https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added to the
    existing documentation that would have helped you figure this out sooner.

    Thanks,
    Jeff

    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gamwrote:
    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.



    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughranwrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam"wrote:
    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question: Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference. In both cases, all copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack? and
    a sample<configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number. 10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Michael Thomas at Mar 19, 2010 at 1:27 am

    On 03/18/2010 06:21 PM, Michael Thomas wrote:
    On 03/17/2010 08:34 PM, Mag Gam wrote:
    Well, I didn't really solve the problem. Now I have even more questions.

    I came across this script,
    http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

    but it makes no sense to me! Can someone please try to explain what
    its trying to do?


    MikeThomas:

    Your script isn't working for me. I think there are some syntax
    errors. Is this how its supposed to look: http://pastebin.ca/1844287
    Not quite. A couple of lines got incorrectly wrapped. It should look
    like this:

    http://pastebin.ca/1845286
    One more miswrapped line. I hate auto-wrapping...

    http://pastebin.ca/1845290

    --Mike

    --Mike
    On Thu, Mar 4, 2010 at 10:30 PM, Jeff
    Hammerbacherwrote:
    Hey Mag,

    Glad you have solved the problem. I've created a JIRA ticket to
    improve the
    existing documentation:
    https://issues.apache.org/jira/browse/HADOOP-6616.
    If you have some time, it would be useful to hear what could be added
    to the
    existing documentation that would have helped you figure this out
    sooner.

    Thanks,
    Jeff

    On Thu, Mar 4, 2010 at 3:39 PM, Mag Gamwrote:
    Thanks everyone for explaining this to me instead of giving me RTFM!

    I will play around with it and see how far I get.



    On Thu, Mar 4, 2010 at 9:21 AM, Steve Loughran<stevel@apache.org>
    wrote:
    Allen Wittenauer wrote:
    On 3/3/10 5:01 PM, "Mag Gam"wrote:
    Thanks Alan! Your presentation is very nice!
    Thanks. :)
    "If you don't provide a script for rack awareness, it treats every
    node as if it was its own rack". I am using the default settings and
    the report still says only 1 rack.
    Let's take a different approach to convince you. :)

    Think about the question: Is there a difference between all nodes in
    one
    rack vs. every node acting as a lone rack?

    The answer is no, there isn't any difference. In both cases, all
    copies
    of
    the blocks can go to pretty much any node. When a MR job runs, every
    node
    would either be considered 'off rack' or 'rack-local'.

    So there is no difference.

    Do you mind sharing a script with us on how you determine a rack?
    and
    a sample<configuration> </configuration> syntax?
    Michael has already posted his, so I'll skip this one. :)
    Think Mag probably wanted a shell script.

    Mag, give your machines IPv4 addresses that map to rack number.
    10.1.1.* for
    rack one, 10.1.2.* for rack 2, etc. Then just filter out the IP
    address by
    the top bytes, returning "10.1.1" for everything in rack one, "10.1.2" for
    rack 2; Hadoop will be happy
  • Allen Wittenauer at Mar 3, 2010 at 4:57 pm

    On 3/2/10 7:43 PM, "Mag Gam" wrote:

    I have a 5 slave servers and I would like to be rackaware meaning each
    server represents 1 rack resulting in 5 racks. I have looked around
    for examples online but could not find anything concrete. Can someone
    please show me an example on how to set this up?

    You're in luck.

    If you don't provide a script for rack awareness, it treats every node as if
    it was its own rack.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgeneral @
categorieshadoop
postedMar 3, '10 at 3:43a
activeMar 20, '10 at 6:05p
posts28
users9
websitehadoop.apache.org
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase