FAQ
Hello,

We have been looking at IBM JDK junit failures on Hadoop-1.0.1
independently and have ran into the same failures as reported in this JIRA.
I have a question based upon what I have observed below.

We started debugging the problems in the testcase -
org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
The testcase fails because the number of splits returned back from
CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
expected return value is 2.

So far, we have found the reason for this difference in number of splits is
because the order in which elements in the rackToBlocks hashmap get created
is in the reverse order that Sun JDK creates.

The question I have at this point is -- Should there be a strict dependency
in the order in which the rackToBlocks hashmap gets populated, to determine
the number of splits that get should get created in a hadoop cluster? Is
this Working as designed?

Regards,
Kumar

Search Discussions

  • Robert Evans at Mar 22, 2012 at 4:55 pm
    If it really is the ordering of the hash map I would say no it should not, and the code should be updated. If ordering matters we need to use a map that guarantees a given order, and hash map is not one of them.

    --Bobby Evans

    On 3/22/12 7:24 AM, "Kumar Ravi" wrote:

    Hello,

    We have been looking at IBM JDK junit failures on Hadoop-1.0.1
    independently and have ran into the same failures as reported in this JIRA.
    I have a question based upon what I have observed below.

    We started debugging the problems in the testcase -
    org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
    The testcase fails because the number of splits returned back from
    CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
    expected return value is 2.

    So far, we have found the reason for this difference in number of splits is
    because the order in which elements in the rackToBlocks hashmap get created
    is in the reverse order that Sun JDK creates.

    The question I have at this point is -- Should there be a strict dependency
    in the order in which the rackToBlocks hashmap gets populated, to determine
    the number of splits that get should get created in a hadoop cluster? Is
    this Working as designed?

    Regards,
    Kumar
  • Amir Sanjar at Mar 22, 2012 at 6:47 pm
    Thanks for the reply Robert,
    However I believe the main design issue is:
    If there is a rack ( listed in rackToBlock hashMap) that contains all the
    blocks (stored in blockToNode hashMap), regardless of the order, the split
    operation terminates after the rack gets processed, That means remaining
    racks ( listed in rackToBlock hashMap) will not get processed . For more
    details look at file CombineFileInputFormat.JAVA, method getMoreSplits(),
    while loop starting at line 344.

    Best Regards
    Amir Sanjar

    Linux System Management Architect and Lead
    IBM Senior Software Engineer
    Phone# 512-286-8393
    Fax# 512-838-8858





    From: Robert Evans <evans@yahoo-inc.com>
    To: "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org>
    Date: 03/22/2012 11:57 AM
    Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering



    If it really is the ordering of the hash map I would say no it should not,
    and the code should be updated. If ordering matters we need to use a map
    that guarantees a given order, and hash map is not one of them.

    --Bobby Evans

    On 3/22/12 7:24 AM, "Kumar Ravi" wrote:

    Hello,

    We have been looking at IBM JDK junit failures on Hadoop-1.0.1
    independently and have ran into the same failures as reported in this JIRA.
    I have a question based upon what I have observed below.

    We started debugging the problems in the testcase -
    org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
    The testcase fails because the number of splits returned back from
    CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
    expected return value is 2.

    So far, we have found the reason for this difference in number of splits is
    because the order in which elements in the rackToBlocks hashmap get created
    is in the reverse order that Sun JDK creates.

    The question I have at this point is -- Should there be a strict dependency
    in the order in which the rackToBlocks hashmap gets populated, to determine
    the number of splits that get should get created in a hadoop cluster? Is
    this Working as designed?

    Regards,
    Kumar
  • Devaraj Das at Mar 22, 2012 at 9:41 pm
    On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote:

    Thanks for the reply Robert,
    However I believe the main design issue is:
    If there is a rack ( listed in rackToBlock hashMap) that contains all the
    blocks (stored in blockToNode hashMap), regardless of the order, the split
    operation terminates after the rack gets processed, That means remaining
    racks ( listed in rackToBlock hashMap) will not get processed . For more
    details look at file CombineFileInputFormat.JAVA, method getMoreSplits(),
    while loop starting at line 344.
    I haven't looked at the code much yet. But trying to understand your question - what issue are you trying to bring out? Is it overloading one task with too much input (there is a min/max limit on that one though)?
    Best Regards
    Amir Sanjar

    Linux System Management Architect and Lead
    IBM Senior Software Engineer
    Phone# 512-286-8393
    Fax# 512-838-8858





    From: Robert Evans <evans@yahoo-inc.com>
    To: "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org>
    Date: 03/22/2012 11:57 AM
    Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering



    If it really is the ordering of the hash map I would say no it should not,
    and the code should be updated. If ordering matters we need to use a map
    that guarantees a given order, and hash map is not one of them.

    --Bobby Evans

    On 3/22/12 7:24 AM, "Kumar Ravi" wrote:

    Hello,

    We have been looking at IBM JDK junit failures on Hadoop-1.0.1
    independently and have ran into the same failures as reported in this JIRA.
    I have a question based upon what I have observed below.

    We started debugging the problems in the testcase -
    org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
    The testcase fails because the number of splits returned back from
    CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
    expected return value is 2.

    So far, we have found the reason for this difference in number of splits is
    because the order in which elements in the rackToBlocks hashmap get created
    is in the reverse order that Sun JDK creates.

    The question I have at this point is -- Should there be a strict dependency
    in the order in which the rackToBlocks hashmap gets populated, to determine
    the number of splits that get should get created in a hadoop cluster? Is
    this Working as designed?

    Regards,
    Kumar

  • Kumar Ravi at Mar 23, 2012 at 2:28 pm
    Hi Devaraj,

    The issue Amir brings up has to do with the Testcase scenario.

    We are trying to determine if this is a Design issue with the getMoreSplits
    () method in CombineFileInputFormat class or if the testcase needs
    modification.
    Like I mentioned in my earlier note, the observation we made while
    debugging this issue is that the order by which the racksToBlocks HashMap
    gets populated seems to matter. From the comments by Robert Evans and you,
    it appears by design that the order should not matter.

    Amir's point is - The reason order happens to play a role here is that as
    soon as all the blocks are accounted for, getMoreSplits() stops iterating
    through the racks, and depending upon which rack(s) each block is
    replicated on, and depending upon when each rack is processed in the loop
    within getMoreSplits(), one can end up with different split counts, and as
    a result fail the testcase in some situations.

    Specifically for this testcase, there are 3 racks that are simulated where
    each of these 3 racks have a datanode each. Datanode 1 has replicas of all
    the blocks of all the 3 files (file1, file2, and file3) while Datanode 2
    has all the blocks of files file2 and file 3 and Datanode 3 has all the
    blocks of only file3. As soon as Rack 1 is processed, getMoreSplits() exits
    with a split count of the number of times it stays in this loop. So in this
    scenario, if Rack1 gets processed last, one will end up with a split count
    of 3. If Rack1 gets processed in the beginning, split count will be 1. The
    testcase is expecting a return value of 3 but can get a 1 or 2 depending on
    when it gets processed.

    Hope this clarifies things a bit.

    Regards,
    Kumar


    Kumar Ravi
    IBM Linux Technology Center
    Austin, TX

    Tel.: (512)286-8179



    From: Devaraj Das <ddas@hortonworks.com>

    To: common-dev@hadoop.apache.org

    Cc: Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John Williams/Austin/IBM@IBMUS

    Date: 03/22/2012 04:41 PM

    Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering





    On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote:

    Thanks for the reply Robert,
    However I believe the main design issue is:
    If there is a rack ( listed in rackToBlock hashMap) that contains all the
    blocks (stored in blockToNode hashMap), regardless of the order, the split
    operation terminates after the rack gets processed, That means remaining
    racks ( listed in rackToBlock hashMap) will not get processed . For more
    details look at file CombineFileInputFormat.JAVA, method getMoreSplits(),
    while loop starting at line 344.
    I haven't looked at the code much yet. But trying to understand your
    question - what issue are you trying to bring out? Is it overloading one
    task with too much input (there is a min/max limit on that one though)?
    Best Regards
    Amir Sanjar

    Linux System Management Architect and Lead
    IBM Senior Software Engineer
    Phone# 512-286-8393
    Fax# 512-838-8858





    From: Robert Evans <evans@yahoo-inc.com>
    To: "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org>
    Date: 03/22/2012 11:57 AM
    Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering



    If it really is the ordering of the hash map I would say no it should not,
    and the code should be updated. If ordering matters we need to use a map
    that guarantees a given order, and hash map is not one of them.

    --Bobby Evans

    On 3/22/12 7:24 AM, "Kumar Ravi" wrote:

    Hello,

    We have been looking at IBM JDK junit failures on Hadoop-1.0.1
    independently and have ran into the same failures as reported in this JIRA.
    I have a question based upon what I have observed below.

    We started debugging the problems in the testcase -
    org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
    The testcase fails because the number of splits returned back from
    CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
    expected return value is 2.

    So far, we have found the reason for this difference in number of splits is
    because the order in which elements in the rackToBlocks hashmap get created
    is in the reverse order that Sun JDK creates.

    The question I have at this point is -- Should there be a strict
    dependency
    in the order in which the rackToBlocks hashmap gets populated, to determine
    the number of splits that get should get created in a hadoop cluster? Is
    this Working as designed?

    Regards,
    Kumar

  • Devaraj Das at Mar 23, 2012 at 5:10 pm
    Thanks for the explanation, Kumar.

    This looks like a testcase problem. We can get into a discussion on whether we should tweak the split selection process to output more/less splits when given an input set, but for now we should fix the testcase. Makes sense?
    On Mar 23, 2012, at 7:26 AM, Kumar Ravi wrote:

    Hi Devaraj,

    The issue Amir brings up has to do with the Testcase scenario.

    We are trying to determine if this is a Design issue with the getMoreSplits() method in CombineFileInputFormat class or if the testcase needs modification.
    Like I mentioned in my earlier note, the observation we made while debugging this issue is that the order by which the racksToBlocks HashMap gets populated seems to matter. From the comments by Robert Evans and you, it appears by design that the order should not matter.

    Amir's point is - The reason order happens to play a role here is that as soon as all the blocks are accounted for, getMoreSplits() stops iterating through the racks, and depending upon which rack(s) each block is replicated on, and depending upon when each rack is processed in the loop within getMoreSplits(), one can end up with different split counts, and as a result fail the testcase in some situations.

    Specifically for this testcase, there are 3 racks that are simulated where each of these 3 racks have a datanode each. Datanode 1 has replicas of all the blocks of all the 3 files (file1, file2, and file3) while Datanode 2 has all the blocks of files file2 and file 3 and Datanode 3 has all the blocks of only file3. As soon as Rack 1 is processed, getMoreSplits() exits with a split count of the number of times it stays in this loop. So in this scenario, if Rack1 gets processed last, one will end up with a split count of 3. If Rack1 gets processed in the beginning, split count will be 1. The testcase is expecting a return value of 3 but can get a 1 or 2 depending on when it gets processed.

    Hope this clarifies things a bit.

    Regards,
    Kumar


    Kumar Ravi
    IBM Linux Technology Center
    Austin, TX

    Tel.: (512)286-8179

    Devaraj Das ---03/22/2012 04:41:36 PM---On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote:


    From:

    Devaraj Das <ddas@hortonworks.com>

    To:

    common-dev@hadoop.apache.org

    Cc:

    Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John Williams/Austin/IBM@IBMUS

    Date:

    03/22/2012 04:41 PM

    Subject:

    Re: Question about Hadoop-8192 and rackToBlocks ordering


    On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote:

    Thanks for the reply Robert,
    However I believe the main design issue is:
    If there is a rack ( listed in rackToBlock hashMap) that contains all the
    blocks (stored in blockToNode hashMap), regardless of the order, the split
    operation terminates after the rack gets processed, That means remaining
    racks ( listed in rackToBlock hashMap) will not get processed . For more
    details look at file CombineFileInputFormat.JAVA, method getMoreSplits(),
    while loop starting at line 344.
    I haven't looked at the code much yet. But trying to understand your question - what issue are you trying to bring out? Is it overloading one task with too much input (there is a min/max limit on that one though)?
    Best Regards
    Amir Sanjar

    Linux System Management Architect and Lead
    IBM Senior Software Engineer
    Phone# 512-286-8393
    Fax# 512-838-8858





    From: Robert Evans <evans@yahoo-inc.com>
    To: "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org>
    Date: 03/22/2012 11:57 AM
    Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering



    If it really is the ordering of the hash map I would say no it should not,
    and the code should be updated. If ordering matters we need to use a map
    that guarantees a given order, and hash map is not one of them.

    --Bobby Evans

    On 3/22/12 7:24 AM, "Kumar Ravi" wrote:

    Hello,

    We have been looking at IBM JDK junit failures on Hadoop-1.0.1
    independently and have ran into the same failures as reported in this JIRA.
    I have a question based upon what I have observed below.

    We started debugging the problems in the testcase -
    org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
    The testcase fails because the number of splits returned back from
    CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
    expected return value is 2.

    So far, we have found the reason for this difference in number of splits is
    because the order in which elements in the rackToBlocks hashmap get created
    is in the reverse order that Sun JDK creates.

    The question I have at this point is -- Should there be a strict dependency
    in the order in which the rackToBlocks hashmap gets populated, to determine
    the number of splits that get should get created in a hadoop cluster? Is
    this Working as designed?

    Regards,
    Kumar

  • Amir Sanjar at Mar 23, 2012 at 5:56 pm
    Devaraj ,
    I respectfully disagree, fixing the testcase for IBM JVM will break SUN
    JVM, unless we remove the assertion causing the problem. However that might
    in return mask other problems . Before any changes we need to understand
    the logic behind split process. Is there any documentation? Who owned the
    code?


    Best Regards
    Amir Sanjar

    Linux System Management Architect and Lead
    IBM Senior Software Engineer
    Phone# 512-286-8393
    Fax# 512-838-8858





    From: Devaraj Das <ddas@hortonworks.com>
    To: common-dev@hadoop.apache.org
    Cc: Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John
    Williams/Austin/IBM@IBMUS
    Date: 03/23/2012 12:15 PM
    Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering



    Thanks for the explanation, Kumar.

    This looks like a testcase problem. We can get into a discussion on whether
    we should tweak the split selection process to output more/less splits when
    given an input set, but for now we should fix the testcase. Makes sense?
    On Mar 23, 2012, at 7:26 AM, Kumar Ravi wrote:

    Hi Devaraj,

    The issue Amir brings up has to do with the Testcase scenario.

    We are trying to determine if this is a Design issue with the
    getMoreSplits() method in CombineFileInputFormat class or if the testcase
    needs modification.
    Like I mentioned in my earlier note, the observation we made while
    debugging this issue is that the order by which the racksToBlocks HashMap
    gets populated seems to matter. From the comments by Robert Evans and you,
    it appears by design that the order should not matter.
    Amir's point is - The reason order happens to play a role here is that as
    soon as all the blocks are accounted for, getMoreSplits() stops iterating
    through the racks, and depending upon which rack(s) each block is
    replicated on, and depending upon when each rack is processed in the loop
    within getMoreSplits(), one can end up with different split counts, and as
    a result fail the testcase in some situations.
    Specifically for this testcase, there are 3 racks that are simulated
    where each of these 3 racks have a datanode each. Datanode 1 has replicas
    of all the blocks of all the 3 files (file1, file2, and file3) while
    Datanode 2 has all the blocks of files file2 and file 3 and Datanode 3 has
    all the blocks of only file3. As soon as Rack 1 is processed, getMoreSplits
    () exits with a split count of the number of times it stays in this loop.
    So in this scenario, if Rack1 gets processed last, one will end up with a
    split count of 3. If Rack1 gets processed in the beginning, split count
    will be 1. The testcase is expecting a return value of 3 but can get a 1 or
    2 depending on when it gets processed.
    Hope this clarifies things a bit.

    Regards,
    Kumar


    Kumar Ravi
    IBM Linux Technology Center
    Austin, TX

    Tel.: (512)286-8179

    Devaraj Das ---03/22/2012 04:41:36 PM---On Mar 22, 2012, at 11:45 AM,
    Amir Sanjar wrote:

    From:

    Devaraj Das <ddas@hortonworks.com>

    To:

    common-dev@hadoop.apache.org

    Cc:

    Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John Williams/Austin/IBM@IBMUS

    Date:

    03/22/2012 04:41 PM

    Subject:

    Re: Question about Hadoop-8192 and rackToBlocks ordering


    On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote:

    Thanks for the reply Robert,
    However I believe the main design issue is:
    If there is a rack ( listed in rackToBlock hashMap) that contains all
    the
    blocks (stored in blockToNode hashMap), regardless of the order, the
    split
    operation terminates after the rack gets processed, That means
    remaining
    racks ( listed in rackToBlock hashMap) will not get processed . For
    more
    details look at file CombineFileInputFormat.JAVA, method getMoreSplits
    (),
    while loop starting at line 344.
    I haven't looked at the code much yet. But trying to understand your
    question - what issue are you trying to bring out? Is it overloading one
    task with too much input (there is a min/max limit on that one though)?
    Best Regards
    Amir Sanjar

    Linux System Management Architect and Lead
    IBM Senior Software Engineer
    Phone# 512-286-8393
    Fax# 512-838-8858





    From: Robert Evans <evans@yahoo-inc.com>
    To: "common-dev@hadoop.apache.org"
    <common-dev@hadoop.apache.org>
    Date: 03/22/2012 11:57 AM
    Subject: Re: Question about Hadoop-8192 and rackToBlocks
    ordering


    If it really is the ordering of the hash map I would say no it should
    not,
    and the code should be updated. If ordering matters we need to use a
    map
    that guarantees a given order, and hash map is not one of them.

    --Bobby Evans

    On 3/22/12 7:24 AM, "Kumar Ravi" wrote:

    Hello,

    We have been looking at IBM JDK junit failures on Hadoop-1.0.1
    independently and have ran into the same failures as reported in this
    JIRA.
    I have a question based upon what I have observed below.

    We started debugging the problems in the testcase -
    org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
    The testcase fails because the number of splits returned back from
    CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
    expected return value is 2.

    So far, we have found the reason for this difference in number of
    splits is
    because the order in which elements in the rackToBlocks hashmap get
    created
    is in the reverse order that Sun JDK creates.

    The question I have at this point is -- Should there be a strict
    dependency
    in the order in which the rackToBlocks hashmap gets populated, to
    determine
    the number of splits that get should get created in a hadoop cluster?
    Is
    this Working as designed?

    Regards,
    Kumar

  • Devaraj Das at Mar 23, 2012 at 6:16 pm
    Amir, let's continue the discussion on jira (now that we have got into disagreements :-) ).

    Thanks,
    Devaraj.
    On Mar 23, 2012, at 10:55 AM, Amir Sanjar wrote:

    Devaraj ,
    I respectfully disagree, fixing the testcase for IBM JVM will break SUN
    JVM, unless we remove the assertion causing the problem. However that might
    in return mask other problems . Before any changes we need to understand
    the logic behind split process. Is there any documentation? Who owned the
    code?


    Best Regards
    Amir Sanjar

    Linux System Management Architect and Lead
    IBM Senior Software Engineer
    Phone# 512-286-8393
    Fax# 512-838-8858





    From: Devaraj Das <ddas@hortonworks.com>
    To: common-dev@hadoop.apache.org
    Cc: Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John
    Williams/Austin/IBM@IBMUS
    Date: 03/23/2012 12:15 PM
    Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering



    Thanks for the explanation, Kumar.

    This looks like a testcase problem. We can get into a discussion on whether
    we should tweak the split selection process to output more/less splits when
    given an input set, but for now we should fix the testcase. Makes sense?
    On Mar 23, 2012, at 7:26 AM, Kumar Ravi wrote:

    Hi Devaraj,

    The issue Amir brings up has to do with the Testcase scenario.

    We are trying to determine if this is a Design issue with the
    getMoreSplits() method in CombineFileInputFormat class or if the testcase
    needs modification.
    Like I mentioned in my earlier note, the observation we made while
    debugging this issue is that the order by which the racksToBlocks HashMap
    gets populated seems to matter. From the comments by Robert Evans and you,
    it appears by design that the order should not matter.
    Amir's point is - The reason order happens to play a role here is that as
    soon as all the blocks are accounted for, getMoreSplits() stops iterating
    through the racks, and depending upon which rack(s) each block is
    replicated on, and depending upon when each rack is processed in the loop
    within getMoreSplits(), one can end up with different split counts, and as
    a result fail the testcase in some situations.
    Specifically for this testcase, there are 3 racks that are simulated
    where each of these 3 racks have a datanode each. Datanode 1 has replicas
    of all the blocks of all the 3 files (file1, file2, and file3) while
    Datanode 2 has all the blocks of files file2 and file 3 and Datanode 3 has
    all the blocks of only file3. As soon as Rack 1 is processed, getMoreSplits
    () exits with a split count of the number of times it stays in this loop.
    So in this scenario, if Rack1 gets processed last, one will end up with a
    split count of 3. If Rack1 gets processed in the beginning, split count
    will be 1. The testcase is expecting a return value of 3 but can get a 1 or
    2 depending on when it gets processed.
    Hope this clarifies things a bit.

    Regards,
    Kumar


    Kumar Ravi
    IBM Linux Technology Center
    Austin, TX

    Tel.: (512)286-8179

    Devaraj Das ---03/22/2012 04:41:36 PM---On Mar 22, 2012, at 11:45 AM,
    Amir Sanjar wrote:

    From:

    Devaraj Das <ddas@hortonworks.com>

    To:

    common-dev@hadoop.apache.org

    Cc:

    Jeffrey J Heroux/Poughkeepsie/IBM@IBMUS, John Williams/Austin/IBM@IBMUS

    Date:

    03/22/2012 04:41 PM

    Subject:

    Re: Question about Hadoop-8192 and rackToBlocks ordering


    On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote:

    Thanks for the reply Robert,
    However I believe the main design issue is:
    If there is a rack ( listed in rackToBlock hashMap) that contains all
    the
    blocks (stored in blockToNode hashMap), regardless of the order, the
    split
    operation terminates after the rack gets processed, That means
    remaining
    racks ( listed in rackToBlock hashMap) will not get processed . For
    more
    details look at file CombineFileInputFormat.JAVA, method getMoreSplits
    (),
    while loop starting at line 344.
    I haven't looked at the code much yet. But trying to understand your
    question - what issue are you trying to bring out? Is it overloading one
    task with too much input (there is a min/max limit on that one though)?
    Best Regards
    Amir Sanjar

    Linux System Management Architect and Lead
    IBM Senior Software Engineer
    Phone# 512-286-8393
    Fax# 512-838-8858





    From: Robert Evans <evans@yahoo-inc.com>
    To: "common-dev@hadoop.apache.org"
    <common-dev@hadoop.apache.org>
    Date: 03/22/2012 11:57 AM
    Subject: Re: Question about Hadoop-8192 and rackToBlocks
    ordering


    If it really is the ordering of the hash map I would say no it should
    not,
    and the code should be updated. If ordering matters we need to use a
    map
    that guarantees a given order, and hash map is not one of them.

    --Bobby Evans

    On 3/22/12 7:24 AM, "Kumar Ravi" wrote:

    Hello,

    We have been looking at IBM JDK junit failures on Hadoop-1.0.1
    independently and have ran into the same failures as reported in this
    JIRA.
    I have a question based upon what I have observed below.

    We started debugging the problems in the testcase -
    org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
    The testcase fails because the number of splits returned back from
    CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
    expected return value is 2.

    So far, we have found the reason for this difference in number of
    splits is
    because the order in which elements in the rackToBlocks hashmap get
    created
    is in the reverse order that Sun JDK creates.

    The question I have at this point is -- Should there be a strict
    dependency
    in the order in which the rackToBlocks hashmap gets populated, to
    determine
    the number of splits that get should get created in a hadoop cluster?
    Is
    this Working as designed?

    Regards,
    Kumar

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 22, '12 at 4:45p
activeMar 23, '12 at 6:16p
posts8
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase