Thanks,
Devaraj.
On Mar 23, 2012, at 10:55 AM, Amir Sanjar wrote:
Devaraj ,
I respectfully disagree, fixing the testcase for IBM JVM will break SUN
JVM, unless we remove the assertion causing the problem. However that might
in return mask other problems . Before any changes we need to understand
the logic behind split process. Is there any documentation? Who owned the
code?
Best Regards
Amir Sanjar
Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax# 512-838-8858
From: Devaraj Das <[email protected]>
To: [email protected]
Cc: Jeffrey J Heroux/Poughkeepsie/[email protected], John
Williams/Austin/[email protected]
Date: 03/23/2012 12:15 PM
Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering
Thanks for the explanation, Kumar.
This looks like a testcase problem. We can get into a discussion on whether
we should tweak the split selection process to output more/less splits when
given an input set, but for now we should fix the testcase. Makes sense?
needs modification.
gets populated seems to matter. From the comments by Robert Evans and you,
it appears by design that the order should not matter.
through the racks, and depending upon which rack(s) each block is
replicated on, and depending upon when each rack is processed in the loop
within getMoreSplits(), one can end up with different split counts, and as
a result fail the testcase in some situations.
of all the blocks of all the 3 files (file1, file2, and file3) while
Datanode 2 has all the blocks of files file2 and file 3 and Datanode 3 has
all the blocks of only file3. As soon as Rack 1 is processed, getMoreSplits
() exits with a split count of the number of times it stays in this loop.
So in this scenario, if Rack1 gets processed last, one will end up with a
split count of 3. If Rack1 gets processed in the beginning, split count
will be 1. The testcase is expecting a return value of 3 but can get a 1 or
2 depending on when it gets processed.
split
remaining
more
(),
question - what issue are you trying to bring out? Is it overloading one
task with too much input (there is a min/max limit on that one though)?
<[email protected]>
ordering
If it really is the ordering of the hash map I would say no it shouldnot,
map
JIRA.
splits is
created
dependency
determine
Is
Devaraj ,
I respectfully disagree, fixing the testcase for IBM JVM will break SUN
JVM, unless we remove the assertion causing the problem. However that might
in return mask other problems . Before any changes we need to understand
the logic behind split process. Is there any documentation? Who owned the
code?
Best Regards
Amir Sanjar
Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax# 512-838-8858
From: Devaraj Das <[email protected]>
To: [email protected]
Cc: Jeffrey J Heroux/Poughkeepsie/[email protected], John
Williams/Austin/[email protected]
Date: 03/23/2012 12:15 PM
Subject: Re: Question about Hadoop-8192 and rackToBlocks ordering
Thanks for the explanation, Kumar.
This looks like a testcase problem. We can get into a discussion on whether
we should tweak the split selection process to output more/less splits when
given an input set, but for now we should fix the testcase. Makes sense?
On Mar 23, 2012, at 7:26 AM, Kumar Ravi wrote:
Hi Devaraj,
The issue Amir brings up has to do with the Testcase scenario.
We are trying to determine if this is a Design issue with the
getMoreSplits() method in CombineFileInputFormat class or if the testcaseHi Devaraj,
The issue Amir brings up has to do with the Testcase scenario.
We are trying to determine if this is a Design issue with the
needs modification.
Like I mentioned in my earlier note, the observation we made while
debugging this issue is that the order by which the racksToBlocks HashMapgets populated seems to matter. From the comments by Robert Evans and you,
it appears by design that the order should not matter.
Amir's point is - The reason order happens to play a role here is that as
soon as all the blocks are accounted for, getMoreSplits() stops iteratingthrough the racks, and depending upon which rack(s) each block is
replicated on, and depending upon when each rack is processed in the loop
within getMoreSplits(), one can end up with different split counts, and as
a result fail the testcase in some situations.
Specifically for this testcase, there are 3 racks that are simulated
where each of these 3 racks have a datanode each. Datanode 1 has replicasof all the blocks of all the 3 files (file1, file2, and file3) while
Datanode 2 has all the blocks of files file2 and file 3 and Datanode 3 has
all the blocks of only file3. As soon as Rack 1 is processed, getMoreSplits
() exits with a split count of the number of times it stays in this loop.
So in this scenario, if Rack1 gets processed last, one will end up with a
split count of 3. If Rack1 gets processed in the beginning, split count
will be 1. The testcase is expecting a return value of 3 but can get a 1 or
2 depending on when it gets processed.
Hope this clarifies things a bit.
Regards,
Kumar
Kumar Ravi
IBM Linux Technology Center
Austin, TX
Tel.: (512)286-8179
Devaraj Das ---03/22/2012 04:41:36 PM---On Mar 22, 2012, at 11:45 AM,
Amir Sanjar wrote:
From:
Devaraj Das <[email protected]>
To:
[email protected]
Cc:
Jeffrey J Heroux/Poughkeepsie/[email protected], John Williams/Austin/[email protected]
Date:
03/22/2012 04:41 PM
Subject:
Re: Question about Hadoop-8192 and rackToBlocks ordering
theRegards,
Kumar
Kumar Ravi
IBM Linux Technology Center
Austin, TX
Tel.: (512)286-8179
Devaraj Das ---03/22/2012 04:41:36 PM---On Mar 22, 2012, at 11:45 AM,
Amir Sanjar wrote:
From:
Devaraj Das <[email protected]>
To:
[email protected]
Cc:
Jeffrey J Heroux/Poughkeepsie/[email protected], John Williams/Austin/[email protected]
Date:
03/22/2012 04:41 PM
Subject:
Re: Question about Hadoop-8192 and rackToBlocks ordering
On Mar 22, 2012, at 11:45 AM, Amir Sanjar wrote:
Thanks for the reply Robert,
However I believe the main design issue is:
If there is a rack ( listed in rackToBlock hashMap) that contains all
Thanks for the reply Robert,
However I believe the main design issue is:
If there is a rack ( listed in rackToBlock hashMap) that contains all
blocks (stored in blockToNode hashMap), regardless of the order, the
operation terminates after the rack gets processed, That means
racks ( listed in rackToBlock hashMap) will not get processed . For
details look at file CombineFileInputFormat.JAVA, method getMoreSplits
while loop starting at line 344.
I haven't looked at the code much yet. But trying to understand yourtask with too much input (there is a min/max limit on that one though)?
Best Regards
Amir Sanjar
Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax# 512-838-8858
From: Robert Evans <[email protected]>
To: "[email protected]"
Amir Sanjar
Linux System Management Architect and Lead
IBM Senior Software Engineer
Phone# 512-286-8393
Fax# 512-838-8858
From: Robert Evans <[email protected]>
To: "[email protected]"
Date: 03/22/2012 11:57 AM
Subject: Re: Question about Hadoop-8192 and rackToBlocks
Subject: Re: Question about Hadoop-8192 and rackToBlocks
If it really is the ordering of the hash map I would say no it should
and the code should be updated. If ordering matters we need to use a
that guarantees a given order, and hash map is not one of them.
--Bobby Evans
On 3/22/12 7:24 AM, "Kumar Ravi" wrote:
Hello,
We have been looking at IBM JDK junit failures on Hadoop-1.0.1
independently and have ran into the same failures as reported in this
--Bobby Evans
On 3/22/12 7:24 AM, "Kumar Ravi" wrote:
Hello,
We have been looking at IBM JDK junit failures on Hadoop-1.0.1
independently and have ran into the same failures as reported in this
I have a question based upon what I have observed below.
We started debugging the problems in the testcase -
org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
The testcase fails because the number of splits returned back from
CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
expected return value is 2.
So far, we have found the reason for this difference in number of
We started debugging the problems in the testcase -
org.apache.hadoop.mapred.lib.TestCombineFileInputFormat
The testcase fails because the number of splits returned back from
CombineFileInputFormat.getSplits() is 1 when using IBM JDK whereas the
expected return value is 2.
So far, we have found the reason for this difference in number of
because the order in which elements in the rackToBlocks hashmap get
is in the reverse order that Sun JDK creates.
The question I have at this point is -- Should there be a strict
The question I have at this point is -- Should there be a strict
in the order in which the rackToBlocks hashmap gets populated, to
the number of splits that get should get created in a hadoop cluster?
this Working as designed?
Regards,
Kumar
Regards,
Kumar