Maybe this message can solve your problem as well:
@Shi Yu:
Yes there are built in functions to get the input file Path in the Mapper
(you can use these for counters by putting the file name in the counter
name), however there are some issues if you use MultipleInputs to your job.
Here's some sample code I wrote to work around the issue (execute in a
Mapper):
Path filePath = null;
Object obj = reporter.getInputSplit();
if(!(obj instanceof FileSplit)) {
Class clazz = obj.getClass();
try {
Method inputSplitMethod = clazz.getDeclaredMethod(
"getInputSplit", new Class[0]);
inputSplitMethod.setAccessible(true);
Object inputSplit = inputSplitMethod.invoke(obj, new Object[0]);
if(inputSplit instanceof FileSplit) {
filePath = ((FileSplit)inputSplit).getPath();
}
} catch(Exception e) {
throw new IOException(
"Could not find input FileSplit in Mapper", e);
}
} else {
FileSplit fs = (FileSplit)obj;
filePath = fs.getPath();
}
if(filePath == null) {
throw new IOException(
"Could not find input FileSplit in Mapper");
}
if(LOG.isDebugEnabled()) LOG.debug("filePath: " + filePath);
Using Cloudera Hadoop 0.20.1+169.113
Subversion -r 6c765a47a9291470d3d8814c98155115d109d71
I also logged this with Cloudera, please vote for it if you want this fixed:
http://getsatisfaction.com/cloudera/topics/hadoop_getting_taggedinputsplit_instead_of_filesplit_with_multipleinputs
Cheers,
Matt
On 10/22/10 6:01 PM, "Shi Yu"wrote:
My late thanks to the nice advice. I have tried this, it works. However,
to produce the line number I had to rescan the files again, add new line
numbers and then resave them as new files. It took a long time because
they are very big. Are there any built in functions that could
automatically provide the current filename (if there are multiple files)
and the line numbers in Map/Reduce?
Shi
On 2010-10-20 21:16, Hieu Khac Le wrote:How about using the line number as the key and the string at that line as
value.
-------
Please excuse typos and brief nature of this email sent from my mobile device
On Oct 20, 2010, at 9:07 PM, Shi Yuwrote:
Hi,
I have a problem of comparing two huge files (100G each) consist of string
sequence. It is more like the file text compare problem. I would like to
find out how many strings are different within these two files in the
natural order. Can this task be modeled as a map/reduce job? Currently I
have no idea how to control the split of map and make sure the two input
threads in one map task are pointing to the same positions in the files.
Shi
On 2010-10-26 14:43, Oleg Ruchovets wrote:
Hi ,
Running a hadoop job which manipulates ~ 4000 files (files ar gz) , and
suppose one of this gz was corrupted. From web console /log files I can see
which task got exception ,but to isolate which files was corrupted it is
really hard. Is it a way to know which files was produced by which hadoop
task?
Thanks in advance
Oleg.