FAQ
Hi ,
Running a hadoop job which manipulates ~ 4000 files (files ar gz) , and
suppose one of this gz was corrupted. From web console /log files I can see
which task got exception ,but to isolate which files was corrupted it is
really hard. Is it a way to know which files was produced by which hadoop
task?

Thanks in advance
Oleg.

Search Discussions

  • Shi Yu at Oct 26, 2010 at 7:59 pm
    Maybe this message can solve your problem as well:

    @Shi Yu:
    Yes there are built in functions to get the input file Path in the Mapper
    (you can use these for counters by putting the file name in the counter
    name), however there are some issues if you use MultipleInputs to your job.
    Here's some sample code I wrote to work around the issue (execute in a
    Mapper):
    Path filePath = null;
    Object obj = reporter.getInputSplit();
    if(!(obj instanceof FileSplit)) {
    Class clazz = obj.getClass();
    try {
    Method inputSplitMethod = clazz.getDeclaredMethod(
    "getInputSplit", new Class[0]);
    inputSplitMethod.setAccessible(true);
    Object inputSplit = inputSplitMethod.invoke(obj, new Object[0]);
    if(inputSplit instanceof FileSplit) {
    filePath = ((FileSplit)inputSplit).getPath();
    }
    } catch(Exception e) {
    throw new IOException(
    "Could not find input FileSplit in Mapper", e);
    }
    } else {
    FileSplit fs = (FileSplit)obj;
    filePath = fs.getPath();
    }
    if(filePath == null) {
    throw new IOException(
    "Could not find input FileSplit in Mapper");
    }
    if(LOG.isDebugEnabled()) LOG.debug("filePath: " + filePath);

    Using Cloudera Hadoop 0.20.1+169.113
    Subversion -r 6c765a47a9291470d3d8814c98155115d109d71

    I also logged this with Cloudera, please vote for it if you want this fixed:
    http://getsatisfaction.com/cloudera/topics/hadoop_getting_taggedinputsplit_i
    nstead_of_filesplit_with_multipleinputs

    Cheers,
    Matt

    On 10/22/10 6:01 PM, "Shi Yu"wrote:


    My late thanks to the nice advice. I have tried this, it works. However,
    to produce the line number I had to rescan the files again, add new line
    numbers and then resave them as new files. It took a long time because
    they are very big. Are there any built in functions that could
    automatically provide the current filename (if there are multiple files)
    and the line numbers in Map/Reduce?

    Shi
    On 2010-10-20 21:16, Hieu Khac Le wrote:

    How about using the line number as the key and the string at that line as
    value.

    -------
    Please excuse typos and brief nature of this email sent from my mobile device

    On Oct 20, 2010, at 9:07 PM, Shi Yuwrote:

    Hi,

    I have a problem of comparing two huge files (100G each) consist of string
    sequence. It is more like the file text compare problem. I would like to
    find out how many strings are different within these two files in the
    natural order. Can this task be modeled as a map/reduce job? Currently I
    have no idea how to control the split of map and make sure the two input
    threads in one map task are pointing to the same positions in the files.


    Shi
    On 2010-10-26 14:43, Oleg Ruchovets wrote:
    Hi ,
    Running a hadoop job which manipulates ~ 4000 files (files ar gz) , and
    suppose one of this gz was corrupted. From web console /log files I can see
    which task got exception ,but to isolate which files was corrupted it is
    really hard. Is it a way to know which files was produced by which hadoop
    task?

    Thanks in advance
    Oleg.
  • Matt Pouttu-Clarke at Oct 27, 2010 at 2:54 pm
    Hi Oleg,

    We name the output directories according to a standard (currently using
    UUID) and then use the name of the directory to tell us meta data like this.

    You can also look at the job log.

    Cheers,
    Matt

    On 10/26/10 12:43 PM, "Oleg Ruchovets" wrote:

    Hi ,
    Running a hadoop job which manipulates ~ 4000 files (files ar gz) , and
    suppose one of this gz was corrupted. From web console /log files I can see
    which task got exception ,but to isolate which files was corrupted it is
    really hard. Is it a way to know which files was produced by which hadoop
    task?

    Thanks in advance
    Oleg.

    iCrossing Privileged and Confidential Information
    This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
  • Related Discussions

    Discussion Navigation
    viewthread | post
    Discussion Overview
    groupcommon-user @
    categorieshadoop
    postedOct 26, '10 at 7:44p
    activeOct 27, '10 at 2:54p
    posts3
    users3
    websitehadoop.apache.org...
    irc#hadoop

    People

    Translate

    site design / logo © 2022 Grokbase