FAQ
Hi,

I have a problem of comparing two huge files (100G each) consist of
string sequence. It is more like the file text compare problem. I would
like to find out how many strings are different within these two files
in the natural order. Can this task be modeled as a map/reduce job?
Currently I have no idea how to control the split of map and make sure
the two input threads in one map task are pointing to the same positions
in the files.


Shi

Search Discussions

  • Hieu Khac Le at Oct 21, 2010 at 2:16 am
    How about using the line number as the key and the string at that line as value.

    -------
    Please excuse typos and brief nature of this email sent from my mobile device
    On Oct 20, 2010, at 9:07 PM, Shi Yu wrote:

    Hi,

    I have a problem of comparing two huge files (100G each) consist of string sequence. It is more like the file text compare problem. I would like to find out how many strings are different within these two files in the natural order. Can this task be modeled as a map/reduce job? Currently I have no idea how to control the split of map and make sure the two input threads in one map task are pointing to the same positions in the files.


    Shi
  • Shi Yu at Oct 23, 2010 at 1:01 am
    My late thanks to the nice advice. I have tried this, it works. However,
    to produce the line number I had to rescan the files again, add new line
    numbers and then resave them as new files. It took a long time because
    they are very big. Are there any built in functions that could
    automatically provide the current filename (if there are multiple files)
    and the line numbers in Map/Reduce?

    Shi
    On 2010-10-20 21:16, Hieu Khac Le wrote:
    How about using the line number as the key and the string at that line as value.

    -------
    Please excuse typos and brief nature of this email sent from my mobile device

    On Oct 20, 2010, at 9:07 PM, Shi Yuwrote:

    Hi,

    I have a problem of comparing two huge files (100G each) consist of string sequence. It is more like the file text compare problem. I would like to find out how many strings are different within these two files in the natural order. Can this task be modeled as a map/reduce job? Currently I have no idea how to control the split of map and make sure the two input threads in one map task are pointing to the same positions in the files.


    Shi

    --
    Postdoctoral Scholar
    Institute for Genomics and Systems Biology
    Department of Medicine, the University of Chicago
    Knapp Center for Biomedical Discovery
    900 E. 57th St. Room 10148
    Chicago, IL 60637, US
    Tel: 773-702-6799
  • Matt Pouttu-Clarke at Oct 25, 2010 at 5:31 pm
    @Shi Yu:
    Yes there are built in functions to get the input file Path in the Mapper
    (you can use these for counters by putting the file name in the counter
    name), however there are some issues if you use MultipleInputs to your job.
    Here's some sample code I wrote to work around the issue (execute in a
    Mapper):
    Path filePath = null;
    Object obj = reporter.getInputSplit();
    if(!(obj instanceof FileSplit)) {
    Class clazz = obj.getClass();
    try {
    Method inputSplitMethod = clazz.getDeclaredMethod(
    "getInputSplit", new Class[0]);
    inputSplitMethod.setAccessible(true);
    Object inputSplit = inputSplitMethod.invoke(obj, new Object[0]);
    if(inputSplit instanceof FileSplit) {
    filePath = ((FileSplit)inputSplit).getPath();
    }
    } catch(Exception e) {
    throw new IOException(
    "Could not find input FileSplit in Mapper", e);
    }
    } else {
    FileSplit fs = (FileSplit)obj;
    filePath = fs.getPath();
    }
    if(filePath == null) {
    throw new IOException(
    "Could not find input FileSplit in Mapper");
    }
    if(LOG.isDebugEnabled()) LOG.debug("filePath: " + filePath);

    Using Cloudera Hadoop 0.20.1+169.113
    Subversion -r 6c765a47a9291470d3d8814c98155115d109d71

    I also logged this with Cloudera, please vote for it if you want this fixed:
    http://getsatisfaction.com/cloudera/topics/hadoop_getting_taggedinputsplit_i
    nstead_of_filesplit_with_multipleinputs

    Cheers,
    Matt
    On 10/22/10 6:01 PM, "Shi Yu" wrote:

    My late thanks to the nice advice. I have tried this, it works. However,
    to produce the line number I had to rescan the files again, add new line
    numbers and then resave them as new files. It took a long time because
    they are very big. Are there any built in functions that could
    automatically provide the current filename (if there are multiple files)
    and the line numbers in Map/Reduce?

    Shi
    On 2010-10-20 21:16, Hieu Khac Le wrote:
    How about using the line number as the key and the string at that line as
    value.

    -------
    Please excuse typos and brief nature of this email sent from my mobile device

    On Oct 20, 2010, at 9:07 PM, Shi Yuwrote:

    Hi,

    I have a problem of comparing two huge files (100G each) consist of string
    sequence. It is more like the file text compare problem. I would like to
    find out how many strings are different within these two files in the
    natural order. Can this task be modeled as a map/reduce job? Currently I
    have no idea how to control the split of map and make sure the two input
    threads in one map task are pointing to the same positions in the files.


    Shi

    iCrossing Privileged and Confidential Information
    This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 21, '10 at 2:07a
activeOct 25, '10 at 5:31p
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase