FAQ
Hi all.
I have a query about a application I have been asked to implement in
Hadoop MapReduce. It is the concordance application.

Specification of the Concordance application.
------------------
Given: Text file containing English text in ASCII encoding. An integer N.

Find: For all sequences of words, up to length N, occurring in the
input file, the number of occurrences of this sequence in the text,
together with a list of start indices. Optionally, sequences with only
1 occurrence should be omitted.
------------------
An implementation, in C can be found here:
http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/concordance.c

I have created an initial implementation in Hadoop MapReduce, found here:
http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/Concordance.java

Problem Statement:
Because of the application requirements, I felt that I had to restrict
the fileinput to just one mapper. This is so that words at the end of
one line could be part of a sequence with the beginning of the next
line. The same is true for paragraphs (the end of one paragraph should
be considered as part of a sequence, completed by the start of the
next paragrah.

For small files, this is not a problem. But when the input text file
is large, or N=30 or more (see application requirements), the fact
that there is only one mapper is very costly. It also means that there
are a number of map output spills.

small input file: http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/WaD2.txt
Large input file: http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/bible.txt

Try with N=30.

So I really have two things to ask:
- Feedback on my Hadoop implementation. Is it bad? Are my presumptions
about one mapper, wrong?
- Any alternative suggestions on how one could implement the
Concordance application in MapReduce.

thanks,

Rob Stewart

Search Discussions

  • Christian Søttrup at Nov 11, 2010 at 10:25 am
    Hi Rob,

    Generally using map reduce if you can only use one mapper is inefficient.

    How about you implement a custom filesplitter. Then where you split the
    file, you spill N -1 word across the boundary.

    I.e:
    Your first two sentences are split with N=3 to:

    1)
    I have a query about a application I have been asked to implement in
    Hadoop MapReduce. It is

    2)
    Hadoop MapReduce. It is the concordance application.


    Cheers,
    Christian Søttrup


    Rob Stewart wrote:
    Hi all.
    I have a query about a application I have been asked to implement in
    Hadoop MapReduce. It is the concordance application.


    Specification of the Concordance application.
    ------------------
    Given: Text file containing English text in ASCII encoding. An integer N.

    Find: For all sequences of words, up to length N, occurring in the
    input file, the number of occurrences of this sequence in the text,
    together with a list of start indices. Optionally, sequences with only
    1 occurrence should be omitted.
    ------------------
    An implementation, in C can be found here:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/concordance.c

    I have created an initial implementation in Hadoop MapReduce, found here:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/Concordance.java

    Problem Statement:
    Because of the application requirements, I felt that I had to restrict
    the fileinput to just one mapper. This is so that words at the end of
    one line could be part of a sequence with the beginning of the next
    line. The same is true for paragraphs (the end of one paragraph should
    be considered as part of a sequence, completed by the start of the
    next paragrah.

    For small files, this is not a problem. But when the input text file
    is large, or N=30 or more (see application requirements), the fact
    that there is only one mapper is very costly. It also means that there
    are a number of map output spills.

    small input file: http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/WaD2.txt
    Large input file: http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/bible.txt

    Try with N=30.

    So I really have two things to ask:
    - Feedback on my Hadoop implementation. Is it bad? Are my presumptions
    about one mapper, wrong?
    - Any alternative suggestions on how one could implement the
    Concordance application in MapReduce.

    thanks,

    Rob Stewart
  • Rob Stewart at Nov 11, 2010 at 9:57 pm

    How about you implement a custom filesplitter. Then where you split the
    file, you spill N -1 word across the boundary.
    I.e:
    Your first two sentences are split with N=3 to:
    1)
    I have a query about a application I have been asked to implement in
    Hadoop MapReduce. It is
    2)
    Hadoop MapReduce. It is the concordance application.
    Hi Christian, thanks for replying.

    I like the idea, but wouldn't give me duplicate keys. i.e.
    <Hadoop MapReduce. It>
    <MapReduce. It is>

    These keys should only occur once, but having 2 mappers like this,
    both of these keys will occur twice, once in mapper1 and again in
    mapper2.

    Have I misunderstood your example?

    cheers Christian.


    Cheers,
    Christian Søttrup


    Rob Stewart wrote:
    Hi all.
    I have a query about a application I have been asked to implement in
    Hadoop MapReduce. It is the concordance application.


    Specification of the Concordance application.
    ------------------
    Given: Text file containing English text in ASCII encoding. An integer N.

    Find: For all sequences of words, up to length N, occurring in the
    input file, the number of occurrences of this sequence in the text,
    together with a list of start indices. Optionally, sequences with only
    1 occurrence should be omitted.
    ------------------
    An implementation, in C can be found here:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/concordance.c

    I have created an initial implementation in Hadoop MapReduce, found here:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/Concordance.java

    Problem Statement:
    Because of the application requirements, I felt that I had to restrict
    the fileinput to just one mapper. This is so that words at the end of
    one line could be part of a sequence with the beginning of the next
    line. The same is true for paragraphs (the end of one paragraph should
    be considered as part of a sequence, completed by the start of the
    next paragrah.

    For small files, this is not a problem. But when the input text file
    is large, or N=30 or more (see application requirements), the fact
    that there is only one mapper is very costly. It also means that there
    are a number of map output spills.

    small input file:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/WaD2.txt
    Large input file:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/bible.txt

    Try with N=30.

    So I really have two things to ask:
    - Feedback on my Hadoop implementation. Is it bad? Are my presumptions
    about one mapper, wrong?
    - Any alternative suggestions on how one could implement the
    Concordance application in MapReduce.

    thanks,

    Rob Stewart
  • Christian Ulrik Søttrup at Nov 11, 2010 at 10:12 pm
    No you've got it right, you will count double. It was just a quick hint :-)

    You can just spill over in one direction or you could adjust the count
    because you know when you are counting double(end of all mappers except
    the last and beginning of all mappers except the first). The trickiness
    of the solution also depends on whether you are only counting all N
    sequences or all (N and less) sequences.

    Cheers,
    Christian

    On 11-11-2010 22:56, Rob Stewart wrote:
    How about you implement a custom filesplitter. Then where you split the
    file, you spill N -1 word across the boundary.
    I.e:
    Your first two sentences are split with N=3 to:
    1)
    I have a query about a application I have been asked to implement in
    Hadoop MapReduce. It is
    2)
    Hadoop MapReduce. It is the concordance application.
    Hi Christian, thanks for replying.

    I like the idea, but wouldn't give me duplicate keys. i.e.
    <Hadoop MapReduce. It>
    <MapReduce. It is>

    These keys should only occur once, but having 2 mappers like this,
    both of these keys will occur twice, once in mapper1 and again in
    mapper2.

    Have I misunderstood your example?

    cheers Christian.


    Cheers,
    Christian Søttrup


    Rob Stewart wrote:
    Hi all.
    I have a query about a application I have been asked to implement in
    Hadoop MapReduce. It is the concordance application.


    Specification of the Concordance application.
    ------------------
    Given: Text file containing English text in ASCII encoding. An integer N.

    Find: For all sequences of words, up to length N, occurring in the
    input file, the number of occurrences of this sequence in the text,
    together with a list of start indices. Optionally, sequences with only
    1 occurrence should be omitted.
    ------------------
    An implementation, in C can be found here:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/concordance.c

    I have created an initial implementation in Hadoop MapReduce, found here:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/Concordance.java

    Problem Statement:
    Because of the application requirements, I felt that I had to restrict
    the fileinput to just one mapper. This is so that words at the end of
    one line could be part of a sequence with the beginning of the next
    line. The same is true for paragraphs (the end of one paragraph should
    be considered as part of a sequence, completed by the start of the
    next paragrah.

    For small files, this is not a problem. But when the input text file
    is large, or N=30 or more (see application requirements), the fact
    that there is only one mapper is very costly. It also means that there
    are a number of map output spills.

    small input file:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/WaD2.txt
    Large input file:
    http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/bible.txt

    Try with N=30.

    So I really have two things to ask:
    - Feedback on my Hadoop implementation. Is it bad? Are my presumptions
    about one mapper, wrong?
    - Any alternative suggestions on how one could implement the
    Concordance application in MapReduce.

    thanks,

    Rob Stewart

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 11, '10 at 1:42a
activeNov 11, '10 at 10:12p
posts4
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase