Hi all.
I have a query about a application I have been asked to implement in
Hadoop MapReduce. It is the concordance application.
Specification of the Concordance application.
------------------
Given: Text file containing English text in ASCII encoding. An integer N.
Find: For all sequences of words, up to length N, occurring in the
input file, the number of occurrences of this sequence in the text,
together with a list of start indices. Optionally, sequences with only
1 occurrence should be omitted.
------------------
An implementation, in C can be found here:
http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/concordance.c
I have created an initial implementation in Hadoop MapReduce, found here:
http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/Concordance.java
Problem Statement:
Because of the application requirements, I felt that I had to restrict
the fileinput to just one mapper. This is so that words at the end of
one line could be part of a sequence with the beginning of the next
line. The same is true for paragraphs (the end of one paragraph should
be considered as part of a sequence, completed by the start of the
next paragrah.
For small files, this is not a problem. But when the input text file
is large, or N=30 or more (see application requirements), the fact
that there is only one mapper is very costly. It also means that there
are a number of map output spills.
small input file: http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/WaD2.txt
Large input file: http://www.macs.hw.ac.uk/~rs46/dropbox/concordance/bible.txt
Try with N=30.
So I really have two things to ask:
- Feedback on my Hadoop implementation. Is it bad? Are my presumptions
about one mapper, wrong?
- Any alternative suggestions on how one could implement the
Concordance application in MapReduce.
thanks,
Rob Stewart