Hi Dermot,
Off-topic, so I hope no one minds if I reply.
Perl is good at manipulating text strings, but that doesn't usually help
search engine implementations. A search engine (or information
retrieval system) has to be fast and after it has tokenized the document
collection or query, you're basically comparing integers (i.e., a lookup
table that maps an integer to a word in a dictionary). Actually, even
during the initial mapping, a C-style strcmp would be sufficient. I
doubt a fast search engine would actual perform string matching using
regular expressions.
Of course, a Perl implementation might be interesting as a learning tool
for students. But as an IR system that is suppose to be run in the
"real world" and not in the class room...I don't think you will see a
Perl system anytime soon. I think if you wrote quick Perl and C/C++
implementations that merely tokenize a collection (let's say of the
range in GBs), you'll know what I am talking about. Of course, in the
classroom, a lecturer might just want the students to play with
something that is a MB or less...if so, I think Perl would be good and
students might even prefer it... :-)
Ray
Dermot wrote:
I am looking for a text search engine that has a Perl interface. I
have found a few, Lucene, OpenFTS and Swish-E. OpenFTS hasn't had a
release of the last 3 years. That makes me nervous about using it.
Lucene is java based. I have zero java experience but there is Perl
Module into a 'C++ port API of Lucene'. There is also a thread on
perlmonks about the performance penalty of tying Perl to Java. I am a
bit surprised that the there isn't a more native Perl text search
engine given Perl's agility with text strings.
Could anyone recommend any of the above or suggest an alternative?