FAQ
Hi,

This is a slighty OT query.

I am looking for a text search engine that has a Perl interface. I
have found a few, Lucene, OpenFTS and Swish-E. OpenFTS hasn't had a
release of the last 3 years. That makes me nervous about using it.
Lucene is java based. I have zero java experience but there is Perl
Module into a 'C++ port API of Lucene'. There is also a thread on
perlmonks about the performance penalty of tying Perl to Java. I am a
bit surprised that the there isn't a more native Perl text search
engine given Perl's agility with text strings.

Could anyone recommend any of the above or suggest an alternative?
Tia,
Dp.

Search Discussions

  • Rob Dixon at Aug 20, 2008 at 9:07 pm

    Dermot wrote:

    This is a slighty OT query.

    I am looking for a text search engine that has a Perl interface. I
    have found a few, Lucene, OpenFTS and Swish-E. OpenFTS hasn't had a
    release of the last 3 years. That makes me nervous about using it.
    Lucene is java based. I have zero java experience but there is Perl
    Module into a 'C++ port API of Lucene'. There is also a thread on
    perlmonks about the performance penalty of tying Perl to Java. I am a
    bit surprised that the there isn't a more native Perl text search
    engine given Perl's agility with text strings.

    Could anyone recommend any of the above or suggest an alternative?
    Xapian?

    http://xapian.org/

    HTH,

    Rob
  • Joshua Hoblitt at Aug 20, 2008 at 9:11 pm
    It hasn't had a release for a few years either but I've successfully used
    Plucene to build a search engine for inhouse mailing lists.

    http://search.cpan.org/dist/Plucene/

    -J

    --
    On Wed, Aug 20, 2008 at 09:46:42PM +0100, Dermot wrote:
    Hi,

    This is a slighty OT query.

    I am looking for a text search engine that has a Perl interface. I
    have found a few, Lucene, OpenFTS and Swish-E. OpenFTS hasn't had a
    release of the last 3 years. That makes me nervous about using it.
    Lucene is java based. I have zero java experience but there is Perl
    Module into a 'C++ port API of Lucene'. There is also a thread on
    perlmonks about the performance penalty of tying Perl to Java. I am a
    bit surprised that the there isn't a more native Perl text search
    engine given Perl's agility with text strings.

    Could anyone recommend any of the above or suggest an alternative?
    Tia,
    Dp.

    --
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    http://learn.perl.org/
  • Raymond Wan at Aug 21, 2008 at 2:12 am
    Hi Dermot,

    Off-topic, so I hope no one minds if I reply.

    Perl is good at manipulating text strings, but that doesn't usually help
    search engine implementations. A search engine (or information
    retrieval system) has to be fast and after it has tokenized the document
    collection or query, you're basically comparing integers (i.e., a lookup
    table that maps an integer to a word in a dictionary). Actually, even
    during the initial mapping, a C-style strcmp would be sufficient. I
    doubt a fast search engine would actual perform string matching using
    regular expressions.

    Of course, a Perl implementation might be interesting as a learning tool
    for students. But as an IR system that is suppose to be run in the
    "real world" and not in the class room...I don't think you will see a
    Perl system anytime soon. I think if you wrote quick Perl and C/C++
    implementations that merely tokenize a collection (let's say of the
    range in GBs), you'll know what I am talking about. Of course, in the
    classroom, a lecturer might just want the students to play with
    something that is a MB or less...if so, I think Perl would be good and
    students might even prefer it... :-)

    Ray



    Dermot wrote:
    I am looking for a text search engine that has a Perl interface. I
    have found a few, Lucene, OpenFTS and Swish-E. OpenFTS hasn't had a
    release of the last 3 years. That makes me nervous about using it.
    Lucene is java based. I have zero java experience but there is Perl
    Module into a 'C++ port API of Lucene'. There is also a thread on
    perlmonks about the performance penalty of tying Perl to Java. I am a
    bit surprised that the there isn't a more native Perl text search
    engine given Perl's agility with text strings.

    Could anyone recommend any of the above or suggest an alternative?
  • Marc van Driel at Aug 21, 2008 at 7:41 am
    Hi Dermont/Ray,

    Please check out the MRS system (mrs.cmbi.ru.nl). It has a SOAP
    interface to perl and other languages, and is extremely fast in indexing
    and retrieval. MRS is a generic tool and you can index yourself, but
    also dowload indexed bio-databanks. The source code is in C++ and is
    available as well.
    Teaching material is available but tailored towards biologists.

    Best,

    Marc


    Raymond Wan schreef:
    Hi Dermot,

    Off-topic, so I hope no one minds if I reply.

    Perl is good at manipulating text strings, but that doesn't usually
    help search engine implementations. A search engine (or information
    retrieval system) has to be fast and after it has tokenized the
    document collection or query, you're basically comparing integers
    (i.e., a lookup table that maps an integer to a word in a
    dictionary). Actually, even during the initial mapping, a C-style
    strcmp would be sufficient. I doubt a fast search engine would actual
    perform string matching using regular expressions.

    Of course, a Perl implementation might be interesting as a learning
    tool for students. But as an IR system that is suppose to be run in
    the "real world" and not in the class room...I don't think you will
    see a Perl system anytime soon. I think if you wrote quick Perl and
    C/C++ implementations that merely tokenize a collection (let's say of
    the range in GBs), you'll know what I am talking about. Of course, in
    the classroom, a lecturer might just want the students to play with
    something that is a MB or less...if so, I think Perl would be good and
    students might even prefer it... :-)

    Ray



    Dermot wrote:
    I am looking for a text search engine that has a Perl interface. I
    have found a few, Lucene, OpenFTS and Swish-E. OpenFTS hasn't had a
    release of the last 3 years. That makes me nervous about using it.
    Lucene is java based. I have zero java experience but there is Perl
    Module into a 'C++ port API of Lucene'. There is also a thread on
    perlmonks about the performance penalty of tying Perl to Java. I am a
    bit surprised that the there isn't a more native Perl text search
    engine given Perl's agility with text strings.

    Could anyone recommend any of the above or suggest an alternative?
  • Dermot at Aug 21, 2008 at 12:22 pm

    2008/8/21 Marc van Driel <[email protected]>:
    Hope you enjoy it! I know the author appreciates feedback :)

    Cheers

    Raymond Wan schreef:
    Hi Marc,

    Yes, it seems we were both right :-). From Dermot's first post, I guess
    he was asking about Perl interfacing an IR system and why Perl isn't used to
    build an IR system. The MRS system demonstrates the first point, so thank
    you for pointing it out -- I did not know about it, either! The second part
    has to do with Perl being an interpreted and not a compiled language; and it
    is for that reason, I don't believe Perl could be used as an IR system
    backend (partly from my own experience from writing text processing in Perl
    and then giving up and doing it again in C/C++ because it was too slow :-)
    ).

    Thanks for the link to the system -- it was of benefit to me, as well!

    Ray

    Marc van Driel wrote:
    Hi Ray,

    My interpretation of Dermots mail was that he was looking for a
    tex-retrieval system with a Perl interface, but I only read the last mail of
    the thread. MRS is written in C++ and originally designed to index and
    search the biodatabanks (usually this is semi-structured data), but is not
    bio-specific. There is a SOAP interface/webservice/WSDL for e.g. Perl. So,
    you can do a query, retrieve 1000 records (out of xxxxx records) and let
    Perl do what you want to do with those 1000 records. MRS has a boolean and
    ranked search mechanism. For more information visit with website
    (mrs.cmbi.ru.nl) or contact the author: [email protected] There is also
    a paper on the system:
    http://nar.oxfordjournals.org/cgi/content/full/33/suppl_2/W766?ijkey=1hM9Po54JADYz0b&keytype=ref

    Best regards,
    Marc

    Raymond Wan schreef:
    Hi Marc,
    (mailing list purposely removed)

    Thanks for the link!

    I think what Dermot was talking about is having a Perl system do the
    underlying work? But yes, if the underlying system is written in C/C++,
    then Perl would be "fast" since it is merely acting as a gateway to the work
    being done; in any case, it would mean that the text manipulation advantages
    of Perl are still not being used? Is that the case with MRS?

    Ray


    Marc van Driel wrote:
    Hi Dermont/Ray,

    Please check out the MRS system (mrs.cmbi.ru.nl). It has a SOAP
    interface to perl and other languages, and is extremely fast in indexing and
    retrieval. MRS is a generic tool and you can index yourself, but also
    dowload indexed bio-databanks. The source code is in C++ and is available as
    well.
    Teaching material is available but tailored towards biologists.

    Best,

    Marc
    Yes I was a bit confused because I didn't understand why there wasn't
    a pure Perl text search engine. I was aware of numerous Perl
    interfaces to other API, Lucene, KinoSearch, OpenFTS and Swish-E but I
    wasn't aware of how they fundamentally work. I also note that Postgres
    has Tsearch. From the little bit of searching I've done Lucene seems
    to have a great deal of support and there are a number of module that
    use the Lucene API. Of course a SOAP/REST interface would allow any
    language access.

    Again thanx for the useful sources.
    Dp.
  • Dermot at Aug 21, 2008 at 7:59 am
    2008/8/21 Raymond Wan <[email protected]>:
    Hi Dermot,

    Off-topic, so I hope no one minds if I reply.

    Perl is good at manipulating text strings, but that doesn't usually help
    search engine implementations. A search engine (or information retrieval
    system) has to be fast and after it has tokenized the document collection or
    query, you're basically comparing integers (i.e., a lookup table that maps
    an integer to a word in a dictionary). Actually, even during the initial
    mapping, a C-style strcmp would be sufficient. I doubt a fast search engine
    would actual perform string matching using regular expressions.

    Of course, a Perl implementation might be interesting as a learning tool for
    students. But as an IR system that is suppose to be run in the "real world"
    and not in the class room...I don't think you will see a Perl system anytime
    soon. I think if you wrote quick Perl and C/C++ implementations that merely
    tokenize a collection (let's say of the range in GBs), you'll know what I am
    talking about. Of course, in the classroom, a lecturer might just want the
    students to play with something that is a MB or less...if so, I think Perl
    would be good and students might even prefer it... :-)
    Thanks for all the suggestion. I am also very grateful for this
    heads-up on how a text search engine is actually implemented. Now that
    I understand that the engine is actually an indexed DB and I have done
    a bit more digging around. I guess I will have to try a couple to see
    what fits and what looks well supported/documented. Thanks for
    replies.
    Dp.
  • Dr.Ruud at Aug 21, 2008 at 9:59 am

    Dermot schreef:

    This is a slighty OT query.

    I am looking for a text search engine that has a Perl interface. I
    have found a few, Lucene, OpenFTS and Swish-E. OpenFTS hasn't had a
    release of the last 3 years. That makes me nervous about using it.
    Lucene is java based. I have zero java experience but there is Perl
    Module into a 'C++ port API of Lucene'. There is also a thread on
    perlmonks about the performance penalty of tying Perl to Java. I am a
    bit surprised that the there isn't a more native Perl text search
    engine given Perl's agility with text strings.

    Could anyone recommend any of the above or suggest an alternative?
    MySQL: MyISAM Fulltext searches.
    Plucene.

    --
    Affijn, Ruud

    "Gewoon is een tijger."

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedAug 20, '08 at 8:46p
activeAug 21, '08 at 12:22p
posts8
users6
websiteperl.org

People

Translate

site design / logo © 2023 Grokbase