FAQ
Sorry for a bit off topic.

I just wonder how to build a search engine for a heavy traffic site?
use KinoSearch or Plunce? or just get result from MySQL?

how those big sites build their search engine? like facebook, hi5, digg?
use SQL to get results instead of build a file index like KinoSearch?

anyone have any idea? Thanks.

--
Fayland Lam // http://www.fayland.org/

Search Discussions

  • Pedro Melo at Nov 8, 2007 at 8:35 am
    Hi,
    On Nov 8, 2007, at 8:23 AM, Fayland Lam wrote:

    Sorry for a bit off topic.

    I just wonder how to build a search engine for a heavy traffic site?
    use KinoSearch or Plunce? or just get result from MySQL?
    If you are using MySQL, checkout Sphinx also.

    They have some interesting numbers on their site. Also check the
    http://www.mysqlperformanceblog.com/ site for Sphinx-related posts.

    Best regards,
    --
    Pedro Melo
    Blog: http://www.simplicidade.org/notes/
    XMPP ID: melo@simplicidade.org
    Use XMPP!
  • Felix Antonius Wilhelm Ostmann at Nov 8, 2007 at 9:25 am
    Other solution is Xapian

    Catalyst::Model::Xapian



    Pedro Melo schrieb:
    Hi,
    On Nov 8, 2007, at 8:23 AM, Fayland Lam wrote:

    Sorry for a bit off topic.

    I just wonder how to build a search engine for a heavy traffic site?
    use KinoSearch or Plunce? or just get result from MySQL?
    If you are using MySQL, checkout Sphinx also.

    They have some interesting numbers on their site. Also check the
    http://www.mysqlperformanceblog.com/ site for Sphinx-related posts.

    Best regards,

    --
    Mit freundlichen Gr??en

    Felix Antonius Wilhelm Ostmann
    --------------------------------------------------
    Websuche Search Technology GmbH & Co. KG
    Martinistra?e 3 - D-49080 Osnabr?ck - Germany
    Tel.: +49 541 40666-0 - Fax: +49 541 40666-22
    Email: info@websuche.de - Website: www.websuche.de
    --------------------------------------------------
    AG Osnabr?ck - HRA 200252 - Ust-Ident: DE814737310
    Komplement?rin: Websuche Search Technology
    Verwaltungs GmbH - AG Osnabr?ck - HRB 200359
    Gesch?ftsf?hrer: Diplom Kaufmann Martin Steinkamp
    --------------------------------------------------
  • David Morel at Nov 8, 2007 at 9:33 am

    Le 8 nov. 07 ? 10:25, Felix Antonius Wilhelm Ostmann a ?crit :

    Other solution is Xapian

    Catalyst::Model::Xapian
    There was a discussion on this topic not so long ago. I suggest to
    have a look at the list archive, some interesting discussion went on
    at that time

    David Morel
  • Fayland Lam at Nov 9, 2007 at 4:13 am

    David Morel wrote:
    Le 8 nov. 07 ? 10:25, Felix Antonius Wilhelm Ostmann a ?crit :

    Other solution is Xapian

    Catalyst::Model::Xapian
    There was a discussion on this topic not so long ago. I suggest to have
    a look at the list archive, some interesting discussion went on at that
    time

    I searched http://lists.scsys.co.uk/pipermail/catalyst/, but don't find
    anything. could u tell me which month is it talked?

    Thanks.


    David Morel




    _______________________________________________
    List: Catalyst@lists.scsys.co.uk
    Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
    Searchable archive: http://www.mail-archive.com/catalyst@lists.rawmode.org/
    Dev site: http://dev.catalyst.perl.org/

    --
    Fayland Lam // http://www.fayland.org/
  • David Morel at Nov 9, 2007 at 8:00 am

    Le 9 nov. 07 ? 05:13, Fayland Lam a ?crit :

    David Morel wrote:
    Le 8 nov. 07 ? 10:25, Felix Antonius Wilhelm Ostmann a ?crit :
    Other solution is Xapian

    Catalyst::Model::Xapian
    There was a discussion on this topic not so long ago. I suggest to
    have a look at the list archive, some interesting discussion went
    on at that time

    I searched http://lists.scsys.co.uk/pipermail/catalyst/, but don't
    find anything. could u tell me which month is it talked?

    Thanks.

    have a look there: http://www.gossamer-threads.com/lists/catalyst/users/

    and search for lucene, plucene, xapian, swish-e, etc

    actually the threads I found seem shorter than I remember, but I just
    had a very quick look.

    David Morel
  • Peter Karman at Nov 9, 2007 at 5:37 pm

    On 11/09/2007 02:00 AM, David Morel wrote:

    have a look there: http://www.gossamer-threads.com/lists/catalyst/users/

    and search for lucene, plucene, xapian, swish-e, etc

    actually the threads I found seem shorter than I remember, but I just
    had a very quick look.
    I posted this link awhile back too:

    http://www.mail-archive.com/cgiapp@lists.erlbaum.net/msg06061.html

    --
    Peter Karman . peter@peknet.com . http://peknet.com/
  • Octavian Rasnita at Nov 9, 2007 at 6:48 pm
    Well, from all those discussions it seems that Xapian is the best (if you
    need its features).

    Do you know if it can index the html documents without parsing them with
    other tools, or possibly other type of files like pdf, doc?

    Octavian

    ----- Original Message -----
    From: "Peter Karman" <peter@peknet.com>
    To: "The elegant MVC web framework" <catalyst@lists.scsys.co.uk>
    Sent: Friday, November 09, 2007 7:37 PM
    Subject: Re: [Catalyst] Re: [OT] Search Solution

    On 11/09/2007 02:00 AM, David Morel wrote:

    have a look there: http://www.gossamer-threads.com/lists/catalyst/users/

    and search for lucene, plucene, xapian, swish-e, etc

    actually the threads I found seem shorter than I remember, but I just
    had a very quick look.
    I posted this link awhile back too:

    http://www.mail-archive.com/cgiapp@lists.erlbaum.net/msg06061.html

    --
    Peter Karman . peter@peknet.com . http://peknet.com/


    _______________________________________________
    List: Catalyst@lists.scsys.co.uk
    Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
    Searchable archive:
    http://www.mail-archive.com/catalyst@lists.rawmode.org/
    Dev site: http://dev.catalyst.perl.org/
  • Peter Karman at Nov 9, 2007 at 7:21 pm

    On 11/09/2007 12:48 PM, Octavian Rasnita wrote:
    Well, from all those discussions it seems that Xapian is the best (if
    you need its features).

    Do you know if it can index the html documents without parsing them with
    other tools, or possibly other type of files like pdf, doc?
    Xapian is a library. The related Omega project has support for parsing docs of various formats.

    --
    Peter Karman . peter@peknet.com . http://peknet.com/
  • Octavian Rasnita at Nov 9, 2007 at 9:47 pm
    From: "Peter Karman" <peter@peknet.com>
    Do you know if it can index the html documents without parsing them with
    other tools, or possibly other type of files like pdf, doc?
    Xapian is a library. The related Omega project has support for parsing
    docs of various formats.
    Oh yes, Omega seems to be nice. Too bad it doesn't allow indexing the
    auto-generated web pages, but only the static content.

    Do you have a recommendation for a good perl module that can be used easyly
    for creating a spider that should index a web site?

    Octavian
  • Peter Karman at Nov 9, 2007 at 10:11 pm

    On 11/09/2007 03:47 PM, Octavian Rasnita wrote:

    Do you have a recommendation for a good perl module that can be used
    easyly for creating a spider that should index a web site?
    If you don't need UTF-8, check out Swish-e. It has a spider and parser.

    --
    Peter Karman . peter@peknet.com . http://peknet.com/
  • Octavian Rasnita at Nov 9, 2007 at 10:24 pm
    From: "Peter Karman" <peter@peknet.com>
    Do you have a recommendation for a good perl module that can be used
    easyly for creating a spider that should index a web site?
    If you don't need UTF-8, check out Swish-e. It has a spider and parser.
    Unfortunately I need to use UTF-8. This is one of the reasons I said I like
    Xapian.

    Octavian
  • Adam Sjøgren at Nov 9, 2007 at 8:35 am

    On Fri, 09 Nov 2007 04:13:59 +0000, Fayland wrote:

    Other solution is Xapian
    Catalyst::Model::Xapian
    I searched http://lists.scsys.co.uk/pipermail/catalyst/, but don't
    find anything. could u tell me which month is it talked?
    Another place to try searching the list is here:

    <http://search.gmane.org/search.php?group=gmane.comp.web.catalyst.general&query=xapian>

    (Incidentally, search.gmane.org is based on Xapian).


    Best regards,

    --
    Adam Sj?gren
    adsj@novozymes.com
  • Jon Schutz at Nov 8, 2007 at 10:16 am

    On Thu, 2007-11-08 at 08:35 +0000, Pedro Melo wrote:
    If you are using MySQL, checkout Sphinx also.
    Can definitely recommend Sphinx for performance on large volumes of
    data. We were having searches take 10 secs typically, sometimes much
    longer, using MySQL fulltext indices. With Sphinx that went sub-second,
    and ranking was better than MySQL. I wrote the Sphinx::Search perl
    interface; thought I might write a Catalyst model for it one day, just
    haven't had the driving need.

    MySQL fulltext can be dangerous on certain types of data because it
    automatically dismisses frequently occurring terms - so e.g. if you
    search on a word that appears in more than half of your records, you can
    get zero results!

    Have seen Xapian in action, producing slow results of poor relevance in
    these particular cases. That's not to say it needs to be like that -
    Xapian is a highly configurable and fairly complicated beast, so chances
    are it wasn't running optimally. In general, whatever product you use,
    getting search relevance right for your specific data set can be a
    challenge.

    HTH.

    --

    Jon
  • Octavian Rasnita at Nov 8, 2007 at 11:48 am
    From: "Jon Schutz" <jon+catalyst@youramigo.com>
    On Thu, 2007-11-08 at 08:35 +0000, Pedro Melo wrote:


    If you are using MySQL, checkout Sphinx also.
    Can definitely recommend Sphinx for performance on large volumes of
    data. We were having searches take 10 secs typically, sometimes much
    longer, using MySQL fulltext indices. With Sphinx that went sub-second,
    and ranking was better than MySQL. I wrote the Sphinx::Search perl
    interface; thought I might write a Catalyst model for it one day, just
    haven't had the driving need.
    What do you think about e-swish, Kinosearch and Lucene?

    Aside speed and relevance of searches, I am also interested in the easiness
    of indexing and I am searching for a program that also has the tools for
    indexing web pages (and not static files).

    Are there search programs that have a command line program (or a perl
    module) that can be used for indexing an entire site, specifying just the
    main URL and maybe a few other options?

    Thank you.

    Octavian
  • Jon Schutz at Nov 9, 2007 at 10:28 am

    On Thu, 2007-11-08 at 13:48 +0200, Octavian Rasnita wrote:

    What do you think about e-swish, Kinosearch and Lucene?
    As it has been years since I have reviewed these (swish-e and lucene),
    it wouldn't be fair for me to comment. Lots of detail and links at
    http://www.searchtools.com/tools/tools.html


    --

    Jon
  • Ash Berlin at Nov 8, 2007 at 9:30 am

    Fayland Lam wrote:
    Sorry for a bit off topic.

    I just wonder how to build a search engine for a heavy traffic site?
    use KinoSearch or Plunce? or just get result from MySQL?

    how those big sites build their search engine? like facebook, hi5,
    digg? use SQL to get results instead of build a file index like
    KinoSearch?

    anyone have any idea? Thanks.
    Xapian is also worth looking at:

    http://search.cpan.org/search?query=Xapian&mode=beta

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcatalyst @
categoriescatalyst, perl
postedNov 8, '07 at 8:23a
activeNov 9, '07 at 10:24p
posts17
users9
websitecatalystframework.org
irc#catalyst

People

Translate

site design / logo © 2022 Grokbase