FAQ

[Solr-user] solr application for website crawling and indexing html, pdf, word, ... files

Frank van Lingen
Jan 25, 2010 at 8:55 pm
I recently started working with solr and find it easy to setup and tinker with.

I now want to scale up my setup and was wondering if there is an
application/component that can do the following (I was not able to
find documentation on this on the solr site):

-Can I send solr an xml document with a url (html, pdf, word, ppt,
etc..) and solr indexes it after analyzing (can it analyze pdf and
other documents?). Solr would use some generic basic fields like
header and content when analyzing the files.

-Can I send solr a site url and it indexes the whole site?

If the answer to the above is yes; are there some examples? If the
answer is no; Is there a simple (basic) extractor for html, pdf, word,
etc.. files that would translates this in a basic xml document (e.g.
with field names, url, header and content) that solr can ingest, or
preferably an application that does this for a whole site?

The idea is to configure solr for generic indexing and search of a website.

Frank.
reply

Search Discussions

2 responses

  • Mike anderson at Jan 25, 2010 at 9:16 pm
    I think you might be looking for Apache Tika.

    On Mon, Jan 25, 2010 at 3:55 PM, Frank van Lingen wrote:

    I recently started working with solr and find it easy to setup and tinker
    with.

    I now want to scale up my setup and was wondering if there is an
    application/component that can do the following (I was not able to
    find documentation on this on the solr site):

    -Can I send solr an xml document with a url (html, pdf, word, ppt,
    etc..) and solr indexes it after analyzing (can it analyze pdf and
    other documents?). Solr would use some generic basic fields like
    header and content when analyzing the files.

    -Can I send solr a site url and it indexes the whole site?

    If the answer to the above is yes; are there some examples? If the
    answer is no; Is there a simple (basic) extractor for html, pdf, word,
    etc.. files that would translates this in a basic xml document (e.g.
    with field names, url, header and content) that solr can ingest, or
    preferably an application that does this for a whole site?

    The idea is to configure solr for generic indexing and search of a website.

    Frank.
  • Markus Jelsma at Jan 25, 2010 at 9:24 pm
    Hello Frank,

    Answers are inline:

    Frank van Lingen said:
    I recently started working with solr and find it easy to setup and
    tinker with.

    I now want to scale up my setup and was wondering if there is an
    application/component that can do the following (I was not able to find
    documentation on this on the solr site):

    -Can I send solr an xml document with a url (html, pdf, word, ppt,
    etc..) and solr indexes it after analyzing (can it analyze pdf and other
    documents?). Solr would use some generic basic fields like
    header and content when analyzing the files.
    Yes you can! Solr has an integration with Tika [1], yet another Apache
    Lucene project. It can index many different formats. Please see the Solr
    Cell wiki for more information [2].
    -Can I send solr a site url and it indexes the whole site?
    No you can't. But there is yet another fine Apache Lucene project called
    Nutch [3]. It offers a very convenient API and is very flexible. Since
    version 1.0 Nutch can integrate more easily with a standby Solr index, and
    together with Tika you can index almost anything you want with the
    greatest ease.

    You can find information on Nutch [4], also, our friends at
    LucidImagination have written a very decent article on this subject [5].
    You will find what you're looking for.

    Cheers

    If the answer to the above is yes; are there some examples? If the
    answer is no; Is there a simple (basic) extractor for html, pdf, word,
    etc.. files that would translates this in a basic xml document (e.g.
    with field names, url, header and content) that solr can ingest, or
    preferably an application that does this for a whole site?

    The idea is to configure solr for generic indexing and search of a
    website.

    Frank.
    [1]: http://lucene.apache.org/tika/index.html
    [2]: http://wiki.apache.org/solr/ExtractingRequestHandler
    [3]: http://lucene.apache.org/nutch/
    [4]: http://wiki.apache.org/nutch/RunningNutchAndSolr
    [5]: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

Related Discussions

Discussion Navigation
viewthread | post