I recently started working with solr and find it easy to setup and tinker with.
I now want to scale up my setup and was wondering if there is an
application/component that can do the following (I was not able to
find documentation on this on the solr site):
-Can I send solr an xml document with a url (html, pdf, word, ppt,
etc..) and solr indexes it after analyzing (can it analyze pdf and
other documents?). Solr would use some generic basic fields like
header and content when analyzing the files.
-Can I send solr a site url and it indexes the whole site?
If the answer to the above is yes; are there some examples? If the
answer is no; Is there a simple (basic) extractor for html, pdf, word,
etc.. files that would translates this in a basic xml document (e.g.
with field names, url, header and content) that solr can ingest, or
preferably an application that does this for a whole site?
The idea is to configure solr for generic indexing and search of a website.