|
Grant Ingersoll |
at Nov 28, 2007 at 11:48 am
|
⇧ |
| |
Seems reasonable to me, but I guess I wonder what kind of control you
have that you don't in Nutch? Maybe worth asking on Nutch. Also, it
is fairly easy in Nutch to separate the crawling aspect from the
indexing aspect, such that you could use all of Nutch's power for
crawling and extracting content, and then index in Lucene or Solr on
your own.
On Nov 27, 2007, at 6:13 PM, bbrown wrote:I was considering not using nutch for indexing web documents. I was
thinking
either extracting the full HTML document or through the use of some
kind of
web scraper html parser utility extracting only the text content
from a web
page and then indexing that.
I know it is strange, but I feel I have more control on what gets
indexed if I
use just Lucene. Eg, I can add more fields and also I guarantee I
will be
able to search what gets indexed.
Is this a bad approach or should I just use nutch?
--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected] --------------------------
Grant Ingersoll
http://lucene.grantingersoll.comLucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformancehttp://wiki.apache.org/lucene-java/LuceneFAQ---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected]