FAQ
I was considering not using nutch for indexing web documents. I was thinking
either extracting the full HTML document or through the use of some kind of
web scraper html parser utility extracting only the text content from a web
page and then indexing that.

I know it is strange, but I feel I have more control on what gets indexed if I
use just Lucene. Eg, I can add more fields and also I guarantee I will be
able to search what gets indexed.

Is this a bad approach or should I just use nutch?

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Grant Ingersoll at Nov 28, 2007 at 11:48 am
    Seems reasonable to me, but I guess I wonder what kind of control you
    have that you don't in Nutch? Maybe worth asking on Nutch. Also, it
    is fairly easy in Nutch to separate the crawling aspect from the
    indexing aspect, such that you could use all of Nutch's power for
    crawling and extracting content, and then index in Lucene or Solr on
    your own.

    On Nov 27, 2007, at 6:13 PM, bbrown wrote:

    I was considering not using nutch for indexing web documents. I was
    thinking
    either extracting the full HTML document or through the use of some
    kind of
    web scraper html parser utility extracting only the text content
    from a web
    page and then indexing that.

    I know it is strange, but I feel I have more control on what gets
    indexed if I
    use just Lucene. Eg, I can add more fields and also I guarantee I
    will be
    able to search what gets indexed.

    Is this a bad approach or should I just use nutch?

    --
    Berlin Brown
    [berlin dot brown at gmail dot com]
    http://botspiritcompany.com/botlist/?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    --------------------------
    Grant Ingersoll
    http://lucene.grantingersoll.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 27, '07 at 11:13p
activeNov 28, '07 at 11:48a
posts2
users2
websitelucene.apache.org

2 users in discussion

Grant Ingersoll: 1 post Bbrown: 1 post

People

Translate

site design / logo © 2023 Grokbase