FAQ
Good day,

I am currently using lucene for my searches. And one of the problems that Im
facing is when keyword is a url. The tokens such as http, https, ://, index,
html, etc seems to be messing up with our search results. The focus was
supposed to be only on the url domain.

The idea that I have is modify the idf so that rare terms get boosted much
more than the default settings in lucene. Since there are probably a lot of
http, https://, etc, then matches to these terms should be really really
low, while matches to the domain (which is rare) should be high.

Would this work or am I totally misunderstanding lucene's tf/idf? :-)

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Search Discussions

  • Ian Lea at Jan 29, 2010 at 10:43 am
    Instead of playing around with tf/idf, how about just indexing and
    searching the domain.


    --
    Ian.


    On Fri, Jan 29, 2010 at 3:43 AM, Franz Allan Valencia See
    wrote:
    Good day,

    I am currently using lucene for my searches. And one of the problems that Im
    facing is when keyword is a url. The tokens such as http, https, ://, index,
    html, etc seems to be messing up with our search results. The focus was
    supposed to be only on the url domain.

    The idea that I have is modify the idf so that rare terms get boosted much
    more than the default settings in lucene. Since there are probably a lot of
    http, https://, etc, then matches to these terms should be really really
    low, while matches to the domain (which is rare) should be high.

    Would this work or am I totally misunderstanding lucene's tf/idf? :-)

    Thanks,

    --
    Franz Allan Valencia See | Java Software Engineer
    franz.see@gmail.com
    LinkedIn: http://www.linkedin.com/in/franzsee
    Twitter: http://www.twitter.com/franz_see
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Franz Allan Valencia See at Jan 30, 2010 at 12:12 am
    How should I go about identifying the domain?

    Thanks,

    --
    Franz Allan Valencia See | Java Software Engineer
    franz.see@gmail.com
    LinkedIn: http://www.linkedin.com/in/franzsee
    Twitter: http://www.twitter.com/franz_see
    On Fri, Jan 29, 2010 at 6:42 PM, Ian Lea wrote:

    Instead of playing around with tf/idf, how about just indexing and
    searching the domain.


    --
    Ian.


    On Fri, Jan 29, 2010 at 3:43 AM, Franz Allan Valencia See
    wrote:
    Good day,

    I am currently using lucene for my searches. And one of the problems that Im
    facing is when keyword is a url. The tokens such as http, https, ://, index,
    html, etc seems to be messing up with our search results. The focus was
    supposed to be only on the url domain.

    The idea that I have is modify the idf so that rare terms get boosted much
    more than the default settings in lucene. Since there are probably a lot of
    http, https://, etc, then matches to these terms should be really really
    low, while matches to the domain (which is rare) should be high.

    Would this work or am I totally misunderstanding lucene's tf/idf? :-)

    Thanks,

    --
    Franz Allan Valencia See | Java Software Engineer
    franz.see@gmail.com
    LinkedIn: http://www.linkedin.com/in/franzsee
    Twitter: http://www.twitter.com/franz_see
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Jan 30, 2010 at 5:59 pm
    Are you asking how to get lucene.apache.org out of
    http://lucene.apache.org/ or how to get apache.org out of
    lucene.apache.org? The getHost() method of java.net.URL will give you
    the former. Or use a regexp. I don't know an easy way to do the
    latter, but depending on your requirements you could split
    lucene.apache.org into tokens "lucene.apache.org" and "apache.org" and
    "org" and index all of them. You probably want to use an analyzer
    that doesn't split on the . character.


    --
    Ian.


    On Sat, Jan 30, 2010 at 12:12 AM, Franz Allan Valencia See
    wrote:
    How should I go about identifying the domain?

    Thanks,

    --
    Franz Allan Valencia See | Java Software Engineer
    franz.see@gmail.com
    LinkedIn: http://www.linkedin.com/in/franzsee
    Twitter: http://www.twitter.com/franz_see
    On Fri, Jan 29, 2010 at 6:42 PM, Ian Lea wrote:

    Instead of playing around with tf/idf, how about just indexing and
    searching the domain.


    --
    Ian.


    On Fri, Jan 29, 2010 at 3:43 AM, Franz Allan Valencia See
    wrote:
    Good day,

    I am currently using lucene for my searches. And one of the problems that Im
    facing is when keyword is a url. The tokens such as http, https, ://, index,
    html, etc seems to be messing up with our search results. The focus was
    supposed to be only on the url domain.

    The idea that I have is modify the idf so that rare terms get boosted much
    more than the default settings in lucene. Since there are probably a lot of
    http, https://, etc, then matches to these terms should be really really
    low, while matches to the domain (which is rare) should be high.

    Would this work or am I totally misunderstanding lucene's tf/idf? :-)

    Thanks,

    --
    Franz Allan Valencia See | Java Software Engineer
    franz.see@gmail.com
    LinkedIn: http://www.linkedin.com/in/franzsee
    Twitter: http://www.twitter.com/franz_see
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Franz Allan Valencia See at Feb 1, 2010 at 1:05 pm
    Hmm....

    My Analyzer is a Dictionary-based Analyzer. And so, it only recognizes
    tokens in its dictionary. Adding every url (or domain) is not a viable
    solution.

    So how could I include that to my analyzer? Lucene Filter? FilterReader?

    Thanks,

    --
    Franz Allan Valencia See | Java Software Engineer
    franz.see@gmail.com
    LinkedIn: http://www.linkedin.com/in/franzsee
    Twitter: http://www.twitter.com/franz_see
    On Sun, Jan 31, 2010 at 1:58 AM, Ian Lea wrote:

    Are you asking how to get lucene.apache.org out of
    http://lucene.apache.org/ or how to get apache.org out of
    lucene.apache.org? The getHost() method of java.net.URL will give you
    the former. Or use a regexp. I don't know an easy way to do the
    latter, but depending on your requirements you could split
    lucene.apache.org into tokens "lucene.apache.org" and "apache.org" and
    "org" and index all of them. You probably want to use an analyzer
    that doesn't split on the . character.


    --
    Ian.


    On Sat, Jan 30, 2010 at 12:12 AM, Franz Allan Valencia See
    wrote:
    How should I go about identifying the domain?

    Thanks,

    --
    Franz Allan Valencia See | Java Software Engineer
    franz.see@gmail.com
    LinkedIn: http://www.linkedin.com/in/franzsee
    Twitter: http://www.twitter.com/franz_see
    On Fri, Jan 29, 2010 at 6:42 PM, Ian Lea wrote:

    Instead of playing around with tf/idf, how about just indexing and
    searching the domain.


    --
    Ian.


    On Fri, Jan 29, 2010 at 3:43 AM, Franz Allan Valencia See
    wrote:
    Good day,

    I am currently using lucene for my searches. And one of the problems
    that
    Im
    facing is when keyword is a url. The tokens such as http, https, ://, index,
    html, etc seems to be messing up with our search results. The focus
    was
    supposed to be only on the url domain.

    The idea that I have is modify the idf so that rare terms get boosted much
    more than the default settings in lucene. Since there are probably a
    lot
    of
    http, https://, etc, then matches to these terms should be really
    really
    low, while matches to the domain (which is rare) should be high.

    Would this work or am I totally misunderstanding lucene's tf/idf? :-)

    Thanks,

    --
    Franz Allan Valencia See | Java Software Engineer
    franz.see@gmail.com
    LinkedIn: http://www.linkedin.com/in/franzsee
    Twitter: http://www.twitter.com/franz_see
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 29, '10 at 3:44a
activeFeb 1, '10 at 1:05p
posts5
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase