|
Ian Lea |
at Jan 30, 2010 at 5:59 pm
|
⇧ |
| |
Are you asking how to get lucene.apache.org out of
http://lucene.apache.org/ or how to get apache.org out of
lucene.apache.org? The getHost() method of java.net.URL will give you
the former. Or use a regexp. I don't know an easy way to do the
latter, but depending on your requirements you could split
lucene.apache.org into tokens "lucene.apache.org" and "apache.org" and
"org" and index all of them. You probably want to use an analyzer
that doesn't split on the . character.
--
Ian.
On Sat, Jan 30, 2010 at 12:12 AM, Franz Allan Valencia See
wrote:
How should I go about identifying the domain?
Thanks,
--
Franz Allan Valencia See | Java Software Engineer
[email protected]LinkedIn:
http://www.linkedin.com/in/franzseeTwitter:
http://www.twitter.com/franz_seeOn Fri, Jan 29, 2010 at 6:42 PM, Ian Lea wrote:Instead of playing around with tf/idf, how about just indexing and
searching the domain.
--
Ian.
On Fri, Jan 29, 2010 at 3:43 AM, Franz Allan Valencia See
wrote:
Good day,
I am currently using lucene for my searches. And one of the problems that Im
facing is when keyword is a url. The tokens such as http, https, ://, index,
html, etc seems to be messing up with our search results. The focus was
supposed to be only on the url domain.
The idea that I have is modify the idf so that rare terms get boosted much
more than the default settings in lucene. Since there are probably a lot of
http,
https://, etc, then matches to these terms should be really really
low, while matches to the domain (which is rare) should be high.
Would this work or am I totally misunderstanding lucene's tf/idf? :-)
Thanks,
--
Franz Allan Valencia See | Java Software Engineer
[email protected]LinkedIn:
http://www.linkedin.com/in/franzseeTwitter:
http://www.twitter.com/franz_see---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected] ---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected]