FAQ
I have a use case for comparing two given strings (attached to a specific
field)
using Lucene and get the similarity scores.

I tried but could not find any built-in way to do so. Hence assuming that
Lucene only compares a Query against Indexed documents, I came up with the
following approach:
(Let the 2 strings be, str1 and str2 )

1) Create an IndexWriter using a RAMDirectory (I don't want to store those
strings on the disk)
2) Index str1 and store it
3) Search str2 in the index. ( shall the indexWriter be closed before you
search on the index? )
4) Get the similarity score & publish it
5) Delete str1 from the index and make the index available for a new
comparison

Any comments & suggestions on making the process optimal

Siddharth

--
View this message in context: http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18020806.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Grant Ingersoll at Jun 20, 2008 at 2:12 am
    You might also have a look at the MemoryIndex. Question, though, is
    what are you hoping to gain from doing a Query against a single
    String? Are you doing a FuzzyQuery? You might look at the
    SecondString project on SourceForge for doing string comparisons.

    I guess I am a bit confused by your problem statement. Perhaps you
    can explain more what you are trying to do at a higher level, as it
    sounds like to me you have str1 and str2, so why do you need to inject
    an index into the middle of it?

    -Grant
    On Jun 19, 2008, at 8:33 PM, Sangrish wrote:


    I have a use case for comparing two given strings (attached to a
    specific
    field)
    using Lucene and get the similarity scores.

    I tried but could not find any built-in way to do so. Hence
    assuming that
    Lucene only compares a Query against Indexed documents, I came up
    with the
    following approach:
    (Let the 2 strings be, str1 and str2 )

    1) Create an IndexWriter using a RAMDirectory (I don't want to store
    those
    strings on the disk)
    2) Index str1 and store it
    3) Search str2 in the index. ( shall the indexWriter be closed
    before you
    search on the index? )
    4) Get the similarity score & publish it
    5) Delete str1 from the index and make the index available for a new
    comparison

    Any comments & suggestions on making the process optimal

    Siddharth

    --
    View this message in context: http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18020806.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sangrish at Jun 20, 2008 at 4:20 am
    Given 2 text documents I want to quantitatively find, how similar they are,
    with respect to each other. Say, I want to find Cosine Similarity score
    between any two given documents. I am trying to use Lucene for it (is it
    good for this purpose?)

    This use case is different from querying against a set of documents

    I am not sure if Lucene provides a direct API to evaluate this score.

    Siddharth





    Grant Ingersoll-6 wrote:
    You might also have a look at the MemoryIndex. Question, though, is
    what are you hoping to gain from doing a Query against a single
    String? Are you doing a FuzzyQuery? You might look at the
    SecondString project on SourceForge for doing string comparisons.

    I guess I am a bit confused by your problem statement. Perhaps you
    can explain more what you are trying to do at a higher level, as it
    sounds like to me you have str1 and str2, so why do you need to inject
    an index into the middle of it?

    -Grant
    On Jun 19, 2008, at 8:33 PM, Sangrish wrote:


    I have a use case for comparing two given strings (attached to a
    specific
    field)
    using Lucene and get the similarity scores.

    I tried but could not find any built-in way to do so. Hence
    assuming that
    Lucene only compares a Query against Indexed documents, I came up
    with the
    following approach:
    (Let the 2 strings be, str1 and str2 )

    1) Create an IndexWriter using a RAMDirectory (I don't want to store
    those
    strings on the disk)
    2) Index str1 and store it
    3) Search str2 in the index. ( shall the indexWriter be closed
    before you
    search on the index? )
    4) Get the similarity score & publish it
    5) Delete str1 from the index and make the index available for a new
    comparison

    Any comments & suggestions on making the process optimal

    Siddharth

    --
    View this message in context:
    http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18020806.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18022691.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 20, 2008 at 5:59 am
    Hi,

    Have a look at MoreLikeThis:

    [otis@localhost trunk]$ ff \*MoreLikeThis\*.java
    ./contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java
    ./contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java


    I think that or something a lot like it is what you are after.

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

    ----- Original Message ----
    From: Sangrish <sidangrish@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Friday, June 20, 2008 12:20:02 AM
    Subject: Re: Arbitrary String to String Similarity Score


    Given 2 text documents I want to quantitatively find, how similar they are,
    with respect to each other. Say, I want to find Cosine Similarity score
    between any two given documents. I am trying to use Lucene for it (is it
    good for this purpose?)

    This use case is different from querying against a set of documents

    I am not sure if Lucene provides a direct API to evaluate this score.

    Siddharth





    Grant Ingersoll-6 wrote:
    You might also have a look at the MemoryIndex. Question, though, is
    what are you hoping to gain from doing a Query against a single
    String? Are you doing a FuzzyQuery? You might look at the
    SecondString project on SourceForge for doing string comparisons.

    I guess I am a bit confused by your problem statement. Perhaps you
    can explain more what you are trying to do at a higher level, as it
    sounds like to me you have str1 and str2, so why do you need to inject
    an index into the middle of it?

    -Grant
    On Jun 19, 2008, at 8:33 PM, Sangrish wrote:


    I have a use case for comparing two given strings (attached to a
    specific
    field)
    using Lucene and get the similarity scores.

    I tried but could not find any built-in way to do so. Hence
    assuming that
    Lucene only compares a Query against Indexed documents, I came up
    with the
    following approach:
    (Let the 2 strings be, str1 and str2 )

    1) Create an IndexWriter using a RAMDirectory (I don't want to store
    those
    strings on the disk)
    2) Index str1 and store it
    3) Search str2 in the index. ( shall the indexWriter be closed
    before you
    search on the index? )
    4) Get the similarity score & publish it
    5) Delete str1 from the index and make the index available for a new
    comparison

    Any comments & suggestions on making the process optimal

    Siddharth

    --
    View this message in context:
    http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18020806.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context:
    http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18022691.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sangrish at Jun 20, 2008 at 5:46 pm
    Yes, "MoreLikeThis" is more like what I want.

    But theres one problem. Even here one has to run the query against an
    indexed set of documents.

    While I would like to create two Queries through "MoreLikeThis" and get a
    score of how similar they are to each other.

    Siddharth







    Otis Gospodnetic wrote:
    Hi,

    Have a look at MoreLikeThis:

    [otis@localhost trunk]$ ff \*MoreLikeThis\*.java
    ./contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java
    ./contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java


    I think that or something a lot like it is what you are after.

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

    ----- Original Message ----
    From: Sangrish <sidangrish@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Friday, June 20, 2008 12:20:02 AM
    Subject: Re: Arbitrary String to String Similarity Score


    Given 2 text documents I want to quantitatively find, how similar they
    are,
    with respect to each other. Say, I want to find Cosine Similarity score
    between any two given documents. I am trying to use Lucene for it (is it
    good for this purpose?)

    This use case is different from querying against a set of documents

    I am not sure if Lucene provides a direct API to evaluate this score.

    Siddharth





    Grant Ingersoll-6 wrote:
    You might also have a look at the MemoryIndex. Question, though, is
    what are you hoping to gain from doing a Query against a single
    String? Are you doing a FuzzyQuery? You might look at the
    SecondString project on SourceForge for doing string comparisons.

    I guess I am a bit confused by your problem statement. Perhaps you
    can explain more what you are trying to do at a higher level, as it
    sounds like to me you have str1 and str2, so why do you need to inject
    an index into the middle of it?

    -Grant
    On Jun 19, 2008, at 8:33 PM, Sangrish wrote:


    I have a use case for comparing two given strings (attached to a
    specific
    field)
    using Lucene and get the similarity scores.

    I tried but could not find any built-in way to do so. Hence
    assuming that
    Lucene only compares a Query against Indexed documents, I came up
    with the
    following approach:
    (Let the 2 strings be, str1 and str2 )

    1) Create an IndexWriter using a RAMDirectory (I don't want to store
    those
    strings on the disk)
    2) Index str1 and store it
    3) Search str2 in the index. ( shall the indexWriter be closed
    before you
    search on the index? )
    4) Get the similarity score & publish it
    5) Delete str1 from the index and make the index available for a new
    comparison

    Any comments & suggestions on making the process optimal

    Siddharth

    --
    View this message in context:
    http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18020806.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context:
    http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18022691.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18034468.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 21, 2008 at 6:04 am
    You should look into SecondString perhaps then, like Grant said.


    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

    ----- Original Message ----
    From: Sangrish <sidangrish@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Friday, June 20, 2008 1:45:52 PM
    Subject: Re: Arbitrary String to String Similarity Score



    Yes, "MoreLikeThis" is more like what I want.

    But theres one problem. Even here one has to run the query against an
    indexed set of documents.

    While I would like to create two Queries through "MoreLikeThis" and get a
    score of how similar they are to each other.

    Siddharth







    Otis Gospodnetic wrote:
    Hi,

    Have a look at MoreLikeThis:

    [otis@localhost trunk]$ ff \*MoreLikeThis\*.java
    ./contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java
    ./contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java


    I think that or something a lot like it is what you are after.

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

    ----- Original Message ----
    From: Sangrish
    To: java-user@lucene.apache.org
    Sent: Friday, June 20, 2008 12:20:02 AM
    Subject: Re: Arbitrary String to String Similarity Score


    Given 2 text documents I want to quantitatively find, how similar they
    are,
    with respect to each other. Say, I want to find Cosine Similarity score
    between any two given documents. I am trying to use Lucene for it (is it
    good for this purpose?)

    This use case is different from querying against a set of documents

    I am not sure if Lucene provides a direct API to evaluate this score.

    Siddharth





    Grant Ingersoll-6 wrote:
    You might also have a look at the MemoryIndex. Question, though, is
    what are you hoping to gain from doing a Query against a single
    String? Are you doing a FuzzyQuery? You might look at the
    SecondString project on SourceForge for doing string comparisons.

    I guess I am a bit confused by your problem statement. Perhaps you
    can explain more what you are trying to do at a higher level, as it
    sounds like to me you have str1 and str2, so why do you need to inject
    an index into the middle of it?

    -Grant
    On Jun 19, 2008, at 8:33 PM, Sangrish wrote:


    I have a use case for comparing two given strings (attached to a
    specific
    field)
    using Lucene and get the similarity scores.

    I tried but could not find any built-in way to do so. Hence
    assuming that
    Lucene only compares a Query against Indexed documents, I came up
    with the
    following approach:
    (Let the 2 strings be, str1 and str2 )

    1) Create an IndexWriter using a RAMDirectory (I don't want to store
    those
    strings on the disk)
    2) Index str1 and store it
    3) Search str2 in the index. ( shall the indexWriter be closed
    before you
    search on the index? )
    4) Get the similarity score & publish it
    5) Delete str1 from the index and make the index available for a new
    comparison

    Any comments & suggestions on making the process optimal

    Siddharth

    --
    View this message in context:
    http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18020806.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context:
    http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18022691.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context:
    http://www.nabble.com/Arbitrary-String-to-String-Similarity-Score-tp18020806p18034468.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 20, '08 at 12:33a
activeJun 21, '08 at 6:04a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase