FAQ
Hi List,

I am pretty new to Lucene. Certainly, it is very exciting. I need to
implement a new Similarity class based on the Term Vector Space Model given
in http://www.miislita.com/term-vector/term-vector-3.html

Although that model is similar to Lucene’s model
(http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html),
I am having hard time to extend the Similarity class to calculate that
model.

In that model, “tf” is multiplied with Idf for all terms in the index, but
in Lucene “tf” is calculated only for terms in the given Query. Because of
that effect, the norm calculation should also include “idf” for all terms.
Lucene calculates the norm, during indexing, by “just” counting the number
of terms per document. In the web formula (in miislita.com), a document norm
is calculated after multiplying “tf” and “idf”.

FYI: I could implement “idf” according to miisliat.com formula, but not the
“tf” and “norm”

Could you please comment me how I can implement a new Similarity class that
will fit in the Lucene’s architecture, but still implement the vector space
model given in miislita.com

Thanks a lot for your comments,

Dharma

--
View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15696719.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Grant Ingersoll at Feb 28, 2008 at 12:05 pm
    Not sure I am understanding what you are asking, but I will give it a
    shot. See below

    On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:


    Hi List,

    I am pretty new to Lucene. Certainly, it is very exciting. I need to
    implement a new Similarity class based on the Term Vector Space
    Model given
    in http://www.miislita.com/term-vector/term-vector-3.html

    Although that model is similar to Lucene’s model
    (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
    ),
    I am having hard time to extend the Similarity class to calculate that
    model.

    In that model, “tf” is multiplied with Idf for all terms in the
    index, but
    in Lucene “tf” is calculated only for terms in the given Query.
    Because of
    that effect, the norm calculation should also include “idf” for all
    terms.
    Lucene calculates the norm, during indexing, by “just” counting the
    number
    of terms per document. In the web formula (in miislita.com), a
    document norm
    is calculated after multiplying “tf” and “idf”.
    Are you wondering if there is a way to score all documents regardless
    of whether the document has the term or not? I don't quite get your
    statement: "In that model, “tf” is multiplied with Idf for all terms
    in the index, but in Lucene “tf” is calculated only for terms in the
    given Query."

    Isn't the result for those documents that don't have query terms just
    going to be 0 or am I not fully understanding? I briefly skimmed the
    paper you cite and it doesn't seem that different, it's just
    describing the Salton's VSM right?

    FYI: I could implement “idf” according to miisliat.com formula, but
    not the
    “tf” and “norm”

    Could you please comment me how I can implement a new Similarity
    class that
    will fit in the Lucene’s architecture, but still implement the
    vector space
    model given in miislita.com
    In the end, you may need to implement some lower level Query classes,
    but I still don't fully understand what you are trying to do, so I
    wouldn't head down that path just yet.

    --------------------------
    Grant Ingersoll
    http://www.lucenebootcamp.com
    Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dharmalingam at Feb 28, 2008 at 2:00 pm
    Thanks for the reply. Sorry if my explanation is not clear. Yes, you are
    correct the model is based on Salton's VSM. However, the calculation of the
    term weight and the doc norm is, in my opinion, different from Lucene. If
    you look at the table given in
    http://www.miislita.com/term-vector/term-vector-3.html, they calcuate the
    document norm based on the weight wi=tfi*idfi. I looked at the interfaces of
    Similarity and DefaultSimilairty class. I place it below:

    public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
    }

    You can see that this lengthNorm for a doc is quite different from that
    website norm calculation.

    Similarly, the querynorm interface of DefaultSimilarity class is:

    /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
    public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
    }

    This is again different the website model.

    I also have difficulities with tf interface of DefaultSimilarity:
    /** Implemented as <code>sqrt(freq)</code>. */
    public float tf(float freq) {
    return (float)Math.sqrt(freq);
    }

    In that website model, a tf refers to the frequency of a term within a doc.

    I hope explained it better. Please let me know if it is unclear. I am
    looking for an easy way to implement that table, and of course want to
    integrate with my lucene ( i.e., myIndexWriter.setSimilarity(new
    mySimilarity());) Will this be possible by just somehow inheriting the base
    classes of Lucene.

    Thanks for your advice.

    Grant Ingersoll-6 wrote:
    Not sure I am understanding what you are asking, but I will give it a
    shot. See below

    On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:


    Hi List,

    I am pretty new to Lucene. Certainly, it is very exciting. I need to
    implement a new Similarity class based on the Term Vector Space
    Model given
    in http://www.miislita.com/term-vector/term-vector-3.html

    Although that model is similar to Lucene’s model
    (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
    ),
    I am having hard time to extend the Similarity class to calculate that
    model.

    In that model, “tf” is multiplied with Idf for all terms in the
    index, but
    in Lucene “tf” is calculated only for terms in the given Query.
    Because of
    that effect, the norm calculation should also include “idf” for all
    terms.
    Lucene calculates the norm, during indexing, by “just” counting the
    number
    of terms per document. In the web formula (in miislita.com), a
    document norm
    is calculated after multiplying “tf” and “idf”.
    Are you wondering if there is a way to score all documents regardless
    of whether the document has the term or not? I don't quite get your
    statement: "In that model, “tf” is multiplied with Idf for all terms
    in the index, but in Lucene “tf” is calculated only for terms in the
    given Query."

    Isn't the result for those documents that don't have query terms just
    going to be 0 or am I not fully understanding? I briefly skimmed the
    paper you cite and it doesn't seem that different, it's just
    describing the Salton's VSM right?

    FYI: I could implement “idf” according to miisliat.com formula, but
    not the
    “tf” and “norm”

    Could you please comment me how I can implement a new Similarity
    class that
    will fit in the Lucene’s architecture, but still implement the
    vector space
    model given in miislita.com
    In the end, you may need to implement some lower level Query classes,
    but I still don't fully understand what you are trying to do, so I
    wouldn't head down that path just yet.

    --------------------------
    Grant Ingersoll
    http://www.lucenebootcamp.com
    Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15736946.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Feb 28, 2008 at 5:45 pm

    On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
    Thanks for the reply. Sorry if my explanation is not clear. Yes, you
    are
    correct the model is based on Salton's VSM. However, the
    calculation of the
    term weight and the doc norm is, in my opinion, different from
    Lucene. If
    you look at the table given in
    http://www.miislita.com/term-vector/term-vector-3.html, they
    calcuate the
    document norm based on the weight wi=tfi*idfi. I looked at the
    interfaces of
    Similarity and DefaultSimilairty class. I place it below:

    public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
    }

    You can see that this lengthNorm for a doc is quite different from
    that
    website norm calculation.
    The lengthNorm method is different from the IDF calculation. In the
    Similarity class, that is handled by the idf() method. Length norm is
    an attempt to address one of the limitations listed further down in
    that paper:
    "Long Documents: Very long documents make similarity measures
    difficult (vectors with small dot products and high dimensionality)"



    Similarly, the querynorm interface of DefaultSimilarity class is:

    /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
    public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
    }

    This is again different the website model.
    Query norm is an attempt to allow for comparison of scores across
    queries, but I don't think one should do that anyway.


    I also have difficulities with tf interface of DefaultSimilarity:
    /** Implemented as <code>sqrt(freq)</code>. */
    public float tf(float freq) {
    return (float)Math.sqrt(freq);
    }
    These are all callback methods from within the Scorer classes that
    each Query uses. Have a look at TermScorer for how these things get
    called.


    Try this as an example:

    Setup a really simple index with 1 or 2 docs each with a few words.
    Setup a simple Similarity class where you override all of these
    methods to return 1 (or some simple default)
    and then index your documents and do a few queries.

    Then, have a look at Searcher.explain() to see why a document scores
    the way it does. Then, you can work to modify from there.

    Here's the bigger question: what is your ultimate goal here? Are you
    just trying to understand Lucene at an academic/programming level or
    do you have something you are trying to achieve in terms of relevance?

    -Grant

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dharmalingam at Feb 28, 2008 at 8:56 pm
    Thanks for your tips. My overall goal is to quickly implement 7 variants of
    vector space model using Lucene. You can find these variants in the
    updloaded file.

    I am doing all these stuffs for a much broader goal: I am trying to recover
    traceability links from requirements to source code files. I treat every
    requirement as a query. In this problem, I would like to compare these
    collection of algorithms for their relevance.




    Grant Ingersoll-6 wrote:
    On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:


    Thanks for the reply. Sorry if my explanation is not clear. Yes, you
    are
    correct the model is based on Salton's VSM. However, the
    calculation of the
    term weight and the doc norm is, in my opinion, different from
    Lucene. If
    you look at the table given in
    http://www.miislita.com/term-vector/term-vector-3.html, they
    calcuate the
    document norm based on the weight wi=tfi*idfi. I looked at the
    interfaces of
    Similarity and DefaultSimilairty class. I place it below:

    public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
    }

    You can see that this lengthNorm for a doc is quite different from
    that
    website norm calculation.
    The lengthNorm method is different from the IDF calculation. In the
    Similarity class, that is handled by the idf() method. Length norm is
    an attempt to address one of the limitations listed further down in
    that paper:
    "Long Documents: Very long documents make similarity measures
    difficult (vectors with small dot products and high dimensionality)"



    Similarly, the querynorm interface of DefaultSimilarity class is:

    /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
    public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
    }

    This is again different the website model.
    Query norm is an attempt to allow for comparison of scores across
    queries, but I don't think one should do that anyway.


    I also have difficulities with tf interface of DefaultSimilarity:
    /** Implemented as <code>sqrt(freq)</code>. */
    public float tf(float freq) {
    return (float)Math.sqrt(freq);
    }
    These are all callback methods from within the Scorer classes that
    each Query uses. Have a look at TermScorer for how these things get
    called.


    Try this as an example:

    Setup a really simple index with 1 or 2 docs each with a few words.
    Setup a simple Similarity class where you override all of these
    methods to return 1 (or some simple default)
    and then index your documents and do a few queries.

    Then, have a look at Searcher.explain() to see why a document scores
    the way it does. Then, you can work to modify from there.

    Here's the bigger question: what is your ultimate goal here? Are you
    just trying to understand Lucene at an academic/programming level or
    do you have something you are trying to achieve in terms of relevance?

    -Grant

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
    --
    View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Feb 28, 2008 at 10:09 pm
    FYI: The mailing list handler strips attachments.

    At any rate, sounds like an interesting project. I don't know how
    easy it will be for you to implement 7 variants of VSM in Lucene given
    the nature of the APIs, but if you do, it might be handy to see your
    changes as a patch. :-) Also not quite sure what all those variants
    will help with when it comes to your broader goal, but that isn't for
    me to decide :-) Seems like your goal is to find the traceability
    stuff, not see if you can figure out how to change Lucene's
    similarity! To that end, my two cents would be to focus on creating
    the right kinds of queries, analyzers, etc.


    -Grant
    On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:


    Thanks for your tips. My overall goal is to quickly implement 7
    variants of
    vector space model using Lucene. You can find these variants in the
    updloaded file.

    I am doing all these stuffs for a much broader goal: I am trying to
    recover
    traceability links from requirements to source code files. I treat
    every
    requirement as a query. In this problem, I would like to compare these
    collection of algorithms for their relevance.




    Grant Ingersoll-6 wrote:
    On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:


    Thanks for the reply. Sorry if my explanation is not clear. Yes, you
    are
    correct the model is based on Salton's VSM. However, the
    calculation of the
    term weight and the doc norm is, in my opinion, different from
    Lucene. If
    you look at the table given in
    http://www.miislita.com/term-vector/term-vector-3.html, they
    calcuate the
    document norm based on the weight wi=tfi*idfi. I looked at the
    interfaces of
    Similarity and DefaultSimilairty class. I place it below:

    public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
    }

    You can see that this lengthNorm for a doc is quite different from
    that
    website norm calculation.
    The lengthNorm method is different from the IDF calculation. In the
    Similarity class, that is handled by the idf() method. Length norm
    is
    an attempt to address one of the limitations listed further down in
    that paper:
    "Long Documents: Very long documents make similarity measures
    difficult (vectors with small dot products and high dimensionality)"



    Similarly, the querynorm interface of DefaultSimilarity class is:

    /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
    public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
    }

    This is again different the website model.
    Query norm is an attempt to allow for comparison of scores across
    queries, but I don't think one should do that anyway.


    I also have difficulities with tf interface of DefaultSimilarity:
    /** Implemented as <code>sqrt(freq)</code>. */
    public float tf(float freq) {
    return (float)Math.sqrt(freq);
    }
    These are all callback methods from within the Scorer classes that
    each Query uses. Have a look at TermScorer for how these things get
    called.


    Try this as an example:

    Setup a really simple index with 1 or 2 docs each with a few words.
    Setup a simple Similarity class where you override all of these
    methods to return 1 (or some simple default)
    and then index your documents and do a few queries.

    Then, have a look at Searcher.explain() to see why a document scores
    the way it does. Then, you can work to modify from there.

    Here's the bigger question: what is your ultimate goal here? Are
    you
    just trying to understand Lucene at an academic/programming level or
    do you have something you are trying to achieve in terms of
    relevance?

    -Grant

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
    --
    View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucenebootcamp.com
    Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dharmalingam at Feb 28, 2008 at 10:19 pm
    You can find those variants of the vector space model in this interesting
    article:
    http://ieeexplore.ieee.org/iel1/52/12658/00582976.pdf?tp=&isnumber=&arnumber=582976

    Now, I got confirmed with you the current nature of Similarity API's will be
    not easy to quickly realize these variants.

    Actually, I implemented the earlier web-site model as a separate Java
    program, which uses Lucene classes, but not through inherting the Similarity
    class. It appears inherting similarity class will not solve my problem of
    realization these variant


    Grant Ingersoll-6 wrote:
    FYI: The mailing list handler strips attachments.

    At any rate, sounds like an interesting project. I don't know how
    easy it will be for you to implement 7 variants of VSM in Lucene given
    the nature of the APIs, but if you do, it might be handy to see your
    changes as a patch. :-) Also not quite sure what all those variants
    will help with when it comes to your broader goal, but that isn't for
    me to decide :-) Seems like your goal is to find the traceability
    stuff, not see if you can figure out how to change Lucene's
    similarity! To that end, my two cents would be to focus on creating
    the right kinds of queries, analyzers, etc.


    -Grant
    On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:


    Thanks for your tips. My overall goal is to quickly implement 7
    variants of
    vector space model using Lucene. You can find these variants in the
    updloaded file.

    I am doing all these stuffs for a much broader goal: I am trying to
    recover
    traceability links from requirements to source code files. I treat
    every
    requirement as a query. In this problem, I would like to compare these
    collection of algorithms for their relevance.




    Grant Ingersoll-6 wrote:
    On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:


    Thanks for the reply. Sorry if my explanation is not clear. Yes, you
    are
    correct the model is based on Salton's VSM. However, the
    calculation of the
    term weight and the doc norm is, in my opinion, different from
    Lucene. If
    you look at the table given in
    http://www.miislita.com/term-vector/term-vector-3.html, they
    calcuate the
    document norm based on the weight wi=tfi*idfi. I looked at the
    interfaces of
    Similarity and DefaultSimilairty class. I place it below:

    public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
    }

    You can see that this lengthNorm for a doc is quite different from
    that
    website norm calculation.
    The lengthNorm method is different from the IDF calculation. In the
    Similarity class, that is handled by the idf() method. Length norm
    is
    an attempt to address one of the limitations listed further down in
    that paper:
    "Long Documents: Very long documents make similarity measures
    difficult (vectors with small dot products and high dimensionality)"



    Similarly, the querynorm interface of DefaultSimilarity class is:

    /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
    public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
    }

    This is again different the website model.
    Query norm is an attempt to allow for comparison of scores across
    queries, but I don't think one should do that anyway.


    I also have difficulities with tf interface of DefaultSimilarity:
    /** Implemented as <code>sqrt(freq)</code>. */
    public float tf(float freq) {
    return (float)Math.sqrt(freq);
    }
    These are all callback methods from within the Scorer classes that
    each Query uses. Have a look at TermScorer for how these things get
    called.


    Try this as an example:

    Setup a really simple index with 1 or 2 docs each with a few words.
    Setup a simple Similarity class where you override all of these
    methods to return 1 (or some simple default)
    and then index your documents and do a few queries.

    Then, have a look at Searcher.explain() to see why a document scores
    the way it does. Then, you can work to modify from there.

    Here's the bigger question: what is your ultimate goal here? Are
    you
    just trying to understand Lucene at an academic/programming level or
    do you have something you are trying to achieve in terms of
    relevance?

    -Grant

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
    --
    View this message in context:
    http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucenebootcamp.com
    Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15747395.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • H t at Feb 29, 2008 at 2:19 am
    Compare with classical VSM, lucene just ignore the denominator (|Q|*|D|) of
    similarity formula,
    but it add norm(t,d) and coord(q,d) to calculate the fraction of terms in
    Query and Doc,
    so it's a modified implementation of VSM in practice.
    Do you just want to verify which implementation of VSM in "ieee-sw-rank" is
    more precise in practice by lucene?
    If so, it's an useful experiment.

    2008/2/27, Dharmalingam <dganesan@fc-md.umd.edu>:

    Hi List,

    I am pretty new to Lucene. Certainly, it is very exciting. I need to
    implement a new Similarity class based on the Term Vector Space Model
    given
    in http://www.miislita.com/term-vector/term-vector-3.html

    Although that model is similar to Lucene's model
    (
    http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
    ),
    I am having hard time to extend the Similarity class to calculate that
    model.

    In that model, "tf" is multiplied with Idf for all terms in the index, but
    in Lucene "tf" is calculated only for terms in the given Query. Because of
    that effect, the norm calculation should also include "idf" for all terms.
    Lucene calculates the norm, during indexing, by "just" counting the number
    of terms per document. In the web formula (in miislita.com), a document
    norm
    is calculated after multiplying "tf" and "idf".

    FYI: I could implement "idf" according to miisliat.com formula, but not
    the
    "tf" and "norm"

    Could you please comment me how I can implement a new Similarity class
    that
    will fit in the Lucene's architecture, but still implement the vector
    space
    model given in miislita.com

    Thanks a lot for your comments,

    Dharma


    --
    View this message in context:
    http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15696719.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 26, '08 at 8:46p
activeFeb 29, '08 at 2:19a
posts8
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase