FAQ
Hi all!

hmmm, i need to get how important is the word in entire document collection
that is indexed in the lucene index. I need to extract some "representable
words", lets say concepts that are common and can be representable to whole
collection. Or collection "keywords". I did the fulltext indexing and the
only field i am using are text contents, because titles of the documents are
mostly not representable(numbers, codes etc....)

So, if i calculate tfidf, it gives me importance of single term with respect
to single document. But if that word is repeating in the documents, how can
i calculate its total importance within index?

All help appreciated!! Thank you!!!

--
View this message in context: http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p988836.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Karl Wettin at Jul 23, 2010 at 7:57 am
    Hi,

    Please define "important". Important to do what?

    It would probably be helpful if you explained what it is you attempt
    to achieve by doing this. Perhaps there is something in MoreLikeThis
    that will help you?


    karl




    23 jul 2010 kl. 04.44 skrev Xaida:
    Hi all!

    hmmm, i need to get how important is the word in entire document
    collection
    that is indexed in the lucene index. I need to extract some
    "representable
    words", lets say concepts that are common and can be representable
    to whole
    collection. Or collection "keywords". I did the fulltext indexing
    and the
    only field i am using are text contents, because titles of the
    documents are
    mostly not representable(numbers, codes etc....)

    So, if i calculate tfidf, it gives me importance of single term with
    respect
    to single document. But if that word is repeating in the documents,
    how can
    i calculate its total importance within index?

    All help appreciated!! Thank you!!!

    --
    View this message in context: http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p988836.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Xaida at Jul 23, 2010 at 8:55 am
    Hi! thanks for reply! I will try to explain better, sorry if it was unclear.

    I have user text document collection. Not too big. Goal is to get the most
    "important" concepts which would in a way represent user interests. That
    is what i mean when i say important :)


    So lets say, in my collection I have my school documents, i have some
    snowboarding articles, i have some backpacking and easy travelling guides,
    my favorite cooking recipes......and so on. Collection is more - less
    supervised, so number of documents for each "area" is similar. Not equal,
    but there is some balance.

    So i would like, as result to get terms which are important in the entire
    collection. For example, i think that term "cheese" should appear in my
    results, because i know in my recipes there is a lot of cheese. Also i would
    like to get the term "database"...from my school documents. And so on.

    So nothing more smart comes to my mind than this :)
    step 1 take one document
    step 2 calculate tfidf for all its terms
    step 3 take the terms with best tfidf and save them somewhere....
    step 4 go to step 1......and so on for all the documents

    And in the end to merge these results somehow :/

    I guess there is better way :)

    Thank you!!





    --
    View this message in context: http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p989301.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Wettin at Jul 23, 2010 at 9:23 am
    Are you perhaps looking for this:

    http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/similar/MoreLikeThis.html

    ?

    karl

    23 jul 2010 kl. 10.54 skrev Xaida:
    Hi! thanks for reply! I will try to explain better, sorry if it was
    unclear.

    I have user text document collection. Not too big. Goal is to get
    the most
    "important" concepts which would in a way represent user
    interests. That
    is what i mean when i say important :)


    So lets say, in my collection I have my school documents, i have some
    snowboarding articles, i have some backpacking and easy travelling
    guides,
    my favorite cooking recipes......and so on. Collection is more - less
    supervised, so number of documents for each "area" is similar. Not
    equal,
    but there is some balance.

    So i would like, as result to get terms which are important in the
    entire
    collection. For example, i think that term "cheese" should appear in
    my
    results, because i know in my recipes there is a lot of cheese. Also
    i would
    like to get the term "database"...from my school documents. And so on.

    So nothing more smart comes to my mind than this :)
    step 1 take one document
    step 2 calculate tfidf for all its terms
    step 3 take the terms with best tfidf and save them somewhere....
    step 4 go to step 1......and so on for all the documents

    And in the end to merge these results somehow :/

    I guess there is better way :)

    Thank you!!





    --
    View this message in context: http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p989301.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Xaida at Jul 23, 2010 at 11:30 am
    Thanx!

    I am not sure, I have to study this class more deeper today , this is bit
    complex, and i am not so advanced user to understand all. But this part
    written in description is important to me:

    "An efficient, effective "more-like-this" query generator would be a great
    contribution, if anyone's interested. I'd imagine that it would take a
    Reader or a String (the document's text), analyzer Analyzer, and return a
    set of representative terms using heuristics like those above. "

    I know that "more like this" finds the set of similar documents.....(but i
    dont need documents, i need "important"terms from whole index, to retrieve
    them into some list for example)....... so for my case, important terms that
    I would need, would actually be the terms generated in this query???

    --
    View this message in context: http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p989510.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Jul 23, 2010 at 11:44 am
    Couple of thoughts inline...
    On Jul 22, 2010, at 10:44 PM, Xaida wrote:


    Hi all!

    hmmm, i need to get how important is the word in entire document collection
    that is indexed in the lucene index. I need to extract some "representable
    words", lets say concepts that are common and can be representable to whole
    collection. Or collection "keywords". I did the fulltext indexing and the
    only field i am using are text contents, because titles of the documents are
    mostly not representable(numbers, codes etc....)

    So, if i calculate tfidf, it gives me importance of single term with respect
    to single document.
    TF gives you the importance in a single document.
    IDF gives you the inverse of importance across the collection
    But if that word is repeating in the documents, how can
    i calculate its total importance within index?

    Also, Lucene can also normalize by length, which is often a part of these things too.

    This information can be retrieved from TermDocs, TermEnum, etc.

    Also, as a related item, you may be interested in important phrases, which can often be more helpful. Check out https://cwiki.apache.org/confluence/display/MAHOUT/Collocations for one way of doing that.

    -Grant

    ---------------------
    Grant Ingersoll
    http://www.lucidimagination.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 23, '10 at 2:44a
activeJul 23, '10 at 11:44a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase