FAQ
Lucene developers,

We’ve been working on a undergraduate project to the college about changing
Apache Nutch (that uses Lucene do index it’s web pages) to include a
category filter, and we are having problems about the query part. We want to
develop an application with a good performance, so we thought that here
would be the best place to ask this kind of question. The idea is that the
user can search pages stored for only a category. So the number of results
found should display the number of pages that actually is classified in that
category.

The problem is about how to add to the Lucene indexes the category
information, and how filter the search on that. We tried to look on the
Nutch mailing-list (Nabble) about that and asked some help, but people from
there think that we should use some plug-in like Carrot, that get like 100
of pages and classify it in the query time. We are not very confident that
it’s the best solution. We thought in other two different ideas: #1 To
classify those pages and store that information on a DB and in the query
time filter the result that DB to filter the result. #2 Use different index
servers, one for each category and one to search without filtering by
category.

We have seen on this project http://search-lucene.com/ that there are
pre-defined categories. We think that this should be classified at indexing
time, as we wanted.

Do you have any other idea about how to do that?

Sincerely,

Daniel Costa Gimenes & Luan Cestari
Undergraduate students of University Center of FEI
Brazil
--
View this message in context: http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1049232.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Otis Gospodnetic at Aug 9, 2010 at 5:04 am
    Hello Luan,

    I think you are looking for facets and faceted search. In short, it means
    storing the category for a document (web page) in the Document Field in Lucene
    index . Then, at search time, you count how many matches were in which
    category. You can implement this yourself or you can use Solr, which has this
    functionality built-in. If you want to stick with Lucene and don't want Solr,
    you can use Bobo Browse with Lucene - Lucene in Action 2 has a case study about
    Bobo Browse, where you can learn how this is done. Slick stuff.

    Thanks for using http://search-lucene.com :)

    Otis
    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/


    ----- Original Message ----
    From: Luan Cestari <luan.cestari@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Sun, August 8, 2010 7:16:05 PM
    Subject: Using categories with Lucene


    Lucene developers,

    We’ve been working on a undergraduate project to the college about changing
    Apache Nutch (that uses Lucene do index it’s web pages) to include a
    category filter, and we are having problems about the query part. We want to
    develop an application with a good performance, so we thought that here
    would be the best place to ask this kind of question. The idea is that the
    user can search pages stored for only a category. So the number of results
    found should display the number of pages that actually is classified in that
    category.

    The problem is about how to add to the Lucene indexes the category
    information, and how filter the search on that. We tried to look on the
    Nutch mailing-list (Nabble) about that and asked some help, but people from
    there think that we should use some plug-in like Carrot, that get like 100
    of pages and classify it in the query time. We are not very confident that
    it’s the best solution. We thought in other two different ideas: #1 To
    classify those pages and store that information on a DB and in the query
    time filter the result that DB to filter the result. #2 Use different index
    servers, one for each category and one to search without filtering by
    category.

    We have seen on this project http://search-lucene.com/ that there are
    pre-defined categories. We think that this should be classified at indexing
    time, as we wanted.

    Do you have any other idea about how to do that?

    Sincerely,

    Daniel Costa Gimenes & Luan Cestari
    Undergraduate students of University Center of FEI
    Brazil
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1049232.html

    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Findbestopensource at Aug 9, 2010 at 5:13 am
    Hello Daniel & Luan

    1. Carrot is not required for your purpose. Carrot helps to
    consolidate the results from multiple search results.

    2. You need to add a category to the pages at the index time and
    filter out the results during search time. If you want to use Lucene,
    then you could store the category information in a separate DB or XML
    and filter out the search results. If you want an automatic system to
    display the category data then you could consider using Solr. You
    could checkout our website, we have used Solr to display the category
    / tagged field data.

    Regards
    Aditya
    www,findbestopensource.com



    On Mon, Aug 9, 2010 at 4:46 AM, Luan Cestari wrote:

    Lucene developers,

    We’ve been working on a undergraduate project to the college about changing
    Apache Nutch (that uses Lucene do index it’s web pages) to include a
    category filter, and we are having problems about the query part. We want to
    develop an application with a good performance, so we thought that here
    would be the best place to ask this kind of question. The idea is that the
    user can search pages stored for only a category. So the number of results
    found should display the number of pages that actually is classified in that
    category.

    The problem is about how to add to the Lucene indexes the category
    information, and how filter the search on that. We tried to look on the
    Nutch mailing-list (Nabble) about that and asked some help, but people from
    there think that we should use some plug-in like Carrot, that get like 10
    of pages and classify it in the query time. We are not very confident that
    it’s the best solution. We thought in other two different ideas: #1 To
    classify those pages and store that information on a DB and in the query
    time filter the result that DB to filter the result. #2 Use different index
    servers, one for each category and one to search without filtering by
    category.

    We have seen on this project http://search-lucene.com/ that there are
    pre-defined categories. We think that this should be classified at indexing
    time, as we wanted.

    Do you have any other idea about how to do that?

    Sincerely,

    Daniel Costa Gimenes & Luan Cestari
    Undergraduate students of University Center of FEI
    Brazil
    --
    View this message in context: http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1049232.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Luan Cestari at Aug 10, 2010 at 2:22 am
    We would like to say thanks for the replies.

    We found a plugin in Nutch (the Creative Commons plugin) that does like Otis
    said. It adds information to the indexes, and then uses them to filter the
    results during the query.

    Thanks again for the help.

    Best Regards,
    Daniel & Luan
    --
    View this message in context: http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1066049.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Glen Newton at Aug 10, 2010 at 6:05 am
    Hi Luan,

    Could you tell us the name and/or URL of this plugin so that the list
    might know about it?
    Thanks,
    Glen

    On 10 August 2010 12:21, Luan Cestari wrote:

    We would like to say thanks for the replies.

    We found a plugin in Nutch (the Creative Commons plugin) that does like Otis
    said. It adds information to the indexes, and then uses them to filter the
    results during the query.

    Thanks again for the help.

    Best Regards,
    Daniel & Luan
    --
    View this message in context: http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1066049.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --

    -

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Luan Cestari at Aug 11, 2010 at 4:52 pm
    Hi Glen,

    The URL to the Creative Commons package plugin (
    http://netlikon.de/docs/javadoc-nutch/trunk/org/creativecommons/nutch/package-summary.html
    ).

    It is in the CCIndexingFilter class that add a field that in
    the CCQueryFilter class filter the result using that new field.

    Regards,
    Luan
    On Tue, Aug 10, 2010 at 3:04 AM, Glen Newton wrote:

    Hi Luan,

    Could you tell us the name and/or URL of this plugin so that the list
    might know about it?
    Thanks,
    Glen

    On 10 August 2010 12:21, Luan Cestari wrote:

    We would like to say thanks for the replies.

    We found a plugin in Nutch (the Creative Commons plugin) that does like Otis
    said. It adds information to the indexes, and then uses them to filter the
    results during the query.

    Thanks again for the help.

    Best Regards,
    Daniel & Luan
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1066049.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --

    -

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Luan cestari

    "All the gold which is under or upon the earth is not enough to give in
    exchange for virtue."
    Plato
    "At his best, man is the noblest of all animals; separated from law and
    justice he is the worst."
    "A true friend is one soul in two bodies."
    Aristotle
  • Julien Nioche at Aug 11, 2010 at 5:17 pm
    BTW I don't remember anyone on the Nutch list suggesting you to use Carrot
    for this (see : http://search-lucene.com/?q=luan+carrot) or classifying at
    querying time

    What I suggested in http://search-lucene.com/m/JWZTj1q4lB92 was about
    classifying during the parsing or indexing and generating a field for Lucene
    or SOLR. As Otis pointed out you can of course use SOLR for faceting. Since
    you will be using Nutch anyway, you might as well avoid an external DB just
    for storing the results of the classification and just keep the labels e.g.
    in the parse metadata

    Julien
    --
    DigitalPebble Ltd

    Open Source Solutions for Text Engineering
    http://www.digitalpebble.com
    On 9 August 2010 00:16, Luan Cestari wrote:


    Lucene developers,

    We’ve been working on a undergraduate project to the college about changing
    Apache Nutch (that uses Lucene do index it’s web pages) to include a
    category filter, and we are having problems about the query part. We want
    to
    develop an application with a good performance, so we thought that here
    would be the best place to ask this kind of question. The idea is that the
    user can search pages stored for only a category. So the number of results
    found should display the number of pages that actually is classified in
    that
    category.

    The problem is about how to add to the Lucene indexes the category
    information, and how filter the search on that. We tried to look on the
    Nutch mailing-list (Nabble) about that and asked some help, but people from
    there think that we should use some plug-in like Carrot, that get like 10
    of pages and classify it in the query time. We are not very confident that
    it’s the best solution. We thought in other two different ideas: #1 To
    classify those pages and store that information on a DB and in the query
    time filter the result that DB to filter the result. #2 Use different index
    servers, one for each category and one to search without filtering by
    category.

    We have seen on this project http://search-lucene.com/ that there are
    pre-defined categories. We think that this should be classified at indexing
    time, as we wanted.

    Do you have any other idea about how to do that?

    Sincerely,

    Daniel Costa Gimenes & Luan Cestari
    Undergraduate students of University Center of FEI
    Brazil
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1049232.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Shuai Weng at Aug 11, 2010 at 5:22 pm
    Hey,

    I'm new to Lucene... I was wondering if we can use Lucene/Solr for word frequency counting
    (eg, in a subset of full text papers).

    Thanks for any info you may provide.
    Shuai
    On Aug 11, 2010, at 10:16 AM, Julien Nioche wrote:

    BTW I don't remember anyone on the Nutch list suggesting you to use Carrot
    for this (see : http://search-lucene.com/?q=luan+carrot) or classifying at
    querying time

    What I suggested in http://search-lucene.com/m/JWZTj1q4lB92 was about
    classifying during the parsing or indexing and generating a field for Lucene
    or SOLR. As Otis pointed out you can of course use SOLR for faceting. Since
    you will be using Nutch anyway, you might as well avoid an external DB just
    for storing the results of the classification and just keep the labels e.g.
    in the parse metadata

    Julien
    --
    DigitalPebble Ltd

    Open Source Solutions for Text Engineering
    http://www.digitalpebble.com
    On 9 August 2010 00:16, Luan Cestari wrote:


    Lucene developers,

    We’ve been working on a undergraduate project to the college about changing
    Apache Nutch (that uses Lucene do index it’s web pages) to include a
    category filter, and we are having problems about the query part. We want
    to
    develop an application with a good performance, so we thought that here
    would be the best place to ask this kind of question. The idea is that the
    user can search pages stored for only a category. So the number of results
    found should display the number of pages that actually is classified in
    that
    category.

    The problem is about how to add to the Lucene indexes the category
    information, and how filter the search on that. We tried to look on the
    Nutch mailing-list (Nabble) about that and asked some help, but people from
    there think that we should use some plug-in like Carrot, that get like 100
    of pages and classify it in the query time. We are not very confident that
    it’s the best solution. We thought in other two different ideas: #1 To
    classify those pages and store that information on a DB and in the query
    time filter the result that DB to filter the result. #2 Use different index
    servers, one for each category and one to search without filtering by
    category.

    We have seen on this project http://search-lucene.com/ that there are
    pre-defined categories. We think that this should be classified at indexing
    time, as we wanted.

    Do you have any other idea about how to do that?

    Sincerely,

    Daniel Costa Gimenes & Luan Cestari
    Undergraduate students of University Center of FEI
    Brazil
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1049232.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Greg Gershman at Aug 13, 2010 at 7:13 pm
    Absolutely!

    Index your documents, then open an IndexReader and take a look at the terms()
    method. You can grab each term, and pass it to the IndexReader using the
    docFreq(Term t) method and get back the number of documents that term appears
    in.

    Greg



    ________________________________
    From: Shuai Weng <shuai@genome.stanford.edu>
    To: java-user@lucene.apache.org
    Sent: Wed, August 11, 2010 1:21:54 PM
    Subject: word frequency counting


    Hey,

    I'm new to Lucene... I was wondering if we can use Lucene/Solr for word
    frequency counting
    (eg, in a subset of full text papers).

    Thanks for any info you may provide.
    Shuai
    On Aug 11, 2010, at 10:16 AM, Julien Nioche wrote:

    BTW I don't remember anyone on the Nutch list suggesting you to use Carrot
    for this (see : http://search-lucene.com/?q=luan+carrot) or classifying at
    querying time

    What I suggested in http://search-lucene.com/m/JWZTj1q4lB92 was about
    classifying during the parsing or indexing and generating a field for Lucene
    or SOLR. As Otis pointed out you can of course use SOLR for faceting. Since
    you will be using Nutch anyway, you might as well avoid an external DB just
    for storing the results of the classification and just keep the labels e.g.
    in the parse metadata

    Julien
    --
    DigitalPebble Ltd

    Open Source Solutions for Text Engineering
    http://www.digitalpebble.com
    On 9 August 2010 00:16, Luan Cestari wrote:


    Lucene developers,

    We’ve been working on a undergraduate project to the college about changing
    Apache Nutch (that uses Lucene do index it’s web pages) to include a
    category filter, and we are having problems about the query part. We want
    to
    develop an application with a good performance, so we thought that here
    would be the best place to ask this kind of question. The idea is that the
    user can search pages stored for only a category. So the number of results
    found should display the number of pages that actually is classified in
    that
    category.

    The problem is about how to add to the Lucene indexes the category
    information, and how filter the search on that. We tried to look on the
    Nutch mailing-list (Nabble) about that and asked some help, but people from
    there think that we should use some plug-in like Carrot, that get like 100
    of pages and classify it in the query time. We are not very confident that
    it’s the best solution. We thought in other two different ideas: #1 To
    classify those pages and store that information on a DB and in the query
    time filter the result that DB to filter the result. #2 Use different index
    servers, one for each category and one to search without filtering by
    category.

    We have seen on this project http://search-lucene.com/ that there are
    pre-defined categories. We think that this should be classified at indexing
    time, as we wanted.

    Do you have any other idea about how to do that?

    Sincerely,

    Daniel Costa Gimenes & Luan Cestari
    Undergraduate students of University Center of FEI
    Brazil
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1049232.html
    l
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Luan Cestari at Aug 11, 2010 at 7:44 pm
    Julien,

    You're right. We discovered carrot by searching the mailing-list and thought
    it was mentioned in one of our conversations. We are sorry for our mistake.

    Best Regards,
    Daniel Gimenes
    Luan Cestari
    On Wed, Aug 11, 2010 at 2:16 PM, Julien Nioche wrote:

    BTW I don't remember anyone on the Nutch list suggesting you to use Carrot
    for this (see : http://search-lucene.com/?q=luan+carrot) or classifying at
    querying time

    What I suggested in http://search-lucene.com/m/JWZTj1q4lB92 was about
    classifying during the parsing or indexing and generating a field for
    Lucene
    or SOLR. As Otis pointed out you can of course use SOLR for faceting. Since
    you will be using Nutch anyway, you might as well avoid an external DB just
    for storing the results of the classification and just keep the labels e.g.
    in the parse metadata

    Julien
    --
    DigitalPebble Ltd

    Open Source Solutions for Text Engineering
    http://www.digitalpebble.com
    On 9 August 2010 00:16, Luan Cestari wrote:


    Lucene developers,

    We’ve been working on a undergraduate project to the college about changing
    Apache Nutch (that uses Lucene do index it’s web pages) to include a
    category filter, and we are having problems about the query part. We want
    to
    develop an application with a good performance, so we thought that here
    would be the best place to ask this kind of question. The idea is that the
    user can search pages stored for only a category. So the number of results
    found should display the number of pages that actually is classified in
    that
    category.

    The problem is about how to add to the Lucene indexes the category
    information, and how filter the search on that. We tried to look on the
    Nutch mailing-list (Nabble) about that and asked some help, but people from
    there think that we should use some plug-in like Carrot, that get like 100
    of pages and classify it in the query time. We are not very confident that
    it’s the best solution. We thought in other two different ideas: #1 To
    classify those pages and store that information on a DB and in the query
    time filter the result that DB to filter the result. #2 Use different index
    servers, one for each category and one to search without filtering by
    category.

    We have seen on this project http://search-lucene.com/ that there are
    pre-defined categories. We think that this should be classified at indexing
    time, as we wanted.

    Do you have any other idea about how to do that?

    Sincerely,

    Daniel Costa Gimenes & Luan Cestari
    Undergraduate students of University Center of FEI
    Brazil
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/Using-categories-with-Lucene-tp1049232p1049232.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Luan cestari

    "All the gold which is under or upon the earth is not enough to give in
    exchange for virtue."
    Plato
    "At his best, man is the noblest of all animals; separated from law and
    justice he is the worst."
    "A true friend is one soul in two bodies."
    Aristotle

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 8, '10 at 11:16p
activeAug 13, '10 at 7:13p
posts10
users7
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase