Grokbase Groups Lucene dev July 2011
FAQ
Is it faceting per-segment?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Shai Erera at Jul 9, 2011 at 3:45 am
    Currently it doesn't facet per segment, because the approach it uses
    is irrelevant to per segment.

    It maintains a count array in the size of the taxonomy and every
    matching document contributes to the weight of the categories it is
    associated with, orregardless of the segment it is found in.

    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both).

    Shai
    On Saturday, July 9, 2011, Jason Rutherglen wrote:
    Is it faceting per-segment?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Jason Rutherglen at Jul 9, 2011 at 3:56 am

    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both)
    Right so with NRT this'll be an issue. Is there a write up on this?
    It sounds fairly radical in design. Eg, I'm curious as to how it
    compares with the bit set and un-inverted field cache based faceting
    systems.
    On Fri, Jul 8, 2011 at 8:44 PM, Shai Erera wrote:
    Currently it doesn't facet per segment, because the approach it uses
    is irrelevant to per segment.

    It maintains a count array in the size of the taxonomy and every
    matching document contributes to the weight of the categories it is
    associated with, orregardless of the segment it is found in.

    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both).

    Shai
    On Saturday, July 9, 2011, Jason Rutherglen wrote:
    Is it faceting per-segment?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Shai Erera at Jul 9, 2011 at 4:41 am
    Well, the approach is entirely different, and the new module
    introduces features not available in the other impls (and I imagine
    vice versa).

    The taxonomy is managed on the side, hence why it is global to the
    'content' index. It plays very well with NRT, and we in fact have
    several apps that use the module in an NRT environment.

    The taxonomy index supports NRT by itself, by using the IR.open(IW)
    API and then it's up to the application to manage its content index
    search as NRT.

    I think you should read the high-level description I put on
    LUCENE-3079 and the userguide I put on LUCENE-3261. As I said, the
    approach is quite different than the bitset and FieldCache ones.

    Shai
    On Saturday, July 9, 2011, Jason Rutherglen wrote:
    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both)
    Right so with NRT this'll be an issue.  Is there a write up on this?
    It sounds fairly radical in design.  Eg, I'm curious as to how it
    compares with the bit set and un-inverted field cache based faceting
    systems.
    On Fri, Jul 8, 2011 at 8:44 PM, Shai Erera wrote:
    Currently it doesn't facet per segment, because the approach it uses
    is irrelevant to per segment.

    It maintains a count array in the size of the taxonomy and every
    matching document contributes to the weight of the categories it is
    associated with, orregardless of the segment it is found in.

    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both).

    Shai
    On Saturday, July 9, 2011, Jason Rutherglen wrote:
    Is it faceting per-segment?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless at Jul 9, 2011 at 11:13 am
    Actually I think the faceting module is per-segment?

    The facets are encoded into payloads, and then it visits the payload
    of each hit right per segment, and aggregates the counts.

    Like, on reopen (NRT or not) of a reader, there are no global data
    structures that must be recomputed. EG, this facets impl doesn't use
    FieldCache on the global reader (leading to insanity....).

    Mike McCandless

    http://blog.mikemccandless.com
    On Sat, Jul 9, 2011 at 12:40 AM, Shai Erera wrote:
    Well, the approach is entirely different, and the new module
    introduces features not available in the other impls (and I imagine
    vice versa).

    The taxonomy is managed on the side, hence why it is global to the
    'content' index. It plays very well with NRT, and we in fact have
    several apps that use the module in an NRT environment.

    The taxonomy index supports NRT by itself, by using the IR.open(IW)
    API and then it's up to the application to manage its content index
    search as NRT.

    I think you should read the high-level description I put on
    LUCENE-3079 and the userguide I put on LUCENE-3261. As I said, the
    approach is quite different than the bitset and FieldCache ones.

    Shai
    On Saturday, July 9, 2011, Jason Rutherglen wrote:
    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both)
    Right so with NRT this'll be an issue.  Is there a write up on this?
    It sounds fairly radical in design.  Eg, I'm curious as to how it
    compares with the bit set and un-inverted field cache based faceting
    systems.
    On Fri, Jul 8, 2011 at 8:44 PM, Shai Erera wrote:
    Currently it doesn't facet per segment, because the approach it uses
    is irrelevant to per segment.

    It maintains a count array in the size of the taxonomy and every
    matching document contributes to the weight of the categories it is
    associated with, orregardless of the segment it is found in.

    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both).

    Shai
    On Saturday, July 9, 2011, Jason Rutherglen wrote:
    Is it faceting per-segment?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Jason Rutherglen at Jul 11, 2011 at 6:28 am
    Actually I think the faceting module is per-segment?
    That would be very cool. I reviewed the user guide and it is
    ambiguous on this topic. Eg, why does the facet taxonomy need to be
    be committed for every IW commit? Mapping that to [N]RT will be
    tricky.

    Page 17:

    "In faceted search, we complicate things somewhat by adding a second index – the
    taxonomy index. The taxonomy API also follows point-in-time semantics,
    but this is
    not quite enough. Some attention must be paid by the user to keep
    those two indexes
    consistently in sync:"

    "The main index refers to category numbers defined in the taxonomy index.
    Therefore, it is important that we open the TaxonomyReader after opening the
    IndexReader. Moreover, every time an IndexReader is reopen()ed, the
    TaxonomyReader needs to be refresh()1ed as well."

    But there is one extra caution: whenever the application deems it has written
    enough information worthy a commit, it must first call commit() for the
    TaxonomyWriter and only after that call commit() for the IndexWriter.
    Closing the
    indices should also be done in this order – first close the taxonomy,
    and only after
    that close the index."


    On Sat, Jul 9, 2011 at 4:13 AM, Michael McCandless
    wrote:
    Actually I think the faceting module is per-segment?

    The facets are encoded into payloads, and then it visits the payload
    of each hit right per segment, and aggregates the counts.

    Like, on reopen (NRT or not) of a reader, there are no global data
    structures that must be recomputed.  EG, this facets impl doesn't use
    FieldCache on the global reader (leading to insanity....).

    Mike McCandless

    http://blog.mikemccandless.com
    On Sat, Jul 9, 2011 at 12:40 AM, Shai Erera wrote:
    Well, the approach is entirely different, and the new module
    introduces features not available in the other impls (and I imagine
    vice versa).

    The taxonomy is managed on the side, hence why it is global to the
    'content' index. It plays very well with NRT, and we in fact have
    several apps that use the module in an NRT environment.

    The taxonomy index supports NRT by itself, by using the IR.open(IW)
    API and then it's up to the application to manage its content index
    search as NRT.

    I think you should read the high-level description I put on
    LUCENE-3079 and the userguide I put on LUCENE-3261. As I said, the
    approach is quite different than the bitset and FieldCache ones.

    Shai
    On Saturday, July 9, 2011, Jason Rutherglen wrote:
    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both)
    Right so with NRT this'll be an issue.  Is there a write up on this?
    It sounds fairly radical in design.  Eg, I'm curious as to how it
    compares with the bit set and un-inverted field cache based faceting
    systems.
    On Fri, Jul 8, 2011 at 8:44 PM, Shai Erera wrote:
    Currently it doesn't facet per segment, because the approach it uses
    is irrelevant to per segment.

    It maintains a count array in the size of the taxonomy and every
    matching document contributes to the weight of the categories it is
    associated with, orregardless of the segment it is found in.

    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both).

    Shai
    On Saturday, July 9, 2011, Jason Rutherglen wrote:
    Is it faceting per-segment?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Shai Erera at Jul 11, 2011 at 6:46 am
    Hi Jason,

    The reason why the taxonomy and content indexes need to be in sync is
    because the taxonomy index manages the categories and their ordinals. The
    ordinals are written in a special posting list in the content index (I think
    we should cut over this part to use DocValues).

    Now imagine that you only commit to the content index, but not to the
    taxonomy index. If the system crashes, the content index will refer to
    ordinals which the taxonomy index does not know about.

    NRT-wise, the taxonomy index is much smaller than the content index. Imagine
    what it takes to manage NRT over regular content. Every document you add
    includes probably some moderate size of text that's parsed, stored fields,
    term vectors and what not. Flushing that data (during getReader()), whether
    to FSDir or RAMDir is much more expensive than flushing the information in
    the taxonomy index, where every document contains a single term with the
    category label.

    So I don't think we should be worried too much about the taxonomy index's
    NRT support and performance. It is orders of magnitude smaller than the
    other index.

    There is one thing we should improve about per-segment faceting in the new
    module -- by default categories are read from the posting list's payload,
    but there is a way to load all categories into RAM and fetch them from there
    during search. Today that code is not per-segment, and I think it should be.

    Shai
    On Mon, Jul 11, 2011 at 9:27 AM, Jason Rutherglen wrote:

    Actually I think the faceting module is per-segment?
    That would be very cool. I reviewed the user guide and it is
    ambiguous on this topic. Eg, why does the facet taxonomy need to be
    be committed for every IW commit? Mapping that to [N]RT will be
    tricky.

    Page 17:

    "In faceted search, we complicate things somewhat by adding a second index
    – the
    taxonomy index. The taxonomy API also follows point-in-time semantics,
    but this is
    not quite enough. Some attention must be paid by the user to keep
    those two indexes
    consistently in sync:"

    "The main index refers to category numbers defined in the taxonomy index.
    Therefore, it is important that we open the TaxonomyReader after opening
    the
    IndexReader. Moreover, every time an IndexReader is reopen()ed, the
    TaxonomyReader needs to be refresh()1ed as well."

    But there is one extra caution: whenever the application deems it has
    written
    enough information worthy a commit, it must first call commit() for the
    TaxonomyWriter and only after that call commit() for the IndexWriter.
    Closing the
    indices should also be done in this order – first close the taxonomy,
    and only after
    that close the index."


    On Sat, Jul 9, 2011 at 4:13 AM, Michael McCandless
    wrote:
    Actually I think the faceting module is per-segment?

    The facets are encoded into payloads, and then it visits the payload
    of each hit right per segment, and aggregates the counts.

    Like, on reopen (NRT or not) of a reader, there are no global data
    structures that must be recomputed. EG, this facets impl doesn't use
    FieldCache on the global reader (leading to insanity....).

    Mike McCandless

    http://blog.mikemccandless.com
    On Sat, Jul 9, 2011 at 12:40 AM, Shai Erera wrote:
    Well, the approach is entirely different, and the new module
    introduces features not available in the other impls (and I imagine
    vice versa).

    The taxonomy is managed on the side, hence why it is global to the
    'content' index. It plays very well with NRT, and we in fact have
    several apps that use the module in an NRT environment.

    The taxonomy index supports NRT by itself, by using the IR.open(IW)
    API and then it's up to the application to manage its content index
    search as NRT.

    I think you should read the high-level description I put on
    LUCENE-3079 and the userguide I put on LUCENE-3261. As I said, the
    approach is quite different than the bitset and FieldCache ones.

    Shai
    On Saturday, July 9, 2011, Jason Rutherglen wrote:
    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both)
    Right so with NRT this'll be an issue. Is there a write up on this?
    It sounds fairly radical in design. Eg, I'm curious as to how it
    compares with the bit set and un-inverted field cache based faceting
    systems.
    On Fri, Jul 8, 2011 at 8:44 PM, Shai Erera wrote:
    Currently it doesn't facet per segment, because the approach it uses
    is irrelevant to per segment.

    It maintains a count array in the size of the taxonomy and every
    matching document contributes to the weight of the categories it is
    associated with, orregardless of the segment it is found in.

    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both).

    Shai

    On Saturday, July 9, 2011, Jason Rutherglen <
    jason.rutherglen@gmail.com> wrote:
    Is it faceting per-segment?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Toke Eskildsen at Jul 11, 2011 at 7:05 am

    On Sat, 2011-07-09 at 05:44 +0200, Shai Erera wrote:
    The taxonomy is global to the index, but I think it will be
    interesting to explore per-segment taxonomy, and how it can be used to
    improve indexing or search perf (hopefully both).
    I have struggled with this for some time and still haven't found a real
    solution. Distributed faceting, with the special case segment based
    faceting, is hard to do without a central taxonomy.

    The new faceting module is explicit about the central taxonomy. My
    experiments with https://issues.apache.org/jira/browse/LUCENE-2369
    computes it at index open time. None of them work very well, if at all,
    for a real distributed environment.

    The problem is the same for flat faceting but is magnified with
    hierarchical faceting: When the sorting order of facet elements is
    popularity based, computing the correct counts for a top-X might
    potentially involve comparison of the whole result from each part.

    A pathological case for flat faceting is
    Part 1: A1(2), A2(2)... An(2)
    Part 2: B1(3), B2(2), B3(2)... Bn(2), An(1)
    where the correct top 3 answer is An(3), B1(3), A2(2), which requires
    the full part results to get to the An(2) and An(1) as they are the last
    elements.

    For real world use, we can do clever counting so that we only return
    what is necessary, but it does not change the worst case. To ensure that
    we don't hit any million entries merge situations, we must cheat and
    make a cutoff point.

    With a multi-level faceting result (state/town/street expanded to top 5
    elements on all levels) we must resolve quite a lot of elements to
    ensure a high chance of getting the right elements with the right
    counts. We can avoid this by drilling down one level at a time, but that
    is just replacing bulk transfers with multiple requests: 1*5*5 is the
    unrealistically low minimum for the address case.

    - Toke


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJul 9, '11 at 1:30a
activeJul 11, '11 at 7:05a
posts8
users4
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase