FAQ
Next steps towards flexible indexing
------------------------------------

Key: LUCENE-1426
URL: https://issues.apache.org/jira/browse/LUCENE-1426
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Fix For: 2.9


In working on LUCENE-1410 (PFOR compression) I tried to prototype
switching the postings files to use PFOR instead of vInts for
encoding.

But it quickly became difficult. EG we currently mux the skip data
into the .frq file, which messes up the int blocks. We inline
payloads with positions which would also mess up the int blocks.
Skipping offsets and TermInfo offsets hardwire the file pointers of
frq & prox files yet I need to change these to block + offset, etc.

Separately this thread also started up, on how to customize how Lucene
stores positional information in the index:

http://www.gossamer-threads.com/lists/lucene/java-user/66264

So I decided to make a bit more progress towards "flexible indexing"
by first modularizing/isolating the classes that actually write the
index format. The idea is to capture the logic of each (terms, freq,
positions/payloads) into separate interfaces and switch the flushing
of a new segment as well as writing the segment during merging to use
the same APIs.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Michael McCandless (JIRA) at Oct 20, 2008 at 12:42 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Michael McCandless updated LUCENE-1426:
    ---------------------------------------

    Attachment: LUCENE-1426.patch

    Attached patch. I think it's ready to commit... I'll wait a few days.

    This factors the writing of postings into separate Format* classes.
    The approach I took is similar to what I did for DocumentsWriter,
    where there is a hierarchical consumer interface (abstract class) for
    each of fields, terms, docs, and positions writing. Then there's a
    corresponding set of concrete classes (the "codec chain") that write
    today's index format. There is no change to the index format.

    Here are the details:

    * This only applies to postings (not stored fields, term vectors,
    norms, field infos)

    * Both SegmentMerger & FreqProxTermsWriter now use the same codec
    API to write postings. I think this is a big step forward: we now
    have a single set of classes that ever write the postings.

    * You can't yet customize this codec chain; we can add that at some
    point. It's all package private.

    * I don't yet allow the codec to override SegmentInfo.files(); at
    some point (when I first try to make a codec that uses different
    files) I will add this.

    I ran a quick performance test, indexing wikipedia, and found
    negligible performance cost of this.

    The next step, which is trickier, is to modularize/genericize the
    classes the read from the index, and then refactor
    SegmentTerm{Enum,Docs,Positions} to use that codec API.

    Then, finally, I want to make a codec that uses PFOR to encode
    postings.
    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot (JIRA) at Oct 20, 2008 at 7:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641121#action_12641121 ]

    Paul Elschot commented on LUCENE-1426:
    --------------------------------------

    bq. We inline payloads with positions which would also mess up the int blocks.

    Which begs the question whether we should also allow compression of these payloads.
    I think we should do that because normally only one or two bytes will be used as payload per position.
    Thinking about this: position+payload actually looks a lot like docId+freq, could that
    be used to simplify future index formats for inverted terms?
    Btw. allowing a payload to accompany the field norms would allow to store a kind of
    dictionary for the position payloads. This could help to keep the position payloads small
    so they would compress nicely.

    bq. Both SegmentMerger & FreqProxTermsWriter now use the same codec API to write postings.

    That is indeed a big step.

    bq. It's all package private.

    Good for now, making it public might actually reduce flexibility for new index formats.


    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot (JIRA) at Oct 20, 2008 at 7:08 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641125#action_12641125 ]

    Paul Elschot commented on LUCENE-1426:
    --------------------------------------

    bq. Skipping offsets and TermInfo offsets hardwire the file pointers of frq & prox files yet I need to change these to block + offset, etc.

    Does the offset imply that there is also a need for random access into each block?
    For such blocks PFOR patching might better be avoided.
    Even with patching random access is possible, but it is not available yet at LUCENE-1410.

    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Eks Dev (JIRA) at Oct 20, 2008 at 7:30 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641128#action_12641128 ]

    Eks Dev commented on LUCENE-1426:
    ---------------------------------

    Just a few random thoughts on this topic

    - I am sure I read somewhere in these pdfs that were floating around that it would make sense to use VInts for very short postings and PFOR for the rest. I just do not remember rationale behind it.

    - During omitTf() discussion, we came up with cool idea to actually inline very short postings into term dict instead of storing offset. This way we spare one seek per term in many cases, as well as some space for storing offset. I do not know if this is a problem, but sounds reasonable. With standard Zipfian distribution, a lot of postings should get inlined. Use cases where we have query expansion on many terms (think spell checker, synonyms ...) should benefit from that heavily. These postings are small but there is a lot of them, so it adds up... seek is deadly :)

    I am sorry to miss the party here with PFOR, but let us hope this credit crunch gets over soon so I that I could dedicate some time to fun things like this :)

    cheers, eks



    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Doug Cutting (JIRA) at Oct 20, 2008 at 7:52 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641132#action_12641132 ]

    Doug Cutting commented on LUCENE-1426:
    --------------------------------------

    +1 This sounds like a great way to approach flexible indexing: incrementally.
    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Oct 20, 2008 at 7:58 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641137#action_12641137 ]

    Michael McCandless commented on LUCENE-1426:
    --------------------------------------------

    bq. During omitTf() discussion, we came up with cool idea to actually inline very short postings into term dict instead of storing offset.

    Yes, there's this issue:

    https://issues.apache.org/jira/browse/LUCENE-1278

    And you had found this one:

    http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf

    And then Doug referenced this:

    http://citeseer.ist.psu.edu/cutting90optimizations.html

    I think the idea makes tons of sense (saving a seek) and one of my
    goals in phase 2 (genericizing the reading of an index) is to make
    pulsing a drop-in codec as an example & litmus test. Terms iteration
    may suffer, though, unless we put this in a separate file.

    I also think, at the opposite end of the spectrum, it would make sense
    for very common terms to use simple n-bit packing (PFOR minus the
    exceptions). For massive terms we need the fastest search we can
    get, since that gates when you have to start sharding.

    bq. I am sorry to miss the party here with PFOR, but let us hope this credit crunch gets over soon so I that I could dedicate some time to fun things like this

    Well the stock market seems to think the credit crunch is improving,
    today... of course who knows what'll happen tomorrow! Good luck :)

    Also, I'd like to explore improving the terms dict indexing -- I don't
    think we need to load a TermInfo instance for every indexed term, into
    RAM. I think we just need the term & seek data (into the tis file),
    then you seek there and skip to the TermInfo you need. This should
    save a good amount of RAM for large indices with odd terms, sicne each
    TermInfo instance requires a pointer to it (4 or 8 bytes), an object
    header (8 bytes at least) then 20 bytes for the members.

    All these explorations should become simple drop-in codecs, once I can
    finish phase 2.

    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Oct 20, 2008 at 8:08 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641139#action_12641139 ]

    Michael McCandless commented on LUCENE-1426:
    --------------------------------------------


    {quote}
    Does the offset imply that there is also a need for random access into each block?
    For such blocks PFOR patching might better be avoided.
    Even with patching random access is possible, but it is not available yet at LUCENE-1410.
    {quote}

    Yeah this is one of the reasons why I'm thinking for frequent terms we
    may want to fallback to pure nbit packing (which would make random
    access simple).

    But, for starters would could simply implement random access as "load
    & decode the entire block, then look at the part you want" and then
    assess the cost. While it will clearly increase the cost of queries
    that do alot of skipping (eg AND query of N terms), it may not matter
    so much since these queries should be fairly fast now. It's the OR of
    frequent term queries that we need to improve since that limits how
    big an index you can put on one box.

    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Oct 20, 2008 at 8:12 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641140#action_12641140 ]

    Michael McCandless commented on LUCENE-1426:
    --------------------------------------------


    bq. Which begs the question whether we should also allow compression of these payloads.

    I think that's interesting, but would probably be rather application dependent.

    {quote}
    Btw. allowing a payload to accompany the field norms would allow to store a kind of
    dictionary for the position payloads. This could help to keep the position payloads small
    so they would compress nicely.
    {quote}

    Couldn't stored fields, once they are faster (with column-stride
    fields, LUCENE-1231) solve this?

    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael Busch (JIRA) at Oct 21, 2008 at 8:27 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641574#action_12641574 ]

    Michael Busch commented on LUCENE-1426:
    ---------------------------------------

    {quote}
    +1 This sounds like a great way to approach flexible indexing: incrementally.
    {quote}

    Couldn't agree more. This is great!

    {quote}
    The next step, which is trickier, is to modularize/genericize the
    classes the read from the index, and then refactor
    SegmentTerm(Enum,Docs,Positions) to use that codec API.
    {quote}

    Yes this is definitely the tricky part. I've been thinking a bit about this and was wondering if for the read APIs we could do something similar as with the new Token API (LUCENE-1422)? TermDocs could have a list of Attributes that the posting list offers. If for example no payloads are stored in the posting list, then TermDocs should not offer that corresponding Attribute.
    This approach should be just as fast as the current API. When the application opens a TermDocs, it could check for the offered Attributes before it starts iterating the postinglist, and keep references to the Attribute. (in fact that's exactly the same approach as the TokenStream/Token/Consumer approach in LUCENE-1422).

    Thoughts?
    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot (JIRA) at Oct 21, 2008 at 9:19 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641599#action_12641599 ]

    Paul Elschot commented on LUCENE-1426:
    --------------------------------------

    bq. ... it would make sense to use VInts for very short postings and PFOR for the rest. I just do not remember rationale behind it.
    bq. ... cool idea to actually inline very short postings into term dict instead of storing offset.

    Iirc the rationale was that PFOR has most performance benefits on integer arrays of more than 100 elements.
    Shorter lists of numbers might also benefit from using (P)FOR instead of VInt, I don't know how big the break even size is.

    bq. for starters (we) could simply implement random access as "load & decode the entire block, then look at the part you want" and then assess the cost.

    I've just started some performance tests on PFOR patching (i.e. filling in the exceptions), and I'm not happy with what I'm seeing. More on this later at 1410.


    On allowing a payload to accompany the field norms:
    bq. Couldn't stored fields, once they are faster (with column-stride fields, LUCENE-1231) solve this?

    Yes.

    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Oct 22, 2008 at 9:21 am
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641747#action_12641747 ]

    Michael McCandless commented on LUCENE-1426:
    --------------------------------------------

    bq. TermDocs could have a list of Attributes that the posting list offers.

    I like this approach.

    Though unlike LUCENE-1422, where Token remains separate from
    TokenStream (and I'm still not sure it should be...?), I think for
    TermDocs there would not be the analog of a separate Token.
    Ie, it would look something like this:

    myPerDocAttr = termDocs.getAttribute(MyPerDoc.class);

    while(termDocs.next()) {
    x = myPerDocAttr.getValue(...);
    }

    However, this form of flexibility is actually beyond what I'm aiming
    for, for the first step of reader flexibility (there are so many
    facets of "flexible indexing"!).

    For starters I'd like to allow flexibility on how you encode the
    existing postings (doc/freq/positions/payloads). Whereas this
    flexibility is in extending what stuff is actually stored into & read
    from the index. I think we should do both, but my focus now is on the
    first one, specifically to be able to drop in a codec that uses
    pulsing, a less RAM-intestive terms dict indexing, and/or PFOR, etc.

    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Oct 25, 2008 at 10:42 am
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Michael McCandless resolved LUCENE-1426.
    ----------------------------------------

    Resolution: Fixed
    Next steps towards flexible indexing
    ------------------------------------

    Key: LUCENE-1426
    URL: https://issues.apache.org/jira/browse/LUCENE-1426
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael McCandless
    Assignee: Michael McCandless
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1426.patch


    In working on LUCENE-1410 (PFOR compression) I tried to prototype
    switching the postings files to use PFOR instead of vInts for
    encoding.
    But it quickly became difficult. EG we currently mux the skip data
    into the .frq file, which messes up the int blocks. We inline
    payloads with positions which would also mess up the int blocks.
    Skipping offsets and TermInfo offsets hardwire the file pointers of
    frq & prox files yet I need to change these to block + offset, etc.
    Separately this thread also started up, on how to customize how Lucene
    stores positional information in the index:
    http://www.gossamer-threads.com/lists/lucene/java-user/66264
    So I decided to make a bit more progress towards "flexible indexing"
    by first modularizing/isolating the classes that actually write the
    index format. The idea is to capture the logic of each (terms, freq,
    positions/payloads) into separate interfaces and switch the flushing
    of a new segment as well as writing the segment during merging to use
    the same APIs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedOct 20, '08 at 12:40p
activeOct 25, '08 at 10:42a
posts13
users1
websitelucene.apache.org

1 user in discussion

Michael McCandless (JIRA): 13 posts

People

Translate

site design / logo © 2022 Grokbase