Grokbase Groups Lucene dev June 2011
FAQ
Hi. While I was playing with automata recently, I had a use case
scenario when I could really use an external sort of a large list of
unicode strings. I know I could simply emulate this by creating
synthetic documents, index, etc., but is there a more "direct" way of
achieving this using Lucene's internals?

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Michael McCandless at Jun 14, 2011 at 10:56 am
    Not that I know of.

    And, if you create & index such Lucene documents, and then do a
    MatchAllDocsQuery sorting by your field, this is (unfortunately) not
    an external sort! Ie, Lucene loads all terms data in RAM as packed
    byte[], for merging the per-segment results.

    It even does this, unnecessarily, for an optimized segment, even
    though we only need ords in that case (there's an issue open for
    this).

    Doing a sort-by-String-field without loading the String data even when
    there are multiple segments in the index would be a nice addition :)

    Mike McCandless

    http://blog.mikemccandless.com
    On Tue, Jun 14, 2011 at 6:31 AM, Dawid Weiss wrote:
    Hi. While I was playing with automata recently, I had a use case
    scenario when I could really use an external sort of a large list of
    unicode strings. I know I could simply emulate this by creating
    synthetic documents, index, etc., but is there a more "direct" way of
    achieving this using Lucene's internals?

    Dawid

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss at Jun 14, 2011 at 11:07 am

    And, if you create & index such Lucene documents, and then do a
    MatchAllDocsQuery sorting by your field, this is (unfortunately) not
    I was thinking about an optimized segment -- then the terms enum on a
    given field should be sorted, right?


    Dawid

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless at Jun 14, 2011 at 12:01 pm

    On Tue, Jun 14, 2011 at 7:06 AM, Dawid Weiss wrote:
    And, if you create & index such Lucene documents, and then do a
    MatchAllDocsQuery sorting by your field, this is (unfortunately) not
    I was thinking about an optimized segment -- then the terms enum on a
    given field should be sorted, right?
    Ahh, right.

    So actually it would work if you just enum'd the terms yourself, after
    indexing and optimizing. And this does amount to an external sort, I
    think!

    Mike McCandless

    http://blog.mikemccandless.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss at Jun 14, 2011 at 12:19 pm

    So actually it would work if you just enum'd the terms yourself, after
    indexing and optimizing.  And this does amount to an external sort, I
    think!
    Yep. I was just curious if there's a way to do it without the overhead
    of creating fields, documents, etc. If I have a spare minute I'll try
    to write a merge sort from disk splits. It'd be neat to write FST
    merging too (so that, given to FSTs you could merge them into one by
    creating a new FST and adding sequences in order from one or the other
    source).

    Dawid

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless at Jun 14, 2011 at 12:24 pm
    In theory, you could use the codec API directly, adding "chunks" of
    pre-sorted terms, and then fake up a SegmentInfo to make it look like
    some kind of degenerate segment, and then merge them?

    But it's gonna be a lot of work to do that :)

    Merging FSTs sounds cool!

    Mike McCandless

    http://blog.mikemccandless.com

    On Tue, Jun 14, 2011 at 8:18 AM, Dawid Weiss
    wrote:
    So actually it would work if you just enum'd the terms yourself, after
    indexing and optimizing.  And this does amount to an external sort, I
    think!
    Yep. I was just curious if there's a way to do it without the overhead
    of creating fields, documents, etc. If I have a spare minute I'll try
    to write a merge sort from disk splits. It'd be neat to write FST
    merging too (so that, given to FSTs you could merge them into one by
    creating a new FST and adding sequences in order from one or the other
    source).

    Dawid

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss at Jun 14, 2011 at 12:36 pm
    Merging FSTs sounds cool!
    Yep, it would be quite neat and relatively simple too. I looked at the
    FST API again, after doing some work on those compounds and there are
    several things that just hurt my eyes... I'll see if I can figure out
    a nicer API... how emotionally attached are you to that simulation of
    terminal states, Mike? :)

    Dawid

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless at Jun 14, 2011 at 12:45 pm

    On Tue, Jun 14, 2011 at 8:35 AM, Dawid Weiss wrote:
    Merging FSTs sounds cool!
    Yep, it would be quite neat and relatively simple too. I looked at the
    FST API again, after doing some work on those compounds and there are
    several things that just hurt my eyes... I'll see if I can figure out
    a nicer API... how emotionally attached are you to that simulation of
    terminal states, Mike? :)
    I'm not at all! Fix away :) FST is very new and it needs some good
    iterating...

    Just make sure the *FSTEnum work ok -- I think that's why I added the END_LABEL.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJun 14, '11 at 10:32a
activeJun 14, '11 at 12:45p
posts8
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase