Grokbase Groups Lucene dev June 2011
FAQ
FST package API refactoring
---------------------------

Key: LUCENE-3206
URL: https://issues.apache.org/jira/browse/LUCENE-3206
Project: Lucene - Java
Issue Type: Improvement
Components: core/FSTs
Affects Versions: 3.2
Reporter: Dawid Weiss
Assignee: Dawid Weiss
Priority: Minor
Fix For: 3.3, 4.0


The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Dawid Weiss (JIRA) at Jun 16, 2011 at 10:56 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Dawid Weiss updated LUCENE-3206:
    --------------------------------

    Attachment: LUCENE-3206.patch

    An empty (but compiling and consistent) take at the FST/FSA API.
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 16, 2011 at 11:02 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050347#comment-13050347 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    This is my take at the revamped FST API. My changes are mostly aiming at having a bit clearer code (especially wrt. to loops), but also detach the "algebra" of a transition's output from the actual output. This should allow us to create an output algebra that would work directly on mutable integers, for example (to save on autoboxing). I also just like the way it reads after the changes:
    {code}
    FST<Integer> fst = FSTBuilder.fst(FST.ArcLabel.BYTE2, PositiveInt.class)
    .add("abc", 10)
    .add("abc, 5)
    .add("def", 0, 3), 2)
    .build();
    {code}
    or a loop over all arcs of a state:
    {code}
    Arc<Integer> arc = fst.getRoot();
    for (Arc<Integer> tmp = arc.copy(); tmp.hasNext(); tmp.next()) {
    int label = tmp.getLabel(); // transition label here.
    Integer output = tmp.getOutput(); // FSAs have a constant empty output.
    }
    {code}

    I definitely didn't consider all the use cases that FSTs are used for currently (in particular the "stop" bit indicating non-accepted input sequences that are also dead ends), but I think these could be integrated... I think :)

    Arcs now also store the pointer to the FST object, which may seem like an overhead, but I doubt it really will be (it's a single pointer and we buffer arcs whenever we can; a larger waste is having an object on each arc's output, even if it can be a primitive type or reused buffer).



    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 16, 2011 at 11:02 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050347#comment-13050347 ]

    Dawid Weiss edited comment on LUCENE-3206 at 6/16/11 11:01 AM:
    ---------------------------------------------------------------

    This is my take at the revamped FST API. My changes are mostly aiming at having a bit clearer code (especially wrt. to loops), but also detach the "algebra" of a transition's output from the actual output. This should allow us to create an output algebra that would work directly on mutable integers, for example (to save on autoboxing). I also just like the way it reads after the changes:
    {code}
    FST<Integer> fst = FSTBuilder.fst(FST.ArcLabel.BYTE2, PositiveInt.class)
    .add("abc", 10)
    .add("abc, 5)
    .add("def", 0, 3), 2)
    .build();
    {code}
    or a loop over all arcs of a state:
    {code}
    Arc<Integer> arc = fst.getRoot();
    for (Arc<Integer> tmp = arc.copy(); tmp.hasNext(); tmp.next()) {
    int label = tmp.getLabel(); // transition label here.
    Integer output = tmp.getOutput();
    }
    {code}

    I definitely didn't consider all the use cases that FSTs are used for currently (in particular the "stop" bit indicating non-accepted input sequences that are also dead ends), but I think these could be integrated... I think :)

    Arcs now also store the pointer to the FST object, which may seem like an overhead, but I doubt it really will be (it's a single pointer and we buffer arcs whenever we can; a larger waste is having an object on each arc's output, even if it can be a primitive type or reused buffer).




    was (Author: dweiss):
    This is my take at the revamped FST API. My changes are mostly aiming at having a bit clearer code (especially wrt. to loops), but also detach the "algebra" of a transition's output from the actual output. This should allow us to create an output algebra that would work directly on mutable integers, for example (to save on autoboxing). I also just like the way it reads after the changes:
    {code}
    FST<Integer> fst = FSTBuilder.fst(FST.ArcLabel.BYTE2, PositiveInt.class)
    .add("abc", 10)
    .add("abc, 5)
    .add("def", 0, 3), 2)
    .build();
    {code}
    or a loop over all arcs of a state:
    {code}
    Arc<Integer> arc = fst.getRoot();
    for (Arc<Integer> tmp = arc.copy(); tmp.hasNext(); tmp.next()) {
    int label = tmp.getLabel(); // transition label here.
    Integer output = tmp.getOutput(); // FSAs have a constant empty output.
    }
    {code}

    I definitely didn't consider all the use cases that FSTs are used for currently (in particular the "stop" bit indicating non-accepted input sequences that are also dead ends), but I think these could be integrated... I think :)

    Arcs now also store the pointer to the FST object, which may seem like an overhead, but I doubt it really will be (it's a single pointer and we buffer arcs whenever we can; a larger waste is having an object on each arc's output, even if it can be a primitive type or reused buffer).



    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jun 16, 2011 at 2:38 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050459#comment-13050459 ]

    Michael McCandless commented on LUCENE-3206:
    --------------------------------------------

    This new FST API looks *sweet*! Nice work :)

    So with this we no longer need static Util methods right? (Since each
    arc can .follow a sequence of inputs).

    I like OutputAlgebra ... better matches what this class actually does,
    and if this means we can not create a new Object for every arc transition
    that would be great (this makes FST lookups costly now).

    I don't know if this is possible, but, one thing I don't like about
    the current API is that the BYTE1/2/4 is an enum and not parameterized
    into the Builder/FST. Ie, Builder/FST should really take the input
    type as a type param too, since really an FST acts like a SortedMap<K,V>.
    But I fear this could get scary-hairy w/ the required generics...

    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 17, 2011 at 7:13 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050913#comment-13050913 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    Thanks Mike. I agree it'd be nice to have a flexible label type as well, but I have no idea how to make it efficient (and code-clean) yet. You could do a similar thing as with the outputs (using either a boxed type if you don't care about performance that much or a mutable wrapper if you do care about GC), but how this would affect the API I have no idea right now. There is also the lexicographic order that one would need to consider (a comparator would need to be passed as part of the construction process and then for traversals). It'll get complicated.

    I was also thinking of just dropping support for BYTE1/2 and leaving fixed int labels... This would bloat byte-labeled automata a little bit (if they're ASCII they'd v-code into a single byte anyway), but would strip down the ugliness of BYTE1/2/4... All methods accepting BytesRef and CharSequence would still be there, translated on the fly, but the representation of labels would always be an int.

    One more question: can you give me traversal use cases you're using FSTs for now? I'll try to implement them and see how the new API works out in practice. I looked at the FSTEnum and it has next(), seekCeil() and seekFloor().

    I'm also a bit terrified by the about of changes this would introduce if we decided to switch the APIs (tests, scattered use cases...). Don't know if I'll have the time to update this all.
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jun 17, 2011 at 4:27 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051170#comment-13051170 ]

    Michael McCandless commented on LUCENE-3206:
    --------------------------------------------

    bq. Thanks Mike. I agree it'd be nice to have a flexible label type as well, but I have no idea how to make it efficient (and code-clean) yet. You could do a similar thing as with the outputs (using either a boxed type if you don't care about performance that much or a mutable wrapper if you do care about GC), but how this would affect the API I have no idea right now. There is also the lexicographic order that one would need to consider (a comparator would need to be passed as part of the construction process and then for traversals). It'll get complicated.

    Yeah this was my fear :)

    bq. I was also thinking of just dropping support for BYTE1/2 and leaving fixed int labels... This would bloat byte-labeled automata a little bit (if they're ASCII they'd v-code into a single byte anyway), but would strip down the ugliness of BYTE1/2/4... All methods accepting BytesRef and CharSequence would still be there, translated on the fly, but the representation of labels would always be an int.

    Hmm, that makes me nervous -- this could be a non-negligible increase
    in FST size for the non-ascii case I think?

    bq. One more question: can you give me traversal use cases you're using FSTs for now? I'll try to implement them and see how the new API works out in practice. I looked at the FSTEnum and it has next(), seekCeil() and seekFloor().

    I think SimpleText codec is a good example? Also
    VariableGapTermsIndexReader, and MemoryCodec? Each of these use the
    BytesRefFSTEnum, I believe.

    bq. I'm also a bit terrified by the about of changes this would introduce if we decided to switch the APIs (tests, scattered use cases...). Don't know if I'll have the time to update this all.

    I think it's still fairly contained at this point? (Ie the number of
    tests that directly use the FST APIs).

    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 18, 2011 at 8:33 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051475#comment-13051475 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    bq. this could be a non-negligible increase in FST size for the non-ascii case I think?

    I don't know. If the non-ASCII is encoded as UTF8 for the BytesRef, then storing full unicode points on transitions shouldn't really account for much more (in fact it may create fewer states/ transitions because multibyte UTF8 sequences will require multiple transitions)? This we would need to check, of course. And I assume input sequences ARE text, which in general may not be the case... I think I'll leave BYTE1/BYTE4 an option for now and see if I can improve on it once I have a working test suite.

    bq. I think SimpleText codec is a good example? Also VariableGapTermsIndexReader, and MemoryCodec? Each of these use the BytesRefFSTEnum, I believe.

    I wasn't clear -- I can find the places where they're used, but I wanted to clarify the nature of stored keys and values (are they UTF8 text, utf16, unicode, random bytes)? I can go through the code, but you're probably a faster source of information on this one. Robert, if you're reading this -- anything you envision could be stored as transition labels?

    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jun 18, 2011 at 9:58 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051491#comment-13051491 ]

    Michael McCandless commented on LUCENE-3206:
    --------------------------------------------

    {quote}
    bq. this could be a non-negligible increase in FST size for the non-ascii case I think?

    I don't know. If the non-ASCII is encoded as UTF8 for the BytesRef, then storing full unicode points on transitions shouldn't really account for much more (in fact it may create fewer states/ transitions because multibyte UTF8 sequences will require multiple transitions)? This we would need to check, of course. And I assume input sequences ARE text, which in general may not be the case... I think I'll leave BYTE1/BYTE4 an option for now and see if I can improve on it once I have a working test suite.
    {quote}

    Ahh, yes I agree it'd be a more interesting comparison if you use
    UTF32 instead of UTF8.

    The case I was worried about is if you must use UTF8 (ie because
    TermsEnum speaks only BytesRef), then writing those bytes as a vInt
    instead of a fixed byte is a penalty to non-ascii.

    {quote}
    bq. I think SimpleText codec is a good example? Also VariableGapTermsIndexReader, and MemoryCodec? Each of these use the BytesRefFSTEnum, I believe.

    I wasn't clear -- I can find the places where they're used, but I wanted to clarify the nature of stored keys and values (are they UTF8 text, utf16, unicode, random bytes)? I can go through the code, but you're probably a faster source of information on this one. Robert, if you're reading this -- anything you envision could be stored as transition labels?
    {quote}

    Ahh... I think all uses have BytesRef (UTF8 encoded term) as the key,
    and various things as the values.

    I don't think we've used FST during analysis yet but we should try;
    then I suspect we'd use UTF16 labels?

    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 18, 2011 at 10:42 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051496#comment-13051496 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    I think I know how to compare storing byte[] of UTF8 as compared to vint-encoded codepoints in UTF32 -- I'll encode the wikipedia terms list in both ways and we will see what comes out. Theoretically they should be very, very similar (and full unicode codepoints should generate fewer arcs) because UTF8 uses an encoding scheme with similar overhead to vint encoding... os if something is a single-byte sequence in UTF8, will remain single byte vint. Double-byte UTF8 character will remaing double-byte vint (last double byte codepoint is 0x7ff=2047, whereas the last double byte vint is 2^14=16384. And so on. So for text, vint-encoded UTF32 should be more compact than UTF8... The gain is of course when your "labels" are not text, but arbitrary bytes -- then byte[] representation would be nicer.


    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 18, 2011 at 9:38 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051597#comment-13051597 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    I encoded wikipedia termslist in UTF32 (int4) and UTF8 (int1). Interesting results:
    {noformat}
    271,461,850 utf32.fst
    Arcs: 64.485.082
    Nodes: 36.624.613

    270,137,939 utf8.fst
    Arcs: 66.478.193
    Nodes: 38.687.637
    {noformat}

    So... the files are pretty much the same size... UTF32 is slighly bigger, but (as predicted) it has fewer arcs and fewer nodes. I checked and ALL input UTF8 strings are the same or longer than vint-coded UTF32 sequences... So how come UTF32 automaton is larger? I have no clue -- I assume it may be something with the size of v-coded pointers... but I have no clue. In any case, the size gain from using int1 to encode UTF8 is minimal over just using full unicode codepoints and v-coded int4. Performance-wise it may be a hit (because one would need to convert UTF8/UTF16 to full unicode codepoints), but size-wise it seems to be relatively the same.
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 18, 2011 at 9:42 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051598#comment-13051598 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    Oh, a wild guess: with int4 more nodes will be expanded into bsearch arrays (fixed size arcs). This may account for the observed size difference. And it may matter for traversals too (because int4 nodes will have a higher fanout, especially at root and first levels... something to consider).
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 19, 2011 at 8:51 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051658#comment-13051658 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    I confirmed the above, out of sheer curiosity, by compiling without expanded nodes (just linear packing).
    {noformat}
    261,820,296 utf32-noexp.fst
    271,461,850 utf32.fst
    263,415,558 utf8-noexp.fst
    270,137,939 utf8.fst
    {noformat}
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jun 19, 2011 at 1:52 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051684#comment-13051684 ]

    Michael McCandless commented on LUCENE-3206:
    --------------------------------------------

    OK, these results make sense! UTF32 (vInt labels) is more compact than UTF8, if you disable array'd arcs. These wiki terms are from the en export right? So the differences are due to the smallish number of random terms that are not English... it should be more extreme if we used non-English content.

    I wonder how lookup time would compare... I think UTF32 should be faster?

    And yes for truly binary terms (eg collated fields, and maybe eventually numeric fields but not yet because they still avoid the 8th bit I think) I think we want to keep BYTE1.

    We need some good use cases of FSTs during analysis... there we are free to make the alphabet non-byte (vs the index, where terms are a BytesRef).
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Jun 19, 2011 at 2:22 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051688#comment-13051688 ]

    Uwe Schindler commented on LUCENE-3206:
    ---------------------------------------

    Just one question: Is sort order for suppl chars in UTF32 compatible to UTF8 or will there be more problems? As we have a special case for UTF16 (surrogates dance), so what happens with a third "encoding"?
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 19, 2011 at 7:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051725#comment-13051725 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    UTF32 is basically codepoint representation, so there are no surrogates (as in UTF16) and there is no special encoding of higher codepoints (as in UTF8). I don't know what sort order is used inside Lucene (is it UTF8 byte-to-byte values or decoded codepoints?). If it is codepoint order then no problem -- this should be preserved.

    I'll stick to BYTE1/BYTE4 inputs then for now and I'll try to push this patch forward in my spare time.
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Jun 19, 2011 at 8:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051728#comment-13051728 ]

    Uwe Schindler commented on LUCENE-3206:
    ---------------------------------------

    The term index is sorted by utf8 bytes natively. So a FST build on term index must assume that order, because the terms must be presorted for Builder. Lucene internally only works on byte[] and uses per default this order. Also most queries rely on it.
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 20, 2011 at 6:56 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051824#comment-13051824 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    This is fine -- "Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points.", http://en.wikipedia.org/wiki/UTF-8.

    Indeed, when you look at how UTF8 encodes multibyte codepoints the codepoint order will be preserved.
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.3, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 24, 2011 at 7:31 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Dawid Weiss updated LUCENE-3206:
    --------------------------------

    Fix Version/s: (was: 3.3)
    3.4

    Moving to 3.4, if at all :)
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.4, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • David Smiley (JIRA) at Aug 30, 2011 at 3:42 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093830#comment-13093830 ]

    David Smiley commented on LUCENE-3206:
    --------------------------------------

    I'm looking forward to these FST API improvements. It's a bit obtuse for something that is basically a SortedMap.
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.4, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Aug 30, 2011 at 4:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093852#comment-13093852 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    I haven't had the time to work on that, sorry. I'll publish what I have on github, if you want to give it a spin. Essentially I wanted to tackle the following issues:

    - clean and well defined algebra of operations and input types,
    - GC efficiency during construction so that we don't create intermediate objects if there's no need to (inherently connected to algebra definition),
    - a small, lean and clean API for traversals, with utility methods of higher complexity (suffix collectors, matchers, etc.)

    Only once this is implemented it makes sense to incorporate add-ons Mike worked on (during-the-construction pruning, different arc storage formats). So, even if it's essentially a sortedmap, there is a fair bit of complexity involved if you want to make it efficient :)
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.4, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Sep 14, 2011 at 10:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Michael McCandless updated LUCENE-3206:
    ---------------------------------------

    Fix Version/s: (was: 3.4)
    3.5
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.5, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (Commented) (JIRA) at Mar 6, 2012 at 2:06 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222910#comment-13222910 ]

    Robert Muir commented on LUCENE-3206:
    -------------------------------------

    3.6 pruning: can we push this out to 4.0?
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.6, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (Commented) (JIRA) at Mar 6, 2012 at 7:53 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223068#comment-13223068 ]

    Dawid Weiss commented on LUCENE-3206:
    -------------------------------------

    Absolutely. This is far-future. I wouldn't even schedule it for 4.0 - I simply won't have the time for rewriting all the code that Mike and you created on top of the FST class :)
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Affects Versions: 3.2
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 3.6, 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (Updated) (JIRA) at Mar 20, 2012 at 8:08 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Dawid Weiss updated LUCENE-3206:
    --------------------------------

    Affects Version/s: (was: 3.2)
    Fix Version/s: (was: flexscoring branch)
    4.0
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Fix For: 4.0

    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (Resolved) (JIRA) at Apr 20, 2012 at 7:17 am
    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Dawid Weiss resolved LUCENE-3206.
    ---------------------------------

    Resolution: Later
    Fix Version/s: (was: 4.0)

    Won't have the time for this in any near future.
    FST package API refactoring
    ---------------------------

    Key: LUCENE-3206
    URL: https://issues.apache.org/jira/browse/LUCENE-3206
    Project: Lucene - Java
    Issue Type: Improvement
    Components: core/FSTs
    Reporter: Dawid Weiss
    Assignee: Dawid Weiss
    Priority: Minor
    Attachments: LUCENE-3206.patch


    The current API is still marked @experimental, so I think there's still time to fiddle with it. I've been using the current API for some time and I do have some ideas for improvement. This is a placeholder for these -- I'll post a patch once I have a working proof of concept.
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJun 16, '11 at 6:55a
activeApr 20, '12 at 7:17a
posts26
users1
websitelucene.apache.org

1 user in discussion

Dawid Weiss (Resolved) (JIRA): 26 posts

People

Translate

site design / logo © 2021 Grokbase