FAQ
Token type as BitSet: typeBits()
--------------------------------

Key: LUCENE-1137
URL: https://issues.apache.org/jira/browse/LUCENE-1137
Project: Lucene - Java
Issue Type: New Feature
Components: Analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 2.4


It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.

For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Grant Ingersoll (JIRA) at Jan 16, 2008 at 4:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Grant Ingersoll updated LUCENE-1137:
    ------------------------------------

    Attachment: LUCENE-1137.patch

    Added get/setTypeBits() method and underlying storage and constructors.
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Jan 16, 2008 at 5:04 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559588#action_12559588 ]

    Yonik Seeley commented on LUCENE-1137:
    --------------------------------------

    Gack! I recommended a bitset on Token previously, but I meant an elemental one... an int (32 bits) or a long (64 bits).
    Half of the bits could be reserved for use by Lucene tokenizers, and half could be reserved for users. I think an actual BitSet is too heavy-weight.

    Just provide a int or long Token.getFlags() and int or long Token.setFlags(), and nothing more (we don't need to do bit twiddling for users IMO)
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 16, 2008 at 5:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559589#action_12559589 ]

    Steven Rowe commented on LUCENE-1137:
    -------------------------------------

    I see two problems with this patch:

    1. Although in the patch you say that the "type bits" field added by the patch is completely separate from the String type information, you don't name them with sufficiently different names to distinguish them.

    2. The information encoded by BitSet is a set of <int,boolean> tuples. These are opaque values. In order for this to work, every tokenizer in the chain has to be aware of every other one's use of these. This makes sharing hard.

    At a minimum, there should be some way to declare who's using what bit for what purpose - maybe through a static hash table or something?
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Grant Ingersoll (JIRA) at Jan 16, 2008 at 5:27 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559595#action_12559595 ]

    Grant Ingersoll commented on LUCENE-1137:
    -----------------------------------------

    {quote}
    The information encoded by BitSet is a set of <int,boolean> tuples. These are opaque values. In order for this to work, every tokenizer in the chain has to be aware of every other one's use of these. This makes sharing hard.
    {quote}

    To some extent, though, the same is true for the current type() functionality. One may decide to change the type, based on the value of the current type.

    While I agree the sharing is hard, it is not impossible, as one need just make sure to communicate which bits are available. I suppose I could see about adding a isClaimed(int position) method or something like that, whereby one can query the chain to see if anyone claims ownership on that position. I'll give that a try. However, to some extent, I also think it is buyer beware in that TokenFilters further down the chain just need to be aware of what is going on. This is part of constructing an Analyzer that works.

    As for the naming, I suppose we could do Flags, as Yonik suggests.
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Grant Ingersoll (JIRA) at Jan 16, 2008 at 5:41 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559606#action_12559606 ]

    Grant Ingersoll commented on LUCENE-1137:
    -----------------------------------------

    Never mind on the isClaimed() idea, I don't see a good way of how that would work.
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Jan 16, 2008 at 5:43 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559607#action_12559607 ]

    Yonik Seeley commented on LUCENE-1137:
    --------------------------------------

    If we go with the bitset (int or long!!!), "type" could be deprecated... there's no reason to have both.

    StandardTokenizer could define constants to replace
    public static final String [] TOKEN_TYPES = new String [] {
    "<ALPHANUM>",
    "<APOSTROPHE>",
    "<ACRONYM>",
    "<COMPANY>",
    "<EMAIL>",
    "<HOST>",
    "<NUM>",
    "<CJ>"
    };

    StandardTokenizer.ALPHANUM, etc
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Grant Ingersoll (JIRA) at Jan 16, 2008 at 6:43 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Grant Ingersoll updated LUCENE-1137:
    ------------------------------------

    Attachment: LUCENE-1137.patch

    Per feedback from Yonik, changes this to use an int. The clear() method sets the flags value back to 0.
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch, LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 16, 2008 at 7:18 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559637#action_12559637 ]

    Steven Rowe commented on LUCENE-1137:
    -------------------------------------

    Looks like the constructors still take a BitSet???

    My vote is for long instead of int, to maximize forward compatibility...
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch, LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Grant Ingersoll (JIRA) at Jan 16, 2008 at 7:31 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Grant Ingersoll updated LUCENE-1137:
    ------------------------------------

    Attachment: LUCENE-1137.patch

    Let's try a patch that actually compiles
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch, LUCENE-1137.patch, LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Grant Ingersoll (JIRA) at Jan 24, 2008 at 3:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Grant Ingersoll resolved LUCENE-1137.
    -------------------------------------

    Resolution: Fixed
    Lucene Fields: (was: [New])

    Committed on 614891
    Token type as BitSet: typeBits()
    --------------------------------

    Key: LUCENE-1137
    URL: https://issues.apache.org/jira/browse/LUCENE-1137
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Analysis
    Reporter: Grant Ingersoll
    Assignee: Grant Ingersoll
    Priority: Minor
    Fix For: 2.4

    Attachments: LUCENE-1137.patch, LUCENE-1137.patch, LUCENE-1137.patch


    It is sometimes useful to have a more compact, easy to parse, type representation for Token than the current type() String. This patch adds a BitSet onto Token, defaulting to null, with accessors for setting bit flags on a Token. This is useful for communicating information about a token to TokenFilters further down the chain.
    For example, in the WikipediaTokenizer, the possibility exists that a token could be both a category and bold (or many other variations), yet it is difficult to communicate this without adding in a lot of different Strings for type. Unlike using the payload information (which could serve this purpose), the BitSet does not get added to the index (although one could easily convert it to a payload.)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJan 16, '08 at 4:32p
activeJan 24, '08 at 3:03p
posts11
users1
websitelucene.apache.org

1 user in discussion

Grant Ingersoll (JIRA): 11 posts

People

Translate

site design / logo © 2021 Grokbase