FAQ
Hi,

I found a blog post from 2008 where it says, there will be additional custom attributes for tokens in the future, that will be searchable.
What is the status of these?

Jan

Search Discussions

  • Simon Willnauer at Nov 23, 2010 at 3:43 pm
    Attribute Serialization is not implemented yet, not even in trunk. You
    can use payloads instead.

    Simon
    On Tue, Nov 23, 2010 at 2:43 PM, wrote:
    Hi,

    I found a blog post from 2008 where it says, there will be additional custom attributes for tokens in the future, that will be searchable.
    What is the status of these?

    Jan
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jan Kurella at Nov 23, 2010 at 3:51 pm
    Yes, payloads I will use. But they perform at score time and not at search time. I just wanted to know if there is anything like that.

    "not even on trunk" does this mean there is a discussion about this ongoing somewhere? I'm just curious.

    Jan

    -----Original Message-----
    From: ext Simon Willnauer
    Sent: Dienstag, 23. November 2010 16:44
    To: java-user@lucene.apache.org
    Subject: Re: custom attributs in tokens

    Attribute Serialization is not implemented yet, not even in trunk. You
    can use payloads instead.

    Simon
    On Tue, Nov 23, 2010 at 2:43 PM, wrote:
    Hi,

    I found a blog post from 2008 where it says, there will be additional custom attributes for tokens in the future, that will be searchable.
    What is the status of these?

    Jan
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Nov 23, 2010 at 4:49 pm

    On Tue, Nov 23, 2010 at 4:50 PM, wrote:
    Yes, payloads I will use. But they perform at score time and not at search time. I just wanted to know if there is anything like that.
    So what is the difference? Maybe you can elaborate a little what are
    you trying to do?

    simon
    "not even on trunk" does this mean there is a discussion about this ongoing somewhere? I'm just curious.

    Jan

    -----Original Message-----
    From: ext Simon Willnauer
    Sent: Dienstag, 23. November 2010 16:44
    To: java-user@lucene.apache.org
    Subject: Re: custom attributs in tokens

    Attribute Serialization is not implemented yet, not even in trunk. You
    can use payloads instead.

    Simon
    On Tue, Nov 23, 2010 at 2:43 PM,  wrote:
    Hi,

    I found a blog post from 2008 where it says, there will be additional custom attributes for tokens in the future, that will be searchable.
    What is the status of these?

    Jan
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jan Kurella at Nov 24, 2010 at 8:13 am
    Of course:

    We are trying to search in documents that contain text in several languages. We are also investigating other approaches*, so this is not about finding other variants.
    the goal is to only match tokens from 1 or more given languages and not to match the token if it is by accident the same in another language.

    For the payloads my plan is to add the correct language to each and every token during indexing (I'm not sure how to solve this best, but I'm sure this can be solved at least with lucene directly).
    On search side my current idea is to wrap around a TermPosition and skip all docs, where the current payload has not one of the requested languages.
    I probably need to use my own Query/Weight for this?
    Another approach would be to just overwrite the Similarity, but this will only influence scoring and depending on the underlying query not completely skip the token - I have to test the difference for the final score between this approaches.

    This one blog made me curious if there is already something similar, that skips TermPositions based on given attributes? I could imagine something similar to the current Tokenattribute concept during index time, but also available during search and controlled by a similarity...

    Jan

    -----Original Message-----
    From: ext Simon Willnauer
    Sent: Dienstag, 23. November 2010 17:50
    To: java-user@lucene.apache.org
    Subject: Re: custom attributs in tokens
    On Tue, Nov 23, 2010 at 4:50 PM, wrote:
    Yes, payloads I will use. But they perform at score time and not at search time. I just wanted to know if there is anything like that.
    So what is the difference? Maybe you can elaborate a little what are
    you trying to do?

    simon
    "not even on trunk" does this mean there is a discussion about this ongoing somewhere? I'm just curious.

    Jan

    -----Original Message-----
    From: ext Simon Willnauer
    Sent: Dienstag, 23. November 2010 16:44
    To: java-user@lucene.apache.org
    Subject: Re: custom attributs in tokens

    Attribute Serialization is not implemented yet, not even in trunk. You
    can use payloads instead.

    Simon
    On Tue, Nov 23, 2010 at 2:43 PM,  wrote:
    Hi,

    I found a blog post from 2008 where it says, there will be additional custom attributes for tokens in the future, that will be searchable.
    What is the status of these?

    Jan
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Nov 25, 2010 at 9:41 am
    Hi Jan,
    On Wed, Nov 24, 2010 at 9:12 AM, wrote:
    Of course:

    We are trying to search in documents that contain text in several languages. We are also investigating other approaches*, so this is not about finding other variants.
    the goal is to only match tokens from 1 or more given languages and not to match the token if it is by accident the same in another language.

    For the payloads my plan is to add the correct language to each and every token during indexing (I'm not sure how to solve this best, but I'm sure this can be solved at least with lucene directly).
    On search side my current idea is to wrap around a TermPosition and skip all docs, where the current payload has not one of the requested languages.
    I probably need to use my own Query/Weight for this?
    You don't need to start from nothing here, I suggest you to look at
    SpanTermQuery and TermSpans which uses DocsAndPositionsEnum (or rather
    TermPositions in non-trunk versions). TermSpan gives you the ability
    to override #next() and #skipTo() which is from what I understand what
    you are looking for, right?
    Another approach would be to just overwrite the Similarity, but this will only influence scoring and depending on the underlying query not completely skip the token - I have to test the difference for the final score between this approaches.
    Well as you figured correctly this is rather for scoring really.
    This one blog made me curious if there is already something similar, that skips TermPositions based on given attributes? I could imagine something similar to the current Tokenattribute concept during index time, but also available during search and controlled by a similarity...
    Actually in lucene 4.0 each Flex-Enum has a AttributeSource that
    allows you to add custom attributes to you enumerations. Yet there is
    no logic that skips based on that though.

    Simon
    Jan

    -----Original Message-----
    From: ext Simon Willnauer
    Sent: Dienstag, 23. November 2010 17:50
    To: java-user@lucene.apache.org
    Subject: Re: custom attributs in tokens
    On Tue, Nov 23, 2010 at 4:50 PM,  wrote:
    Yes, payloads I will use. But they perform at score time and not at search time. I just wanted to know if there is anything like that.
    So what is the difference? Maybe you can elaborate a little what are
    you trying to do?

    simon
    "not even on trunk" does this mean there is a discussion about this ongoing somewhere? I'm just curious.

    Jan

    -----Original Message-----
    From: ext Simon Willnauer
    Sent: Dienstag, 23. November 2010 16:44
    To: java-user@lucene.apache.org
    Subject: Re: custom attributs in tokens

    Attribute Serialization is not implemented yet, not even in trunk. You
    can use payloads instead.

    Simon
    On Tue, Nov 23, 2010 at 2:43 PM,  wrote:
    Hi,

    I found a blog post from 2008 where it says, there will be additional custom attributes for tokens in the future, that will be searchable.
    What is the status of these?

    Jan
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jan Kurella at Nov 25, 2010 at 2:26 pm
    Hi Simon,
    On 25.11.2010 10:40, ext Simon Willnauer wrote:
    Hi Jan,

    On Wed, Nov 24, 2010 at 9:12 AM,wrote:
    Of course:

    We are trying to search in documents that contain text in several languages. We are also investigating other approaches*, so this is not about finding other variants.
    the goal is to only match tokens from 1 or more given languages and not to match the token if it is by accident the same in another language.

    For the payloads my plan is to add the correct language to each and every token during indexing (I'm not sure how to solve this best, but I'm sure this can be solved at least with lucene directly).
    On search side my current idea is to wrap around a TermPosition and skip all docs, where the current payload has not one of the requested languages.
    I probably need to use my own Query/Weight for this?
    You don't need to start from nothing here, I suggest you to look at
    SpanTermQuery and TermSpans which uses DocsAndPositionsEnum (or rather
    TermPositions in non-trunk versions). TermSpan gives you the ability
    to override #next() and #skipTo() which is from what I understand what
    you are looking for, right?
    Just to get it right: I only subclass the SpanTermQuery to verwrite the
    getSpans(Reader) method to return MyTermSpans().
    MyTermSpans are a subclass of TermSpans where I just extend #next() and
    #skipTo() to go further until my desired Payload is found.

    Sounds pretty easy and straight forward.
    Another approach would be to just overwrite the Similarity, but this will only influence scoring and depending on the underlying query not completely skip the token - I have to test the difference for the final score between this approaches.
    Well as you figured correctly this is rather for scoring really.
    So if I'm going to use the scoring stuff also, I rather subclass
    PayloadTermQuery then
    This one blog made me curious if there is already something similar, that skips TermPositions based on given attributes? I could imagine something similar to the current Tokenattribute concept during index time, but also available during search and controlled by a similarity...
    Actually in lucene 4.0 each Flex-Enum has a AttributeSource that
    allows you to add custom attributes to you enumerations. Yet there is
    no logic that skips based on that though.

    Simon
    lucene 4.0 is a little far away today? If the above approach performs
    good (and it sounds like it will) it should be good enough for now

    Jan



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Nov 25, 2010 at 3:56 pm

    On Thu, Nov 25, 2010 at 3:25 PM, Jan Kurella wrote:
    Hi Simon,
    On 25.11.2010 10:40, ext Simon Willnauer wrote:

    Hi Jan,

    On Wed, Nov 24, 2010 at 9:12 AM,wrote:
    Of course:

    We are trying to search in documents that contain text in several
    languages. We are also investigating other approaches*, so this is not about
    finding other variants.
    the goal is to only match tokens from 1 or more given languages and not
    to match the token if it is by accident the same in another language.

    For the payloads my plan is to add the correct language to each and every
    token during indexing (I'm not sure how to solve this best, but I'm sure
    this can be solved at least with lucene directly).
    On search side my current idea is to wrap around a TermPosition and skip
    all docs, where the current payload has not one of the requested languages.
    I probably need to use my own Query/Weight for this?
    You don't need to start from nothing here, I suggest you to look at
    SpanTermQuery and TermSpans which uses DocsAndPositionsEnum (or rather
    TermPositions in non-trunk versions). TermSpan gives you the ability
    to override #next() and #skipTo() which is from what I understand what
    you are looking for, right?
    Just to get it right: I only subclass the SpanTermQuery to verwrite the
    getSpans(Reader) method to return MyTermSpans().
    MyTermSpans are a subclass of TermSpans where I just extend #next() and
    #skipTo() to go further until my desired Payload is found.
    that sounds about right...
    Sounds pretty easy and straight forward.
    Another approach would be to just overwrite the Similarity, but this will
    only influence scoring and depending on the underlying query not completely
    skip the token - I have to test the difference for the final score between
    this approaches.
    Well as you figured correctly this is rather for scoring really.
    So if I'm going to use the scoring stuff also, I rather subclass
    PayloadTermQuery then
    hmm I am not a span expert but I guess that would make it easier though.
    This one blog made me curious if there is already something similar, that
    skips TermPositions based on given attributes? I could imagine something
    similar to the current Tokenattribute concept during index time, but also
    available during search and controlled by a similarity...
    Actually in lucene 4.0 each Flex-Enum has a AttributeSource that
    allows you to add custom attributes to you enumerations. Yet there is
    no logic that skips based on that though.

    Simon
    lucene 4.0 is  a little far away today? If the above approach performs good
    (and it sounds like it will) it should be good enough for now
    i was just saying that this is on the way... and yeah you might need
    to wait a bit until 4.0 :)

    simon
    Jan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 23, '10 at 1:44p
activeNov 25, '10 at 3:56p
posts8
users2
websitelucene.apache.org

2 users in discussion

Simon Willnauer: 4 posts Jan Kurella: 4 posts

People

Translate

site design / logo © 2022 Grokbase