FAQ
I used KeywordAnalyzer and KeywordTokenizer as templates for
a new analyzer.
The analyzer works fine but the result never reaches the index.

My analyzer is called in "DocInverterPerField.processFields"
with "stream.incrementToken()".
...
try {
boolean hasMoreTokens = stream.incrementToken();

fieldState.attributeSource = stream;

OffsetAttribute offsetAttribute = fieldState.attributeSource.addAttribute(OffsetAttribute.class);
PositionIncrementAttribute posIncrAttribute = fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);

consumer.start(field);
...

The result goes to "fieldState.attributeSource" but is not in "field".
So "field.fieldsData" still has the old content before calling my
analyzer. And when calling "consumer.start(field)" the old content
is going to the index and not the new analyzed one.
Does the analyzer has to care about "Fieldable field.fieldsData"
or who is responsible for it?

Regards
Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Erick Erickson at Nov 25, 2010 at 5:19 pm
    What is your evidence that "the result never reaches the index?"

    Are you sure:
    1> you commit afterwards
    2> you reopen the underlying reader to see
    3> if you don't store the value for the field, how are you sure?
    4> If you search and don't find it, did you index it?

    First, I'd be sure the value in question is in the document just before
    sending it to be added to your index to see if the value you think
    is in there really is. Something like Document.get() and see if

    Best
    Erick
    On Thu, Nov 25, 2010 at 8:08 AM, Bernd Fehling wrote:

    I used KeywordAnalyzer and KeywordTokenizer as templates for
    a new analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens = stream.incrementToken();

    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);

    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is not in "field".
    So "field.fieldsData" still has the old content before calling my
    analyzer. And when calling "consumer.start(field)" the old content
    is going to the index and not the new analyzed one.
    Does the analyzer has to care about "Fieldable field.fieldsData"
    or who is responsible for it?

    Regards
    Bernd

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bernd Fehling at Nov 26, 2010 at 7:11 am
    Hi Erik,

    my evidence is that I load a single document into an empty index
    with a field "id" and a second field "dcdocid". The field "dcdocid"
    has the word "foo". This goes through my analyzer and changes to
    MD5 string which is then "acbd18db4cc2f85cedef654fccc4a4d8".
    After indexing and commit a search for *:* shows me "foo" for
    the field "dcdocid" and not my MD5.

    my fieldType:
    <fieldType name="text_md" class="solr.TextField" omitNorms="true" >
    <analyzer type="index" class="de.ubbielefeld.solr.analysis.TextMessageDigestAnalyzer" />
    </fieldType>

    <!-- UNIQUE ID -->
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="dcdocid" type="text_md" indexed="true" stored="true" />
    <copyField source="id" dest="dcdocid" />

    Using the debugger shows that the value in question is going
    through the TextMessageDigestAnalyzer and coming out as MD5
    but it is not as MD5 in the index.

    I also tried a filter but no success.
    So why is something that is analyzed (and the value has changed
    due to analysis) not stored with its new value in the index?

    Best regards,
    Bernd


    Am 25.11.2010 18:18, schrieb Erick Erickson:
    What is your evidence that "the result never reaches the index?"

    Are you sure:
    1> you commit afterwards
    2> you reopen the underlying reader to see
    3> if you don't store the value for the field, how are you sure?
    4> If you search and don't find it, did you index it?

    First, I'd be sure the value in question is in the document just before
    sending it to be added to your index to see if the value you think
    is in there really is. Something like Document.get() and see if

    Best
    Erick

    On Thu, Nov 25, 2010 at 8:08 AM, Bernd Fehling <
    [email protected]> wrote:
    I used KeywordAnalyzer and KeywordTokenizer as templates for
    a new analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens = stream.incrementToken();

    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);

    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is not in "field".
    So "field.fieldsData" still has the old content before calling my
    analyzer. And when calling "consumer.start(field)" the old content
    is going to the index and not the new analyzed one.
    Does the analyzer has to care about "Fieldable field.fieldsData"
    or who is responsible for it?

    Regards
    Bernd

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    --
    *************************************************************
    Bernd Fehling Universitätsbibliothek Bielefeld
    Dipl.-Inform. (FH) Universitätsstr. 25
    Tel. +49 521 106-4060 Fax. +49 521 106-4052
    [email protected] 33615 Bielefeld

    BASE - Bielefeld Academic Search Engine - www.base-search.net
    *************************************************************

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Erick Erickson at Nov 26, 2010 at 2:05 pm
    So, you're using Solr, right? And have a custom analyzer? If that's the
    case, Uwe pointed you in the right direction and I think everything may
    be working fine, or at least as I'd expect.

    Specifying stored="true" puts a verbatim, unanalyzed copy of the data
    in the index. When you display a field in a document (i.e. query
    on *:*) the *stored* value is returned, *not* the results of analysis. The
    stored
    value has nothing to do with what's searched.

    To see if I've got it right, go into the admin page of solr, click on
    "schema browser",
    and then the field dcdocid should have the MD5 in it. If that's true, then
    Solr
    is working as expected.

    If the analyzed values were returned, humans would be in a world of hurt
    since
    all the transformations would be applied and results pages would have
    gibberish.
    Imagine applying lowercase and stemming to input for "Running on Empty",
    your
    display would be something like "run on empti".

    And if you're doing pure lucene, you can see this by enumerating the terms
    in your
    dcdocid field.

    Best
    Erick
    On Fri, Nov 26, 2010 at 2:10 AM, Bernd Fehling wrote:

    Hi Erik,

    my evidence is that I load a single document into an empty index
    with a field "id" and a second field "dcdocid". The field "dcdocid"
    has the word "foo". This goes through my analyzer and changes to
    MD5 string which is then "acbd18db4cc2f85cedef654fccc4a4d8".
    After indexing and commit a search for *:* shows me "foo" for
    the field "dcdocid" and not my MD5.

    my fieldType:
    <fieldType name="text_md" class="solr.TextField" omitNorms="true" >
    <analyzer type="index"
    class="de.ubbielefeld.solr.analysis.TextMessageDigestAnalyzer" />
    </fieldType>

    <!-- UNIQUE ID -->
    <field name="id" type="string" indexed="true" stored="true" required="true"
    />
    <field name="dcdocid" type="text_md" indexed="true" stored="true" />
    <copyField source="id" dest="dcdocid" />

    Using the debugger shows that the value in question is going
    through the TextMessageDigestAnalyzer and coming out as MD5
    but it is not as MD5 in the index.

    I also tried a filter but no success.
    So why is something that is analyzed (and the value has changed
    due to analysis) not stored with its new value in the index?

    Best regards,
    Bernd


    Am 25.11.2010 18:18, schrieb Erick Erickson:
    What is your evidence that "the result never reaches the index?"

    Are you sure:
    1> you commit afterwards
    2> you reopen the underlying reader to see
    3> if you don't store the value for the field, how are you sure?
    4> If you search and don't find it, did you index it?

    First, I'd be sure the value in question is in the document just before
    sending it to be added to your index to see if the value you think
    is in there really is. Something like Document.get() and see if

    Best
    Erick

    On Thu, Nov 25, 2010 at 8:08 AM, Bernd Fehling <
    [email protected]> wrote:
    I used KeywordAnalyzer and KeywordTokenizer as templates for
    a new analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens = stream.incrementToken();

    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);
    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is not in "field".
    So "field.fieldsData" still has the old content before calling my
    analyzer. And when calling "consumer.start(field)" the old content
    is going to the index and not the new analyzed one.
    Does the analyzer has to care about "Fieldable field.fieldsData"
    or who is responsible for it?

    Regards
    Bernd

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    --
    *************************************************************
    Bernd Fehling Universitätsbibliothek Bielefeld
    Dipl.-Inform. (FH) Universitätsstr. 25
    Tel. +49 521 106-4060 Fax. +49 521 106-4052
    [email protected] 33615 Bielefeld

    BASE - Bielefeld Academic Search Engine - www.base-search.net
    *************************************************************

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bernd Fehling at Nov 26, 2010 at 2:55 pm
    Hi Erik,

    I see my problem, caused by a misunderstanding of the indexing by lucene.
    I guess its due to the fact that FAST Data Search has real processing pipelines.

    Youre right I use Solr but, as a matter of fact, in this special case
    I really want to change the indexed _and_ stored data.
    For security reasons I have to MD5 or even better SHA256 strings which
    go into a field. So where is the sense if I SHA256 the string but
    still display the plain text of the stored field?

    So the Analyzer or Filter should MessageDigest the content and index and
    store it as MessageDigest.

    Can this be achived somehow with Analyzer or Filter,
    what is your opinion?

    May be a hint which classes to use?

    Kind regards,
    Bernd



    Am 26.11.2010 15:05, schrieb Erick Erickson:
    So, you're using Solr, right? And have a custom analyzer? If that's the
    case, Uwe pointed you in the right direction and I think everything may
    be working fine, or at least as I'd expect.

    Specifying stored="true" puts a verbatim, unanalyzed copy of the data
    in the index. When you display a field in a document (i.e. query
    on *:*) the *stored* value is returned, *not* the results of analysis. The
    stored
    value has nothing to do with what's searched.

    To see if I've got it right, go into the admin page of solr, click on
    "schema browser",
    and then the field dcdocid should have the MD5 in it. If that's true, then
    Solr
    is working as expected.

    If the analyzed values were returned, humans would be in a world of hurt
    since
    all the transformations would be applied and results pages would have
    gibberish.
    Imagine applying lowercase and stemming to input for "Running on Empty",
    your
    display would be something like "run on empti".

    And if you're doing pure lucene, you can see this by enumerating the terms
    in your
    dcdocid field.

    Best
    Erick

    On Fri, Nov 26, 2010 at 2:10 AM, Bernd Fehling <
    [email protected]> wrote:
    Hi Erik,

    my evidence is that I load a single document into an empty index
    with a field "id" and a second field "dcdocid". The field "dcdocid"
    has the word "foo". This goes through my analyzer and changes to
    MD5 string which is then "acbd18db4cc2f85cedef654fccc4a4d8".
    After indexing and commit a search for *:* shows me "foo" for
    the field "dcdocid" and not my MD5.

    my fieldType:
    <fieldType name="text_md" class="solr.TextField" omitNorms="true" >
    <analyzer type="index"
    class="de.ubbielefeld.solr.analysis.TextMessageDigestAnalyzer" />
    </fieldType>

    <!-- UNIQUE ID -->
    <field name="id" type="string" indexed="true" stored="true" required="true"
    />
    <field name="dcdocid" type="text_md" indexed="true" stored="true" />
    <copyField source="id" dest="dcdocid" />

    Using the debugger shows that the value in question is going
    through the TextMessageDigestAnalyzer and coming out as MD5
    but it is not as MD5 in the index.

    I also tried a filter but no success.
    So why is something that is analyzed (and the value has changed
    due to analysis) not stored with its new value in the index?

    Best regards,
    Bernd


    Am 25.11.2010 18:18, schrieb Erick Erickson:
    What is your evidence that "the result never reaches the index?"

    Are you sure:
    1> you commit afterwards
    2> you reopen the underlying reader to see
    3> if you don't store the value for the field, how are you sure?
    4> If you search and don't find it, did you index it?

    First, I'd be sure the value in question is in the document just before
    sending it to be added to your index to see if the value you think
    is in there really is. Something like Document.get() and see if

    Best
    Erick

    On Thu, Nov 25, 2010 at 8:08 AM, Bernd Fehling <
    [email protected]> wrote:
    I used KeywordAnalyzer and KeywordTokenizer as templates for
    a new analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens = stream.incrementToken();

    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);
    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is not in "field".
    So "field.fieldsData" still has the old content before calling my
    analyzer. And when calling "consumer.start(field)" the old content
    is going to the index and not the new analyzed one.
    Does the analyzer has to care about "Fieldable field.fieldsData"
    or who is responsible for it?

    Regards
    Bernd
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Erick Erickson at Nov 26, 2010 at 4:28 pm
    Can you "define the problem away"? That is, why do you want to store it at
    all?
    If there's no value to the users in seeing the encoded value, just don't
    store it.
    You can still search on the encoded value if in that case....

    Which is a way of saying that I don't know, off the top of my head, how
    you'd
    index one thing and store the result of analysis...

    Best
    Erick
    On Fri, Nov 26, 2010 at 9:54 AM, Bernd Fehling wrote:

    Hi Erik,

    I see my problem, caused by a misunderstanding of the indexing by lucene.
    I guess its due to the fact that FAST Data Search has real processing
    pipelines.

    Youre right I use Solr but, as a matter of fact, in this special case
    I really want to change the indexed _and_ stored data.
    For security reasons I have to MD5 or even better SHA256 strings which
    go into a field. So where is the sense if I SHA256 the string but
    still display the plain text of the stored field?

    So the Analyzer or Filter should MessageDigest the content and index and
    store it as MessageDigest.

    Can this be achived somehow with Analyzer or Filter,
    what is your opinion?

    May be a hint which classes to use?

    Kind regards,
    Bernd



    Am 26.11.2010 15:05, schrieb Erick Erickson:
    So, you're using Solr, right? And have a custom analyzer? If that's the
    case, Uwe pointed you in the right direction and I think everything may
    be working fine, or at least as I'd expect.

    Specifying stored="true" puts a verbatim, unanalyzed copy of the data
    in the index. When you display a field in a document (i.e. query
    on *:*) the *stored* value is returned, *not* the results of analysis. The
    stored
    value has nothing to do with what's searched.

    To see if I've got it right, go into the admin page of solr, click on
    "schema browser",
    and then the field dcdocid should have the MD5 in it. If that's true, then
    Solr
    is working as expected.

    If the analyzed values were returned, humans would be in a world of hurt
    since
    all the transformations would be applied and results pages would have
    gibberish.
    Imagine applying lowercase and stemming to input for "Running on Empty",
    your
    display would be something like "run on empti".

    And if you're doing pure lucene, you can see this by enumerating the terms
    in your
    dcdocid field.

    Best
    Erick

    On Fri, Nov 26, 2010 at 2:10 AM, Bernd Fehling <
    [email protected]> wrote:
    Hi Erik,

    my evidence is that I load a single document into an empty index
    with a field "id" and a second field "dcdocid". The field "dcdocid"
    has the word "foo". This goes through my analyzer and changes to
    MD5 string which is then "acbd18db4cc2f85cedef654fccc4a4d8".
    After indexing and commit a search for *:* shows me "foo" for
    the field "dcdocid" and not my MD5.

    my fieldType:
    <fieldType name="text_md" class="solr.TextField" omitNorms="true" >
    <analyzer type="index"
    class="de.ubbielefeld.solr.analysis.TextMessageDigestAnalyzer" />
    </fieldType>

    <!-- UNIQUE ID -->
    <field name="id" type="string" indexed="true" stored="true"
    required="true"
    />
    <field name="dcdocid" type="text_md" indexed="true" stored="true" />
    <copyField source="id" dest="dcdocid" />

    Using the debugger shows that the value in question is going
    through the TextMessageDigestAnalyzer and coming out as MD5
    but it is not as MD5 in the index.

    I also tried a filter but no success.
    So why is something that is analyzed (and the value has changed
    due to analysis) not stored with its new value in the index?

    Best regards,
    Bernd


    Am 25.11.2010 18:18, schrieb Erick Erickson:
    What is your evidence that "the result never reaches the index?"

    Are you sure:
    1> you commit afterwards
    2> you reopen the underlying reader to see
    3> if you don't store the value for the field, how are you sure?
    4> If you search and don't find it, did you index it?

    First, I'd be sure the value in question is in the document just before
    sending it to be added to your index to see if the value you think
    is in there really is. Something like Document.get() and see if

    Best
    Erick

    On Thu, Nov 25, 2010 at 8:08 AM, Bernd Fehling <
    [email protected]> wrote:
    I used KeywordAnalyzer and KeywordTokenizer as templates for
    a new analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens = stream.incrementToken();

    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);
    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is not in "field".
    So "field.fieldsData" still has the old content before calling my
    analyzer. And when calling "consumer.start(field)" the old content
    is going to the index and not the new analyzed one.
    Does the analyzer has to care about "Fieldable field.fieldsData"
    or who is responsible for it?

    Regards
    Bernd
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bernd Fehling at Nov 26, 2010 at 11:01 pm
    Hi Erik,

    the "problem" can be described as follows:
    - we have a database for users
    - users can search and mark/store records for watching
    - the record marker is the unique path to the source and also the unique record id of the database
    - therefore we decided to sha256 the id as backreference from the watched record database
    to the solr database holding more than 25 mio. records.
    - this way we don't have to care about special characters of the unique id because of the sha256
    - we have a direct "connection" between document database and user database.
    - if the record of the document database has been deleted the customer gets notified by the
    sha256 identifier via the user database.
    ... and so on.

    So we are really interested in getting a MessageDigest which changes the value of a field
    _and_ stores the changed value.
    This is how it is currently working in our FAST search engine and because of switching
    to Solr we wanted to keep this feature instead of redesigning our current system.

    This said I'm still in the need of a solution for "processing" and storing the changed value
    of a field.

    Anyway, thanks a lot for your help Erik and giving me some ideas how
    the lucene index is "thinking" :-)

    Anyone else an idea how to achive this???

    Best regards,
    Bernd

    Can you "define the problem away"? That is, why do you want to
    store it at
    all?
    If there's no value to the users in seeing the encoded value,
    just don't
    store it.
    You can still search on the encoded value if in that case....

    Which is a way of saying that I don't know, off the top of my
    head, how
    you'd
    index one thing and store the result of analysis...

    Best
    Erick

    On Fri, Nov 26, 2010 at 9:54 AM, Bernd Fehling <
    [email protected]> wrote:
    Hi Erik,

    I see my problem, caused by a misunderstanding of the indexing
    by lucene.
    I guess its due to the fact that FAST Data Search has real
    processing> pipelines.
    Youre right I use Solr but, as a matter of fact, in this
    special case
    I really want to change the indexed _and_ stored data.
    For security reasons I have to MD5 or even better SHA256
    strings which
    go into a field. So where is the sense if I SHA256 the string but
    still display the plain text of the stored field?

    So the Analyzer or Filter should MessageDigest the content and index and
    store it as MessageDigest.

    Can this be achived somehow with Analyzer or Filter,
    what is your opinion?

    May be a hint which classes to use?

    Kind regards,
    Bernd



    Am 26.11.2010 15:05, schrieb Erick Erickson:
    So, you're using Solr, right? And have a custom analyzer? If
    that's the
    case, Uwe pointed you in the right direction and I think
    everything may
    be working fine, or at least as I'd expect.

    Specifying stored="true" puts a verbatim, unanalyzed copy of
    the data
    in the index. When you display a field in a document (i.e. query
    on *:*) the *stored* value is returned, *not* the results of
    analysis.> The
    stored
    value has nothing to do with what's searched.

    To see if I've got it right, go into the admin page of solr,
    click on
    "schema browser",
    and then the field dcdocid should have the MD5 in it. If
    that's true,
    then
    Solr
    is working as expected.

    If the analyzed values were returned, humans would be in a
    world of hurt
    since
    all the transformations would be applied and results pages
    would have
    gibberish.
    Imagine applying lowercase and stemming to input for
    "Running on Empty",
    your
    display would be something like "run on empti".

    And if you're doing pure lucene, you can see this by
    enumerating the
    terms
    in your
    dcdocid field.

    Best
    Erick

    On Fri, Nov 26, 2010 at 2:10 AM, Bernd Fehling <
    [email protected]> wrote:
    Hi Erik,

    my evidence is that I load a single document into an empty index
    with a field "id" and a second field "dcdocid". The field
    "dcdocid"> >> has the word "foo". This goes through my analyzer
    and changes to
    MD5 string which is then "acbd18db4cc2f85cedef654fccc4a4d8".
    After indexing and commit a search for *:* shows me "foo" for
    the field "dcdocid" and not my MD5.

    my fieldType:
    <fieldType name="text_md" class="solr.TextField"
    omitNorms="true" >
    <analyzer type="index"
    class="de.ubbielefeld.solr.analysis.TextMessageDigestAnalyzer" />
    </fieldType>

    <!-- UNIQUE ID -->
    <field name="id" type="string" indexed="true" stored="true"
    required="true"
    />
    <field name="dcdocid" type="text_md" indexed="true"
    stored="true" />
    <copyField source="id" dest="dcdocid" />

    Using the debugger shows that the value in question is going
    through the TextMessageDigestAnalyzer and coming out as MD5
    but it is not as MD5 in the index.

    I also tried a filter but no success.
    So why is something that is analyzed (and the value has changed
    due to analysis) not stored with its new value in the index?

    Best regards,
    Bernd


    Am 25.11.2010 18:18, schrieb Erick Erickson:
    What is your evidence that "the result never reaches the index?"

    Are you sure:
    1> you commit afterwards
    2> you reopen the underlying reader to see
    3> if you don't store the value for the field, how are you sure?
    4> If you search and don't find it, did you index it?

    First, I'd be sure the value in question is in the
    document just before
    sending it to be added to your index to see if the value
    you think
    is in there really is. Something like Document.get() and
    see if
    Best
    Erick

    On Thu, Nov 25, 2010 at 8:08 AM, Bernd Fehling <
    [email protected]> wrote:
    I used KeywordAnalyzer and KeywordTokenizer as templates for
    a new analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens =
    stream.incrementToken();> >>>>
    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute
    posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);> >>>>
    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is
    not in "field".
    So "field.fieldsData" still has the old content before
    calling my
    analyzer. And when calling "consumer.start(field)" the
    old content
    is going to the index and not the new analyzed one.
    Does the analyzer has to care about "Fieldable
    field.fieldsData"> >>>> or who is responsible for it?
    Regards
    Bernd
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Uwe Schindler at Nov 25, 2010 at 5:33 pm
    field.fieldsData is used for the stored field contents and so only *stored*
    in index, of course not analyzed (why should I analyze a stored field). The
    indexed tokens go of course through your analyzer and the returned tokens
    are indexed as terms. Where is the problem?

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]

    -----Original Message-----
    From: Bernd Fehling
    Sent: Thursday, November 25, 2010 2:08 PM
    To: [email protected]
    Subject: not indexing analyzed field

    I used KeywordAnalyzer and KeywordTokenizer as templates for a new
    analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens = stream.incrementToken();

    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);

    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is not in "field".
    So "field.fieldsData" still has the old content before calling my
    analyzer. And
    when calling "consumer.start(field)" the old content is going to the index and
    not the new analyzed one.
    Does the analyzer has to care about "Fieldable field.fieldsData"
    or who is responsible for it?

    Regards
    Bernd

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bernd Fehling at Nov 26, 2010 at 7:23 am
    Hi Uwe,

    my fieldType and fields are as follows:

    <fieldType name="text_md" class="solr.TextField" omitNorms="true" >
    <analyzer type="index" class="de.ubbielefeld.solr.analysis.TextMessageDigestAnalyzer" />
    </fieldType>

    <!-- UNIQUE ID -->
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="dcdocid" type="text_md" indexed="true" stored="true" />
    <copyField source="id" dest="dcdocid" />

    So the field dcdocid has the attribute *stored* which I can also see
    in the debugger.
    Why should I analyze a stored field?
    I don't know if I need to analyze it, I also tried a filter but also no success.

    My understanding is to send something to a field and the field has a processing
    chain. The processing chain analyzes, filters, ... is doing something to the
    content and then stores the content to that field in the index.

    May be it is a misunderstanding on my side about the field based processing
    because I'm normally working with FAST search engines which is document based.

    Best regards
    Bernd



    Am 25.11.2010 18:33, schrieb Uwe Schindler:
    field.fieldsData is used for the stored field contents and so only *stored*
    in index, of course not analyzed (why should I analyze a stored field). The
    indexed tokens go of course through your analyzer and the returned tokens
    are indexed as terms. Where is the problem?

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]

    -----Original Message-----
    From: Bernd Fehling
    Sent: Thursday, November 25, 2010 2:08 PM
    To: [email protected]
    Subject: not indexing analyzed field

    I used KeywordAnalyzer and KeywordTokenizer as templates for a new
    analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens = stream.incrementToken();

    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.class);

    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is not in "field".
    So "field.fieldsData" still has the old content before calling my
    analyzer. And
    when calling "consumer.start(field)" the old content is going to the index and
    not the new analyzed one.
    Does the analyzer has to care about "Fieldable field.fieldsData"
    or who is responsible for it?

    Regards
    Bernd

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    --
    *************************************************************
    Bernd Fehling Universitätsbibliothek Bielefeld
    Dipl.-Inform. (FH) Universitätsstr. 25
    Tel. +49 521 106-4060 Fax. +49 521 106-4052
    [email protected] 33615 Bielefeld

    BASE - Bielefeld Academic Search Engine - www.base-search.net
    *************************************************************

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Uwe Schindler at Nov 27, 2010 at 10:34 am
    You have to first understand the difference between "stored" and "indexed":

    - For stored fields no analysis is done, as they are only stored (e.g. for
    display of retrieval results). These are simply copied unchanged to the
    index - and you cannot search on them.
    - Analysis is done on the "index" side, so the text is split up into tokens.
    So you won't see analysis occurring on the stored field contents (e.g. when
    you display the results using Lucene's IndexReader.document(int) or Solr's
    result structure). The indexed fields are used when you query the index. For
    queries to work, the search query is also analyzed and split up into tokens.
    These tokens are searched in the "index". This also implies, that the query
    and index analyzer needs to be compatible. And that’s another problem in
    your schema, you don't have an query analyzer! So your searches would never
    hit any result!

    I'd suggest to read a book about Lucene/Solr first :-)

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]
    -----Original Message-----
    From: Bernd Fehling
    Sent: Friday, November 26, 2010 8:23 AM
    To: [email protected]
    Subject: Re: not indexing analyzed field

    Hi Uwe,

    my fieldType and fields are as follows:

    <fieldType name="text_md" class="solr.TextField" omitNorms="true" >
    <analyzer type="index"
    class="de.ubbielefeld.solr.analysis.TextMessageDigestAnalyzer" />
    </fieldType>
    <!-- UNIQUE ID -->
    <field name="id" type="string" indexed="true" stored="true"
    required="true" />
    <field name="dcdocid" type="text_md" indexed="true" stored="true" />
    <copyField source="id" dest="dcdocid" />

    So the field dcdocid has the attribute *stored* which I can also see in the
    debugger.
    Why should I analyze a stored field?
    I don't know if I need to analyze it, I also tried a filter but also no success.
    My understanding is to send something to a field and the field has a
    processing
    chain. The processing chain analyzes, filters, ... is doing something to the
    content and then stores the content to that field in the index.

    May be it is a misunderstanding on my side about the field based
    processing
    because I'm normally working with FAST search engines which is document
    based.

    Best regards
    Bernd



    Am 25.11.2010 18:33, schrieb Uwe Schindler:
    field.fieldsData is used for the stored field contents and so only
    *stored* in index, of course not analyzed (why should I analyze a
    stored field). The indexed tokens go of course through your analyzer
    and the returned tokens are indexed as terms. Where is the problem?

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]

    -----Original Message-----
    From: Bernd Fehling
    Sent: Thursday, November 25, 2010 2:08 PM
    To: [email protected]
    Subject: not indexing analyzed field

    I used KeywordAnalyzer and KeywordTokenizer as templates for a new
    analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens = stream.incrementToken();

    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.cl
    ass);

    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is not in "field".
    So "field.fieldsData" still has the old content before calling my
    analyzer. And
    when calling "consumer.start(field)" the old content is going to the
    index and
    not the new analyzed one.
    Does the analyzer has to care about "Fieldable field.fieldsData"
    or who is responsible for it?

    Regards
    Bernd

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    --
    *************************************************************
    Bernd Fehling Universitätsbibliothek Bielefeld
    Dipl.-Inform. (FH) Universitätsstr. 25
    Tel. +49 521 106-4060 Fax. +49 521 106-4052
    [email protected] 33615 Bielefeld

    BASE - Bielefeld Academic Search Engine - www.base-search.net
    *************************************************************

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bernd Fehling at Nov 29, 2010 at 7:03 am
    Hi Uwe,

    Erick explained it pretty well and i got it now, but generally you're right RTFM ;-)

    Nevertheless I'm in the need of the functionality to change the stored
    value during analysis or tokenization or filtering (what ever works).

    Thats how it can be done in FAST FDS/ESP (full processing) compared
    to Lucene/Solr (sparse processing).
    Sure I can make a branch of the trunk and enhance the
    "DocInverterPerField.processFields" class/method and change the line
    "boolean hasMoreTokens = stream.incrementToken();"
    but my hope is still that it is possible without touching the basic
    Lucene code.

    Do you have any idea, how to change the stored value during analysis or
    tokenization or filtering?

    Best regards,
    Bernd


    Am 27.11.2010 11:34, schrieb Uwe Schindler:
    You have to first understand the difference between "stored" and "indexed":

    - For stored fields no analysis is done, as they are only stored (e.g. for
    display of retrieval results). These are simply copied unchanged to the
    index - and you cannot search on them.
    - Analysis is done on the "index" side, so the text is split up into tokens.
    So you won't see analysis occurring on the stored field contents (e.g. when
    you display the results using Lucene's IndexReader.document(int) or Solr's
    result structure). The indexed fields are used when you query the index. For
    queries to work, the search query is also analyzed and split up into tokens.
    These tokens are searched in the "index". This also implies, that the query
    and index analyzer needs to be compatible. And that’s another problem in
    your schema, you don't have an query analyzer! So your searches would never
    hit any result!

    I'd suggest to read a book about Lucene/Solr first :-)

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]
    -----Original Message-----
    From: Bernd Fehling
    Sent: Friday, November 26, 2010 8:23 AM
    To: [email protected]
    Subject: Re: not indexing analyzed field

    Hi Uwe,

    my fieldType and fields are as follows:

    <fieldType name="text_md" class="solr.TextField" omitNorms="true" >
    <analyzer type="index"
    class="de.ubbielefeld.solr.analysis.TextMessageDigestAnalyzer" />
    </fieldType>
    <!-- UNIQUE ID -->
    <field name="id" type="string" indexed="true" stored="true"
    required="true" />
    <field name="dcdocid" type="text_md" indexed="true" stored="true" />
    <copyField source="id" dest="dcdocid" />

    So the field dcdocid has the attribute *stored* which I can also see in the
    debugger.
    Why should I analyze a stored field?
    I don't know if I need to analyze it, I also tried a filter but also no success.
    My understanding is to send something to a field and the field has a
    processing
    chain. The processing chain analyzes, filters, ... is doing something to the
    content and then stores the content to that field in the index.

    May be it is a misunderstanding on my side about the field based
    processing
    because I'm normally working with FAST search engines which is document
    based.

    Best regards
    Bernd



    Am 25.11.2010 18:33, schrieb Uwe Schindler:
    field.fieldsData is used for the stored field contents and so only
    *stored* in index, of course not analyzed (why should I analyze a
    stored field). The indexed tokens go of course through your analyzer
    and the returned tokens are indexed as terms. Where is the problem?

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: [email protected]

    -----Original Message-----
    From: Bernd Fehling
    Sent: Thursday, November 25, 2010 2:08 PM
    To: [email protected]
    Subject: not indexing analyzed field

    I used KeywordAnalyzer and KeywordTokenizer as templates for a new
    analyzer.
    The analyzer works fine but the result never reaches the index.

    My analyzer is called in "DocInverterPerField.processFields"
    with "stream.incrementToken()".
    ...
    try {
    boolean hasMoreTokens = stream.incrementToken();

    fieldState.attributeSource = stream;

    OffsetAttribute offsetAttribute =
    fieldState.attributeSource.addAttribute(OffsetAttribute.class);
    PositionIncrementAttribute posIncrAttribute =
    fieldState.attributeSource.addAttribute(PositionIncrementAttribute.cl
    ass);

    consumer.start(field);
    ...

    The result goes to "fieldState.attributeSource" but is not in "field".
    So "field.fieldsData" still has the old content before calling my
    analyzer. And
    when calling "consumer.start(field)" the old content is going to the
    index and
    not the new analyzed one.
    Does the analyzer has to care about "Fieldable field.fieldsData"
    or who is responsible for it?

    Regards
    Bernd
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Uwe Schindler at Nov 29, 2010 at 7:41 am
    Hi Bernd,

    Nevertheless I'm in the need of the functionality to change the stored value
    during analysis or tokenization or filtering (what ever works).

    Thats how it can be done in FAST FDS/ESP (full processing) compared to
    Lucene/Solr (sparse processing).

    Do you have any idea, how to change the stored value during analysis or
    tokenization or filtering?
    This is simply not possible or even wanted. Stored fields in Lucene are for
    storing arbitrary values in the index (they don't even need to be strings).
    With RTFM I mean, that you need to completely separate them in your head and
    look a them differently. It's indeed not the best idea in Lucene (and Solr)
    to provide them together and in one API to the user.

    As stored fields are not processed at all, you can simply process then
    *before* you put them into Lucene! Why do you want to do that in the
    processing pipeline?
    Sure I can make a branch of the trunk and enhance the
    "DocInverterPerField.processFields" class/method and change the line
    "boolean hasMoreTokens = stream.incrementToken();"
    but my hope is still that it is possible without touching the basic Lucene
    code.

    I would never do this, and as noted above, its uneeded.

    To e.g. store something different from what was being analyzed in Lucene,
    just add the field two times with same name to o.a.l.d.Document. Once as
    stored-only, once as indexed-only (internally that's exactly what Lucene is
    also doing when you pass a combined stored/indexed field; stored fields and
    indexed fields are completely different - and thats why you want to change
    DocInverter).

    If you are using Solr, there are also possibilities to do this, just
    implement your own FieldType, that handles analysis and storing different.
    E.g. TrieField in Solr works exactly like that (it indexes tokens as binary
    terms and also stores them in a different format). Look at methods
    to/fromExternal.

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 25, '10 at 1:08p
activeNov 29, '10 at 7:41a
posts12
users3
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase