FAQ
Hi,

I'm using Lucene 2.9.1 patched with
http://issues.apache.org/jira/browse/LUCENE-1260
For some special reason I need to find all documents which contain at
least 1 term in a certain field.
This works by iterating the norms array only as long as the field
exists on every document.
For documents without the field the norms array holds the byte-value 124.
Where does 124 come from - and is there a way to change it to an other
value like -128 (0xFF) for not existing fields?


Benjamin

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Michael McCandless at Dec 3, 2009 at 5:14 pm
    This isn't easy to change; it's hardcoded, in oal.index.NormsWriter,
    to 1.0, and also in SegmentReader, to 1.0 (when the field doesn't have
    norms stored, but eg someone is requesting them anyway). 1.0 must
    encode to 124. I suppose we could empower Similarity to define what
    the "undefined norm value" should be? Wanna make a patch?

    Mike
    On Thu, Dec 3, 2009 at 11:46 AM, Benjamin Heilbrunn wrote:
    Hi,

    I'm using Lucene 2.9.1 patched with
    http://issues.apache.org/jira/browse/LUCENE-1260
    For some special reason I need to find all documents which contain at
    least 1 term in a certain field.
    This works by iterating the norms array only as long as the field
    exists on every document.
    For documents without the field the norms array holds the byte-value 124.
    Where does 124 come from - and is there a way to change it to an other
    value like -128 (0xFF) for not existing fields?


    Benjamin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Christopher Condit at Dec 3, 2009 at 8:04 pm
    The Snowball Analyzer works well for certain constructs but not others. In particular I'm having a problem with things like "colossal" vs "colossus" and "hippocampus" vs "hippocampal".
    Is there a way to customize the analyzer to include these rules?
    Thanks,
    -Chris

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Otis Gospodnetic at Dec 4, 2009 at 2:45 am
    Chris,

    You could look at KStem to see if that does a better job.
    Or perhaps WordNet can be used to get the lemma of those terms instead of using stemming.
    Finally.... what was I going to say... ah, yes, using synonyms may be another way this can be handled.

    Otis
    --
    Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch


    ----- Original Message ----
    From: Christopher Condit <[email protected]>
    To: "java-[email protected]" <[email protected]>
    Sent: Thu, December 3, 2009 3:04:03 PM
    Subject: Snowball Stemmer Question

    The Snowball Analyzer works well for certain constructs but not others. In
    particular I'm having a problem with things like "colossal" vs "colossus" and
    "hippocampus" vs "hippocampal".
    Is there a way to customize the analyzer to include these rules?
    Thanks,
    -Chris

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Erick Erickson at Dec 3, 2009 at 6:33 pm
    It would be clumsier, but you could create a Filter by spinning
    through all the terms on a field and setting the appropriate bit.

    You could even do this at startup and store the filters around for
    all the fields you care about, or cache them when first used.

    The advantage I see here is that it wouldn't depend upon
    what looks like a peculiarity in field norms.

    The disadvantage is that I bet it's slower.

    FWIW
    Erick
    On Thu, Dec 3, 2009 at 11:46 AM, Benjamin Heilbrunn wrote:

    Hi,

    I'm using Lucene 2.9.1 patched with
    http://issues.apache.org/jira/browse/LUCENE-1260
    For some special reason I need to find all documents which contain at
    least 1 term in a certain field.
    This works by iterating the norms array only as long as the field
    exists on every document.
    For documents without the field the norms array holds the byte-value 124.
    Where does 124 come from - and is there a way to change it to an other
    value like -128 (0xFF) for not existing fields?


    Benjamin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Benjamin Heilbrunn at Dec 4, 2009 at 8:41 am
    Erick, I'm not sure if I understand you right.
    What do you mean by "spinning through all the terms on a field".

    It would be an option to load all unique terms of a field by using TermEnum.
    Than use TermDocs to get the docs to those terms.
    The rest of docs doesn't contain a term and so you know, that the
    field don't exists or is empty on those docs.
    Btw: Is there a distinction in Lucene between empty and not existing Fields?

    The above method would work very well I think, but it would require to
    build and hold an extra data structure.
    My index has about 20 fields and 4 million docs. The overhead would be to large.

    I think - using the norms array (which is already there for most of
    the fields) would be a nice approach.


    Benjamin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Erick Erickson at Dec 4, 2009 at 1:54 pm
    The word "Filter" as part of a class is overloaded in Lucene <G>....

    See: http://lucene.apache.org/java/2_9_1/api/all/index.html

    The above filter is just a DocIdSet, one bit per document. So
    in your example, you're only talking 12M or so, even if you
    create one filter for every field and keep it around.

    You *might* get some joy from, say, QueryWrapperFilter, although
    I don't know if it handles pure wildcard terms (e.g. field:*)...

    If that doesn't work out of the box, I *think* you can use TermDocs
    with a term like field:"" and just keep marching until next() returns
    false, merrily setting your Filter bits for each Doc returned by
    the enumerator.....

    HTH
    Erick

    On Fri, Dec 4, 2009 at 3:40 AM, Benjamin Heilbrunn wrote:

    Erick, I'm not sure if I understand you right.
    What do you mean by "spinning through all the terms on a field".

    It would be an option to load all unique terms of a field by using
    TermEnum.
    Than use TermDocs to get the docs to those terms.
    The rest of docs doesn't contain a term and so you know, that the
    field don't exists or is empty on those docs.
    Btw: Is there a distinction in Lucene between empty and not existing
    Fields?

    The above method would work very well I think, but it would require to
    build and hold an extra data structure.
    My index has about 20 fields and 4 million docs. The overhead would be to
    large.

    I think - using the norms array (which is already there for most of
    the fields) would be a nice approach.


    Benjamin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 3, '09 at 4:46p
activeDec 4, '09 at 1:54p
posts7
users5
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase