FAQ
Hi,

I would like to create an index with Lucene to a document collections of
text files.
The index should be created in such a way, that for the search I can
enforce that query term A and query term B are contained within the same
sentence.

How should implement the index? Should I have for every sentence a
different field (but make sure that it is not a multi-valued field
because they would get merged which is exactly what I do not want)?
Would it be problematic that different documents would then end up
having different numbes of fields?

Thank you in advance!

Best,
Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Ian Lea at Mar 4, 2011 at 10:35 am
    You can use multi valued fields if you play with the position
    increment gap. See e.g.
    http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-td1501269.html

    A google search for "lucene indexing sentences" or similar finds that, and more.


    Different docs can have different fields/different numbers of fields,
    but the position gap approach is probably better.


    --
    Ian.


    On Fri, Mar 4, 2011 at 7:06 AM, Michael Wiegand
    wrote:
    Hi,

    I would like to create an index with Lucene to a document collections of
    text files.
    The index should be created in such a way, that for the search I can enforce
    that query term A and query term B are contained within the same sentence.

    How should implement the index? Should I have for every sentence a different
    field (but make sure that it is not a multi-valued field because they would
    get merged which is exactly what I do not want)?
    Would it be problematic that different documents would then end up having
    different numbes of fields?

    Thank you in advance!

    Best,
    Michael


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Wiegand at Mar 4, 2011 at 2:41 pm
    Thank you for all these useful hints!

    If I use the multi-valued fields in combination with "modified" position
    increments, I would actually distort the shape of a document.
    For instance, if I would like to compare a retrieval enforcing query
    term co-occurrence within the same sentence with a co-occurrence using
    PhraseQuery (or SpanNearQuery) just enforcing that the query terms have
    to appear within a text window of n words (also allowing this window to
    cross sentence boundaries), I would need to create another index where I
    do not modify the position increments.
    Is that right?

    Best,
    Michael

    Ian Lea schrieb:
    You can use multi valued fields if you play with the position
    increment gap. See e.g.
    http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-td1501269.html

    A google search for "lucene indexing sentences" or similar finds that, and more.


    Different docs can have different fields/different numbers of fields,
    but the position gap approach is probably better.


    --
    Ian.


    On Fri, Mar 4, 2011 at 7:06 AM, Michael Wiegand
    wrote:
    Hi,

    I would like to create an index with Lucene to a document collections of
    text files.
    The index should be created in such a way, that for the search I can enforce
    that query term A and query term B are contained within the same sentence.

    How should implement the index? Should I have for every sentence a different
    field (but make sure that it is not a multi-valued field because they would
    get merged which is exactly what I do not want)?
    Would it be problematic that different documents would then end up having
    different numbes of fields?

    Thank you in advance!

    Best,
    Michael


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Mar 4, 2011 at 3:18 pm
    Another index, or a different field in the same index but without the
    modified gaps. Maybe PerFieldAnalyzerWrapper would help - one
    Analyzer for field x with modified gaps and a different one for field
    y with standard gaps.


    --
    Ian.


    On Fri, Mar 4, 2011 at 2:40 PM, Michael Wiegand
    wrote:
    Thank you for all these useful hints!

    If I use the multi-valued fields in combination with "modified" position
    increments, I would actually distort the shape of a document.
    For instance, if I would like to compare a retrieval enforcing query term
    co-occurrence within the same sentence with a co-occurrence using
    PhraseQuery (or SpanNearQuery) just enforcing that the query terms have to
    appear within a text window of n words (also allowing this window to cross
    sentence boundaries), I would need to create another index where I do not
    modify the position increments.
    Is that right?

    Best,
    Michael

    Ian Lea schrieb:
    You can use multi valued fields if you play with the position
    increment gap.  See e.g.

    http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-td1501269.html

    A google search for "lucene indexing sentences" or similar finds that, and
    more.


    Different docs can have different fields/different numbers of fields,
    but the position gap approach is probably better.


    --
    Ian.


    On Fri, Mar 4, 2011 at 7:06 AM, Michael Wiegand
    wrote:
    Hi,

    I would like to create an index with Lucene to a document collections of
    text files.
    The index should be created in such a way, that for the search I can
    enforce
    that query term A and query term B are contained within the same
    sentence.

    How should implement the index? Should I have for every sentence a
    different
    field (but make sure that it is not a multi-valued field because they
    would
    get merged which is exactly what I do not want)?
    Would it be problematic that different documents would then end up having
    different numbes of fields?

    Thank you in advance!

    Best,
    Michael


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Wiegand at Mar 10, 2011 at 4:34 pm
    Conceptually, I think I know what to do. Unfortunately, with the given
    interfaces of Lucene I have some difficulty.

    If I add the content of a document sentence by sentence, i.e. line by
    line, (using a multi-valued field), there are only two constructors
    possible:
    Field(String name, String value, Field.Store store, Field.Index index)
    or
    Field(String name, String value, Field.Store store, Field.Index index,
    Field.TermVector termVector)
    The sentence comes as a string which I get from a BufferedReader-object
    by using the readLine() method.

    But as far as I understood, I need to access some TokenStream-object in
    order to set the PositionIncrementAttribute. So how should that work?

    Thank you in advance.

    Ian Lea schrieb:
    You can use multi valued fields if you play with the position
    increment gap. See e.g.
    http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-td1501269.html


    A google search for "lucene indexing sentences" or similar finds
    that, and more.


    Different docs can have different fields/different numbers of fields,
    but the position gap approach is probably better.


    --
    Ian.


    On Fri, Mar 4, 2011 at 7:06 AM, Michael Wiegand
    wrote:
    Hi,

    I would like to create an index with Lucene to a document
    collections of
    text files.
    The index should be created in such a way, that for the search I can
    enforce
    that query term A and query term B are contained within the same
    sentence.

    How should implement the index? Should I have for every sentence a
    different
    field (but make sure that it is not a multi-valued field because
    they would
    get merged which is exactly what I do not want)?
    Would it be problematic that different documents would then end up
    having
    different numbes of fields?

    Thank you in advance!

    Best,
    Michael


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Mar 11, 2011 at 10:04 am
    The example code in
    http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-td1501269.html
    reads

    custom standard analyzer:

    public class MyStandardAnalyzer extends StandardAnalyzer implements
    IndexFields {
    public MyStandardAnalyzer(Version matchVersion) {
    super(matchVersion);
    }
    public int getPositionIncrementGap(String fieldName) {
    int incrementGap = super.getPositionIncrementGap(fieldName);
    if (fieldName.equals(IFIELD_TEXT)) {
    incrementGap += 10;
    }
    return incrementGap;
    }
    }

    so if you used this analyzer and called

    new Field(IFIELD_TEXT, value, ...) and

    new Field("someothername", value, ...) the first field would get the
    modified gaps and the second one wouldn't.


    Hope that helps.


    --
    Ian.

    On Thu, Mar 10, 2011 at 4:34 PM, Michael Wiegand
    wrote:
    Conceptually, I think I know what to do. Unfortunately, with the given
    interfaces of Lucene I have some difficulty.

    If I add the content of a document sentence by sentence, i.e. line by line,
    (using a multi-valued field), there are only two constructors possible:
    Field(String name, String value, Field.Store store, Field.Index index)
    or
    Field(String name, String value, Field.Store store, Field.Index index,
    Field.TermVector termVector)
    The sentence comes as a string which I get from a BufferedReader-object by
    using the readLine() method.

    But as far as I understood, I need to access some TokenStream-object in
    order to set the PositionIncrementAttribute. So how should that work?

    Thank you in advance.

    Ian Lea schrieb:
    You can use multi valued fields if you play with the position
    increment gap.  See e.g.

    http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-td1501269.html

    A google search for "lucene indexing sentences" or similar finds that,
    and more.


    Different docs can have different fields/different numbers of fields,
    but the position gap approach is probably better.


    --
    Ian.


    On Fri, Mar 4, 2011 at 7:06 AM, Michael Wiegand
    wrote:
    Hi,

    I would like to create an index with Lucene to a document collections of
    text files.
    The index should be created in such a way, that for the search I can
    enforce
    that query term A and query term B are contained within the same
    sentence.

    How should implement the index? Should I have for every sentence a
    different
    field (but make sure that it is not a multi-valued field because they
    would
    get merged which is exactly what I do not want)?
    Would it be problematic that different documents would then end up
    having
    different numbes of fields?

    Thank you in advance!

    Best,
    Michael


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 4, '11 at 7:06a
activeMar 11, '11 at 10:04a
posts6
users2
websitelucene.apache.org

2 users in discussion

Ian Lea: 3 posts Michael Wiegand: 3 posts

People

Translate

site design / logo © 2022 Grokbase