FAQ
Hello,



I am new to lucene and building an application which requires documents
with many fields to be searched.

A "project" id is being stored (not_analyzed) and all matching project
ids will be returned to be used to join other data from a database.

Will it provide better performance to store each comment field in a
separate document with the project ID and a comment ID or to store all
the comments for a single project in a single document with multiple
fields?



Thanks,



Steve Greene

Search Discussions

  • Anshum at Sep 8, 2009 at 12:47 pm
    Hi Stephen,
    Could you clarify more on the requirement. Do you intend to have data in
    index as:
    Document{
    String Comment;
    String CommentId;
    String ProjectId;
    }

    How do you intend to index it.. as in the doc structure? Is there a primary
    key there? What would you search on? What would you want to have as the
    result?
    All said and done, its not really an overhead as long as the number of
    fields is within normal bounds.


    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
    wrote:
    Hello,



    I am new to lucene and building an application which requires documents
    with many fields to be searched.

    A "project" id is being stored (not_analyzed) and all matching project
    ids will be returned to be used to join other data from a database.

    Will it provide better performance to store each comment field in a
    separate document with the project ID and a comment ID or to store all
    the comments for a single project in a single document with multiple
    fields?



    Thanks,



    Steve Greene
  • Stephen Greene at Sep 8, 2009 at 1:27 pm
    Hi Anshum,

    Thank you for your reply. I have two options I am considering.
    One would be:
    Document {
    String projectID;
    String generalComment;
    String workHistoryComment;
    String environmentalComment;
    String claimsComment;
    ...
    }

    And the document may contain upwards of 20 comment fields.

    The other option would be to normalize the data
    Document {
    String projectID;
    String commentType;
    String comment;
    }

    I will need to return only the projectID for all found documents. I have
    implemented a custom Collector to capture the projectID for each
    document. Then it occurred to me that I might be better served by the
    normalized document model. But I am wondering which method will have
    better performance: possibly returning 20 documents per hit, or having
    to search 20 fields per document? (This also has implications for the
    query, as each search term will always search all fields, this is
    somewhat easier in the normalized example as opposed to creating 20 "or"
    queries.)

    Thanks,

    Steve

    -----Original Message-----
    From: Anshum
    Sent: Tuesday, September 08, 2009 9:47 AM
    To: java-user@lucene.apache.org
    Subject: Re: large document with multiple fields performance

    Hi Stephen,
    Could you clarify more on the requirement. Do you intend to have data in
    index as:
    Document{
    String Comment;
    String CommentId;
    String ProjectId;
    }

    How do you intend to index it.. as in the doc structure? Is there a
    primary
    key there? What would you search on? What would you want to have as the
    result?
    All said and done, its not really an overhead as long as the number of
    fields is within normal bounds.


    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
    wrote:
    Hello,



    I am new to lucene and building an application which requires documents
    with many fields to be searched.

    A "project" id is being stored (not_analyzed) and all matching project
    ids will be returned to be used to join other data from a database.

    Will it provide better performance to store each comment field in a
    separate document with the project ID and a comment ID or to store all
    the comments for a single project in a single document with multiple
    fields?



    Thanks,



    Steve Greene
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Anshum at Sep 8, 2009 at 5:08 pm
    Hey Steve,

    I'd suggest you go with the 20 fields (Non normalized) model. I've used much
    larger models and they happen to work just fine. Wouldnt be a point
    increasing the complexity.
    Hope that clarifies things a little atleast :)
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Tue, Sep 8, 2009 at 6:57 PM, Stephen Greene
    wrote:
    Hi Anshum,

    Thank you for your reply. I have two options I am considering.
    One would be:
    Document {
    String projectID;
    String generalComment;
    String workHistoryComment;
    String environmentalComment;
    String claimsComment;
    ...
    }

    And the document may contain upwards of 20 comment fields.

    The other option would be to normalize the data
    Document {
    String projectID;
    String commentType;
    String comment;
    }

    I will need to return only the projectID for all found documents. I have
    implemented a custom Collector to capture the projectID for each
    document. Then it occurred to me that I might be better served by the
    normalized document model. But I am wondering which method will have
    better performance: possibly returning 20 documents per hit, or having
    to search 20 fields per document? (This also has implications for the
    query, as each search term will always search all fields, this is
    somewhat easier in the normalized example as opposed to creating 20 "or"
    queries.)

    Thanks,

    Steve

    -----Original Message-----
    From: Anshum
    Sent: Tuesday, September 08, 2009 9:47 AM
    To: java-user@lucene.apache.org
    Subject: Re: large document with multiple fields performance

    Hi Stephen,
    Could you clarify more on the requirement. Do you intend to have data in
    index as:
    Document{
    String Comment;
    String CommentId;
    String ProjectId;
    }

    How do you intend to index it.. as in the doc structure? Is there a
    primary
    key there? What would you search on? What would you want to have as the
    result?
    All said and done, its not really an overhead as long as the number of
    fields is within normal bounds.


    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
    wrote:
    Hello,



    I am new to lucene and building an application which requires documents
    with many fields to be searched.

    A "project" id is being stored (not_analyzed) and all matching project
    ids will be returned to be used to join other data from a database.

    Will it provide better performance to store each comment field in a
    separate document with the project ID and a comment ID or to store all
    the comments for a single project in a single document with multiple
    fields?



    Thanks,



    Steve Greene
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Stephen Greene at Sep 14, 2009 at 1:49 am
    Hi Anshum,

    Thanks for your insight. I will stick with the 20 fields.
    I realized that I had neglected to mention that in a separate query I
    will search on the primary key and a search term to return details about
    how many hits come from each field. Is it safe to assume that this will
    also not be a problem and implementing a custom hitcollector will do the
    trick?

    Thanks again,

    Steve

    -----Original Message-----
    From: Anshum
    Sent: Tuesday, September 08, 2009 2:08 PM
    To: java-user@lucene.apache.org
    Subject: Re: large document with multiple fields performance

    Hey Steve,

    I'd suggest you go with the 20 fields (Non normalized) model. I've used
    much
    larger models and they happen to work just fine. Wouldnt be a point
    increasing the complexity.
    Hope that clarifies things a little atleast :)
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Tue, Sep 8, 2009 at 6:57 PM, Stephen Greene
    wrote:
    Hi Anshum,

    Thank you for your reply. I have two options I am considering.
    One would be:
    Document {
    String projectID;
    String generalComment;
    String workHistoryComment;
    String environmentalComment;
    String claimsComment;
    ...
    }

    And the document may contain upwards of 20 comment fields.

    The other option would be to normalize the data
    Document {
    String projectID;
    String commentType;
    String comment;
    }

    I will need to return only the projectID for all found documents. I have
    implemented a custom Collector to capture the projectID for each
    document. Then it occurred to me that I might be better served by the
    normalized document model. But I am wondering which method will have
    better performance: possibly returning 20 documents per hit, or having
    to search 20 fields per document? (This also has implications for the
    query, as each search term will always search all fields, this is
    somewhat easier in the normalized example as opposed to creating 20 "or"
    queries.)

    Thanks,

    Steve

    -----Original Message-----
    From: Anshum
    Sent: Tuesday, September 08, 2009 9:47 AM
    To: java-user@lucene.apache.org
    Subject: Re: large document with multiple fields performance

    Hi Stephen,
    Could you clarify more on the requirement. Do you intend to have data in
    index as:
    Document{
    String Comment;
    String CommentId;
    String ProjectId;
    }

    How do you intend to index it.. as in the doc structure? Is there a
    primary
    key there? What would you search on? What would you want to have as the
    result?
    All said and done, its not really an overhead as long as the number of
    fields is within normal bounds.


    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
    wrote:
    Hello,



    I am new to lucene and building an application which requires documents
    with many fields to be searched.

    A "project" id is being stored (not_analyzed) and all matching
    project
    ids will be returned to be used to join other data from a database.

    Will it provide better performance to store each comment field in a
    separate document with the project ID and a comment ID or to store
    all
    the comments for a single project in a single document with multiple
    fields?



    Thanks,



    Steve Greene
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 8, '09 at 11:58a
activeSep 14, '09 at 1:49a
posts5
users2
websitelucene.apache.org

2 users in discussion

Stephen Greene: 3 posts Anshum: 2 posts

People

Translate

site design / logo © 2022 Grokbase