FAQ
Hi, was recently looking to incorporate Lucene for a simple
"parametric"/"faceted" type search. The documents are very small,
roughly 15 fields of short length (5-15 characters, generally strings
and padded integers). When profiling query performance of our
application, which inserts 1 million documents then
1) filters on 1-3 fields with simple boolean/term matches
2) stores these docids in a BitSet
3) calls IndexSearcher.doc() to retrieve all matching documents (all
fields, 100 - 1,000,000 results per call)

It turns out that 98% of the query time was spent not actually doing the
query, but within the IndexSearcher.doc() call.

My first question is, is there any way to more efficiently get
(all/most) of the fields for a set of documents, other than iterating
and calling doc()?

Additionally, is there any way (or planned feature) to index *binary*
data? Using a profiler, I have determined that String decoding is a
significant performance limiter for my use-case:

90% of the application time is spent in this method:
---------------------------------------
org.apache.lucene.index.FieldsReader.addField(Document, FieldInfo,
boolean, boolean, boolean)


46% of the application time is spent decoding strings (half of the above
addField() time):
---------------------------------------org.apache.lucene.store.IndexInpu
t.readString()
java.lang.String.<init>(byte[], int, int, String)
java.lang.StringCoding.decode(String, byte[], int, int)

java.lang.StringCoding$StringDecoder.decode(byte[], int, int)

(YJP profiler output available if needed)

String.intern() was my top hot spot, but my patch was accepted and fixed
this: https://issues.apache.org/jira/browse/LUCENE-1600. I'm not
familiar enough with the lucene codebase to figure out the above though,
so thought I would ask.



//ideally i'd be able to do add a binary field as such:
doc.add(new Field("f1",new
byte[]{1,2,3,4},Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

//then query like:
Query q = new TermQuery(new Term("f1",byte[]{1,2,3,4}))
searcher.search(q,...);

Which would allow me to avoid the Integer -> String -> Padded String ->
String -> Integer coding/decoding to index an integer, and avoid Object
-> String -> Object conversion (which per above is quite expensive).


Thanks for any help!


Regards,

Patrick

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Khawaja Shams at Apr 14, 2009 at 10:40 pm
    Hi,
    It is not a good idea to extract each document. You can be more efficient
    by only looking at the fields you are interested in. Depending on the size
    of your index, you can try:

    String[] codes = FieldCache.DEFAULT.getStrings(indexReader, fieldName);


    This returns a string [] with the length being the number of documents in
    your index. If you are doing faceted searching, you may want to try:

    StringIndex stringIndex = FieldCache.DEFAULT.getStringIndex(indexReader,
    fieldName);

    The StringIndex class has a lookup array and an order array. The order array
    contains a value for each document id, and you can use this value to extract
    the string from the lookup array once you are done counting.


    Perhaps the Lucene experts can shed light on a better approach.

    You may also want to look at SOLR for faceted searching support :). HTH.


    Regards,
    Khawaja Shams
    On Tue, Apr 14, 2009 at 11:12 AM, Eger, Patrick wrote:

    Hi, was recently looking to incorporate Lucene for a simple
    "parametric"/"faceted" type search. The documents are very small,
    roughly 15 fields of short length (5-15 characters, generally strings
    and padded integers). When profiling query performance of our
    application, which inserts 1 million documents then
    1) filters on 1-3 fields with simple boolean/term matches
    2) stores these docids in a BitSet
    3) calls IndexSearcher.doc() to retrieve all matching documents (all
    fields, 100 - 1,000,000 results per call)

    It turns out that 98% of the query time was spent not actually doing the
    query, but within the IndexSearcher.doc() call.

    My first question is, is there any way to more efficiently get
    (all/most) of the fields for a set of documents, other than iterating
    and calling doc()?

    Additionally, is there any way (or planned feature) to index *binary*
    data? Using a profiler, I have determined that String decoding is a
    significant performance limiter for my use-case:

    90% of the application time is spent in this method:
    ---------------------------------------
    org.apache.lucene.index.FieldsReader.addField(Document, FieldInfo,
    boolean, boolean, boolean)


    46% of the application time is spent decoding strings (half of the above
    addField() time):
    ---------------------------------------org.apache.lucene.store.IndexInpu
    t.readString()
    java.lang.String.<init>(byte[], int, int, String)
    java.lang.StringCoding.decode(String, byte[], int, int)

    java.lang.StringCoding$StringDecoder.decode(byte[], int, int)

    (YJP profiler output available if needed)

    String.intern() was my top hot spot, but my patch was accepted and fixed
    this: https://issues.apache.org/jira/browse/LUCENE-1600. I'm not
    familiar enough with the lucene codebase to figure out the above though,
    so thought I would ask.



    //ideally i'd be able to do add a binary field as such:
    doc.add(new Field("f1",new
    byte[]{1,2,3,4},Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

    //then query like:
    Query q = new TermQuery(new Term("f1",byte[]{1,2,3,4}))
    searcher.search(q,...);

    Which would allow me to avoid the Integer -> String -> Padded String ->
    String -> Integer coding/decoding to index an integer, and avoid Object
    -> String -> Object conversion (which per above is quite expensive).


    Thanks for any help!


    Regards,

    Patrick

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Eks dev at Apr 14, 2009 at 11:19 pm
    you can store binary value?
    e.g. with:
    Field(String name, byte[] value, Field.Store store)

    You could store all your fields as byte[], so you get them back as byte[]. How you index them is just another problem, but you are having no problems with speed in your case, leave it as it is.

    try simply to create pairs of fields for each field you now have, one Stored and not indexed and another Indexed and not stored. Or Fields you use for searching as only indexed, and one big byte[] field where you encode all your documents (Blob)... if complex, you could try protobuf, thrift...

    Anyhow, your idea with byte[] as indexed unit that can be searched unit is maybe not all that bad, but it does not look like you need it and is not an easy one to change (I guess).

    ----- Original Message ----
    From: "Eger, Patrick" <peger@automotive.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, 14 April, 2009 20:12:34
    Subject: Binary indexing / query efficiency

    Hi, was recently looking to incorporate Lucene for a simple
    "parametric"/"faceted" type search. The documents are very small,
    roughly 15 fields of short length (5-15 characters, generally strings
    and padded integers). When profiling query performance of our
    application, which inserts 1 million documents then
    1) filters on 1-3 fields with simple boolean/term matches
    2) stores these docids in a BitSet
    3) calls IndexSearcher.doc() to retrieve all matching documents (all
    fields, 100 - 1,000,000 results per call)

    It turns out that 98% of the query time was spent not actually doing the
    query, but within the IndexSearcher.doc() call.

    My first question is, is there any way to more efficiently get
    (all/most) of the fields for a set of documents, other than iterating
    and calling doc()?

    Additionally, is there any way (or planned feature) to index *binary*
    data? Using a profiler, I have determined that String decoding is a
    significant performance limiter for my use-case:

    90% of the application time is spent in this method:
    ---------------------------------------
    org.apache.lucene.index.FieldsReader.addField(Document, FieldInfo,
    boolean, boolean, boolean)


    46% of the application time is spent decoding strings (half of the above
    addField() time):
    ---------------------------------------org.apache.lucene.store.IndexInpu
    t.readString()
    java.lang.String.(byte[], int, int, String)
    java.lang.StringCoding.decode(String, byte[], int, int)

    java.lang.StringCoding$StringDecoder.decode(byte[], int, int)

    (YJP profiler output available if needed)

    String.intern() was my top hot spot, but my patch was accepted and fixed
    this: https://issues.apache.org/jira/browse/LUCENE-1600. I'm not
    familiar enough with the lucene codebase to figure out the above though,
    so thought I would ask.



    //ideally i'd be able to do add a binary field as such:
    doc.add(new Field("f1",new
    byte[]{1,2,3,4},Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

    //then query like:
    Query q = new TermQuery(new Term("f1",byte[]{1,2,3,4}))
    searcher.search(q,...);

    Which would allow me to avoid the Integer -> String -> Padded String ->
    String -> Integer coding/decoding to index an integer, and avoid Object
    -> String -> Object conversion (which per above is quite expensive).


    Thanks for any help!


    Regards,

    Patrick

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 14, '09 at 6:19p
activeApr 14, '09 at 11:19p
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase