FAQ
Hi,

I have been experimenting with using a int payload as a unique identifier, one per Document. I have successfully loaded them in using the TermPositions API with something like:

public static void loadPayloadIntArray(IndexReader reader, Term term, int[] intArray, int from, int to) throws Exception {
TermPositions tp = null;
byte[] dataBuffer = new byte[4];
int intVal = -1;

//System.out.println("loadPayloadIntArray(" + from + " -> " + to + ")");
try {
tp = reader.termPositions(term);
tp.skipTo(from);
do {
tp.nextPosition();
tp.getPayload(dataBuffer, 0);
intVal = PayloadHelper.decodeInt(dataBuffer, 0);
intArray[from] = intVal;
} while (++from <= to && tp.next());
}
finally {
if (tp != null) {
tp.close();
}
}
}


This works fine, it even allows me to use several threads in parallel to each load a part of the array, thus reducing the time taken.
However, performance is not as great as I would like - with indexes of ~4m docs it can take over 3 seconds. (These are cold start times, no reader cacheing and OS read caches completely flushed). Increasing the number of loader threads (calling the method shown) doesn't help much, which makes me suspect there is some synchronized stuff going on inside TermPositions.

So ... onto my question(s):

1) Because this approach is still faster than using FieldCache, my belief is that payloads are stored sequentially in the index - is this true?
2) If so, is there some raw, lower level I/O I can call which will allow me to load much bigger chunks of payload rather than a single int at a time?

Thanks for any help in this area.

Chris

Search Discussions

  • Michael McCandless at May 3, 2011 at 10:31 am

    On Tue, May 3, 2011 at 5:35 AM, Chris Bamford wrote:
    Hi,

    I have been experimenting with using a int payload as a unique identifier, one per Document.  I have successfully loaded them in using the TermPositions API with something like:

    public static void loadPayloadIntArray(IndexReader reader, Term term, int[] intArray, int from, int to) throws Exception {
    TermPositions tp = null;
    byte[] dataBuffer = new byte[4];
    int intVal = -1;

    //System.out.println("loadPayloadIntArray(" + from + " -> " + to + ")");
    try {
    tp = reader.termPositions(term);
    tp.skipTo(from);
    do {
    tp.nextPosition();
    tp.getPayload(dataBuffer, 0);
    intVal = PayloadHelper.decodeInt(dataBuffer, 0);
    intArray[from] = intVal;
    } while (++from <= to && tp.next());
    }
    finally {
    if (tp != null) {
    tp.close();
    }
    }
    }


    This works fine, it even allows me to use several threads in parallel to each load a part of the array, thus reducing the time taken.
    However, performance is not as great as I would like - with indexes of ~4m docs it can take over 3 seconds.  (These are cold start times, no reader cacheing and OS read caches completely flushed).   Increasing the number of loader threads (calling the method shown) doesn't help much, which makes me suspect there is some synchronized stuff going on inside TermPositions.
    Since it's cold, likely your bottleneck is the IO system, and that's
    why adding threads doesn't help?
    1) Because this approach is still faster than using FieldCache, my belief is that payloads are stored sequentially in the index - is this true?
    They are... they are encoded into the positions (*.prx) file.
    2) If so, is there some raw, lower level I/O I can call which will allow me to load much bigger chunks of payload rather than a single int at a time?
    Not really... I mean you could try to parse the bytes yourself from
    the prx file, but Lucene doesn't have existing APIs to bulk load from
    here.

    In trunk we are working towards faster bulk-load APIs, but, how to
    make payloads work "well" there is an open question...

    Furthermore, there is a new feature branch (docvalues), which aims to
    handle your use case ("store X per document") well. Maybe check out
    that branch?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 3, '11 at 9:36a
activeMay 3, '11 at 10:31a
posts2
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase