FAQ
Hi,

I am trying to figure out how to solve this problem:

I have about 500,000 files that I would like to index, but the files are structured. So, each file has the following layout:

doc1
token1, weight11, frequency1, weight21
token2, weight12, frequency2, weight22
.
.
.

etc for 500,000 docs.

Basically, I would like to index the tokens for each doc. When I search for a token, I would like to be able to return the top docs sorted by weight1, frequency, or weight2.

So, in my naive setup, I loop through the files in the directory, then I loop through the lines of the file. In side of the loop through each file, I call this function:

public Document processKeywords(Document doc, String keyword, Float weight1, Float weight2, Integer frequency) throws Exception {
Document doc = new Document();
doc.add(new Field("keywords", keyword, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new NumericField(keyword+"weight1", Field.Store.YES, true).setFloatValue(weight1));
doc.add(new NumericField(keyword+"weight2", Field.Store.YES, true).setFloatValue(weight2));
doc.add(new NumericField(keyword+"frequency", Field.Store.YES, true).setFloatValue(frequency));
return doc;
}

So, for each token, I create 3 new fields each time. Notice how I am trying to index the keyword in the "keywords" field. For the weights and frequency, I create a new field with a name based on the keyword. On average, I have 100 tokens per document, so each document will have about 300 distinct fields.

When running my program, the lucene portion eats up tons of memory and when it gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the program slows to a crawl. I assume it is spending all of its time in garbage collection due to all these fields.

My code above seems like a very hacky way of accomplishing what I want (sorting documents based on keyword search using different numeric fields associated with that keyword).

FYI, here is the main search code, where q is the token I am searching for and sortby is the field I want to use to sort. I setup a QP to search for the keyword in the "keywords" field. Then, I can extract the stats that I indexed for the given query keyword.

private static final QueryParser parser = new QueryParser(Version.LUCENE_30, "keywords", new StandardAnalyzer(Version.LUCENE_30));

public void search(String q, String sortby) throws IOException, ParseException {
Query query = parser.parse(q);
long start = System.currentTimeMillis();
TopDocs hits = this.is.search(query, null, 10, new Sort(new SortField(q+"sortby", SortField.FLOAT, true)));
long end = System.currentTimeMillis();
System.out.println("Found " + hits.totalHits +
" document(s) (in " + (end - start) +
" milliseconds) that matched query '" +
q + "':");
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = this.is.doc(scoreDoc.doc);
String hash = doc.get("hash");
System.out.println(hash + " " + doc.get(q+"sortby") + " " + hash);
}
}

I am pretty new to Lucene, so I hope this makes sense. I tried to pare my problem down as much as possible. Like I said, the main problem I am running into is that after processing about 30000 documents, the indexing slows to a crawl and seems to spend all of its time in the garbage collector. I am looking for a more efficient/effective way of solving this problem. Code tidbits would help, but are not necessary :)

Thanks for your help,
Chris S.

Search Discussions

  • Mike Sokolov at May 5, 2011 at 10:01 pm
    Are the tokens unique within a document? If so, why not store a document
    for every doc/token pair with fields:

    id (doc#/token#)
    doc-id (doc#)
    token
    weight1
    weight2
    frequency

    Then search for token, sort by weight1, weight2 or frequency.

    If the token matches are unique within a document you will only get each
    document listed once. If they aren't unique, it's not clear what you
    want to sort by anyway....

    -Mike
    On 05/05/2011 04:12 PM, Chris Schilling wrote:
    Hi,

    I am trying to figure out how to solve this problem:

    I have about 500,000 files that I would like to index, but the files are structured. So, each file has the following layout:

    doc1
    token1, weight11, frequency1, weight21
    token2, weight12, frequency2, weight22
    .
    .
    .

    etc for 500,000 docs.

    Basically, I would like to index the tokens for each doc. When I search for a token, I would like to be able to return the top docs sorted by weight1, frequency, or weight2.

    So, in my naive setup, I loop through the files in the directory, then I loop through the lines of the file. In side of the loop through each file, I call this function:

    public Document processKeywords(Document doc, String keyword, Float weight1, Float weight2, Integer frequency) throws Exception {
    Document doc = new Document();
    doc.add(new Field("keywords", keyword, Field.Store.NO, Field.Index.ANALYZED));
    doc.add(new NumericField(keyword+"weight1", Field.Store.YES, true).setFloatValue(weight1));
    doc.add(new NumericField(keyword+"weight2", Field.Store.YES, true).setFloatValue(weight2));
    doc.add(new NumericField(keyword+"frequency", Field.Store.YES, true).setFloatValue(frequency));
    return doc;
    }

    So, for each token, I create 3 new fields each time. Notice how I am trying to index the keyword in the "keywords" field. For the weights and frequency, I create a new field with a name based on the keyword. On average, I have 100 tokens per document, so each document will have about 300 distinct fields.

    When running my program, the lucene portion eats up tons of memory and when it gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the program slows to a crawl. I assume it is spending all of its time in garbage collection due to all these fields.

    My code above seems like a very hacky way of accomplishing what I want (sorting documents based on keyword search using different numeric fields associated with that keyword).

    FYI, here is the main search code, where q is the token I am searching for and sortby is the field I want to use to sort. I setup a QP to search for the keyword in the "keywords" field. Then, I can extract the stats that I indexed for the given query keyword.

    private static final QueryParser parser = new QueryParser(Version.LUCENE_30, "keywords", new StandardAnalyzer(Version.LUCENE_30));

    public void search(String q, String sortby) throws IOException, ParseException {
    Query query = parser.parse(q);
    long start = System.currentTimeMillis();
    TopDocs hits = this.is.search(query, null, 10, new Sort(new SortField(q+"sortby", SortField.FLOAT, true)));
    long end = System.currentTimeMillis();
    System.out.println("Found " + hits.totalHits +
    " document(s) (in " + (end - start) +
    " milliseconds) that matched query '" +
    q + "':");
    for(ScoreDoc scoreDoc : hits.scoreDocs) {
    Document doc = this.is.doc(scoreDoc.doc);
    String hash = doc.get("hash");
    System.out.println(hash + " " + doc.get(q+"sortby") + " " + hash);
    }
    }

    I am pretty new to Lucene, so I hope this makes sense. I tried to pare my problem down as much as possible. Like I said, the main problem I am running into is that after processing about 30000 documents, the indexing slows to a crawl and seems to spend all of its time in the garbage collector. I am looking for a more efficient/effective way of solving this problem. Code tidbits would help, but are not necessary :)

    Thanks for your help,
    Chris S.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Schilling at May 5, 2011 at 10:09 pm
    Hey Mike,

    Let me clarify:

    The tokens are not unique. Let's say doc1 contains the token
    foo and has the properties weight1 = 0.75, weight2 = 0.90, frequency = 10

    Now, let's say doc2 also contains the token
    foo with properties: weight1 = 0.8, weight2 = 0.75, frequency = 5

    Now, I want to search for all the documents that contain foo, but I want them sorted by frequency.

    Then, I would have doc1, doc2.

    Now, I want to search for all the documents that contain foon, but I want them sorted by weight1.
    Then, I would have doc2, doc1

    Does that clarify?

    On May 5, 2011, at 3:01 PM, Mike Sokolov wrote:

    Are the tokens unique within a document? If so, why not store a document for every doc/token pair with fields:

    id (doc#/token#)
    doc-id (doc#)
    token
    weight1
    weight2
    frequency

    Then search for token, sort by weight1, weight2 or frequency.

    If the token matches are unique within a document you will only get each document listed once. If they aren't unique, it's not clear what you want to sort by anyway....

    -Mike
    On 05/05/2011 04:12 PM, Chris Schilling wrote:
    Hi,

    I am trying to figure out how to solve this problem:

    I have about 500,000 files that I would like to index, but the files are structured. So, each file has the following layout:

    doc1
    token1, weight11, frequency1, weight21
    token2, weight12, frequency2, weight22
    .
    .
    .

    etc for 500,000 docs.

    Basically, I would like to index the tokens for each doc. When I search for a token, I would like to be able to return the top docs sorted by weight1, frequency, or weight2.

    So, in my naive setup, I loop through the files in the directory, then I loop through the lines of the file. In side of the loop through each file, I call this function:

    public Document processKeywords(Document doc, String keyword, Float weight1, Float weight2, Integer frequency) throws Exception {
    Document doc = new Document();
    doc.add(new Field("keywords", keyword, Field.Store.NO, Field.Index.ANALYZED));
    doc.add(new NumericField(keyword+"weight1", Field.Store.YES, true).setFloatValue(weight1));
    doc.add(new NumericField(keyword+"weight2", Field.Store.YES, true).setFloatValue(weight2));
    doc.add(new NumericField(keyword+"frequency", Field.Store.YES, true).setFloatValue(frequency));
    return doc;
    }

    So, for each token, I create 3 new fields each time. Notice how I am trying to index the keyword in the "keywords" field. For the weights and frequency, I create a new field with a name based on the keyword. On average, I have 100 tokens per document, so each document will have about 300 distinct fields.

    When running my program, the lucene portion eats up tons of memory and when it gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the program slows to a crawl. I assume it is spending all of its time in garbage collection due to all these fields.

    My code above seems like a very hacky way of accomplishing what I want (sorting documents based on keyword search using different numeric fields associated with that keyword).

    FYI, here is the main search code, where q is the token I am searching for and sortby is the field I want to use to sort. I setup a QP to search for the keyword in the "keywords" field. Then, I can extract the stats that I indexed for the given query keyword.

    private static final QueryParser parser = new QueryParser(Version.LUCENE_30, "keywords", new StandardAnalyzer(Version.LUCENE_30));

    public void search(String q, String sortby) throws IOException, ParseException {
    Query query = parser.parse(q);
    long start = System.currentTimeMillis();
    TopDocs hits = this.is.search(query, null, 10, new Sort(new SortField(q+"sortby", SortField.FLOAT, true)));
    long end = System.currentTimeMillis();
    System.out.println("Found " + hits.totalHits +
    " document(s) (in " + (end - start) +
    " milliseconds) that matched query '" +
    q + "':");
    for(ScoreDoc scoreDoc : hits.scoreDocs) {
    Document doc = this.is.doc(scoreDoc.doc);
    String hash = doc.get("hash");
    System.out.println(hash + " " + doc.get(q+"sortby") + " " + hash);
    }
    }

    I am pretty new to Lucene, so I hope this makes sense. I tried to pare my problem down as much as possible. Like I said, the main problem I am running into is that after processing about 30000 documents, the indexing slows to a crawl and seems to spend all of its time in the garbage collector. I am looking for a more efficient/effective way of solving this problem. Code tidbits would help, but are not necessary :)

    Thanks for your help,
    Chris S.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mike Sokolov at May 5, 2011 at 10:12 pm
    I think the solution I gave you will work. The only problem is if a
    token appears twice in the same doc:

    doc1 has foo with two different sets of weights and frequencies...

    but I think you're saying that doesn't happen
    On 05/05/2011 06:09 PM, Chris Schilling wrote:
    Hey Mike,

    Let me clarify:

    The tokens are not unique. Let's say doc1 contains the token
    foo and has the properties weight1 = 0.75, weight2 = 0.90, frequency = 10

    Now, let's say doc2 also contains the token
    foo with properties: weight1 = 0.8, weight2 = 0.75, frequency = 5

    Now, I want to search for all the documents that contain foo, but I want them sorted by frequency.

    Then, I would have doc1, doc2.

    Now, I want to search for all the documents that contain foon, but I want them sorted by weight1.
    Then, I would have doc2, doc1

    Does that clarify?


    On May 5, 2011, at 3:01 PM, Mike Sokolov wrote:

    Are the tokens unique within a document? If so, why not store a document for every doc/token pair with fields:

    id (doc#/token#)
    doc-id (doc#)
    token
    weight1
    weight2
    frequency

    Then search for token, sort by weight1, weight2 or frequency.

    If the token matches are unique within a document you will only get each document listed once. If they aren't unique, it's not clear what you want to sort by anyway....

    -Mike
    On 05/05/2011 04:12 PM, Chris Schilling wrote:

    Hi,

    I am trying to figure out how to solve this problem:

    I have about 500,000 files that I would like to index, but the files are structured. So, each file has the following layout:

    doc1
    token1, weight11, frequency1, weight21
    token2, weight12, frequency2, weight22
    .
    .
    .

    etc for 500,000 docs.

    Basically, I would like to index the tokens for each doc. When I search for a token, I would like to be able to return the top docs sorted by weight1, frequency, or weight2.

    So, in my naive setup, I loop through the files in the directory, then I loop through the lines of the file. In side of the loop through each file, I call this function:

    public Document processKeywords(Document doc, String keyword, Float weight1, Float weight2, Integer frequency) throws Exception {
    Document doc = new Document();
    doc.add(new Field("keywords", keyword, Field.Store.NO, Field.Index.ANALYZED));
    doc.add(new NumericField(keyword+"weight1", Field.Store.YES, true).setFloatValue(weight1));
    doc.add(new NumericField(keyword+"weight2", Field.Store.YES, true).setFloatValue(weight2));
    doc.add(new NumericField(keyword+"frequency", Field.Store.YES, true).setFloatValue(frequency));
    return doc;
    }

    So, for each token, I create 3 new fields each time. Notice how I am trying to index the keyword in the "keywords" field. For the weights and frequency, I create a new field with a name based on the keyword. On average, I have 100 tokens per document, so each document will have about 300 distinct fields.

    When running my program, the lucene portion eats up tons of memory and when it gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the program slows to a crawl. I assume it is spending all of its time in garbage collection due to all these fields.

    My code above seems like a very hacky way of accomplishing what I want (sorting documents based on keyword search using different numeric fields associated with that keyword).

    FYI, here is the main search code, where q is the token I am searching for and sortby is the field I want to use to sort. I setup a QP to search for the keyword in the "keywords" field. Then, I can extract the stats that I indexed for the given query keyword.

    private static final QueryParser parser = new QueryParser(Version.LUCENE_30, "keywords", new StandardAnalyzer(Version.LUCENE_30));

    public void search(String q, String sortby) throws IOException, ParseException {
    Query query = parser.parse(q);
    long start = System.currentTimeMillis();
    TopDocs hits = this.is.search(query, null, 10, new Sort(new SortField(q+"sortby", SortField.FLOAT, true)));
    long end = System.currentTimeMillis();
    System.out.println("Found " + hits.totalHits +
    " document(s) (in " + (end - start) +
    " milliseconds) that matched query '" +
    q + "':");
    for(ScoreDoc scoreDoc : hits.scoreDocs) {
    Document doc = this.is.doc(scoreDoc.doc);
    String hash = doc.get("hash");
    System.out.println(hash + " " + doc.get(q+"sortby") + " " + hash);
    }
    }

    I am pretty new to Lucene, so I hope this makes sense. I tried to pare my problem down as much as possible. Like I said, the main problem I am running into is that after processing about 30000 documents, the indexing slows to a crawl and seems to spend all of its time in the garbage collector. I am looking for a more efficient/effective way of solving this problem. Code tidbits would help, but are not necessary :)

    Thanks for your help,
    Chris S.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Schilling at May 5, 2011 at 11:00 pm
    Hey Mike,

    My only concern is that I am replacing a large number of fields inside of a Document with a (very large ~50e6) number of Documents. Will I not run into the same memory issues? Or do I create only one doc object and reuse it? With so many Doc/Token pairs, won't searching the index take a lot more time?

    Thanks for your help,
    Chris
    On May 5, 2011, at 3:11 PM, Mike Sokolov wrote:

    I think the solution I gave you will work. The only problem is if a token appears twice in the same doc:

    doc1 has foo with two different sets of weights and frequencies...

    but I think you're saying that doesn't happen
    On 05/05/2011 06:09 PM, Chris Schilling wrote:
    Hey Mike,

    Let me clarify:

    The tokens are not unique. Let's say doc1 contains the token
    foo and has the properties weight1 = 0.75, weight2 = 0.90, frequency = 10

    Now, let's say doc2 also contains the token
    foo with properties: weight1 = 0.8, weight2 = 0.75, frequency = 5

    Now, I want to search for all the documents that contain foo, but I want them sorted by frequency.

    Then, I would have doc1, doc2.

    Now, I want to search for all the documents that contain foon, but I want them sorted by weight1.
    Then, I would have doc2, doc1

    Does that clarify?


    On May 5, 2011, at 3:01 PM, Mike Sokolov wrote:

    Are the tokens unique within a document? If so, why not store a document for every doc/token pair with fields:

    id (doc#/token#)
    doc-id (doc#)
    token
    weight1
    weight2
    frequency

    Then search for token, sort by weight1, weight2 or frequency.

    If the token matches are unique within a document you will only get each document listed once. If they aren't unique, it's not clear what you want to sort by anyway....

    -Mike
    On 05/05/2011 04:12 PM, Chris Schilling wrote:

    Hi,

    I am trying to figure out how to solve this problem:

    I have about 500,000 files that I would like to index, but the files are structured. So, each file has the following layout:

    doc1
    token1, weight11, frequency1, weight21
    token2, weight12, frequency2, weight22
    .
    .
    .

    etc for 500,000 docs.

    Basically, I would like to index the tokens for each doc. When I search for a token, I would like to be able to return the top docs sorted by weight1, frequency, or weight2.

    So, in my naive setup, I loop through the files in the directory, then I loop through the lines of the file. In side of the loop through each file, I call this function:

    public Document processKeywords(Document doc, String keyword, Float weight1, Float weight2, Integer frequency) throws Exception {
    Document doc = new Document();
    doc.add(new Field("keywords", keyword, Field.Store.NO, Field.Index.ANALYZED));
    doc.add(new NumericField(keyword+"weight1", Field.Store.YES, true).setFloatValue(weight1));
    doc.add(new NumericField(keyword+"weight2", Field.Store.YES, true).setFloatValue(weight2));
    doc.add(new NumericField(keyword+"frequency", Field.Store.YES, true).setFloatValue(frequency));
    return doc;
    }

    So, for each token, I create 3 new fields each time. Notice how I am trying to index the keyword in the "keywords" field. For the weights and frequency, I create a new field with a name based on the keyword. On average, I have 100 tokens per document, so each document will have about 300 distinct fields.

    When running my program, the lucene portion eats up tons of memory and when it gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the program slows to a crawl. I assume it is spending all of its time in garbage collection due to all these fields.

    My code above seems like a very hacky way of accomplishing what I want (sorting documents based on keyword search using different numeric fields associated with that keyword).

    FYI, here is the main search code, where q is the token I am searching for and sortby is the field I want to use to sort. I setup a QP to search for the keyword in the "keywords" field. Then, I can extract the stats that I indexed for the given query keyword.

    private static final QueryParser parser = new QueryParser(Version.LUCENE_30, "keywords", new StandardAnalyzer(Version.LUCENE_30));

    public void search(String q, String sortby) throws IOException, ParseException {
    Query query = parser.parse(q);
    long start = System.currentTimeMillis();
    TopDocs hits = this.is.search(query, null, 10, new Sort(new SortField(q+"sortby", SortField.FLOAT, true)));
    long end = System.currentTimeMillis();
    System.out.println("Found " + hits.totalHits +
    " document(s) (in " + (end - start) +
    " milliseconds) that matched query '" +
    q + "':");
    for(ScoreDoc scoreDoc : hits.scoreDocs) {
    Document doc = this.is.doc(scoreDoc.doc);
    String hash = doc.get("hash");
    System.out.println(hash + " " + doc.get(q+"sortby") + " " + hash);
    }
    }

    I am pretty new to Lucene, so I hope this makes sense. I tried to pare my problem down as much as possible. Like I said, the main problem I am running into is that after processing about 30000 documents, the indexing slows to a crawl and seems to spend all of its time in the garbage collector. I am looking for a more efficient/effective way of solving this problem. Code tidbits would help, but are not necessary :)

    Thanks for your help,
    Chris S.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Sokolov at May 6, 2011 at 10:34 am
    I believe creating a large number of fields is not a good match w/the
    underlying architecture, and you'd be better off w/a large number of
    documents/small number of fields, where the same field occurs in every
    document. There is some discussion here:
    http://markmail.org/message/hcmt5syca7zdeac6.

    -Mike
    On 5/5/2011 7:00 PM, Chris Schilling wrote:
    Hey Mike,

    My only concern is that I am replacing a large number of fields inside of a Document with a (very large ~50e6) number of Documents. Will I not run into the same memory issues? Or do I create only one doc object and reuse it? With so many Doc/Token pairs, won't searching the index take a lot more time?

    Thanks for your help,
    Chris
    On May 5, 2011, at 3:11 PM, Mike Sokolov wrote:

    I think the solution I gave you will work. The only problem is if a token appears twice in the same doc:

    doc1 has foo with two different sets of weights and frequencies...

    but I think you're saying that doesn't happen
    On 05/05/2011 06:09 PM, Chris Schilling wrote:
    Hey Mike,

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Schilling at May 5, 2011 at 10:11 pm
    Oh, yes, they are unique within a document. I was also thinking about something like this. But I would be replacing a large number of fields within a document by a large number of documents. Let me see if I can work that out.

    On May 5, 2011, at 3:01 PM, Mike Sokolov wrote:

    Are the tokens unique within a document? If so, why not store a document for every doc/token pair with fields:

    id (doc#/token#)
    doc-id (doc#)
    token
    weight1
    weight2
    frequency

    Then search for token, sort by weight1, weight2 or frequency.

    If the token matches are unique within a document you will only get each document listed once. If they aren't unique, it's not clear what you want to sort by anyway....

    -Mike
    On 05/05/2011 04:12 PM, Chris Schilling wrote:
    Hi,

    I am trying to figure out how to solve this problem:

    I have about 500,000 files that I would like to index, but the files are structured. So, each file has the following layout:

    doc1
    token1, weight11, frequency1, weight21
    token2, weight12, frequency2, weight22
    .
    .
    .

    etc for 500,000 docs.

    Basically, I would like to index the tokens for each doc. When I search for a token, I would like to be able to return the top docs sorted by weight1, frequency, or weight2.

    So, in my naive setup, I loop through the files in the directory, then I loop through the lines of the file. In side of the loop through each file, I call this function:

    public Document processKeywords(Document doc, String keyword, Float weight1, Float weight2, Integer frequency) throws Exception {
    Document doc = new Document();
    doc.add(new Field("keywords", keyword, Field.Store.NO, Field.Index.ANALYZED));
    doc.add(new NumericField(keyword+"weight1", Field.Store.YES, true).setFloatValue(weight1));
    doc.add(new NumericField(keyword+"weight2", Field.Store.YES, true).setFloatValue(weight2));
    doc.add(new NumericField(keyword+"frequency", Field.Store.YES, true).setFloatValue(frequency));
    return doc;
    }

    So, for each token, I create 3 new fields each time. Notice how I am trying to index the keyword in the "keywords" field. For the weights and frequency, I create a new field with a name based on the keyword. On average, I have 100 tokens per document, so each document will have about 300 distinct fields.

    When running my program, the lucene portion eats up tons of memory and when it gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the program slows to a crawl. I assume it is spending all of its time in garbage collection due to all these fields.

    My code above seems like a very hacky way of accomplishing what I want (sorting documents based on keyword search using different numeric fields associated with that keyword).

    FYI, here is the main search code, where q is the token I am searching for and sortby is the field I want to use to sort. I setup a QP to search for the keyword in the "keywords" field. Then, I can extract the stats that I indexed for the given query keyword.

    private static final QueryParser parser = new QueryParser(Version.LUCENE_30, "keywords", new StandardAnalyzer(Version.LUCENE_30));

    public void search(String q, String sortby) throws IOException, ParseException {
    Query query = parser.parse(q);
    long start = System.currentTimeMillis();
    TopDocs hits = this.is.search(query, null, 10, new Sort(new SortField(q+"sortby", SortField.FLOAT, true)));
    long end = System.currentTimeMillis();
    System.out.println("Found " + hits.totalHits +
    " document(s) (in " + (end - start) +
    " milliseconds) that matched query '" +
    q + "':");
    for(ScoreDoc scoreDoc : hits.scoreDocs) {
    Document doc = this.is.doc(scoreDoc.doc);
    String hash = doc.get("hash");
    System.out.println(hash + " " + doc.get(q+"sortby") + " " + hash);
    }
    }

    I am pretty new to Lucene, so I hope this makes sense. I tried to pare my problem down as much as possible. Like I said, the main problem I am running into is that after processing about 30000 documents, the indexing slows to a crawl and seems to spend all of its time in the garbage collector. I am looking for a more efficient/effective way of solving this problem. Code tidbits would help, but are not necessary :)

    Thanks for your help,
    Chris S.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 5, '11 at 8:11p
activeMay 6, '11 at 10:34a
posts7
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase