FAQ
Hello Everyone,



I have been trying to get Lucene.Net to index a large number of
documents but I have been having trouble when batch processing around
800,000 documents at a time. At first I tried indexing using a
FSDirectory as my indexing directory, but that process takes incredibly
long. I also tried using a Ram Directory as my indexing directory and
then merging with a FSDirectory after every 10,000 new documents.
Using this process would take a while and Take about 1.8 GB of Ram and
eventually crash with an out of memory error. Right now each document
isn't more than a few words each and the batch processing is using up a
huge chunk of the system resources.



I would appreciate some help with batch processing large numbers of
documents at a time. In batch processing should I batch processing
using a Ram Directory and then merge indexes with an FSDirectory? Does
Lucene indexing usually take large amounts of Ram at a time? Can
Lucene.Net handle batch processing millions of documents at a time?



Thanks,

Chris

Snapstream Media

Search Discussions

  • Kauler, Leto S at Aug 29, 2006 at 12:31 am

    I have been trying to get Lucene.Net to index a large number of
    documents but I have
    been having trouble when batch processing around 800,000 documents at
    a time. At
    first I tried
    Have you checked out IndexWriter's MergeFactor and Min/MaxMergeDocs? We
    only index up to a couple of hundred thousands docs but we use those
    properties to manage RAM-usage versus disk-usage, ie use more RAM to
    give less frequent disk writing.

    This page is good (or google "lucene mergefactor"):
    http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html

    --Leto


    CONFIDENTIALITY NOTICE AND DISCLAIMER

    Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission.

    This disclaimer has been automatically added.
  • Chris David at Aug 29, 2006 at 7:00 pm
    Thanks Leto for the info. I have tried messing around with mergeFactor
    and the min/maxmegeDocs, but I have not had very much success. I used
    settings, maxFieldLength = 10, mergeFactor = 10, minMergeDocs = 50, and
    maxMegeDocs = 100000, and run a batch processing on 800,000 documents.
    When I checked on it, it had generated 30 Meg in its index folder and
    threw a System.Outofmemory exception. I would appreciate any help on
    setting the values so that I can be able to index the large number of
    documents. Also is there anyknowlege of any memory probems using
    specifically Lucene.Net. When the batch process threw out the
    System.OutOfMemory exception it my basic indexing program had used 1.8
    Gig of ram. Is this normal and Do I need to tweak my mergeFactor or
    min/maxMergeDocs numbers to fix?

    Thanks,

    Chris

    Snapstream Media
  • Michael Garski at Aug 29, 2006 at 7:18 pm
    Chris,

    I've never encountered any memory issues with Lucene.Net, but I have
    found that batching up into a RAMDirectory and then merging them was
    more a drain to performance than an improvement due to the optimizations
    inside the merge process. I found it much better to go straight to
    disk. I did some experimentation with a 300,000 test document set and
    found the following settings to give the best performance:

    MergeFactor: 100
    MaxMergeDocs: 1000000
    MinMergeDocs: 100

    What you set the MaxFieldLength to depends on how much information you
    need out of the items you are indexing - I've stuck with the default of
    10000.

    An index creation approach that I have found works great on
    multi-processor machines when you need to index several data sources is
    to create an FSDirectory for each data source, then start a thread for
    each data source to retrieve data and add it to a sub-index. Once all
    the data has been retrieved, merge the file system directories into one
    index. I found that approach to be faster than having multiple threads
    adding documents to a single index as the single index will become the
    bottleneck to performance. On a four cpu server with 5 threads creating
    sub-indexes I can index ~2 million documents/hour, and never use more
    than a few hundred megs of RAM at any given time.

    Hope that helps!

    Mike




    Chris David wrote:
    Thanks Leto for the info. I have tried messing around with mergeFactor
    and the min/maxmegeDocs, but I have not had very much success. I used
    settings, maxFieldLength = 10, mergeFactor = 10, minMergeDocs = 50, and
    maxMegeDocs = 100000, and run a batch processing on 800,000 documents.
    When I checked on it, it had generated 30 Meg in its index folder and
    threw a System.Outofmemory exception. I would appreciate any help on
    setting the values so that I can be able to index the large number of
    documents. Also is there anyknowlege of any memory probems using
    specifically Lucene.Net. When the batch process threw out the
    System.OutOfMemory exception it my basic indexing program had used 1.8
    Gig of ram. Is this normal and Do I need to tweak my mergeFactor or
    min/maxMergeDocs numbers to fix?

    Thanks,

    Chris

    Snapstream Media
  • Chris David at Aug 29, 2006 at 8:58 pm
    Michael, thanks for the suggestion. I have stopped using the
    RAMDirectory and only have been using the FSDirectory. I also used the
    MergeFactor, Min/MaxMergeDoc numbers you suggested. It seems to be
    indexing faster but I still have the problem of the RAM usage going up
    to 1.8 GB and then crashing giving me the System.OutOfMemory exception.

    This is a snippet of the code that I am using to index. I call this in
    a function and sit back and wait and hope that it finishes the batch of
    800,000 documents.


    diskDirectory = Lucene.Net.Store.FSDirectory.GetDirectory(
    indexDirectory, true );
    analyzer = new StandardAnalyzer();
    diskIndex = new IndexWriter( diskDirectory, analyzer, true );
    diskIndex.maxFieldLength = 100;
    diskIndex.mergeFactor = 100;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 100;

    while( reader.Read() ) {
    string title;
    string key;
    string epiTitle;

    doc = new Document();
    title = //Get title
    key = //Get Key
    epiTitle = //Get Episode Title
    doc.Add( Field.Text("Title", title) );
    doc.Add( Field.Text("Key", key) );
    doc.Add( Field.Text("EpisodeTitle", epiTitle) );
    diskIndex.AddDocument( doc );
    count++;

    }
    diskIndex.Optimize();
    diskIndex.Close();

    Is there something that I am missing I have tried taking out my data
    reading and just indexing hard coded strings. And memory gets huge. I
    would appreciate any suggestions.

    Thanks,

    Chris
    Snapstream Media
  • Michael Garski at Aug 29, 2006 at 9:24 pm
    Chris -

    Couple of questions for you:

    What version of Lucene.Net are you using? What Framework version (1.1
    vs. 2.0)

    How large are the data elements you are indexing? Field.Text stores the
    string inside the index in addition to tokenizing it and indexing it.

    How large is the index on disk when the program crashes?

    When you are indexing hard-coded strings, are you still making the call
    to your data source?

    I've only seen memory go that high when I had a huge volume of documents
    in a RAMDirectory, when indexing directly to disk it stays rather low.
    I doubt this would do much for your memory woes, but it may cut some
    overhead to declare your strings outside of your loop, setting the
    values inside of it.

    Mike

    Chris David wrote:
    Michael, thanks for the suggestion. I have stopped using the
    RAMDirectory and only have been using the FSDirectory. I also used the
    MergeFactor, Min/MaxMergeDoc numbers you suggested. It seems to be
    indexing faster but I still have the problem of the RAM usage going up
    to 1.8 GB and then crashing giving me the System.OutOfMemory exception.

    This is a snippet of the code that I am using to index. I call this in
    a function and sit back and wait and hope that it finishes the batch of
    800,000 documents.


    diskDirectory = Lucene.Net.Store.FSDirectory.GetDirectory(
    indexDirectory, true );
    analyzer = new StandardAnalyzer();
    diskIndex = new IndexWriter( diskDirectory, analyzer, true );
    diskIndex.maxFieldLength = 100;
    diskIndex.mergeFactor = 100;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 100;

    while( reader.Read() ) {
    string title;
    string key;
    string epiTitle;

    doc = new Document();
    title = //Get title
    key = //Get Key
    epiTitle = //Get Episode Title
    doc.Add( Field.Text("Title", title) );
    doc.Add( Field.Text("Key", key) );
    doc.Add( Field.Text("EpisodeTitle", epiTitle) );
    diskIndex.AddDocument( doc );
    count++;

    }
    diskIndex.Optimize();
    diskIndex.Close();

    Is there something that I am missing I have tried taking out my data
    reading and just indexing hard coded strings. And memory gets huge. I
    would appreciate any suggestions.

    Thanks,

    Chris
    Snapstream Media
  • Chris David at Aug 29, 2006 at 9:54 pm
    Michael,

    - I am using Lucene.Net 1.9.1 from
    http://incubator.apache.org/lucene.net/download/
    - Using .NET 1.1 framework
    - Developing with VS.NET 2003
    - Each data element is string. The strings are television show and
    episode titles that range in length and are up to 50 characters long,
    and database keys which are 10 characters long.
    - The index was up to 36 meg from the last time I ran it.
    - When I was indexing hard coded strings I removed all calls to the data
    source and looped till a counting variable reached 800,000

    I am wondering if the problem is with .NET's Garbage collector. I have
    tried many things and just kinda hit a wall here.

    I appreciate the help.

    Chris
    Snapstream Media


    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 4:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris -

    Couple of questions for you:

    What version of Lucene.Net are you using? What Framework version (1.1
    vs. 2.0)

    How large are the data elements you are indexing? Field.Text stores the

    string inside the index in addition to tokenizing it and indexing it.

    How large is the index on disk when the program crashes?

    When you are indexing hard-coded strings, are you still making the call
    to your data source?

    I've only seen memory go that high when I had a huge volume of documents

    in a RAMDirectory, when indexing directly to disk it stays rather low.
    I doubt this would do much for your memory woes, but it may cut some
    overhead to declare your strings outside of your loop, setting the
    values inside of it.

    Mike

    Chris David wrote:
    Michael, thanks for the suggestion. I have stopped using the
    RAMDirectory and only have been using the FSDirectory. I also used the
    MergeFactor, Min/MaxMergeDoc numbers you suggested. It seems to be
    indexing faster but I still have the problem of the RAM usage going up
    to 1.8 GB and then crashing giving me the System.OutOfMemory
    exception.
    This is a snippet of the code that I am using to index. I call this in
    a function and sit back and wait and hope that it finishes the batch of
    800,000 documents.


    diskDirectory = Lucene.Net.Store.FSDirectory.GetDirectory(
    indexDirectory, true );
    analyzer = new StandardAnalyzer();
    diskIndex = new IndexWriter( diskDirectory, analyzer, true );
    diskIndex.maxFieldLength = 100;
    diskIndex.mergeFactor = 100;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 100;

    while( reader.Read() ) {
    string title;
    string key;
    string epiTitle;

    doc = new Document();
    title = //Get title
    key = //Get Key
    epiTitle = //Get Episode Title
    doc.Add( Field.Text("Title", title) );
    doc.Add( Field.Text("Key", key) );
    doc.Add( Field.Text("EpisodeTitle", epiTitle) );
    diskIndex.AddDocument( doc );
    count++;

    }
    diskIndex.Optimize();
    diskIndex.Close();

    Is there something that I am missing I have tried taking out my data
    reading and just indexing hard coded strings. And memory gets huge. I
    would appreciate any suggestions.

    Thanks,

    Chris
    Snapstream Media
  • Michael Garski at Aug 29, 2006 at 10:16 pm
    Chris,

    With the very small size of your data sources, a 36 MB index, and memory
    going to 1.8 GB, there is something going on outside of the realm Lucene.

    From your reply it sounds like you were able to index 800,000 strings
    when no data source calls were being made. If that is the case it
    sounds like there may be an issue in how the data is retrieved - I
    retrieve data in chunks of 5000 from my data source using a
    SqlDataReader as opposed to retrieving a disconnected DataSet.

    Hope that helps,

    Mike

    Chris David wrote:
    Michael,

    - I am using Lucene.Net 1.9.1 from
    http://incubator.apache.org/lucene.net/download/
    - Using .NET 1.1 framework
    - Developing with VS.NET 2003
    - Each data element is string. The strings are television show and
    episode titles that range in length and are up to 50 characters long,
    and database keys which are 10 characters long.
    - The index was up to 36 meg from the last time I ran it.
    - When I was indexing hard coded strings I removed all calls to the data
    source and looped till a counting variable reached 800,000

    I am wondering if the problem is with .NET's Garbage collector. I have
    tried many things and just kinda hit a wall here.

    I appreciate the help.

    Chris
    Snapstream Media


    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 4:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris -

    Couple of questions for you:

    What version of Lucene.Net are you using? What Framework version (1.1
    vs. 2.0)

    How large are the data elements you are indexing? Field.Text stores the

    string inside the index in addition to tokenizing it and indexing it.

    How large is the index on disk when the program crashes?

    When you are indexing hard-coded strings, are you still making the call
    to your data source?

    I've only seen memory go that high when I had a huge volume of documents

    in a RAMDirectory, when indexing directly to disk it stays rather low.
    I doubt this would do much for your memory woes, but it may cut some
    overhead to declare your strings outside of your loop, setting the
    values inside of it.

    Mike

    Chris David wrote:
    Michael, thanks for the suggestion. I have stopped using the
    RAMDirectory and only have been using the FSDirectory. I also used the
    MergeFactor, Min/MaxMergeDoc numbers you suggested. It seems to be
    indexing faster but I still have the problem of the RAM usage going up
    to 1.8 GB and then crashing giving me the System.OutOfMemory
    exception.
    This is a snippet of the code that I am using to index. I call this in
    a function and sit back and wait and hope that it finishes the batch of
    800,000 documents.


    diskDirectory = Lucene.Net.Store.FSDirectory.GetDirectory(
    indexDirectory, true );
    analyzer = new StandardAnalyzer();
    diskIndex = new IndexWriter( diskDirectory, analyzer, true );
    diskIndex.maxFieldLength = 100;
    diskIndex.mergeFactor = 100;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 100;

    while( reader.Read() ) {
    string title;
    string key;
    string epiTitle;

    doc = new Document();
    title = //Get title
    key = //Get Key
    epiTitle = //Get Episode Title
    doc.Add( Field.Text("Title", title) );
    doc.Add( Field.Text("Key", key) );
    doc.Add( Field.Text("EpisodeTitle", epiTitle) );
    diskIndex.AddDocument( doc );
    count++;

    }
    diskIndex.Optimize();
    diskIndex.Close();

    Is there something that I am missing I have tried taking out my data
    reading and just indexing hard coded strings. And memory gets huge. I
    would appreciate any suggestions.

    Thanks,

    Chris
    Snapstream Media
  • Chris David at Aug 29, 2006 at 10:39 pm
    Mike,
    I actually was not able to successfully run indexing 800,000 dummy
    strings I was running into the same problem of memory usage. I
    commented out the diskIndex.AddDocument( doc ); line and ran fine
    without any problems with memory. When I add it back memory usage gets
    huge. The SQL data reader we use grabs the SQL data one row at a time,
    so I don't think that the memory problem is coming from the DataReader.


    Would there be a problem with me trying to index everything all at once
    with this one while loop?

    Thanks,
    Chris

    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 5:16 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris,

    With the very small size of your data sources, a 36 MB index, and memory

    going to 1.8 GB, there is something going on outside of the realm
    Lucene.

    From your reply it sounds like you were able to index 800,000 strings
    when no data source calls were being made. If that is the case it
    sounds like there may be an issue in how the data is retrieved - I
    retrieve data in chunks of 5000 from my data source using a
    SqlDataReader as opposed to retrieving a disconnected DataSet.

    Hope that helps,

    Mike

    Chris David wrote:
    Michael,

    - I am using Lucene.Net 1.9.1 from
    http://incubator.apache.org/lucene.net/download/
    - Using .NET 1.1 framework
    - Developing with VS.NET 2003
    - Each data element is string. The strings are television show and
    episode titles that range in length and are up to 50 characters long,
    and database keys which are 10 characters long.
    - The index was up to 36 meg from the last time I ran it.
    - When I was indexing hard coded strings I removed all calls to the data
    source and looped till a counting variable reached 800,000

    I am wondering if the problem is with .NET's Garbage collector. I have
    tried many things and just kinda hit a wall here.

    I appreciate the help.

    Chris
    Snapstream Media


    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 4:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris -

    Couple of questions for you:

    What version of Lucene.Net are you using? What Framework version (1.1
    vs. 2.0)

    How large are the data elements you are indexing? Field.Text stores the
    string inside the index in addition to tokenizing it and indexing it.

    How large is the index on disk when the program crashes?

    When you are indexing hard-coded strings, are you still making the call
    to your data source?

    I've only seen memory go that high when I had a huge volume of documents
    in a RAMDirectory, when indexing directly to disk it stays rather low.
    I doubt this would do much for your memory woes, but it may cut some
    overhead to declare your strings outside of your loop, setting the
    values inside of it.

    Mike

    Chris David wrote:
    Michael, thanks for the suggestion. I have stopped using the
    RAMDirectory and only have been using the FSDirectory. I also used the
    MergeFactor, Min/MaxMergeDoc numbers you suggested. It seems to be
    indexing faster but I still have the problem of the RAM usage going
    up
    to 1.8 GB and then crashing giving me the System.OutOfMemory
    exception.
    This is a snippet of the code that I am using to index. I call this in
    a function and sit back and wait and hope that it finishes the batch of
    800,000 documents.


    diskDirectory = Lucene.Net.Store.FSDirectory.GetDirectory(
    indexDirectory, true );
    analyzer = new StandardAnalyzer();
    diskIndex = new IndexWriter( diskDirectory, analyzer, true );
    diskIndex.maxFieldLength = 100;
    diskIndex.mergeFactor = 100;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 100;

    while( reader.Read() ) {
    string title;
    string key;
    string epiTitle;

    doc = new Document();
    title = //Get title
    key = //Get Key
    epiTitle = //Get Episode Title
    doc.Add( Field.Text("Title", title) );
    doc.Add( Field.Text("Key", key) );
    doc.Add( Field.Text("EpisodeTitle", epiTitle) );
    diskIndex.AddDocument( doc );
    count++;

    }
    diskIndex.Optimize();
    diskIndex.Close();

    Is there something that I am missing I have tried taking out my data
    reading and just indexing hard coded strings. And memory gets huge.
    I
    would appreciate any suggestions.

    Thanks,

    Chris
    Snapstream Media
  • Michael Garski at Aug 29, 2006 at 11:36 pm
    Chris -

    I'm unable to duplicate your issue when using a ram or file system
    directory. I used a similar format to your code snippet, adding 317,000
    documents of a similar size to yours to the index and memory never went
    over 50 MB for a file system directory or 100 MB for a ram directory.

    Mike

    Chris David wrote:
    Mike,
    I actually was not able to successfully run indexing 800,000 dummy
    strings I was running into the same problem of memory usage. I
    commented out the diskIndex.AddDocument( doc ); line and ran fine
    without any problems with memory. When I add it back memory usage gets
    huge. The SQL data reader we use grabs the SQL data one row at a time,
    so I don't think that the memory problem is coming from the DataReader.


    Would there be a problem with me trying to index everything all at once
    with this one while loop?

    Thanks,
    Chris

    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 5:16 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris,

    With the very small size of your data sources, a 36 MB index, and memory

    going to 1.8 GB, there is something going on outside of the realm
    Lucene.

    From your reply it sounds like you were able to index 800,000 strings
    when no data source calls were being made. If that is the case it
    sounds like there may be an issue in how the data is retrieved - I
    retrieve data in chunks of 5000 from my data source using a
    SqlDataReader as opposed to retrieving a disconnected DataSet.

    Hope that helps,

    Mike

    Chris David wrote:
    Michael,

    - I am using Lucene.Net 1.9.1 from
    http://incubator.apache.org/lucene.net/download/
    - Using .NET 1.1 framework
    - Developing with VS.NET 2003
    - Each data element is string. The strings are television show and
    episode titles that range in length and are up to 50 characters long,
    and database keys which are 10 characters long.
    - The index was up to 36 meg from the last time I ran it.
    - When I was indexing hard coded strings I removed all calls to the data
    source and looped till a counting variable reached 800,000

    I am wondering if the problem is with .NET's Garbage collector. I have
    tried many things and just kinda hit a wall here.

    I appreciate the help.

    Chris
    Snapstream Media


    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 4:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris -

    Couple of questions for you:

    What version of Lucene.Net are you using? What Framework version (1.1
    vs. 2.0)

    How large are the data elements you are indexing? Field.Text stores the
    string inside the index in addition to tokenizing it and indexing it.

    How large is the index on disk when the program crashes?

    When you are indexing hard-coded strings, are you still making the call
    to your data source?

    I've only seen memory go that high when I had a huge volume of documents
    in a RAMDirectory, when indexing directly to disk it stays rather low.
    I doubt this would do much for your memory woes, but it may cut some
    overhead to declare your strings outside of your loop, setting the
    values inside of it.

    Mike

    Chris David wrote:

    Michael, thanks for the suggestion. I have stopped using the
    RAMDirectory and only have been using the FSDirectory. I also used
    the

    MergeFactor, Min/MaxMergeDoc numbers you suggested. It seems to be
    indexing faster but I still have the problem of the RAM usage going
    up
    to 1.8 GB and then crashing giving me the System.OutOfMemory
    exception.

    This is a snippet of the code that I am using to index. I call this
    in

    a function and sit back and wait and hope that it finishes the batch
    of

    800,000 documents.


    diskDirectory = Lucene.Net.Store.FSDirectory.GetDirectory(
    indexDirectory, true );
    analyzer = new StandardAnalyzer();
    diskIndex = new IndexWriter( diskDirectory, analyzer, true );
    diskIndex.maxFieldLength = 100;
    diskIndex.mergeFactor = 100;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 100;

    while( reader.Read() ) {
    string title;
    string key;
    string epiTitle;

    doc = new Document();
    title = //Get title
    key = //Get Key
    epiTitle = //Get Episode Title
    doc.Add( Field.Text("Title", title) );
    doc.Add( Field.Text("Key", key) );
    doc.Add( Field.Text("EpisodeTitle", epiTitle) );
    diskIndex.AddDocument( doc );
    count++;

    }
    diskIndex.Optimize();
    diskIndex.Close();

    Is there something that I am missing I have tried taking out my data
    reading and just indexing hard coded strings. And memory gets huge.
    I
    would appreciate any suggestions.

    Thanks,

    Chris
    Snapstream Media

  • René de Vries at Aug 30, 2006 at 12:04 pm
    We're also indexing a (large) database. We did spend quite a bit of time trying out the various params,
    this is the result on one that just ran an index on 500.000 SQL Server records. Once we're done, we exepect to run this on a 16 million record database.

    We're indexing indexing one keyword on the primary key so we can quickly do a lookup in the original table, a key for the datefield (to use in a rangequery), an Unstored Title, and and Unstored Story. We do use the TermVectors.

    We grab 1000 records at the time, but keep the indexwriter open.

    ' Parameters for the IndexWriter
    objIndexWriter.mergeFactor = 10
    objIndexWriter.maxMergeDocs = 9999999
    objIndexWriter.minMergeDocs = 1000

    With this, we never get over 80Mb used, on a filesystem....

    René

    -----Original Message-----
    From: Michael Garski
    Sent: woensdag 30 augustus 2006 1:36
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris -

    I'm unable to duplicate your issue when using a ram or file system
    directory. I used a similar format to your code snippet, adding 317,000
    documents of a similar size to yours to the index and memory never went
    over 50 MB for a file system directory or 100 MB for a ram directory.

    Mike

    Chris David wrote:
    Mike,
    I actually was not able to successfully run indexing 800,000 dummy
    strings I was running into the same problem of memory usage. I
    commented out the diskIndex.AddDocument( doc ); line and ran fine
    without any problems with memory. When I add it back memory usage gets
    huge. The SQL data reader we use grabs the SQL data one row at a time,
    so I don't think that the memory problem is coming from the DataReader.


    Would there be a problem with me trying to index everything all at once
    with this one while loop?

    Thanks,
    Chris

    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 5:16 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris,

    With the very small size of your data sources, a 36 MB index, and memory

    going to 1.8 GB, there is something going on outside of the realm
    Lucene.

    From your reply it sounds like you were able to index 800,000 strings
    when no data source calls were being made. If that is the case it
    sounds like there may be an issue in how the data is retrieved - I
    retrieve data in chunks of 5000 from my data source using a
    SqlDataReader as opposed to retrieving a disconnected DataSet.

    Hope that helps,

    Mike

    Chris David wrote:
    Michael,

    - I am using Lucene.Net 1.9.1 from
    http://incubator.apache.org/lucene.net/download/
    - Using .NET 1.1 framework
    - Developing with VS.NET 2003
    - Each data element is string. The strings are television show and
    episode titles that range in length and are up to 50 characters long,
    and database keys which are 10 characters long.
    - The index was up to 36 meg from the last time I ran it.
    - When I was indexing hard coded strings I removed all calls to the data
    source and looped till a counting variable reached 800,000

    I am wondering if the problem is with .NET's Garbage collector. I have
    tried many things and just kinda hit a wall here.

    I appreciate the help.

    Chris
    Snapstream Media


    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 4:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris -

    Couple of questions for you:

    What version of Lucene.Net are you using? What Framework version (1.1
    vs. 2.0)

    How large are the data elements you are indexing? Field.Text stores the
    string inside the index in addition to tokenizing it and indexing it.

    How large is the index on disk when the program crashes?

    When you are indexing hard-coded strings, are you still making the call
    to your data source?

    I've only seen memory go that high when I had a huge volume of documents
    in a RAMDirectory, when indexing directly to disk it stays rather low.
    I doubt this would do much for your memory woes, but it may cut some
    overhead to declare your strings outside of your loop, setting the
    values inside of it.

    Mike

    Chris David wrote:

    Michael, thanks for the suggestion. I have stopped using the
    RAMDirectory and only have been using the FSDirectory. I also used
    the

    MergeFactor, Min/MaxMergeDoc numbers you suggested. It seems to be
    indexing faster but I still have the problem of the RAM usage going
    up
    to 1.8 GB and then crashing giving me the System.OutOfMemory
    exception.

    This is a snippet of the code that I am using to index. I call this
    in

    a function and sit back and wait and hope that it finishes the batch
    of

    800,000 documents.


    diskDirectory = Lucene.Net.Store.FSDirectory.GetDirectory(
    indexDirectory, true );
    analyzer = new StandardAnalyzer();
    diskIndex = new IndexWriter( diskDirectory, analyzer, true );
    diskIndex.maxFieldLength = 100;
    diskIndex.mergeFactor = 100;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 100;

    while( reader.Read() ) {
    string title;
    string key;
    string epiTitle;

    doc = new Document();
    title = //Get title
    key = //Get Key
    epiTitle = //Get Episode Title
    doc.Add( Field.Text("Title", title) );
    doc.Add( Field.Text("Key", key) );
    doc.Add( Field.Text("EpisodeTitle", epiTitle) );
    diskIndex.AddDocument( doc );
    count++;

    }
    diskIndex.Optimize();
    diskIndex.Close();

    Is there something that I am missing I have tried taking out my data
    reading and just indexing hard coded strings. And memory gets huge.
    I
    would appreciate any suggestions.

    Thanks,

    Chris
    Snapstream Media

  • George Aroush at Aug 30, 2006 at 4:17 am
    Hi Mike,

    You will need to debug this and a good place to start is to simplify your
    index creation. That is, change your code such that it still extract the
    data from the DB but what you add to Lucene is a constant text for "Title",
    Key" and "EpisodeTitle". If the problem still exist, comment out the
    extraction, etc. After this experiment, try to only index one of the
    fields.

    The basic idea is to try to narrow down the problem at which point you can
    have better luck at debugging.

    Let us know how it goes.

    Regards,

    -- George Aroush

    -----Original Message-----
    From: Chris David
    Sent: Tuesday, August 29, 2006 6:39 PM
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Lucene.Net Indexing Large Databases

    Mike,
    I actually was not able to successfully run indexing 800,000 dummy strings I
    was running into the same problem of memory usage. I commented out the
    diskIndex.AddDocument( doc ); line and ran fine without any problems with
    memory. When I add it back memory usage gets huge. The SQL data reader we
    use grabs the SQL data one row at a time, so I don't think that the memory
    problem is coming from the DataReader.


    Would there be a problem with me trying to index everything all at once
    with this one while loop?

    Thanks,
    Chris

    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 5:16 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris,

    With the very small size of your data sources, a 36 MB index, and memory

    going to 1.8 GB, there is something going on outside of the realm
    Lucene.

    From your reply it sounds like you were able to index 800,000 strings
    when no data source calls were being made. If that is the case it
    sounds like there may be an issue in how the data is retrieved - I
    retrieve data in chunks of 5000 from my data source using a
    SqlDataReader as opposed to retrieving a disconnected DataSet.

    Hope that helps,

    Mike

    Chris David wrote:
    Michael,

    - I am using Lucene.Net 1.9.1 from
    http://incubator.apache.org/lucene.net/download/
    - Using .NET 1.1 framework
    - Developing with VS.NET 2003
    - Each data element is string. The strings are television show and
    episode titles that range in length and are up to 50 characters long,
    and database keys which are 10 characters long.
    - The index was up to 36 meg from the last time I ran it.
    - When I was indexing hard coded strings I removed all calls to the data
    source and looped till a counting variable reached 800,000

    I am wondering if the problem is with .NET's Garbage collector. I have
    tried many things and just kinda hit a wall here.

    I appreciate the help.

    Chris
    Snapstream Media


    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 4:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris -

    Couple of questions for you:

    What version of Lucene.Net are you using? What Framework version (1.1
    vs. 2.0)

    How large are the data elements you are indexing? Field.Text stores the
    string inside the index in addition to tokenizing it and indexing it.

    How large is the index on disk when the program crashes?

    When you are indexing hard-coded strings, are you still making the call
    to your data source?

    I've only seen memory go that high when I had a huge volume of documents
    in a RAMDirectory, when indexing directly to disk it stays rather low.
    I doubt this would do much for your memory woes, but it may cut some
    overhead to declare your strings outside of your loop, setting the
    values inside of it.

    Mike

    Chris David wrote:
    Michael, thanks for the suggestion. I have stopped using the
    RAMDirectory and only have been using the FSDirectory. I also used the
    MergeFactor, Min/MaxMergeDoc numbers you suggested. It seems to be
    indexing faster but I still have the problem of the RAM usage going
    up
    to 1.8 GB and then crashing giving me the System.OutOfMemory
    exception.
    This is a snippet of the code that I am using to index. I call this in
    a function and sit back and wait and hope that it finishes the batch of
    800,000 documents.


    diskDirectory = Lucene.Net.Store.FSDirectory.GetDirectory(
    indexDirectory, true );
    analyzer = new StandardAnalyzer();
    diskIndex = new IndexWriter( diskDirectory, analyzer, true );
    diskIndex.maxFieldLength = 100;
    diskIndex.mergeFactor = 100;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 100;

    while( reader.Read() ) {
    string title;
    string key;
    string epiTitle;

    doc = new Document();
    title = //Get title
    key = //Get Key
    epiTitle = //Get Episode Title
    doc.Add( Field.Text("Title", title) );
    doc.Add( Field.Text("Key", key) );
    doc.Add( Field.Text("EpisodeTitle", epiTitle) );
    diskIndex.AddDocument( doc );
    count++;

    }
    diskIndex.Optimize();
    diskIndex.Close();

    Is there something that I am missing I have tried taking out my data
    reading and just indexing hard coded strings. And memory gets huge.
    I
    would appreciate any suggestions.

    Thanks,

    Chris
    Snapstream Media
  • George Aroush at Aug 30, 2006 at 4:19 am
    Ugh. I meant to say "Chris" and not "Mike". Sorry about this, it's time
    for me to be in bed!

    -- George Aroush

    -----Original Message-----
    From: George Aroush
    Sent: Wednesday, August 30, 2006 12:17 AM
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Lucene.Net Indexing Large Databases

    Hi Mike,

    You will need to debug this and a good place to start is to simplify your
    index creation. That is, change your code such that it still extract the
    data from the DB but what you add to Lucene is a constant text for "Title",
    Key" and "EpisodeTitle". If the problem still exist, comment out the
    extraction, etc. After this experiment, try to only index one of the
    fields.

    The basic idea is to try to narrow down the problem at which point you can
    have better luck at debugging.

    Let us know how it goes.

    Regards,

    -- George Aroush

    -----Original Message-----
    From: Chris David
    Sent: Tuesday, August 29, 2006 6:39 PM
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Lucene.Net Indexing Large Databases

    Mike,
    I actually was not able to successfully run indexing 800,000 dummy strings I
    was running into the same problem of memory usage. I commented out the
    diskIndex.AddDocument( doc ); line and ran fine without any problems with
    memory. When I add it back memory usage gets huge. The SQL data reader we
    use grabs the SQL data one row at a time, so I don't think that the memory
    problem is coming from the DataReader.


    Would there be a problem with me trying to index everything all at once with
    this one while loop?

    Thanks,
    Chris

    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 5:16 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris,

    With the very small size of your data sources, a 36 MB index, and memory

    going to 1.8 GB, there is something going on outside of the realm Lucene.

    From your reply it sounds like you were able to index 800,000 strings when
    no data source calls were being made. If that is the case it sounds like
    there may be an issue in how the data is retrieved - I retrieve data in
    chunks of 5000 from my data source using a SqlDataReader as opposed to
    retrieving a disconnected DataSet.

    Hope that helps,

    Mike

    Chris David wrote:
    Michael,

    - I am using Lucene.Net 1.9.1 from
    http://incubator.apache.org/lucene.net/download/
    - Using .NET 1.1 framework
    - Developing with VS.NET 2003
    - Each data element is string. The strings are television show and
    episode titles that range in length and are up to 50 characters long,
    and database keys which are 10 characters long.
    - The index was up to 36 meg from the last time I ran it.
    - When I was indexing hard coded strings I removed all calls to the data
    source and looped till a counting variable reached 800,000

    I am wondering if the problem is with .NET's Garbage collector. I have
    tried many things and just kinda hit a wall here.

    I appreciate the help.

    Chris
    Snapstream Media


    -----Original Message-----
    From: Michael Garski
    Sent: Tuesday, August 29, 2006 4:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Lucene.Net Indexing Large Databases

    Chris -

    Couple of questions for you:

    What version of Lucene.Net are you using? What Framework version (1.1
    vs. 2.0)

    How large are the data elements you are indexing? Field.Text stores the
    string inside the index in addition to tokenizing it and indexing it.

    How large is the index on disk when the program crashes?

    When you are indexing hard-coded strings, are you still making the call
    to your data source?

    I've only seen memory go that high when I had a huge volume of documents
    in a RAMDirectory, when indexing directly to disk it stays rather low.
    I doubt this would do much for your memory woes, but it may cut some
    overhead to declare your strings outside of your loop, setting the
    values inside of it.

    Mike

    Chris David wrote:
    Michael, thanks for the suggestion. I have stopped using the
    RAMDirectory and only have been using the FSDirectory. I also used the
    MergeFactor, Min/MaxMergeDoc numbers you suggested. It seems to be
    indexing faster but I still have the problem of the RAM usage going
    up
    to 1.8 GB and then crashing giving me the System.OutOfMemory
    exception.
    This is a snippet of the code that I am using to index. I call this in
    a function and sit back and wait and hope that it finishes the batch of
    800,000 documents.


    diskDirectory = Lucene.Net.Store.FSDirectory.GetDirectory(
    indexDirectory, true );
    analyzer = new StandardAnalyzer();
    diskIndex = new IndexWriter( diskDirectory, analyzer, true );
    diskIndex.maxFieldLength = 100; diskIndex.mergeFactor = 100;
    diskIndex.maxMergeDocs = 100000; diskIndex.minMergeDocs = 100;

    while( reader.Read() ) {
    string title;
    string key;
    string epiTitle;

    doc = new Document();
    title = //Get title
    key = //Get Key
    epiTitle = //Get Episode Title
    doc.Add( Field.Text("Title", title) );
    doc.Add( Field.Text("Key", key) );
    doc.Add( Field.Text("EpisodeTitle", epiTitle) );
    diskIndex.AddDocument( doc );
    count++;

    }
    diskIndex.Optimize();
    diskIndex.Close();

    Is there something that I am missing I have tried taking out my data
    reading and just indexing hard coded strings. And memory gets huge.
    I
    would appreciate any suggestions.

    Thanks,

    Chris
    Snapstream Media
  • Chris David at Aug 30, 2006 at 9:57 pm
    Alrighty everyone, thanks for all your suggestions I really do
    appreciate all the help. All day I have still been banging my head
    against the wall on this one. I've removed all references to everything
    except to Lucene 1.9.1. I have a basic console program that all it does
    is create an index with 800,000 blank documents. Around 350,000
    documents it has used 1.8GB of ram then crashes with a
    System.OutOfMemory exception.

    I have tried to isolate the problem the only thing that will fix it is
    commenting out the indexWriter.AddDocument(doc); Creating documents
    does not cause this problem, adding fields does not cause this problem,
    or stepping through SQL does not cause this problem.

    static void Main(string[] args) {
    IndexWriter diskIndex;
    int count;
    Lucene.Net.Store.Directory directory;
    Analyzer analyzer;
    Document doc;
    string indexDirectory;
    System.IO.FileInfo fi;

    indexDirectory = "C:\\Index";

    fi = new FileInfo( indexDirectory );
    directory = Lucene.Net.Store.FSDirectory.GetDirectory( fi, true
    );

    analyzer = new SimpleAnalyzer();
    diskIndex = new IndexWriter( directory, analyzer, true );
    diskIndex.mergeFactor = 10;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 1000;

    count = 0;

    while( count < 800000 ) {
    doc = new Document();
    diskIndex.AddDocument( doc );

    count++;
    }
    diskIndex.Optimize();
    diskIndex.Close();
    }

    This is what I finally stripped it down to. Here commenting out the
    AddDocument( doc ); does not create memory problems but.

    I am currently using Visual Studio .NET 2003 under the with .NET 1.1
    Framework. The version of Lucene.NET I am using is 1.9.1. Do I need to
    be working with .NET 2.0 or using an older version of Lucene.

    I'm hoping that someone might be able to help with this one.

    Thanks Everybody,
    Chris
    Snapstream Media
  • Spam at Aug 31, 2006 at 8:03 am
    At the risk of embaressing myself, on a much lesser scale I had this
    problem. Then I realised that each time I created a new document I was
    appending new record values to a variable I was passing in to the Document
    object. The field in each record therefore became increasingly large and
    took longer to analyse eating RAM.

    Anyeay, I figured I'd either help, or at least give you a silly anecdote
    worthy of a DailyWTF!

    Regards,
    Doug

    -----Original Message-----
    From: Chris David
    Sent: 30 August 2006 22:56
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Lucene.Net Indexing Large Databases

    Alrighty everyone, thanks for all your suggestions I really do appreciate
    all the help. All day I have still been banging my head against the wall on
    this one. I've removed all references to everything except to Lucene 1.9.1.
    I have a basic console program that all it does is create an index with
    800,000 blank documents. Around 350,000 documents it has used 1.8GB of ram
    then crashes with a System.OutOfMemory exception.

    I have tried to isolate the problem the only thing that will fix it is
    commenting out the indexWriter.AddDocument(doc); Creating documents does
    not cause this problem, adding fields does not cause this problem, or
    stepping through SQL does not cause this problem.

    static void Main(string[] args) {
    IndexWriter diskIndex;
    int count;
    Lucene.Net.Store.Directory directory;
    Analyzer analyzer;
    Document doc;
    string indexDirectory;
    System.IO.FileInfo fi;

    indexDirectory = "C:\\Index";

    fi = new FileInfo( indexDirectory );
    directory = Lucene.Net.Store.FSDirectory.GetDirectory( fi, true );

    analyzer = new SimpleAnalyzer();
    diskIndex = new IndexWriter( directory, analyzer, true );
    diskIndex.mergeFactor = 10;
    diskIndex.maxMergeDocs = 100000;
    diskIndex.minMergeDocs = 1000;

    count = 0;

    while( count < 800000 ) {
    doc = new Document();
    diskIndex.AddDocument( doc );

    count++;
    }
    diskIndex.Optimize();
    diskIndex.Close();
    }

    This is what I finally stripped it down to. Here commenting out the
    AddDocument( doc ); does not create memory problems but.

    I am currently using Visual Studio .NET 2003 under the with .NET 1.1
    Framework. The version of Lucene.NET I am using is 1.9.1. Do I need to be
    working with .NET 2.0 or using an older version of Lucene.

    I'm hoping that someone might be able to help with this one.

    Thanks Everybody,
    Chris
    Snapstream Media

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouplucene-net-user @
categorieslucene
postedAug 29, '06 at 12:14a
activeAug 31, '06 at 8:03a
posts15
users6
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase