FAQ
Guys,
I am working on a Lucene index to allow some backend processes access to some post-processing type data.. The result is a document that looks something like this.


* ProfileID (Long.ToString())

* Delimited Array of FKs (int.ToString() delimited and tokenized)

* Multiple delimited arrays of strings (each array its own field name, delimited and tokenized)

* Delimited array of about 150 Ints between 0 and 1600 (int.ToString() delimited and tokenized).

(this is a bolt on to a current app so we have limited control over its data model, and the above document is the best we could come up with to describe our data in a way Lucene might like)..

We have about 130 million of these records, We don't "need" the profileID to be indexed except for doing updates, and we won't be storing the array of unique ints.

My concern is on the terms database, with all those profileIDs in the terms data, that's 130 million terms before we look at what we care to search (the list of FKs and the list of ints between 0 and 1600.

I was wondering if anyone had suggestions on this model, or ways to manage the potential size of our terms list?

Thanks in advance.
Josh Handel
Senior Lead Consultant
512.328.8181 | Main
512.328.0584 | Fax
512.577-6568 | Cell
www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.com/>

CATAPULT SYSTEMS INC.
ENABLING BUSINESS THROUGH TECHNOLOGY

Search Discussions

  • Josh Handel at May 24, 2010 at 6:57 pm
    I hate to ping multiple times in the same data on this but I wanted to add something real quick.

    With all those unique terms, I am now running out of memory (on my dev box) when indexing.. I am thinking this is being caused by the hundreds of thousands of unique terms (about 120,000 according to LUKE in the index after it crashed locally).. Is there a way to control memory usage by lucene's caching?

    (FYI:I am using Lucene.NET 2.9.2)

    Thanks
    Josh Handel


    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 1:22 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions.

    Guys,
    I am working on a Lucene index to allow some backend processes access to some post-processing type data.. The result is a document that looks something like this.


    * ProfileID (Long.ToString())

    * Delimited Array of FKs (int.ToString() delimited and tokenized)

    * Multiple delimited arrays of strings (each array its own field name, delimited and tokenized)

    * Delimited array of about 150 Ints between 0 and 1600 (int.ToString() delimited and tokenized).

    (this is a bolt on to a current app so we have limited control over its data model, and the above document is the best we could come up with to describe our data in a way Lucene might like)..

    We have about 130 million of these records, We don't "need" the profileID to be indexed except for doing updates, and we won't be storing the array of unique ints.

    My concern is on the terms database, with all those profileIDs in the terms data, that's 130 million terms before we look at what we care to search (the list of FKs and the list of ints between 0 and 1600.

    I was wondering if anyone had suggestions on this model, or ways to manage the potential size of our terms list?

    Thanks in advance.
    Josh Handel
    Senior Lead Consultant
    512.328.8181 | Main
    512.328.0584 | Fax
    512.577-6568 | Cell
    www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.com/>

    CATAPULT SYSTEMS INC.
    ENABLING BUSINESS THROUGH TECHNOLOGY
  • Digy at May 24, 2010 at 6:59 pm
    Try to use "commit" (for ex, at every 10000 docs)
    DIGY

    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 9:57 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions. (Out of Memory)

    I hate to ping multiple times in the same data on this but I wanted to add
    something real quick.

    With all those unique terms, I am now running out of memory (on my dev box)
    when indexing.. I am thinking this is being caused by the hundreds of
    thousands of unique terms (about 120,000 according to LUKE in the index
    after it crashed locally).. Is there a way to control memory usage by
    lucene's caching?

    (FYI:I am using Lucene.NET 2.9.2)

    Thanks
    Josh Handel


    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 1:22 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions.

    Guys,
    I am working on a Lucene index to allow some backend processes access to
    some post-processing type data.. The result is a document that looks
    something like this.


    * ProfileID (Long.ToString())

    * Delimited Array of FKs (int.ToString() delimited and tokenized)

    * Multiple delimited arrays of strings (each array its own field
    name, delimited and tokenized)

    * Delimited array of about 150 Ints between 0 and 1600
    (int.ToString() delimited and tokenized).

    (this is a bolt on to a current app so we have limited control over its data
    model, and the above document is the best we could come up with to describe
    our data in a way Lucene might like)..

    We have about 130 million of these records, We don't "need" the profileID to
    be indexed except for doing updates, and we won't be storing the array of
    unique ints.

    My concern is on the terms database, with all those profileIDs in the terms
    data, that's 130 million terms before we look at what we care to search (the
    list of FKs and the list of ints between 0 and 1600.

    I was wondering if anyone had suggestions on this model, or ways to manage
    the potential size of our terms list?

    Thanks in advance.
    Josh Handel
    Senior Lead Consultant
    512.328.8181 | Main
    512.328.0584 | Fax
    512.577-6568 | Cell
    www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.com/>

    CATAPULT SYSTEMS INC.
    ENABLING BUSINESS THROUGH TECHNOLOGY
  • Josh Handel at May 25, 2010 at 1:55 pm
    Set the commit to every 1000 documents and changed to just 1 thread writing indexes, it still just marches on up to an Out of memory exception..

    Im also using the IndexWriter.SetRAMBufferSizeMB but it doesn't seem to make a bit of difference either...
    (here is how I am newing up my writer)

    Lucene.Net.Index.IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer, new IndexWriter.MaxFieldLength(10000));
    indexWriter.SetRAMBufferSizeMB(128);
    LogByteSizeMergePolicy lbsmp = new LogByteSizeMergePolicy(indexWriter);
    lbsmp.SetMaxMergeMB(5);
    lbsmp.SetMinMergeMB(10240);
    lbsmp.SetMergeFactor(10);
    indexWriter.SetMergePolicy(lbsmp);
    ConcurrentMergeScheduler scheduler = new ConcurrentMergeScheduler();
    scheduler.SetMaxThreadCount(15);
    indexWriter.SetMergeScheduler(scheduler);

    This is some pretty tweaky stuff I am doing here (based on Lucene in action and what I can figure out from the API docs) so if I am doing it wrong I am all ears to learn the right way :-)

    So what other options are there to keep these indexes from blowing up in memory? I don't mind this taking a lot of ram, as long as it doesn't take so much it crashes.. Just a way to gate the upper limits of the ram it uses would be awesome! :-).


    Thanks!
    Josh
    PS: I will be out of pocket most of the day, so any suggestions that come in I won't be able to try until tomorrow morning.

    -----Original Message-----
    From: Digy
    Sent: Monday, May 24, 2010 1:59 PM
    To: lucene-net-user@lucene.apache.org
    Subject: RE: Big data Suggestions. (Out of Memory)

    Try to use "commit" (for ex, at every 10000 docs)
    DIGY

    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 9:57 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions. (Out of Memory)

    I hate to ping multiple times in the same data on this but I wanted to add
    something real quick.

    With all those unique terms, I am now running out of memory (on my dev box)
    when indexing.. I am thinking this is being caused by the hundreds of
    thousands of unique terms (about 120,000 according to LUKE in the index
    after it crashed locally).. Is there a way to control memory usage by
    lucene's caching?

    (FYI:I am using Lucene.NET 2.9.2)

    Thanks
    Josh Handel


    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 1:22 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions.

    Guys,
    I am working on a Lucene index to allow some backend processes access to
    some post-processing type data.. The result is a document that looks
    something like this.


    * ProfileID (Long.ToString())

    * Delimited Array of FKs (int.ToString() delimited and tokenized)

    * Multiple delimited arrays of strings (each array its own field
    name, delimited and tokenized)

    * Delimited array of about 150 Ints between 0 and 1600
    (int.ToString() delimited and tokenized).

    (this is a bolt on to a current app so we have limited control over its data
    model, and the above document is the best we could come up with to describe
    our data in a way Lucene might like)..

    We have about 130 million of these records, We don't "need" the profileID to
    be indexed except for doing updates, and we won't be storing the array of
    unique ints.

    My concern is on the terms database, with all those profileIDs in the terms
    data, that's 130 million terms before we look at what we care to search (the
    list of FKs and the list of ints between 0 and 1600.

    I was wondering if anyone had suggestions on this model, or ways to manage
    the potential size of our terms list?

    Thanks in advance.
    Josh Handel
    Senior Lead Consultant
    512.328.8181 | Main
    512.328.0584 | Fax
    512.577-6568 | Cell
    www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.com/>

    CATAPULT SYSTEMS INC.
    ENABLING BUSINESS THROUGH TECHNOLOGY
  • Digy at May 25, 2010 at 4:44 pm
    It may be related with that issue
    (https://issues.apache.org/jira/browse/LUCENENET-358)

    http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/201005.mbox/%
    3CAANLkTinwf5JCjSqZmBBNCsQ_jHxQJgH4Ktehr-0UyGWF@mail.gmail.com%3E

    Can you try Lucene.Net 2.9.2.2 (in trunk or 2.9.2 tag. I updated it
    yesterday)

    DIGY

    -----Original Message-----
    From: Josh Handel
    Sent: Tuesday, May 25, 2010 4:55 PM
    To: lucene-net-user@lucene.apache.org
    Subject: RE: Big data Suggestions. (Out of Memory)

    Set the commit to every 1000 documents and changed to just 1 thread writing
    indexes, it still just marches on up to an Out of memory exception..

    Im also using the IndexWriter.SetRAMBufferSizeMB but it doesn't seem to make
    a bit of difference either...
    (here is how I am newing up my writer)

    Lucene.Net.Index.IndexWriter indexWriter = new
    Lucene.Net.Index.IndexWriter(dir, analyzer, new
    IndexWriter.MaxFieldLength(10000));
    indexWriter.SetRAMBufferSizeMB(128);
    LogByteSizeMergePolicy lbsmp = new LogByteSizeMergePolicy(indexWriter);
    lbsmp.SetMaxMergeMB(5);
    lbsmp.SetMinMergeMB(10240);
    lbsmp.SetMergeFactor(10);
    indexWriter.SetMergePolicy(lbsmp);
    ConcurrentMergeScheduler scheduler = new ConcurrentMergeScheduler();
    scheduler.SetMaxThreadCount(15);
    indexWriter.SetMergeScheduler(scheduler);

    This is some pretty tweaky stuff I am doing here (based on Lucene in action
    and what I can figure out from the API docs) so if I am doing it wrong I am
    all ears to learn the right way :-)

    So what other options are there to keep these indexes from blowing up in
    memory? I don't mind this taking a lot of ram, as long as it doesn't take so
    much it crashes.. Just a way to gate the upper limits of the ram it uses
    would be awesome! :-).


    Thanks!
    Josh
    PS: I will be out of pocket most of the day, so any suggestions that come in
    I won't be able to try until tomorrow morning.

    -----Original Message-----
    From: Digy
    Sent: Monday, May 24, 2010 1:59 PM
    To: lucene-net-user@lucene.apache.org
    Subject: RE: Big data Suggestions. (Out of Memory)

    Try to use "commit" (for ex, at every 10000 docs)
    DIGY

    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 9:57 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions. (Out of Memory)

    I hate to ping multiple times in the same data on this but I wanted to add
    something real quick.

    With all those unique terms, I am now running out of memory (on my dev box)
    when indexing.. I am thinking this is being caused by the hundreds of
    thousands of unique terms (about 120,000 according to LUKE in the index
    after it crashed locally).. Is there a way to control memory usage by
    lucene's caching?

    (FYI:I am using Lucene.NET 2.9.2)

    Thanks
    Josh Handel


    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 1:22 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions.

    Guys,
    I am working on a Lucene index to allow some backend processes access to
    some post-processing type data.. The result is a document that looks
    something like this.


    * ProfileID (Long.ToString())

    * Delimited Array of FKs (int.ToString() delimited and tokenized)

    * Multiple delimited arrays of strings (each array its own field
    name, delimited and tokenized)

    * Delimited array of about 150 Ints between 0 and 1600
    (int.ToString() delimited and tokenized).

    (this is a bolt on to a current app so we have limited control over its data
    model, and the above document is the best we could come up with to describe
    our data in a way Lucene might like)..

    We have about 130 million of these records, We don't "need" the profileID to
    be indexed except for doing updates, and we won't be storing the array of
    unique ints.

    My concern is on the terms database, with all those profileIDs in the terms
    data, that's 130 million terms before we look at what we care to search (the
    list of FKs and the list of ints between 0 and 1600.

    I was wondering if anyone had suggestions on this model, or ways to manage
    the potential size of our terms list?

    Thanks in advance.
    Josh Handel
    Senior Lead Consultant
    512.328.8181 | Main
    512.328.0584 | Fax
    512.577-6568 | Cell
    www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.com/>

    CATAPULT SYSTEMS INC.
    ENABLING BUSINESS THROUGH TECHNOLOGY
  • Josh Handel at May 25, 2010 at 4:57 pm
    Ya that was it digy :-)..

    Went from going over a gig (even with commits) inside of 30,000 documents to 60 megs..

    Thanks
    Josh

    -----Original Message-----
    From: Digy
    Sent: Tuesday, May 25, 2010 11:43 AM
    To: lucene-net-user@lucene.apache.org
    Subject: RE: Big data Suggestions. (Out of Memory)

    It may be related with that issue
    (https://issues.apache.org/jira/browse/LUCENENET-358)

    http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/201005.mbox/%
    3CAANLkTinwf5JCjSqZmBBNCsQ_jHxQJgH4Ktehr-0UyGWF@mail.gmail.com%3E

    Can you try Lucene.Net 2.9.2.2 (in trunk or 2.9.2 tag. I updated it
    yesterday)

    DIGY

    -----Original Message-----
    From: Josh Handel
    Sent: Tuesday, May 25, 2010 4:55 PM
    To: lucene-net-user@lucene.apache.org
    Subject: RE: Big data Suggestions. (Out of Memory)

    Set the commit to every 1000 documents and changed to just 1 thread writing
    indexes, it still just marches on up to an Out of memory exception..

    Im also using the IndexWriter.SetRAMBufferSizeMB but it doesn't seem to make
    a bit of difference either...
    (here is how I am newing up my writer)

    Lucene.Net.Index.IndexWriter indexWriter = new
    Lucene.Net.Index.IndexWriter(dir, analyzer, new
    IndexWriter.MaxFieldLength(10000));
    indexWriter.SetRAMBufferSizeMB(128);
    LogByteSizeMergePolicy lbsmp = new LogByteSizeMergePolicy(indexWriter);
    lbsmp.SetMaxMergeMB(5);
    lbsmp.SetMinMergeMB(10240);
    lbsmp.SetMergeFactor(10);
    indexWriter.SetMergePolicy(lbsmp);
    ConcurrentMergeScheduler scheduler = new ConcurrentMergeScheduler();
    scheduler.SetMaxThreadCount(15);
    indexWriter.SetMergeScheduler(scheduler);

    This is some pretty tweaky stuff I am doing here (based on Lucene in action
    and what I can figure out from the API docs) so if I am doing it wrong I am
    all ears to learn the right way :-)

    So what other options are there to keep these indexes from blowing up in
    memory? I don't mind this taking a lot of ram, as long as it doesn't take so
    much it crashes.. Just a way to gate the upper limits of the ram it uses
    would be awesome! :-).


    Thanks!
    Josh
    PS: I will be out of pocket most of the day, so any suggestions that come in
    I won't be able to try until tomorrow morning.

    -----Original Message-----
    From: Digy
    Sent: Monday, May 24, 2010 1:59 PM
    To: lucene-net-user@lucene.apache.org
    Subject: RE: Big data Suggestions. (Out of Memory)

    Try to use "commit" (for ex, at every 10000 docs)
    DIGY

    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 9:57 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions. (Out of Memory)

    I hate to ping multiple times in the same data on this but I wanted to add
    something real quick.

    With all those unique terms, I am now running out of memory (on my dev box)
    when indexing.. I am thinking this is being caused by the hundreds of
    thousands of unique terms (about 120,000 according to LUKE in the index
    after it crashed locally).. Is there a way to control memory usage by
    lucene's caching?

    (FYI:I am using Lucene.NET 2.9.2)

    Thanks
    Josh Handel


    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 1:22 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions.

    Guys,
    I am working on a Lucene index to allow some backend processes access to
    some post-processing type data.. The result is a document that looks
    something like this.


    * ProfileID (Long.ToString())

    * Delimited Array of FKs (int.ToString() delimited and tokenized)

    * Multiple delimited arrays of strings (each array its own field
    name, delimited and tokenized)

    * Delimited array of about 150 Ints between 0 and 1600
    (int.ToString() delimited and tokenized).

    (this is a bolt on to a current app so we have limited control over its data
    model, and the above document is the best we could come up with to describe
    our data in a way Lucene might like)..

    We have about 130 million of these records, We don't "need" the profileID to
    be indexed except for doing updates, and we won't be storing the array of
    unique ints.

    My concern is on the terms database, with all those profileIDs in the terms
    data, that's 130 million terms before we look at what we care to search (the
    list of FKs and the list of ints between 0 and 1600.

    I was wondering if anyone had suggestions on this model, or ways to manage
    the potential size of our terms list?

    Thanks in advance.
    Josh Handel
    Senior Lead Consultant
    512.328.8181 | Main
    512.328.0584 | Fax
    512.577-6568 | Cell
    www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.com/>

    CATAPULT SYSTEMS INC.
    ENABLING BUSINESS THROUGH TECHNOLOGY
  • Hans Merkl at May 25, 2010 at 3:46 pm
    I call the method GC.GetTotalMemory after a document has been added and if
    the memory consumption is above a certain threshold I commit the index. This
    has worked very well for me.



    On Mon, May 24, 2010 at 14:56, Josh Handel
    wrote:
    I hate to ping multiple times in the same data on this but I wanted to add
    something real quick.

    With all those unique terms, I am now running out of memory (on my dev box)
    when indexing.. I am thinking this is being caused by the hundreds of
    thousands of unique terms (about 120,000 according to LUKE in the index
    after it crashed locally).. Is there a way to control memory usage by
    lucene's caching?

    (FYI:I am using Lucene.NET 2.9.2)

    Thanks
    Josh Handel


    -----Original Message-----
    From: Josh Handel
    Sent: Monday, May 24, 2010 1:22 PM
    To: lucene-net-user@lucene.apache.org
    Subject: Big data Suggestions.

    Guys,
    I am working on a Lucene index to allow some backend processes access to
    some post-processing type data.. The result is a document that looks
    something like this.


    * ProfileID (Long.ToString())

    * Delimited Array of FKs (int.ToString() delimited and tokenized)

    * Multiple delimited arrays of strings (each array its own field
    name, delimited and tokenized)

    * Delimited array of about 150 Ints between 0 and 1600
    (int.ToString() delimited and tokenized).

    (this is a bolt on to a current app so we have limited control over its
    data model, and the above document is the best we could come up with to
    describe our data in a way Lucene might like)..

    We have about 130 million of these records, We don't "need" the profileID
    to be indexed except for doing updates, and we won't be storing the array of
    unique ints.

    My concern is on the terms database, with all those profileIDs in the terms
    data, that's 130 million terms before we look at what we care to search (the
    list of FKs and the list of ints between 0 and 1600.

    I was wondering if anyone had suggestions on this model, or ways to manage
    the potential size of our terms list?

    Thanks in advance.
    Josh Handel
    Senior Lead Consultant
    512.328.8181 | Main
    512.328.0584 | Fax
    512.577-6568 | Cell
    www.catapultsystems.com<blocked::blocked::http://www.catapultsystems.com/>

    CATAPULT SYSTEMS INC.
    ENABLING BUSINESS THROUGH TECHNOLOGY

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouplucene-net-user @
categorieslucene
postedMay 24, '10 at 6:22p
activeMay 25, '10 at 4:57p
posts7
users3
websitelucene.apache.org

3 users in discussion

Josh Handel: 4 posts Digy: 2 posts Hans Merkl: 1 post

People

Translate

site design / logo © 2022 Grokbase