Still working on some of the things you ask, namely searching without indexing. I need to modify our code and the general indexing process takes 2 hours, so I won't have a quick turn around on that. We also have a hard time answering the question about items that are normal but do they use more than the 16 MB. The heap dump does not allow us to quickly identify specifics on objects like we show in images below, so we really don't know what the amount of memory is used up in objects of this sort. We only know that byte total for all is at 197891887.
However, I have provided another image that breaks down the memory usage from the heap. A big question we have is that we talk about the 16 mb buffer, but is there other memory used by Lucene beyond that that we should expect to see?http://i39.tinypic.com/o0o560.jpg
we have 197891887 used in byte (anyone we look at is related in some way to the index writer)
we have 169263904 used in char (these are related to the index writer too)
we have 72658944 used in FreqProxTermsWriter$PostingList
we have 37722668 used in RawPostingList
All of these are well over the 16mb. So we are a little lost as to what we should expect to see when we look at the memory usage
I've attached the patch and the CheckIndex files. Unfortunately on the patch I guess my editor made some space line changes, so you get a lot of extra items in the patch that really are not any changes other than tab/space.
If you are open to a live share again, then maybe you could look at this data quicker than the screen shots I send.
From: Michael McCandless
Sent: Monday, May 10, 2010 2:27 AM
To: [email protected]
Subject: Re: IndexWriter and memory usage
Your usage (searching for old doc & updating it, to add new fields) is fine.
But: what memory usage do you see if you open a searcher, and search
for all docs, but don't open an IndexWriter? We need to tease apart
the IndexReader vs IndexWriter memory usage you are seeing. Also, can
you post the output of CheckIndex (java
org.apache.lucene.index.CheckIndex /path/to/index) of your fully built
index? That may give some hints about expected memory usage of IR (eg
if # unique terms is large).
More comments below:
On Thu, May 6, 2010 at 1:03 PM, Woolf, Ross wrote:
Sorry to be so long in getting back on this. The patch you provided has improved the situation but we are still seeing some memory loss. The following are some images from the heap dump. I'll share with you what we are seeing now.
This first image shows the memory pattern. Our fist commit takes place at about 3:54 when the steady trend up takes a drop and the new cycle begins. What we have found is the 2422 fix has made the memory in the first phase before the commit much better (and I'm sure throughout the entire run). But as you can see after the commit then we then again begin to lose memory. One of the pieces of info to know about this is what you are seeing, we have 5 threads that are pushing data to our Lucene plugin. If we drop it down to 1 thread then we are much more successful and can actually index all of our data without running out of memory but at 5 threads it gets into trouble. We still see a trend up in memory usage, but not as severe as when we use the multiple threads.http://tinypic.com/view.php?pic=2w6bf68&s=5
Can you post the output of "svn diff" on the 2.9 code base you're
using? I just want to look & verify all issues we've discussed are
included in your changes. The fact that 1 thread is fine and 5
threads are not still sounds like a symptom of LUCENE-2283.
Also, does that heap usage graph exclude garbage? Or, alternatively,
can you provoke an OOME w/ 512 MB heap and then capture the heap dump
at that point?
There is another piece of the picture that I think might be coming into play. We have plugged Lucene into a legacy app and are subject to how we can get it to deliver the data that we are indexing. In some scenarios (like the one we are having this problem with) we are building our documents progressively (adding fields to the document through the process). What you see before the first commit is the legacy system handing us the first field for many documents. Once we have gotten all of "field 1" for all documents then we commit that data into the index. Then the system starts feeding us "field 2." So we perform a search to see if the document already exists (for the scenario you are seeing it does) and so it retrieves the original document (we store a document ID) and it then adds the new field of data to the existing document and we "update" the document in the index. After the first commit, the rest of the process is one where a document already exist and so the new field is added and and the document is updated. It is in this process that we start rapidly losing memory. The following images show some examples of common areas where memory is being held.http://tinypic.com/view.php?pic=11vkwnb&s=5
This looks like "normal" memory usage of IndexWriter -- these are the
recycled buffers used for holding stored fields. However: the net RAM
used by this allocation should not exceed your 16 MB IW ram buffer
size -- does it?
This one is the byte buffer used by CompoundFileReader, opened by
IndexReader. It's odd that you have so many of these (if I'm reading
this correctly) -- are you certain all opened readers are being
closed? How many segments do you have in your index? Or... are there
many unique threads doing the searching? EG do you create a new
thread for every search or update?
This one is also normal memory used by IndexWriter, but as above, the
net RAM used by this allocation (summed w/ the above one) should not
exceed your 16 MB IW ram buffer size.
As mentioned, we are subject to how we can have the legacy app feed us the data and so this is why we do it this way. We treat this system as a real time system and at anytime the legacy system may send us a field that needs to be added or updated to a document. So we search for the document and if found we either add or update a field if the field is already existing in the document. So I started to wonder if a clue in this memory loss comes from the fact that we are retrieving an existing document and then adding to it and updating.
Now, if we eliminate the updating and simply add each item as a new document (which we did just to test but won't be adequate for our running system), then we still see a slight trend upward in memory usage and the following images show that now most of the memory is consumed in char rather than the byte we saw before. We don't know if this is normal and expected, or if it is something to be concerned about as well.http://tinypic.com/view.php?pic=vfgkyt&s=5
That memory usage is normal -- it's used by the in-RAM terms index of
your opened IndexReader. But I'd like to see the memory usage of
simply opening your IndexReader and searching for documents to update,
but not opening an IndexWriter at all.
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]