On 2/25/2011 12:26 AM, Lokendra Singh wrote:
I am seeking for some guidelines to directly convert an already
existing index to Lucene index.
The index available to me is of a set of <value1,value2> pairs. Where
each pair is :
< word , fileName >
i.e a word as a 'value1', and the 'value2' being the fileName
containing that word.
A word might appear in several fileNames as well a same file can
contain multiple copies of a word. For eg, following index is possible:
< "my" , "file1" >
< "you" , "file2" >
< "my", "file2" >
< "my", "file1">
My actual problem is that the index available to me is very large in
size, hence I am bit reluctant to create 'Document' object for each
file because for that I will have to read through all the pairs first
and store them in memory. Or I will have to 'update' the 'Document'
object of a particular file while iterating through the Pairs of my
index, this 'update', again, is a costly operation.
Please correct me if my understanding of Lucene is wrong or other
Er, sorry for the blank email, hit the wrong button!
There are basically two ways to do this:
1) Buffer everything in RAM and then write all at once - this is
probably the quickest way to do it, but the most resource intensive and
prone to failure (OOM will lose all work, for example).
2) Iterate through the list, collecting some number of values and then
periodically committing them to the index.
There's not really any other way: you either write it out in chunks or
you write it out all at once. However, there is some leeway in how you
iterate through your old index. Iterating through the entire index and
buffering everything in RAM and writing it all out at once is, like you
said, probably prohibitively resource intensive. You could, on the
other hand, iterate through the index and only collect values for a
particular file, then commit that, then iterate again. I would imagine
this is a much slower approach, but it will be less memory intensive.
Personally, the way I'd approach this problem... I'd iterate through the
old index in one pass. Every time I encountered a new file, I'd create
a new Document and store it somewhere (something trivial like
Map<String, Document> where the key is the filename). I'd also ensure
that the Documents have a field called "file" so that I could easily
query them later. Every iteration, I'd continue to add to the Documents
and every n iterations I'd commit all the Documents to the index
(ostensibly calling IndexWriter.updateDocument). By tuning the number
of iterations that triggers an index write optimization, you can adjust
the balance between RAM usage and CPU/IO time spent. n=1 would
obviously be the most CPU/IO intensive and n=inf would be the most RAM
intensive and the "sweet spot" for your requirements is very probably
somewhere between those two points.
How big is this old index, by the way? Have you run tests to ensure
that the memory limit or cpu cost in either method is actually a
problem? I think you may be surprised at the speeds you get, if you
haven't run tests already.
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]