FAQ
Hi all,

I am seeking for some guidelines to directly convert an already existing
index to Lucene index.
The index available to me is of a set of <value1,value2> pairs. Where each
pair is :
< word , fileName >
i.e a word as a 'value1', and the 'value2' being the fileName containing
that word.

A word might appear in several fileNames as well a same file can contain
multiple copies of a word. For eg, following index is possible:
< "my" , "file1" >
< "you" , "file2" >
< "my", "file2" >
< "my", "file1">

My actual problem is that the index available to me is very large in size,
hence I am bit reluctant to create 'Document' object for each file because
for that I will have to read through all the pairs first and store them in
memory. Or I will have to 'update' the 'Document' object of a particular
file while iterating through the Pairs of my index, this 'update', again, is
a costly operation.

Please correct me if my understanding of Lucene is wrong or other
alternative ways.

Regards
Lokendra

Search Discussions

  • Findbestopensource at Feb 25, 2011 at 7:05 am
    Hello Lokendra,

    You could updates frequently. Anyway i think it is one time job.

    My advice would be do insertion and updates in batch.
    1. Parse your file and read 1000 lines
    2. Do some aggregation and insert / update with lucene.

    Regards
    Aditya
    www.findbestopensource.com


    On Fri, Feb 25, 2011 at 11:56 AM, Lokendra Singh wrote:

    Hi all,

    I am seeking for some guidelines to directly convert an already existing
    index to Lucene index.
    The index available to me is of a set of <value1,value2> pairs. Where each
    pair is :
    < word , fileName >
    i.e a word as a 'value1', and the 'value2' being the fileName containing
    that word.

    A word might appear in several fileNames as well a same file can contain
    multiple copies of a word. For eg, following index is possible:
    < "my" , "file1" >
    < "you" , "file2" >
    < "my", "file2" >
    < "my", "file1">

    My actual problem is that the index available to me is very large in size,
    hence I am bit reluctant to create 'Document' object for each file because
    for that I will have to read through all the pairs first and store them in
    memory. Or I will have to 'update' the 'Document' object of a particular
    file while iterating through the Pairs of my index, this 'update', again,
    is
    a costly operation.

    Please correct me if my understanding of Lucene is wrong or other
    alternative ways.

    Regards
    Lokendra
  • Edward Drapkin at Feb 25, 2011 at 8:58 am

    On 2/25/2011 12:26 AM, Lokendra Singh wrote:
    Hi all,

    I am seeking for some guidelines to directly convert an already
    existing index to Lucene index.
    The index available to me is of a set of <value1,value2> pairs. Where
    each pair is :
    < word , fileName >
    i.e a word as a 'value1', and the 'value2' being the fileName
    containing that word.

    A word might appear in several fileNames as well a same file can
    contain multiple copies of a word. For eg, following index is possible:
    < "my" , "file1" >
    < "you" , "file2" >
    < "my", "file2" >
    < "my", "file1">

    My actual problem is that the index available to me is very large in
    size, hence I am bit reluctant to create 'Document' object for each
    file because for that I will have to read through all the pairs first
    and store them in memory. Or I will have to 'update' the 'Document'
    object of a particular file while iterating through the Pairs of my
    index, this 'update', again, is a costly operation.

    Please correct me if my understanding of Lucene is wrong or other
    alternative ways.

    Regards
    Lokendra

    Er, sorry for the blank email, hit the wrong button!

    There are basically two ways to do this:

    1) Buffer everything in RAM and then write all at once - this is
    probably the quickest way to do it, but the most resource intensive and
    prone to failure (OOM will lose all work, for example).
    2) Iterate through the list, collecting some number of values and then
    periodically committing them to the index.

    There's not really any other way: you either write it out in chunks or
    you write it out all at once. However, there is some leeway in how you
    iterate through your old index. Iterating through the entire index and
    buffering everything in RAM and writing it all out at once is, like you
    said, probably prohibitively resource intensive. You could, on the
    other hand, iterate through the index and only collect values for a
    particular file, then commit that, then iterate again. I would imagine
    this is a much slower approach, but it will be less memory intensive.

    Personally, the way I'd approach this problem... I'd iterate through the
    old index in one pass. Every time I encountered a new file, I'd create
    a new Document and store it somewhere (something trivial like
    Map<String, Document> where the key is the filename). I'd also ensure
    that the Documents have a field called "file" so that I could easily
    query them later. Every iteration, I'd continue to add to the Documents
    and every n iterations I'd commit all the Documents to the index
    (ostensibly calling IndexWriter.updateDocument). By tuning the number
    of iterations that triggers an index write optimization, you can adjust
    the balance between RAM usage and CPU/IO time spent. n=1 would
    obviously be the most CPU/IO intensive and n=inf would be the most RAM
    intensive and the "sweet spot" for your requirements is very probably
    somewhere between those two points.

    How big is this old index, by the way? Have you run tests to ensure
    that the memory limit or cpu cost in either method is actually a
    problem? I think you may be surprised at the speeds you get, if you
    haven't run tests already.

    Thanks,
    Eddie

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lokendra Singh at Feb 25, 2011 at 9:56 am
    Eddie and Aditya: Thanks a lot for your suggestions!!.

    The collect-and-commit approach looks good.

    @Eddie: My old index size is of order of ~Milion pairs. Though, I haven't
    run any such tests, I will be running them quickly for analysis.

    Regards
    Lokendra

    On Fri, Feb 25, 2011 at 2:23 PM, Edward Drapkin wrote:
    On 2/25/2011 12:26 AM, Lokendra Singh wrote:

    Hi all,

    I am seeking for some guidelines to directly convert an already existing
    index to Lucene index.
    The index available to me is of a set of <value1,value2> pairs. Where each
    pair is :
    < word , fileName >
    i.e a word as a 'value1', and the 'value2' being the fileName containing
    that word.

    A word might appear in several fileNames as well a same file can contain
    multiple copies of a word. For eg, following index is possible:
    < "my" , "file1" >
    < "you" , "file2" >
    < "my", "file2" >
    < "my", "file1">

    My actual problem is that the index available to me is very large in size,
    hence I am bit reluctant to create 'Document' object for each file because
    for that I will have to read through all the pairs first and store them in
    memory. Or I will have to 'update' the 'Document' object of a particular
    file while iterating through the Pairs of my index, this 'update', again, is
    a costly operation.

    Please correct me if my understanding of Lucene is wrong or other
    alternative ways.

    Regards
    Lokendra

    Er, sorry for the blank email, hit the wrong button!

    There are basically two ways to do this:

    1) Buffer everything in RAM and then write all at once - this is probably
    the quickest way to do it, but the most resource intensive and prone to
    failure (OOM will lose all work, for example).
    2) Iterate through the list, collecting some number of values and then
    periodically committing them to the index.

    There's not really any other way: you either write it out in chunks or you
    write it out all at once. However, there is some leeway in how you iterate
    through your old index. Iterating through the entire index and buffering
    everything in RAM and writing it all out at once is, like you said, probably
    prohibitively resource intensive. You could, on the other hand, iterate
    through the index and only collect values for a particular file, then commit
    that, then iterate again. I would imagine this is a much slower approach,
    but it will be less memory intensive.

    Personally, the way I'd approach this problem... I'd iterate through the
    old index in one pass. Every time I encountered a new file, I'd create a
    new Document and store it somewhere (something trivial like Map<String,
    Document> where the key is the filename). I'd also ensure that the
    Documents have a field called "file" so that I could easily query them
    later. Every iteration, I'd continue to add to the Documents and every n
    iterations I'd commit all the Documents to the index (ostensibly calling
    IndexWriter.updateDocument). By tuning the number of iterations that
    triggers an index write optimization, you can adjust the balance between RAM
    usage and CPU/IO time spent. n=1 would obviously be the most CPU/IO
    intensive and n=inf would be the most RAM intensive and the "sweet spot" for
    your requirements is very probably somewhere between those two points.

    How big is this old index, by the way? Have you run tests to ensure that
    the memory limit or cpu cost in either method is actually a problem? I
    think you may be surprised at the speeds you get, if you haven't run tests
    already.

    Thanks,
    Eddie

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 25, '11 at 6:27a
activeFeb 25, '11 at 9:56a
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase