FAQ
I'm trying to index 10 million small XML-like documents, extracted
from an Oracle database.

Lucene version is 1.2, and I'm using RedHat 7.0 Advanced Server,
on an AMD XP1800+ with 1GB RAM and 46GB+120GB hard disks. The
database is on a separate machine, connected by thin JDBC.

Each document consists of around 10 fields of the form


<key>0001234567</key>
<field1>some text</field1>
<field2>more text</field2>
...
<field10> again</field10>

Each document is typically only around 2K in size. Each field is
free-text indexed, but only the "key" field is stored.

After experimenting, I've set
Java memory to 750MB
writer.mergeFactor = 10000
- and run an optimize every 50,000 documents to try to keep down the
number of open files.

So far, the furthest I've managed to get is 3 million docs,
which used

22,913 simultaneous open files (the maximum on Windows is 2035!)
Up to 79 GB of disk space
Around 18 hours of indexing time

Is what I'm trying to do realistic? Does anyone else have experience
of indexing collections this large?

My indexing code, which is based on the simple file demo, can be seen
at http://www.rogerford.info/docs/IndexDBData_java.txt

- Many thanks,

Roger.





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Martin Sevigny at Jul 28, 2003 at 4:18 pm
    Roger,

    Just to double-check...
    Each document is typically only around 2K in size. Each field is
    free-text indexed, but only the "key" field is stored.
    After experimenting, I've set
    Java memory to 750MB
    writer.mergeFactor = 10000
    - and run an optimize every 50,000 documents to try to
    keep down the
    number of open files.

    So far, the furthest I've managed to get is 3 million docs,
    which used

    22,913 simultaneous open files (the maximum on Windows is 2035!)
    Up to 79 GB of disk space
    Around 18 hours of indexing time
    You get a 79GB index with 3 millions x 2KB documents? The documents
    themselves make 6GB or so, which means that your indices take 13 times
    the space needed to store the documents (if they were stored as files).

    For a single stored field, this factor is pretty hard to believe. Am I
    missing something?

    Martin Sévigny


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 28, '03 at 3:23p
activeJul 28, '03 at 4:18p
posts2
users2
websitelucene.apache.org

2 users in discussion

Martin Sevigny: 1 post Roger Ford: 1 post

People

Translate

site design / logo © 2021 Grokbase