FAQ
hello all

We've got 100GB of data which has doc,txt,pdf,ppt,etc.., we've
separate parser for each file format, so we're going to index those data by
lucene. (since we scared of Nutch setup , thats why we didn't use it) My
doubt is , will it be scalable when i index those dcouments ? we planned to
do separate index for each file format , and we planned to use multi index
reader for searching, please anyone suggest me

1. Are we going on the right way?
2. Please suggest me about mergeFactors & segments
3. How much index size can lucene handle?
4. Will it cause for java OOM.
--
View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24600563.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Shai Erera at Jul 22, 2009 at 7:25 am
    From my experience, you shouldn't have any problems indexing that amount of
    content even into one index. I've successfully indexed 450 GB of data w/
    Lucene, and I believe it can scale much higher if rich text documents are
    indexed. Though I haven't tried yet, I believe it can scale into the 1-5 TB
    domain, on a modern CPU + HD and enough RAM.

    Usually, when rich text documents are involved, some considerable time is
    spent converting these into raw text documents. The raw size of a rich text
    document (PDF, DOC, HTML) is usually (based on my measures) 15-20% of its
    original size, and that is compressed even more when added to Lucene.

    I hope this helps. BTW, you can always just try to index that amount of
    content in one index on your machine and decide if the machine can handle
    that amount of data.

    Shai
    On Wed, Jul 22, 2009 at 9:07 AM, m.harig wrote:


    hello all

    We've got 100GB of data which has doc,txt,pdf,ppt,etc.., we've
    separate parser for each file format, so we're going to index those data by
    lucene. (since we scared of Nutch setup , thats why we didn't use it) My
    doubt is , will it be scalable when i index those dcouments ? we planned to
    do separate index for each file format , and we planned to use multi index
    reader for searching, please anyone suggest me

    1. Are we going on the right way?
    2. Please suggest me about mergeFactors & segments
    3. How much index size can lucene handle?
    4. Will it cause for java OOM.
    --
    View this message in context:
    http://www.nabble.com/indexing-100GB-of-data-tp24600563p24600563.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • M.harig at Jul 22, 2009 at 8:29 am
    Thanks Shai

    So there won't be problem when searching that kind of large index
    . am i right?

    Can anyone tell me is it possible to use hadoop with lucene??
    --
    View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Prashant ullegaddi at Jul 22, 2009 at 8:39 am
    Yes you can use Hadoop with Lucene. Borrow some code from Nutch. Look at
    org.apache.nutch.indexer.IndexerMapReduce and org.apache.nutch.indexer.
    Indexer.

    Prashant.
    On Wed, Jul 22, 2009 at 2:00 PM, m.harig wrote:


    Thanks Shai

    So there won't be problem when searching that kind of large index
    . am i right?

    Can anyone tell me is it possible to use hadoop with lucene??
    --
    View this message in context:
    http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Shai Erera at Jul 22, 2009 at 10:06 am
    There shouldn't be a problem to search such index. It depends on the machine
    you use. If it's a strong enough machine, I don't think you should have any
    problems.

    But like I said, you can always try it out on your machine before you make a
    decision.

    Also, Lucene has a Benchmark package which includes some indexing and search
    algorithms through which you can test the performance on your machine.
    On Wed, Jul 22, 2009 at 11:30 AM, m.harig wrote:


    Thanks Shai

    So there won't be problem when searching that kind of large index
    . am i right?

    Can anyone tell me is it possible to use hadoop with lucene??
    --
    View this message in context:
    http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • M.harig at Jul 22, 2009 at 12:45 pm
    Is there any article or forum for using Hadoop with lucene? Please any1 help
    me
    --
    View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24605164.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Phil Whelan at Jul 22, 2009 at 4:45 pm

    On Wed, Jul 22, 2009 at 5:46 AM, m.harigwrote:

    Is there any article or forum for using Hadoop with lucene? Please any1 help
    me
    Hi M,

    Katta is a project that is combining Lucene and Hadoop. Check it out here...
    http://katta.sourceforge.net/

    Thanks,
    Phil

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Steven A Rowe at Jul 22, 2009 at 4:56 pm
    You may also be interested in Andrzej Bialecki's patch to Solr that provides distributed indexing using Hadoop:

    https://issues.apache.org/jira/browse/SOLR-1301

    Steve
    -----Original Message-----
    From: Phil Whelan
    Sent: Wednesday, July 22, 2009 12:46 PM
    To: [email protected]
    Subject: Re: indexing 100GB of data

    On Wed, Jul 22, 2009 at 5:46 AM, m.harigwrote:
    Is there any article or forum for using Hadoop with lucene? Please any1 help
    me
    Hi M,

    Katta is a project that is combining Lucene and Hadoop. Check it out
    here...
    http://katta.sourceforge.net/

    Thanks,
    Phil

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • M.harig at Jul 23, 2009 at 7:41 am
    Thanks all ,

    Very thankful to all , am tired of hadoop settings , is it
    good to use read such type large index with lucene alone? will it go for OOM
    ? anyone pl suggest me.
    --
    View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24620846.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Shai Erera at Jul 23, 2009 at 8:26 am
    Generally you shouldn't hit OOM. But it may change depending on how you use
    the index. For example, if you have millions of documents spread across the
    100 GB, and you use sorting for various fields, then it will consume lots of
    RAM. Also, if you run hundreds of queries in parallel, each with a dozen
    terms, it will also consume some considerable amount of RAM.

    But if you don't do anything extreme w/ it, and you can allocate enough heap
    size, then you should be ok.

    The way I make such decisions is I design a test which mimics the
    typical/common scenario I expect to face, and then I run it on a machine I
    believe will be used in production (or as close as I can get), and analyze
    the results.

    If you choose to do that, and you're not satisfied w/ the results, you're
    welcome to post back w/ the machine statistics and exact use case, and I
    believe there are plenty of folks here who'd be willing to help you optimize
    the usage of Lucene by your app. Or at least then we'll be able to tell you:
    "for this index and this machine, you cannot run a 100GB index".

    Shai
    On Thu, Jul 23, 2009 at 10:42 AM, m.harig wrote:


    Thanks all ,

    Very thankful to all , am tired of hadoop settings , is it
    good to use read such type large index with lucene alone? will it go for
    OOM
    ? anyone pl suggest me.
    --
    View this message in context:
    http://www.nabble.com/indexing-100GB-of-data-tp24600563p24620846.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Jamie at Jul 22, 2009 at 12:50 pm
    HI There

    We have lucene searching across several terabytes of email data and
    there is no problem at all.

    Regards,

    Jamie



    Shai Erera wrote:
    There shouldn't be a problem to search such index. It depends on the machine
    you use. If it's a strong enough machine, I don't think you should have any
    problems.

    But like I said, you can always try it out on your machine before you make a
    decision.

    Also, Lucene has a Benchmark package which includes some indexing and search
    algorithms through which you can test the performance on your machine.

    On Wed, Jul 22, 2009 at 11:30 AM, m.harig wrote:

    Thanks Shai

    So there won't be problem when searching that kind of large index
    . am i right?

    Can anyone tell me is it possible to use hadoop with lucene??
    --
    View this message in context:
    http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    --
    Stimulus Software - MailArchiva
    Email Archiving And Compliance
    USA Tel: +1-713-343-8824 ext 100
    UK Tel: +44-20-80991035 ext 100
    Email: [email protected]
    Web: http://www.mailarchiva.com
    To receive MailArchiva Enterprise Edition product announcements, send a message to: <[email protected]>


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Dan OConnor at Jul 22, 2009 at 2:12 pm
    Hi Jamie,

    I would appreciate if you could provide details on the hardware/OS you are running this system on and what kind of search response time you are getting.

    As well as how you add email data to your index.

    Thanks,
    Dan


    -----Original Message-----
    From: Jamie
    Sent: Wednesday, July 22, 2009 8:51 AM
    To: [email protected]
    Subject: Re: indexing 100GB of data

    HI There

    We have lucene searching across several terabytes of email data and
    there is no problem at all.

    Regards,

    Jamie



    Shai Erera wrote:
    There shouldn't be a problem to search such index. It depends on the machine
    you use. If it's a strong enough machine, I don't think you should have any
    problems.

    But like I said, you can always try it out on your machine before you make a
    decision.

    Also, Lucene has a Benchmark package which includes some indexing and search
    algorithms through which you can test the performance on your machine.

    On Wed, Jul 22, 2009 at 11:30 AM, m.harig wrote:

    Thanks Shai

    So there won't be problem when searching that kind of large index
    . am i right?

    Can anyone tell me is it possible to use hadoop with lucene??
    --
    View this message in context:
    http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    --
    Stimulus Software - MailArchiva
    Email Archiving And Compliance
    USA Tel: +1-713-343-8824 ext 100
    UK Tel: +44-20-80991035 ext 100
    Email: [email protected]
    Web: http://www.mailarchiva.com
    To receive MailArchiva Enterprise Edition product announcements, send a message to: <[email protected]>


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 22, '09 at 6:07a
activeJul 23, '09 at 8:26a
posts12
users7
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase