FAQ
I am indexing different document formats with lucene 1.9. One of the pdf
file I am indexing is 300MG. Whenever the index writer hits that file it
stops the indexing with "Out of Memory" exception. I am using the pdf
box library to index. I have set the following merge factors in my code.

writer.setMergeFactor(1000);
writer.setMaxMergeDocs(9999999);
writer.setMaxBufferedDocs(1000);
writer.setMaxFieldLength(Integer.MAX_VALUE);

I would like any help and suggestions.

thanks,
suba suresh.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Rob Staveley (Tom) at Jul 13, 2006 at 2:23 pm
    If you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
    rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
    need a 1G heap.

    If, however, you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
    (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary
    file, you will not need so much RAM, but you need to use
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
    #Field(java.lang.String,%20java.io.Reader) to construct your Lucene field
    (rather than
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
    #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi
    eld.Store,%20org.apache.lucene.document.Field.Index)).

    -----Original Message-----
    From: Suba Suresh
    Sent: 13 July 2006 14:55
    To: java-user@lucene.apache.org
    Subject: Out of memory error

    I am indexing different document formats with lucene 1.9. One of the pdf
    file I am indexing is 300MG. Whenever the index writer hits that file it
    stops the indexing with "Out of Memory" exception. I am using the pdf box
    library to index. I have set the following merge factors in my code.

    writer.setMergeFactor(1000);
    writer.setMaxMergeDocs(9999999);
    writer.setMaxBufferedDocs(1000);
    writer.setMaxFieldLength(Integer.MAX_VALUE);

    I would like any help and suggestions.

    thanks,
    suba suresh.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Suba Suresh at Jul 13, 2006 at 2:31 pm
    Thanks.

    I am using the getText(PDDocument) method of the PDFTextStripper. I will
    try the other suggestion.

    suba suresh.

    Rob Staveley (Tom) wrote:
    If you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
    rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
    need a 1G heap.

    If, however, you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
    (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary
    file, you will not need so much RAM, but you need to use
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
    #Field(java.lang.String,%20java.io.Reader) to construct your Lucene field
    (rather than
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
    #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi
    eld.Store,%20org.apache.lucene.document.Field.Index)).

    -----Original Message-----
    From: Suba Suresh
    Sent: 13 July 2006 14:55
    To: java-user@lucene.apache.org
    Subject: Out of memory error

    I am indexing different document formats with lucene 1.9. One of the pdf
    file I am indexing is 300MG. Whenever the index writer hits that file it
    stops the indexing with "Out of Memory" exception. I am using the pdf box
    library to index. I have set the following merge factors in my code.

    writer.setMergeFactor(1000);
    writer.setMaxMergeDocs(9999999);
    writer.setMaxBufferedDocs(1000);
    writer.setMaxFieldLength(Integer.MAX_VALUE);

    I would like any help and suggestions.

    thanks,
    suba suresh.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Rob Staveley (Tom) at Jul 13, 2006 at 3:17 pm
    Let us know how you get on. There are a lot of people fighting very similar
    battles on this list.

    -----Original Message-----
    From: Suba Suresh
    Sent: 13 July 2006 15:30
    To: java-user@lucene.apache.org
    Subject: Re: Out of memory error

    Thanks.

    I am using the getText(PDDocument) method of the PDFTextStripper. I will try
    the other suggestion.

    suba suresh.

    Rob Staveley (Tom) wrote:
    If you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
    Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large
    String and may need a 1G heap.

    If, however, you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
    teText
    (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a
    temporary file, you will not need so much RAM, but you need to use
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
    d.html
    #Field(java.lang.String,%20java.io.Reader) to construct your Lucene
    field (rather than
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
    d.html
    #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
    ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).

    -----Original Message-----
    From: Suba Suresh
    Sent: 13 July 2006 14:55
    To: java-user@lucene.apache.org
    Subject: Out of memory error

    I am indexing different document formats with lucene 1.9. One of the
    pdf file I am indexing is 300MG. Whenever the index writer hits that
    file it stops the indexing with "Out of Memory" exception. I am using
    the pdf box library to index. I have set the following merge factors in my code.
    writer.setMergeFactor(1000);
    writer.setMaxMergeDocs(9999999);
    writer.setMaxBufferedDocs(1000);
    writer.setMaxFieldLength(Integer.MAX_VALUE);

    I would like any help and suggestions.

    thanks,
    suba suresh.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Suba Suresh at Jul 13, 2006 at 3:39 pm
    Definitely. Thanks for both the suggestions. Yes it is 300MB.(typo)

    suba suresh.

    Rob Staveley (Tom) wrote:
    Let us know how you get on. There are a lot of people fighting very similar
    battles on this list.

    -----Original Message-----
    From: Suba Suresh
    Sent: 13 July 2006 15:30
    To: java-user@lucene.apache.org
    Subject: Re: Out of memory error

    Thanks.

    I am using the getText(PDDocument) method of the PDFTextStripper. I will try
    the other suggestion.

    suba suresh.

    Rob Staveley (Tom) wrote:
    If you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
    Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large
    String and may need a 1G heap.

    If, however, you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
    teText
    (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a
    temporary file, you will not need so much RAM, but you need to use
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
    d.html
    #Field(java.lang.String,%20java.io.Reader) to construct your Lucene
    field (rather than
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
    d.html
    #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
    ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).

    -----Original Message-----
    From: Suba Suresh
    Sent: 13 July 2006 14:55
    To: java-user@lucene.apache.org
    Subject: Out of memory error

    I am indexing different document formats with lucene 1.9. One of the
    pdf file I am indexing is 300MG. Whenever the index writer hits that
    file it stops the indexing with "Out of Memory" exception. I am using
    the pdf box library to index. I have set the following merge factors in my code.
    writer.setMergeFactor(1000);
    writer.setMaxMergeDocs(9999999);
    writer.setMaxBufferedDocs(1000);
    writer.setMaxFieldLength(Integer.MAX_VALUE);

    I would like any help and suggestions.

    thanks,
    suba suresh.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Suba Suresh at Jul 26, 2006 at 5:15 pm
    Sorry for my late response. It took us some time to run it again. We
    increased the memory heap to 1G as you suggested and it works. The
    indexer is not crashing. (We are running into some other problem with a
    powerpoint file .That is for another email).

    The code change with
    PDFTextStripper.writeText((org.pdfbox.pdmodel.PDDocument,%20java.io.Writer)
    did not work for us.


    Thanks for all the help.

    suba suresh.

    Rob Staveley (Tom) wrote:
    Let us know how you get on. There are a lot of people fighting very similar
    battles on this list.

    -----Original Message-----
    From: Suba Suresh
    Sent: 13 July 2006 15:30
    To: java-user@lucene.apache.org
    Subject: Re: Out of memory error

    Thanks.

    I am using the getText(PDDocument) method of the PDFTextStripper. I will try
    the other suggestion.

    suba suresh.

    Rob Staveley (Tom) wrote:
    If you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
    Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large
    String and may need a 1G heap.

    If, however, you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
    teText
    (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a
    temporary file, you will not need so much RAM, but you need to use
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
    d.html
    #Field(java.lang.String,%20java.io.Reader) to construct your Lucene
    field (rather than
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
    d.html
    #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
    ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).

    -----Original Message-----
    From: Suba Suresh
    Sent: 13 July 2006 14:55
    To: java-user@lucene.apache.org
    Subject: Out of memory error

    I am indexing different document formats with lucene 1.9. One of the
    pdf file I am indexing is 300MG. Whenever the index writer hits that
    file it stops the indexing with "Out of Memory" exception. I am using
    the pdf box library to index. I have set the following merge factors in my code.
    writer.setMergeFactor(1000);
    writer.setMaxMergeDocs(9999999);
    writer.setMaxBufferedDocs(1000);
    writer.setMaxFieldLength(Integer.MAX_VALUE);

    I would like any help and suggestions.

    thanks,
    suba suresh.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ben Litchfield at Jul 13, 2006 at 3:15 pm
    By 300MG I assume you mean 300MB.

    You can also try extracting the text outside of lucene by using a
    PDFBox command line app.

    java org.pdfbox.ExtractText <pdffile>

    you may need to increase the JRE memory like this

    java -Xmx512m .pdfbox.ExtractText <pdffile>

    OR

    java -Xmx1024m .pdfbox.ExtractText <pdffile>


    If this is still giving you an out of memory error then it is possibly
    an issue with PDFBox, if that is the case then please create an issue
    and attach/upload the PDF on the PDFBox site.


    Ben


    Thanks.

    I am using the getText(PDDocument) method of the PDFTextStripper. I will
    try the other suggestion.

    suba suresh.

    Rob Staveley (Tom) wrote:
    If you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getTe
    xt(o
    rg.pdfbox.pdmodel.PDDocument), you are going to get a large String
    and may
    need a 1G heap.

    If, however, you are using
    http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#write
    Text
    (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a
    temporary
    file, you will not need so much RAM, but you need to use
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.
    html
    #Field(java.lang.String,%20java.io.Reader) to construct your Lucene
    field
    (rather than
    http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.
    html
    #Field(java.lang.String,%20java.lang.String,%
    20org.apache.lucene.document.Fi
    eld.Store,%20org.apache.lucene.document.Field.Index)).

    -----Original Message-----
    From: Suba Suresh
    Sent: 13 July 2006 14:55
    To: java-user@lucene.apache.org
    Subject: Out of memory error

    I am indexing different document formats with lucene 1.9. One of
    the pdf
    file I am indexing is 300MG. Whenever the index writer hits that
    file it
    stops the indexing with "Out of Memory" exception. I am using the
    pdf box
    library to index. I have set the following merge factors in my code.

    writer.setMergeFactor(1000);
    writer.setMaxMergeDocs(9999999);
    writer.setMaxBufferedDocs(1000);
    writer.setMaxFieldLength(Integer.MAX_VALUE);

    I would like any help and suggestions.

    thanks,
    suba suresh.

    --------------------------------------------------------------------
    -
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 13, '06 at 1:55p
activeJul 26, '06 at 5:15p
posts7
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase