FAQ
Hi every body:

I am getting a problem during the indexing process, I am indexing big
amounts of texts most of them in pdf format I am using pdf box 0.6 version.
The space in hard disk before that the indexing process begin is around 120
Gb but incredibly even when my lucene index doesn't have yet 300 mb my hard
disk has not already free space, more incredible is that when I turn off the
process of indexing then the free disk space arise rapidly to 120 Gb. How
could happen this if I doesn't copy the documents to the disk ??? , I have a
linux machine for the indexing process, I have been thinking that could be
the temporaly files of something , may be pdf box ???
Could you help me please ???
Greetings

Search Discussions

  • Ben Litchfield at Dec 4, 2006 at 2:45 pm
    PDFBox version 0.6 is quite old and there have been many improvements,
    you should look at moving to the newest version 0.7.3, although from the
    description of your problem it probably would not resolve it.

    If there are a large number of temp files with "pdfbox" in the name then
    you are most likely not calling close() on the PDDocument object. How
    are you adding the documents to the index. There is a simple helper
    class called org.pdfbox.searchengine.lucene.LucenePDFDocment that you
    may find useful.

    Ben


    Ariel Isaac Romero Cartaya wrote:
    Hi every body:

    I am getting a problem during the indexing process, I am indexing big
    amounts of texts most of them in pdf format I am using pdf box 0.6
    version.
    The space in hard disk before that the indexing process begin is
    around 120
    Gb but incredibly even when my lucene index doesn't have yet 300 mb my
    hard
    disk has not already free space, more incredible is that when I turn
    off the
    process of indexing then the free disk space arise rapidly to 120 Gb. How
    could happen this if I doesn't copy the documents to the disk ??? , I
    have a
    linux machine for the indexing process, I have been thinking that
    could be
    the temporaly files of something , may be pdf box ???
    Could you help me please ???
    Greetings

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ariel Isaac Romero Cartaya at Dec 5, 2006 at 3:53 pm
    Here is my source code where I convert pdf files to text for indexing, I
    got this source code from lucene in action examples and adapted it for my
    convenience, I hop you could help me to fix this problem, anyway if you know
    another more efficient way to do it please tell me how to:

    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.util.Iterator;
    import java.util.List;

    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.pdfbox.cos.COSDocument;
    import org.pdfbox.encryption.DecryptDocument;
    import org.pdfbox.exceptions.CryptographyException;
    import org.pdfbox.exceptions.InvalidPasswordException;
    import org.pdfbox.pdfparser.PDFParser;
    import org.pdfbox.pdmodel.PDDocument;
    import org.pdfbox.pdmodel.PDDocumentInformation;
    import org.pdfbox.util.PDFTextStripper;

    import cu.co.cenatav.kernel.parser.DocumentHandler;
    import cu.co.cenatav.kernel.parser.DocumentHandlerException;
    import cu.co.cenatav.kernel.parser.schema.SchemaExtractor;

    public class PDFBoxPDFHandler implements DocumentHandler {

    public static String password = "-password";

    public Document getDocument(InputStream is)
    throws DocumentHandlerException {

    COSDocument cosDoc = null;
    try {
    cosDoc = parseDocument(is);
    }
    catch (IOException e) {
    closeCOSDocument(cosDoc);
    throw new DocumentHandlerException(
    "Cannot parse PDF document", e);
    }

    // decrypt the PDF document, if it is encrypted
    try {
    if (cosDoc.isEncrypted()) {
    DecryptDocument decryptor = new DecryptDocument(cosDoc);
    decryptor.decryptDocument(password);
    }
    }
    catch (CryptographyException e) {
    closeCOSDocument(cosDoc);
    throw new DocumentHandlerException(
    "Cannot decrypt PDF document", e);
    }
    catch (InvalidPasswordException e) {
    closeCOSDocument(cosDoc);
    throw new DocumentHandlerException(
    "Cannot decrypt PDF document", e);
    }
    catch (IOException e) {
    closeCOSDocument(cosDoc);
    throw new DocumentHandlerException(
    "Cannot decrypt PDF document", e);
    }

    // extract PDF document's textual content
    String bodyText = null;
    try {
    PDFTextStripper stripper = new PDFTextStripper();
    bodyText = stripper.getText(new PDDocument(cosDoc));
    }
    catch (IOException e) {
    closeCOSDocument(cosDoc);
    throw new DocumentHandlerException(
    "Cannot parse PDF document", e);
    // String errS = e.toString();
    // if (errS.toLowerCase().indexOf("font") != -1) {
    // }
    }

    Document doc = new Document();
    if (bodyText != null) {

    PDDocument pdDoc = null;
    PDDocumentInformation docInfo = null;

    try {
    pdDoc = new PDDocument(cosDoc);
    docInfo = pdDoc.getDocumentInformation();
    }
    catch (Exception e) {
    closeCOSDocument(cosDoc);
    closePDDocument(pdDoc);
    System.err.println("Cannot extraxt metadata from PDF: " +
    e.getMessage());
    }

    SchemaExtractor schemaExtractor = new SchemaExtractor(bodyText);

    String author = null;
    if (docInfo != null)
    author = docInfo.getAuthor();

    if (author == null || author.equals("")){

    //TODO Hacer el componente schemaExtractor

    List Authors = schemaExtractor.getAuthor();

    Iterator I = Authors.iterator();

    while (I.hasNext()){
    String Author = (String)I.next();
    doc.add(new Field("author", Author, Field.Store.YES ,
    Field.Index.TOKENIZED, Field.TermVector.YES));
    }
    }else{
    doc.add(new Field("author", author, Field.Store.YES ,
    Field.Index.TOKENIZED, Field.TermVector.YES));
    }
    String title = null;
    if (docInfo != null)
    title = docInfo.getTitle();

    if (title == null || title.equals("")){
    title = schemaExtractor.getTitle();
    }

    String keywords = null;

    if (docInfo != null)
    keywords = docInfo.getKeywords();
    if (keywords == null)
    keywords = "";

    String summary = null;

    if (docInfo != null)
    summary = docInfo.getProducer() + " " +
    docInfo.getCreator() + " " + docInfo.getSubject();

    if (summary == null || summary.equals("")){
    summary = schemaExtractor.getAbstract();
    }

    String content = schemaExtractor.getContent();

    Field fieldTitle = new Field("title", title, Field.Store.YES ,
    Field.Index.TOKENIZED,Field.TermVector.YES);
    //fieldTitle.setBoost(new Float(1.5));
    doc.add(fieldTitle);

    Field fieldSumary = new Field("sumary", summary, Field.Store.YES ,
    Field.Index.TOKENIZED,Field.TermVector.YES);
    //fieldSumary.setBoost(new Float(1.3));
    doc.add(fieldSumary);


    doc.add(new Field("content", content, Field.Store.YES ,
    Field.Index.TOKENIZED,Field.TermVector.YES));

    doc.add(new Field("keywords", keywords, Field.Store.YES ,
    Field.Index.UN_TOKENIZED,Field.TermVector.YES));

    closePDDocument(pdDoc);
    }


    // extract PDF document's meta-data

    closeCOSDocument(cosDoc);

    return doc;
    }

    private static COSDocument parseDocument(InputStream is)
    throws IOException {
    PDFParser parser = new PDFParser(is);
    parser.parse();
    return parser.getDocument();
    }

    private void closeCOSDocument(COSDocument cosDoc) {
    if (cosDoc != null) {
    try {
    cosDoc.close();
    }
    catch (IOException e) {
    // eat it, what else can we do?
    }
    }
    }

    private void closePDDocument(PDDocument pdDoc) {
    if (pdDoc != null) {
    try {
    pdDoc.close();
    }
    catch (IOException e) {
    // eat it, what else can we do?
    }
    }
    }

    public static void main(String[] args) throws Exception
    {
    PDFBoxPDFHandler handler = new PDFBoxPDFHandler();

    Document doc = handler.getDocument(new FileInputStream(new
    File(args[0])));

    System.out.println(doc);
    }
    }

    Could you help me please.
  • Dan Armbrust at Dec 7, 2006 at 3:38 pm

    Ariel Isaac Romero Cartaya wrote:
    Hi every body:

    I am getting a problem during the indexing process, I am indexing big
    amounts of texts most of them in pdf format I am using pdf box 0.6 version.
    The space in hard disk before that the indexing process begin is around 120
    Gb but incredibly even when my lucene index doesn't have yet 300 mb my hard
    disk has not already free space, more incredible is that when I turn off
    the
    process of indexing then the free disk space arise rapidly to 120 Gb. How
    could happen this if I doesn't copy the documents to the disk ??? , I
    have a
    linux machine for the indexing process, I have been thinking that could be
    the temporaly files of something , may be pdf box ???
    Could you help me please ???
    Greetings
    It would be helpful if you knew what was filling your harddisk. What
    files are filling the 120 GB? Where are they located?

    Dan

    --
    ****************************
    Daniel Armbrust
    Biomedical Informatics
    Mayo Clinic Rochester
    daniel.armbrust(at)mayo.edu
    http://informatics.mayo.edu/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Aigner, Thomas at Dec 7, 2006 at 6:15 pm
    Howdy all,



    I have a question on reading many documents and time to do this.
    I have a loop on the hits object reading a record, then writing it to a
    file. When there is only 1 user on the Index Searcher, this process to
    read say 100,000 takes around 3 seconds. This is slow, but can be
    acceptable. When a few more users do searchers, this time to just read
    from the hits object becomes well over 10 seconds, sometimes even 30+
    seconds. Is there a better way to read through and do something with
    the hits information? And yes, I have to read all of them to do this
    particular task.



    for (int i = 0;(i <= hits.length() - 1); i++)

    {



    if (fw == null)

    {

    fw = new BufferedWriter( new FileWriter( searchWriteSpec ),
    8196) ;

    }



    //Write Out records

    String tmpHold = "";

    tmpHold = hits.doc(i).get("somefield1") + hits.doc(i).get("somefield2");



    fw.write(tmpHold + "\n" );



    }



    Any ideas on how to speed this up especially with multiple users? Each
    user gets their own class which has the above code in it.



    Thanks,

    Tom
  • Grant Ingersoll at Dec 7, 2006 at 6:24 pm
    Have you done any profiling to identify hotspots in Lucene versus
    your application?

    You might look into the FieldSelector code (used in IndexReader) in
    the Trunk version of Lucene could be used to only load the fields you
    are interested when getting the document from disk. This can be
    useful if you have large fields that are being loaded that you don't
    necessarily need (thus skipping them).

    Also, do you need the BufferedWriter construction and check in side
    the loop? Probably small in comparison to loading, but It seems
    like it is only created once, why have it in the loop?


    On Dec 7, 2006, at 1:14 PM, Aigner, Thomas wrote:





    Howdy all,



    I have a question on reading many documents and time to do this.
    I have a loop on the hits object reading a record, then writing it
    to a
    file. When there is only 1 user on the Index Searcher, this
    process to
    read say 100,000 takes around 3 seconds. This is slow, but can be
    acceptable. When a few more users do searchers, this time to just
    read
    from the hits object becomes well over 10 seconds, sometimes even 30+
    seconds. Is there a better way to read through and do something with
    the hits information? And yes, I have to read all of them to do this
    particular task.



    for (int i = 0;(i <= hits.length() - 1); i++)

    {



    if (fw == null)

    {

    fw = new BufferedWriter( new FileWriter
    ( searchWriteSpec ),
    8196) ;

    }



    //Write Out records

    String tmpHold = "";

    tmpHold = hits.doc(i).get("somefield1") + hits.doc(i).get
    ("somefield2");



    fw.write(tmpHold + "\n" );



    }



    Any ideas on how to speed this up especially with multiple users?
    Each
    user gets their own class which has the above code in it.



    Thanks,

    Tom





















































    ------------------------------------------------------
    Grant Ingersoll
    http://www.grantingersoll.com/



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 4, '06 at 2:39p
activeDec 7, '06 at 6:24p
posts6
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase