FAQ
Hello Luceners

I have started a new project and need to index pdf documents.
There are several projects around, which allow to extract the content,
like pdfbox, xpdf and pjclassic.

As far as I studied the FAQ's and examples, all these
tools allow simple text extraction.

Which of these open source tool can you recommend the most?

My pdf documents are quite long (in average more than 60 pages long).
Therefore I would like to have additional structure information for
indexing.
This allows that the user not only gets the whole document as a result,
he also gets additional information like the page or the chapter, where
the relevant information is.

As anyone have similar requirements? Which of these tools
are the best to fit my requirements?

Thanks for your help
Thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Mathieu Lecarme at Aug 14, 2007 at 1:05 pm

    Thomas Arni a écrit :
    Hello Luceners

    I have started a new project and need to index pdf documents.
    There are several projects around, which allow to extract the content,
    like pdfbox, xpdf and pjclassic.

    As far as I studied the FAQ's and examples, all these
    tools allow simple text extraction.

    Which of these open source tool can you recommend the most?
    pdftk or iText?
    My pdf documents are quite long (in average more than 60 pages long).
    Therefore I would like to have additional structure information for
    indexing.
    This allows that the user not only gets the whole document as a result,
    he also gets additional information like the page or the chapter, where
    the relevant information is.
    page is simple to extract, chapter should be more tricky, if the
    document got internal links.
    PDF reader accept argument like in http to open a page.
    As anyone have similar requirements? Which of these tools
    are the best to fit my requirements?
    Have a look to "PDF hacks" (ISBN: 0596006551). When your document will
    be split, it will be easy to index it.

    M.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 14, '07 at 6:29a
activeAug 14, '07 at 1:05p
posts2
users2
websitelucene.apache.org

2 users in discussion

Thomas Arni: 1 post Mathieu Lecarme: 1 post

People

Translate

site design / logo © 2022 Grokbase