FAQ
Hi,


I am developing a pdf search engine, just use in local computer to search massive pdf documents.


I used pdfbox+lucene to index and search, and then I have to display the context to the user in pdf file in user interface. HOW CAN I ACHIEVE THIS???


THX

Search Discussions

  • Cescy at Mar 6, 2011 at 1:31 pm
    Hi,


    I am developing a pdf search engine, just use in local computer to search massive pdf documents.


    I used pdfbox+lucene to index and search, and then I have to display the context to the user in pdf file in user interface. HOW CAN I ACHIEVE THIS???


    THX
  • James Wilson at Mar 7, 2011 at 6:48 pm

    Cescy wrote:
    Hi,


    I am developing a pdf search engine, just use in local computer to search massive pdf documents.


    I used pdfbox+lucene to index and search, and then I have to display the context to the user in pdf file in user interface. HOW CAN I ACHIEVE THIS???
    I have completed a project to do the exact same thing. I put the pdf
    text in XML files. Then after I do a Lucene search I read the text from
    the XML files. I do not store the text in the Lucene index. That would
    bloat the index and slow down my searches. FYI -- I use PDFBox to
    extract the "searchable" text and I use tesseract (OCR) to extract the
    text from the images within the PDFs. In order to make tesseract work
    correctly I have to use ImageMagick to do many modification to the
    images so that tesseract can OCR them correctly. Image modification/OCR
    is a slow process and it is extremely resource intensive (CPU
    utilization specifically -- Disk IO to a lesser extent).

    As far as displaying the extracted text I would use an AJAX framework
    that would provide a nice pop-up view of the text. This pop-up should
    also have built in paging. I use Lucene's built in hi-lighting of
    matches as well.

    Oh almost forgot -- I use PDFBox to extract the images from the PDFs.

    James

    THX
    --
    James J. Wilson II
    Systems Engineer
    U.S. District Court
    District of New Mexico
    333 Lomas Blvd., NW
    Albuquerque, NM 87102
    Phone: (505) 348-2081
    Fax: (505) 348-2028

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bill Janssen at Mar 7, 2011 at 9:28 pm

    James Wilson wrote:

    I have completed a project to do the exact same thing. I put the pdf
    text in XML files. Then after I do a Lucene search I read the text from
    the XML files. I do not store the text in the Lucene index. That would
    bloat the index and slow down my searches. FYI -- I use PDFBox to
    extract the "searchable" text and I use tesseract (OCR) to extract the
    text from the images within the PDFs. In order to make tesseract work
    correctly I have to use ImageMagick to do many modification to the
    images so that tesseract can OCR them correctly. Image modification/OCR
    is a slow process and it is extremely resource intensive (CPU
    utilization specifically -- Disk IO to a lesser extent).
    I've built a pipeline in UpLib (open source at http://uplib.parc.com/)
    to extract both the page images and the text (along with wordboxes and
    font size, etc.) from PDFs, along with various metadata items. It also
    includes a converter (ToPDF) which will convert Web pages, Word,
    Powerpoint, email etc. to PDF first, and then do the extraction.

    uplib-add-document --noupload mydoc

    will create a temporary directory with all the pieces in it and output
    the name of that directory to stdout.
    As far as displaying the extracted text I would use an AJAX framework
    that would provide a nice pop-up view of the text. This pop-up should
    also have built in paging. I use Lucene's built in hi-lighting of
    matches as well.
    Actually, with HTML and CSS you can do just what "searchable PDF" does.
    Put up the text in an HTML file, using "span" tags with absolute
    positioning, and using the special color "transparent". Use CSS to make
    the page image the "background-image" for the HTML, and you have a
    browser-displayable object which looks like a page image with selectable
    text.

    Bill

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Mar 7, 2011 at 10:41 am
    Please don't cross-post to multiple lists.

    Look back through the lucene user email archive and you'll find people
    talking about this or use your favourite search engine to find hits
    for something like "lucene pdf highlighting".

    If you don't find an answer, post again to the most appropriate list
    only, with the specific problem.
    Is the problem with searching? Highlighting text? Generating a PDF?
    Highlighting within a generated PDF?



    --
    Ian.

    On Sun, Mar 6, 2011 at 1:30 PM, Cescy wrote:
    Hi,


    I am developing a pdf search engine, just use in local computer to search massive pdf documents.


    I used pdfbox+lucene to index and search, and then I have to display the context to the user in pdf file in user interface. HOW CAN I ACHIEVE THIS???


    THX
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 6, '11 at 1:31p
activeMar 7, '11 at 9:28p
posts5
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase