James Wilson wrote:
I have completed a project to do the exact same thing. I put the pdf
text in XML files. Then after I do a Lucene search I read the text from
the XML files. I do not store the text in the Lucene index. That would
bloat the index and slow down my searches. FYI -- I use PDFBox to
extract the "searchable" text and I use tesseract (OCR) to extract the
text from the images within the PDFs. In order to make tesseract work
correctly I have to use ImageMagick to do many modification to the
images so that tesseract can OCR them correctly. Image modification/OCR
is a slow process and it is extremely resource intensive (CPU
utilization specifically -- Disk IO to a lesser extent).
I've built a pipeline in UpLib (open source at
http://uplib.parc.com/)to extract both the page images and the text (along with wordboxes and
font size, etc.) from PDFs, along with various metadata items. It also
includes a converter (ToPDF) which will convert Web pages, Word,
Powerpoint, email etc. to PDF first, and then do the extraction.
uplib-add-document --noupload mydoc
will create a temporary directory with all the pieces in it and output
the name of that directory to stdout.
As far as displaying the extracted text I would use an AJAX framework
that would provide a nice pop-up view of the text. This pop-up should
also have built in paging. I use Lucene's built in hi-lighting of
matches as well.
Actually, with HTML and CSS you can do just what "searchable PDF" does.
Put up the text in an HTML file, using "span" tags with absolute
positioning, and using the special color "transparent". Use CSS to make
the page image the "background-image" for the HTML, and you have a
browser-displayable object which looks like a page image with selectable
text.
Bill
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org