FAQ
Hi All:

Is there any study / research done on using scanned paper documents as
images (may be PDF), and then use some OCR or other technique for
extracting text, and the resultant index quality?


Thanks in advance,
Sithu D Sudarsan

sithu.sudarsan@fda.hhs.gov
sdsudarsan@ualr.edu

Search Discussions

  • Renaud Waldura at Feb 26, 2009 at 9:59 pm
    There is quite a bit of litterature available on this topic. This paper
    presents a summary. Nothing immediately applicable I'm afraid.

    Retrieving OCR Text: A survey of current approaches
    Steven M. Beitzel, Eric C. Jensen, David A Grossman
    Illinois Institute of Technology

    It lists a number of other papers that are easy to find online. Let me know
    what you find, I'm interested in this too.

    --Renaud



    -----Original Message-----
    From: Sudarsan, Sithu D.
    Sent: Thursday, February 26, 2009 8:29 AM
    To: solr-user@lucene.apache.org; java-user@lucene.apache.org
    Subject: Use of scanned documents for text extraction and indexing


    Hi All:

    Is there any study / research done on using scanned paper documents as
    images (may be PDF), and then use some OCR or other technique for extracting
    text, and the resultant index quality?


    Thanks in advance,
    Sithu D Sudarsan

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bastian Buch at Feb 27, 2009 at 12:57 pm
    You can use Tesseract, an openSource OCR Engine owned from Google. Its
    native C Code and to use it in Java you should use JNI or direct process
    creation. There is no PDF support, but you can use imagemagick to
    convert those docs on the fly. The engine scan documents line by line
    without trying to resolve "text-boxes", which is a problem with
    1-n-column texts. But with some image preprocessing you can also solve this.


    Cheers Bastian.

    http://bastian-buch.de


    Renaud Waldura schrieb:
    There is quite a bit of litterature available on this topic. This paper
    presents a summary. Nothing immediately applicable I'm afraid.

    Retrieving OCR Text: A survey of current approaches
    Steven M. Beitzel, Eric C. Jensen, David A Grossman
    Illinois Institute of Technology

    It lists a number of other papers that are easy to find online. Let me know
    what you find, I'm interested in this too.

    --Renaud



    -----Original Message-----
    From: Sudarsan, Sithu D.
    Sent: Thursday, February 26, 2009 8:29 AM
    To: solr-user@lucene.apache.org; java-user@lucene.apache.org
    Subject: Use of scanned documents for text extraction and indexing


    Hi All:

    Is there any study / research done on using scanned paper documents as
    images (may be PDF), and then use some OCR or other technique for extracting
    text, and the resultant index quality?


    Thanks in advance,
    Sithu D Sudarsan

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 26, '09 at 4:30p
activeFeb 27, '09 at 12:57p
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase