FAQ
Hi,

I need to read pdf files and extract data from it, is there any way to do it
through python.

thanks & reagards
Maneesh KB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090327/0cb5d9b9/attachment-0001.htm>

Search Discussions

  • Gabriel Genellina at Mar 28, 2009 at 6:17 am
    En Thu, 26 Mar 2009 18:31:31 -0300, M Kumar <tomanishkb at gmail.com>
    escribi?:
    I need to read pdf files and extract data from it, is there any way to
    do it
    through python.
    If you are interested in the text, I'd use ghostscript pdf2text (you may
    invoke it from inside python).

    Actually extracting text from a PDF is rather difficult. It's a
    "presentation" format (or "display" format); every word in the document
    might be absolutely positioned, there is no paragraph structure you can
    rely on.

    --
    Gabriel Genellina
  • Cameron Laird at Mar 28, 2009 at 5:05 pm
    In article <mailman.2823.1238221222.11746.python-list at python.org>,
    Gabriel Genellina wrote:
    En Thu, 26 Mar 2009 18:31:31 -0300, M Kumar <tomanishkb at gmail.com>
    escribi?:
    I need to read pdf files and extract data from it, is there any way to
    do it
    through python.
    If you are interested in the text, I'd use ghostscript pdf2text (you may
    invoke it from inside python).

    Actually extracting text from a PDF is rather difficult. It's a
    "presentation" format (or "display" format); every word in the document
    might be absolutely positioned, there is no paragraph structure you can
    rely on.
    .
    .
    .
    I reinforce Gabriel's good advice with a few points of my own:
    A. I used to try to index PDF's text extractors
    at <URL:
    http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#pdf2txt >.
    While I haven't maintained this page in years,
    it would take only a little motivation for me
    to freshen it considerably.
    B. My current favorite is pdftotext.
    C. There are multiple "pdf2txt"-s, that is, dif-
    ferent products which share a name. Notice
    Gabriel's qualification that he is thinking
    of the *GS* one.
    D. Many times the best way to automate a business
    process involving PDF demands a trek farther
    "upstream", that is, identification of the
    source of a text *before* it was rendered as
    PDF. Do you have access to such sources?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMar 26, '09 at 9:31p
activeMar 28, '09 at 5:05p
posts3
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase