FAQ
Hi,
I have a problem :), I just want to extract text from pdf file with
python. There is differents libraries for that but it doesn't work...

pyPdf and pdfTools, I don't know why but it doesn't works with some
pdf... For example space chars are delete in the text..
Pdf playground : I don't understand how it work.

If you have an idea, a tutorial, a library or anything who can help me
to do that.

Search Discussions

  • Rene Pijlman at Apr 21, 2006 at 12:50 pm
    Julien ARNOUX:
    I have a problem :), I just want to extract text from pdf file with
    python. There is differents libraries for that but it doesn't work...

    pyPdf and pdfTools, I don't know why but it doesn't works with some
    pdf...
    Text can be represented in different ways in PDF: as tagged text, bitmap
    and vector images, and even algorithms (IIRC). Most tools will only be
    able to retrieve text represented as tagged text. So some tools may work
    on some texts in some files and fail on others.

    --
    Ren? Pijlman

    Wat wil jij leren? http://www.leren.nl
  • Avishay at Apr 21, 2006 at 5:03 pm
    You can use Ghostscript for that purpose. Look at ps2ascii script (or
    batch file) in the Ghostscript distribution. You can either call
    Ghostscript from command line or use its DLL (don't know if Python
    binding already exists...). The limitations the previous author has
    mentioned, however, still apply.

    Avishay
  • Jim at Apr 21, 2006 at 9:33 pm
    There is a pdftotext executable, at least on Linux.
  • Julien ARNOUX at Apr 24, 2006 at 7:03 am
    Hi,
    Thanks I use that and is all right :)

    import commands
    txt = commands.getoutput('ps2ascii tmp.pdf')
    print txt

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedApr 21, '06 at 12:18p
activeApr 24, '06 at 7:03a
posts5
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase