FAQ
I don't find a general pdf library in python that can do any
operations on pdfs.

I want to automatically highlight certain words (using regex) in a
pdf. Could somebody let me know if there is a tool to do so in python?

Search Discussions

  • Aahz at Mar 16, 2010 at 11:47 pm
    In article <af0830ae-1d24-4db9-b721-d6602fedd540 at 15g2000yqi.googlegroups.com>,
    Peng Yu wrote:
    I don't find a general pdf library in python that can do any
    operations on pdfs.

    I want to automatically highlight certain words (using regex) in a
    pdf. Could somebody let me know if there is a tool to do so in python?
    Did you Google at all? "python pdf" finds this as the first link, though
    I have no clue whether it does what you want:

    http://pybrary.net/pyPdf/
    --
    Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/

    "Many customs in this life persist because they ease friction and promote
    productivity as a result of universal agreement, and whether they are
    precisely the optimal choices is much less important." --Henry Spencer
  • David Boddie at Mar 17, 2010 at 10:40 pm

    On Wednesday 17 March 2010 00:47, Aahz wrote:

    In article
    <af0830ae-1d24-4db9-b721-d6602fedd540 at 15g2000yqi.googlegroups.com>,
    Peng Yu wrote:
    I don't find a general pdf library in python that can do any
    operations on pdfs.

    I want to automatically highlight certain words (using regex) in a
    pdf. Could somebody let me know if there is a tool to do so in python?
    Did you Google at all? "python pdf" finds this as the first link, though
    I have no clue whether it does what you want:

    http://pybrary.net/pyPdf/
    The original poster might also be interested in displaying the highlighted
    words without modifying the original file. In which case, the Poppler
    library is worth investigating:

    http://poppler.freedesktop.org/

    David
  • Patrick Maupin at Mar 17, 2010 at 4:12 am

    On Mar 4, 6:57?pm, Peng Yu wrote:
    I don't find a general pdf library in python that can do any
    operations on pdfs.

    I want to automatically highlight certain words (using regex) in a
    pdf. Could somebody let me know if there is a tool to do so in python?
    The problem with PDFs is that they can be quite complicated. There is
    the outer container structure, which isn't too bad (unless the
    document author applied encryption or fancy multi-object compression),
    but then inside the graphics elements, things could be stored as
    regular ASCII, or as fancy indexes into font-specific tables. Not
    rocket science, but the only industrial-strength solution for this is
    probably reportlab's pagecatcher.

    I have a library which works (primarily with the outer container) for
    reading and writing, called pdfrw. I also maintain a list of other
    PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries It
    may be that pdfminer (link on that page) will do what you want -- it
    is certainly trying to be complete as a PDF reader. But I've never
    personally used pdfminer.

    One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools
    will read in preexisting PDFs and write them out to a reportlab
    canvas. This works quite well on a few very simple ASCII PDFs, but
    the font handling needs a lot of work and probably won't work at all
    right now on unicode. (But if you wanted to improve it, I certainly
    would accept patches or give you commit rights!)

    That pdfrw example does graphics reasonably well. I was actually
    going down that path for getting better vector graphics into rst2pdf
    (both uniconvertor and svglib were broken for my purposes), but then I
    realized that the PDF spec allows you to include a page from another
    PDF quite easily (the spec calls it a form xObject), so you don't
    actually need to parse down into the graphics stream for that. So,
    right now, the best way to do vector graphics with rst2pdf is either
    to give it a preexisting PDF (which it passes off to pdfrw for
    conversion into a form xObject), or to give it a .svg file and invoke
    it with -e inkscape, and then it will use inkscape to convert the svg
    to a pdf and then go through the same path.

    HTH,
    Pat
  • Peng Yu at Mar 17, 2010 at 2:53 pm

    On Tue, Mar 16, 2010 at 11:12 PM, Patrick Maupin wrote:
    On Mar 4, 6:57?pm, Peng Yu wrote:
    I don't find a general pdf library in python that can do any
    operations on pdfs.

    I want to automatically highlight certain words (using regex) in a
    pdf. Could somebody let me know if there is a tool to do so in python?
    The problem with PDFs is that they can be quite complicated. ?There is
    the outer container structure, which isn't too bad (unless the
    document author applied encryption or fancy multi-object compression),
    but then inside the graphics elements, things could be stored as
    regular ASCII, or as fancy indexes into font-specific tables. ?Not
    rocket science, but the only industrial-strength solution for this is
    probably reportlab's pagecatcher.

    I have a library which works (primarily with the outer container) for
    reading and writing, called pdfrw. ?I also maintain a list of other
    PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries ?It
    may be that pdfminer (link on that page) will do what you want -- it
    is certainly trying to be complete as a PDF reader. ?But I've never
    personally used pdfminer.

    One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools
    will read in preexisting PDFs and write them out to a reportlab
    canvas. ?This works quite well on a few very simple ASCII PDFs, but
    the font handling needs a lot of work and probably won't work at all
    right now on unicode. ?(But if you wanted to improve it, I certainly
    would accept patches or give you commit rights!)

    That pdfrw example does graphics reasonably well. ?I was actually
    going down that path for getting better vector graphics into rst2pdf
    (both uniconvertor and svglib were broken for my purposes), but then I
    realized that the PDF spec allows you to include a page from another
    PDF quite easily (the spec calls it a form xObject), so you don't
    actually need to parse down into the graphics stream for that. ?So,
    right now, the best way to do vector graphics with rst2pdf is either
    to give it a preexisting PDF (which it passes off to pdfrw for
    conversion into a form xObject), or to give it a .svg file and invoke
    it with -e inkscape, and then it will use inkscape to convert the svg
    to a pdf and then go through the same path.
    Thank you for your long reply! But I'm not sure if you get my question or not.

    Acrobat can highlight certain words in pdfs. I could add notes to the
    highlighted words as well. However, I find that I frequently end up
    with highlighting some words that can be expressed by a regular
    expression.

    To improve my productivity, I don't want do this manually in Acrobat
    but rather do it in an automatic way, if there is such a tool
    available. People in reportlab mailing list said this is not possible
    with reportlab. And I don't see PyPDF can do this. If you know there
    is an API to for this purpose, please let me know. Thank you!

    Regards,
    Peng
  • Patrick Maupin at Mar 17, 2010 at 3:11 pm

    On Wed, Mar 17, 2010 at 9:53 AM, Peng Yu wrote:
    Thank you for your long reply! But I'm not sure if you get my question or not.

    Acrobat can highlight certain words in pdfs. I could add notes to the
    highlighted words as well. However, I find that I frequently end up
    with highlighting some words that can be expressed by a regular
    expression.

    To improve my productivity, I don't want do this manually in Acrobat
    but rather do it in an automatic way, if there is such a tool
    available. People in reportlab mailing list said this is not possible
    with reportlab. And I don't see PyPDF can do this. If you know there
    is an API to for this purpose, please let me know. Thank you!
    I do not know of any API specific to this purpose, no. But I
    mentioned three libraries (pagecatcher, pdfminer, and pdfrw) that are
    capable, to a greater or lesser extent, of reading in PDFs and giving
    you the data from them, which you can then do your replacement on and
    then write back out. I would imagine this would be a piece of cake
    with pagecatcher. (I noticed you just posted on the reportlab mailing
    list, but you did not specifically mention pagecatcher.) It will
    probably take more work with either of the other two. It is probable
    that none of them do exactly what you want, but also that any of them
    is a better starting point than coding what you want from scratch.

    Regards,
    Pat
  • TP at Mar 18, 2010 at 7:36 pm

    On Wed, Mar 17, 2010 at 7:53 AM, Peng Yu wrote:
    On Tue, Mar 16, 2010 at 11:12 PM, Patrick Maupin wrote:
    On Mar 4, 6:57?pm, Peng Yu wrote:
    I don't find a general pdf library in python that can do any
    operations on pdfs.

    I want to automatically highlight certain words (using regex) in a
    pdf. Could somebody let me know if there is a tool to do so in python?
    The problem with PDFs is that they can be quite complicated. ?There is
    the outer container structure, which isn't too bad (unless the
    document author applied encryption or fancy multi-object compression),
    but then inside the graphics elements, things could be stored as
    regular ASCII, or as fancy indexes into font-specific tables. ?Not
    rocket science, but the only industrial-strength solution for this is
    probably reportlab's pagecatcher.

    I have a library which works (primarily with the outer container) for
    reading and writing, called pdfrw. ?I also maintain a list of other
    PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries ?It
    may be that pdfminer (link on that page) will do what you want -- it
    is certainly trying to be complete as a PDF reader. ?But I've never
    personally used pdfminer.

    One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools
    will read in preexisting PDFs and write them out to a reportlab
    canvas. ?This works quite well on a few very simple ASCII PDFs, but
    the font handling needs a lot of work and probably won't work at all
    right now on unicode. ?(But if you wanted to improve it, I certainly
    would accept patches or give you commit rights!)

    That pdfrw example does graphics reasonably well. ?I was actually
    going down that path for getting better vector graphics into rst2pdf
    (both uniconvertor and svglib were broken for my purposes), but then I
    realized that the PDF spec allows you to include a page from another
    PDF quite easily (the spec calls it a form xObject), so you don't
    actually need to parse down into the graphics stream for that. ?So,
    right now, the best way to do vector graphics with rst2pdf is either
    to give it a preexisting PDF (which it passes off to pdfrw for
    conversion into a form xObject), or to give it a .svg file and invoke
    it with -e inkscape, and then it will use inkscape to convert the svg
    to a pdf and then go through the same path.
    Thank you for your long reply! But I'm not sure if you get my question or not.

    Acrobat can highlight certain words in pdfs. I could add notes to the
    highlighted words as well. However, I find that I frequently end up
    with highlighting some words that can be expressed by a regular
    expression.

    To improve my productivity, I don't want do this manually in Acrobat
    but rather do it in an automatic way, if there is such a tool
    available. People in reportlab mailing list said this is not possible
    with reportlab. And I don't see PyPDF can do this. If you know there
    is an API to for this purpose, please let me know. Thank you!

    Regards,
    Peng
    --
    http://mail.python.org/mailman/listinfo/python-list
    Take a look at the Acrobat SDK
    (http://www.adobe.com/devnet/acrobat/?view=downloads). In particular
    see the Acrobat Interapplication Communication information at
    http://www.adobe.com/devnet/acrobat/interapplication_communication.html.

    "Spell-checking a document" shows how to spell check a PDF using
    visual basic at
    http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.17.html

    "Working with annotations" shows how to add an annotation with visual
    basic at http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.16.html.

    Presumably combining the two examples with Python's win32com should
    allow you to do what you want.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMar 4, '10 at 11:57p
activeMar 18, '10 at 7:36p
posts7
users5
websitepython.org

People

Translate

site design / logo © 2022 Grokbase