FAQ
I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as a
subset of the bunch of things I tried to no avail.

Any help?
Kurt

#!/usr/bin/python
# -*- coding: utf-8 -+-
from pyPdf import PdfFileWriter, PdfFileReader
import unicodedata
fileencoding = "utf-16-LE" #"iso-8859-1" # "utf-8"
doc = PdfFileReader(file(r"C:\Documents and Settings\kpeters\My Documents
\SUA.pdf", "rb")

# print the title of document1.pdf
print "title = %s" % (doc.getDocumentInfo().title)
print "Subject:", doc.getDocumentInfo().subject
print "PDF Version:", doc.getDocumentInfo().producer
page4 = doc.getPage(3)
textu= page4.extractText()
#textu=textu.decode(fileencoding)
print type(textu)
#print type(textu.encode(fileencoding))
#textu=textu.encode(fileencoding) #Converts to str
fn = unichr(167)
print('The char is %s' % fn)
textu.replace(unichr(167),'\n')
#print unicodedata.bidirectional(fn) unichr(167)
for i, c in enumerate(textu):
if (i!02):
print('# %d has char %s, ord: %d , char: %s, category %s, and
Name: %s' % (i, c, ord(c), unichr(ord(c)), unicodedata.category(c),
unicodedata.name(c)))

#if (ord(c)=7):
# print('Found it!')
#textu[i]='\n'
print('----------------------------------------------------')
print textu
print textu.encode(fileencoding)

Search Discussions

  • John Machin at Oct 11, 2008 at 11:43 pm

    On Oct 12, 7:05?am, Kurt Peters wrote:
    I'm using the code below to read a pdf document, and it has no line feeds
    or carriage returns in the imported text. ?I'm therefore trying to just
    replace the symbol that looks like it would be an end of line (found by
    examining the characters in the "for loop") unichr(167).
    ? Unfortunately, the replace isn't working, does anyone know what I'm
    doing wrong? ?I tried a number of things so I left comments in place as a
    subset of the bunch of things I tried to no avail.
    This is the first time I've ever looked inside a PDF file, and *only*
    one file, but:

    import pyPdf, sys
    filename = sys.argv[1]
    doc = pyPdf.PdfFileReader(open(filename, "rb"))
    for pageno in range(doc.getNumPages()):
    page = doc.getPage(pageno)
    textu = page.extractText()
    print "pageno", pageno
    print type(textu)
    print repr(textu)

    gives me <type 'unicode'> and text with lots of \n at places where
    you'd expect them.

    The only problem I can see is that where I see (and expect) quotation
    marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
    the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
    and apostrophes. I had a bit of a poke around:

    1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
    \x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
    into \x93 and \x94).

    2. Then pyPdf appears to push these through a fixed transformation
    table (_pdfDocEncoding in generic.py) and they become \ufb01 and
    \ufb02.

    3. However:
    '\x93\x94'.decode('cp1252') # as suspected
    u'\u201c\u201d' # as expected
    AFAICT there is only one reference to encoding in the pyPdf docs: "if
    pyPdf was unable to decode the string's text encoding" ...

    Cheers,
    John
  • Kurt Peters at Oct 13, 2008 at 1:56 am
    Thanks,
    clearly though, my "For loop" shows a character using ord(167), and using
    print repr(textu), it shows the character \xa7 (as does Peter Oten's post).
    So you can see what I see, here's the document I'm using - the Special Use
    Airspace document at
    http://www.faa.gov/airports_airtraffic/air_traffic/publications/
    which is = JO 7400.8P (PDF)

    if you just look at page three, it shows those unusual characters.
    Once again, using a "simple" replace, doesn't seem to work. I can't seem to
    figure out how to get it to work, despite all the great posts attempting to
    shed some light on the subject.

    Regards,
    Kurt


    "John Machin" <sjmachin at lexicon.net> wrote in message
    news:42f39e4c-e49a-49a3-8a2c-1adbcbb81d88 at u40g2000pru.googlegroups.com...
    On Oct 12, 7:05 am, Kurt Peters wrote:
    I'm using the code below to read a pdf document, and it has no line feeds
    or carriage returns in the imported text. I'm therefore trying to just
    replace the symbol that looks like it would be an end of line (found by
    examining the characters in the "for loop") unichr(167).
    Unfortunately, the replace isn't working, does anyone know what I'm
    doing wrong? I tried a number of things so I left comments in place as a
    subset of the bunch of things I tried to no avail.
    This is the first time I've ever looked inside a PDF file, and *only*
    one file, but:

    import pyPdf, sys
    filename = sys.argv[1]
    doc = pyPdf.PdfFileReader(open(filename, "rb"))
    for pageno in range(doc.getNumPages()):
    page = doc.getPage(pageno)
    textu = page.extractText()
    print "pageno", pageno
    print type(textu)
    print repr(textu)

    gives me <type 'unicode'> and text with lots of \n at places where
    you'd expect them.

    The only problem I can see is that where I see (and expect) quotation
    marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
    the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
    and apostrophes. I had a bit of a poke around:

    1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
    \x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
    into \x93 and \x94).

    2. Then pyPdf appears to push these through a fixed transformation
    table (_pdfDocEncoding in generic.py) and they become \ufb01 and
    \ufb02.

    3. However:
    '\x93\x94'.decode('cp1252') # as suspected
    u'\u201c\u201d' # as expected
    AFAICT there is only one reference to encoding in the pyPdf docs: "if
    pyPdf was unable to decode the string's text encoding" ...

    Cheers,
    John
  • Mark Tolonen at Oct 13, 2008 at 2:53 am
    In your original code:

    textu.replace(unichr(167),'\n')

    as Dennis suggested (but maybe you were distracted by his 'fn' replacement,
    so I'll leave it out):

    textu = textu.replace(unichr(167),'\n')

    .replace does not modify the string in place. It returns the modified
    string, so you have to reassign it.

    -Mark

    "Kurt Peters" <nospampeterskurt at msn.com> wrote in message
    news:-OmdnXghhrxMN2_VnZ2dnUVZ_rHinZ2d at comcast.com...
    Thanks,
    clearly though, my "For loop" shows a character using ord(167), and using
    print repr(textu), it shows the character \xa7 (as does Peter Oten's
    post). So you can see what I see, here's the document I'm using - the
    Special Use Airspace document at
    http://www.faa.gov/airports_airtraffic/air_traffic/publications/
    which is = JO 7400.8P (PDF)

    if you just look at page three, it shows those unusual characters.
    Once again, using a "simple" replace, doesn't seem to work. I can't seem
    to figure out how to get it to work, despite all the great posts
    attempting to shed some light on the subject.

    Regards,
    Kurt


    "John Machin" <sjmachin at lexicon.net> wrote in message
    news:42f39e4c-e49a-49a3-8a2c-1adbcbb81d88 at u40g2000pru.googlegroups.com...
    On Oct 12, 7:05 am, Kurt Peters wrote:
    I'm using the code below to read a pdf document, and it has no line feeds
    or carriage returns in the imported text. I'm therefore trying to just
    replace the symbol that looks like it would be an end of line (found by
    examining the characters in the "for loop") unichr(167).
    Unfortunately, the replace isn't working, does anyone know what I'm
    doing wrong? I tried a number of things so I left comments in place as a
    subset of the bunch of things I tried to no avail.
    This is the first time I've ever looked inside a PDF file, and *only*
    one file, but:

    import pyPdf, sys
    filename = sys.argv[1]
    doc = pyPdf.PdfFileReader(open(filename, "rb"))
    for pageno in range(doc.getNumPages()):
    page = doc.getPage(pageno)
    textu = page.extractText()
    print "pageno", pageno
    print type(textu)
    print repr(textu)

    gives me <type 'unicode'> and text with lots of \n at places where
    you'd expect them.

    The only problem I can see is that where I see (and expect) quotation
    marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
    the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
    and apostrophes. I had a bit of a poke around:

    1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
    \x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
    into \x93 and \x94).

    2. Then pyPdf appears to push these through a fixed transformation
    table (_pdfDocEncoding in generic.py) and they become \ufb01 and
    \ufb02.

    3. However:
    '\x93\x94'.decode('cp1252') # as suspected
    u'\u201c\u201d' # as expected
    AFAICT there is only one reference to encoding in the pyPdf docs: "if
    pyPdf was unable to decode the string's text encoding" ...

    Cheers,
    John
  • Kurt Peters at Oct 18, 2008 at 10:47 pm
    Thanks,
    The "distraction" was my problem. I replaced the textu.replace as you
    suggested and it works fine.
    Kurt
    On Sun, 12 Oct 2008 19:53:09 -0700, Mark Tolonen wrote:

    In your original code:

    textu.replace(unichr(167),'\n')

    as Dennis suggested (but maybe you were distracted by his 'fn'
    replacement, so I'll leave it out):

    textu = textu.replace(unichr(167),'\n')

    .replace does not modify the string in place. It returns the modified
    string, so you have to reassign it.

    -Mark

    "Kurt Peters" <nospampeterskurt at msn.com> wrote in message
    news:-OmdnXghhrxMN2_VnZ2dnUVZ_rHinZ2d at comcast.com...
    Thanks,
    clearly though, my "For loop" shows a character using ord(167), and
    using
    print repr(textu), it shows the character \xa7 (as does Peter Oten's
    post). So you can see what I see, here's the document I'm using - the
    Special Use Airspace document at
    http://www.faa.gov/airports_airtraffic/air_traffic/publications/ which
    is = JO 7400.8P (PDF)

    if you just look at page three, it shows those unusual characters. Once
    again, using a "simple" replace, doesn't seem to work. I can't seem to
    figure out how to get it to work, despite all the great posts
    attempting to shed some light on the subject.

    Regards,
    Kurt


    "John Machin" <sjmachin at lexicon.net> wrote in message
    news:42f39e4c-
    e49a-49a3-8a2c-1adbcbb81d88 at u40g2000pru.googlegroups.com...
    On Oct 12, 7:05 am, Kurt Peters wrote:
    I'm using the code below to read a pdf document, and it has no line
    feeds or carriage returns in the imported text. I'm therefore trying
    to just replace the symbol that looks like it would be an end of line
    (found by examining the characters in the "for loop") unichr(167).
    Unfortunately, the replace isn't working, does anyone know what I'm
    doing wrong? I tried a number of things so I left comments in place as
    a subset of the bunch of things I tried to no avail.
    This is the first time I've ever looked inside a PDF file, and *only*
    one file, but:

    import pyPdf, sys
    filename = sys.argv[1]
    doc = pyPdf.PdfFileReader(open(filename, "rb")) for pageno in
    range(doc.getNumPages()):
    page = doc.getPage(pageno)
    textu = page.extractText()
    print "pageno", pageno
    print type(textu)
    print repr(textu)

    gives me <type 'unicode'> and text with lots of \n at places where
    you'd expect them.

    The only problem I can see is that where I see (and expect) quotation
    marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
    the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
    and apostrophes. I had a bit of a poke around:

    1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
    \x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
    into \x93 and \x94).

    2. Then pyPdf appears to push these through a fixed transformation
    table (_pdfDocEncoding in generic.py) and they become \ufb01 and
    \ufb02.

    3. However:
    '\x93\x94'.decode('cp1252') # as suspected |u'\u201c\u201d' # as
    expected
    AFAICT there is only one reference to encoding in the pyPdf docs: "if
    pyPdf was unable to decode the string's text encoding" ...

    Cheers,
    John
  • Kurt Peters at Oct 12, 2008 at 4:24 am
    I had done that about 21 revisions ago. Nevertheless, why would you think
    that would work, when the code as shown doesn't?
    kurt


    "Dennis Lee Bieber" <wlfraed at ix.netcom.com> wrote in message
    news:msGdnVOGt6i2kmzVnZ2dnUVZ_rDinZ2d at earthlink.com...
    On Sat, 11 Oct 2008 15:05:43 -0500, Kurt Peters
    <nospampetersk at bigfoot.com> declaimed the following in comp.lang.python:

    textu.replace(unichr(167),'\n')
    Might I suggest:

    textu = textu.replace(fn, "\n") #you already created fn as the character
    --
    Wulfraed Dennis Lee Bieber KD6MOG
    wlfraed at ix.netcom.com wulfraed at bestiaria.com
    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: web-asst at bestiaria.com)
    HTTP://www.bestiaria.com/
  • Peter Otten at Oct 12, 2008 at 8:00 am

    Kurt Peters wrote:

    I had done that about 21 revisions ago.
    If you litter your module with code that is commented out it is hard to keep
    track of what works and what doesn't.
    Nevertheless, why would you think
    that would work, when the code as shown doesn't?
    Because he knows Python? Why don't /you/ try it before asking that question?

    A good place to do "exploratory" programming is Python's interactive
    interpreter. Here's a sample session:

    Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:43)
    [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    from pyPdf import PdfFileReader as PFR
    doc = PFR(open("SUA.pdf"))
    text = doc.getPage(3).extractText()
    type(text)
    <type 'unicode'>
    text[:200]
    u'2/16/08 7400.8P Table of Contents - Continued Section Page
    \ xa773.49 New Hampshire (NH) 50
    \xa773.50 New Jersey (NJ) 50 \xa773.51 New Mex
    ico (NM) 51 \xa773.52 New York (NY) 56 \xa773.53 North '
    print text[:200].replace(u"\xa7", u"\n")
    2/16/08 7400.8P Table of Contents - Continued Section Page
    73.49 New Hampshire (NH) 50
    73.50 New Jersey (NJ) 50
    73.51 New Mexico (NM) 51
    73.52 New York (NY) 56
    73.53 North

    Peter
  • Kurt Peters at Oct 13, 2008 at 1:22 am
    Thanks...

    On a side note, do you really think the function call wouldn't interpret
    the unichr before the function call?
    Kurt


    "Peter Otten" <__peter__ at web.de> wrote in message
    news:gcsaqq$5bf$01$1 at news.t-online.com...
    Kurt Peters wrote:
    I had done that about 21 revisions ago.
    If you litter your module with code that is commented out it is hard to
    keep
    track of what works and what doesn't.
    Nevertheless, why would you think
    that would work, when the code as shown doesn't?
    Because he knows Python? Why don't /you/ try it before asking that
    question?

    A good place to do "exploratory" programming is Python's interactive
    interpreter. Here's a sample session:

    Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:43)
    [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    from pyPdf import PdfFileReader as PFR
    doc = PFR(open("SUA.pdf"))
    text = doc.getPage(3).extractText()
    type(text)
    <type 'unicode'>
    text[:200]
    u'2/16/08 7400.8P Table of Contents - Continued Section
    Page
    \ xa773.49 New Hampshire (NH) 50
    \xa773.50 New Jersey (NJ) 50 \xa773.51 New Mex
    ico (NM) 51 \xa773.52 New York (NY) 56 \xa773.53 North '
    print text[:200].replace(u"\xa7", u"\n")
    2/16/08 7400.8P Table of Contents - Continued Section Page
    73.49 New Hampshire (NH) 50
    73.50 New Jersey (NJ) 50
    73.51 New Mexico (NM) 51
    73.52 New York (NY) 56
    73.53 North

    Peter
  • Martin v. Löwis at Oct 13, 2008 at 5:17 am

    On a side note, do you really think the function call wouldn't interpret
    the unichr before the function call?
    Dennis' main point was not that you can reuse fn (which he suggested
    just as performance improvement), but that you need to assign the result
    of .replace back to textu.

    Regards,
    Martin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedOct 11, '08 at 8:05p
activeOct 18, '08 at 10:47p
posts9
users6
websitepython.org

People

Translate

site design / logo © 2022 Grokbase