FAQ
I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.

http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

This is my error.

>>>

Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
NET.pdf'
>>>

Search Discussions

  • MRAB at Sep 26, 2010 at 11:35 pm

    On 27/09/2010 00:10, flebber wrote:
    I was trying to use Pypdf following a recipe from the Activestate
    cookbooks. However I cannot get it too work. Unsure if it is me or it
    is beacuse sets are deprecated.
    The 'sets' module pre-dates the built-in 'set' class. The warning is
    just to inform you that the module will be removed in due course (it's
    still in Python 2.7, but not Python 3), so you can still use it in
    those versions.
    I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
    NET.pdf" You could use anything I was just testing with it.

    I was using the last script on that page that was most recently
    updated. I am using python 2.6.

    http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

    import pyPdf

    def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")

    This is my error.
    Warning (from warnings module):
    File "C:\Documents and Settings\Family\Application Data\Python
    \Python26\site-packages\pyPdf\pdf.py", line 52
    from sets import ImmutableSet
    DeprecationWarning: the sets module is deprecated

    Traceback (most recent call last):
    File "C:/Python26/Pdfread", line 15, in<module>
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    File "C:/Python26/Pdfread", line 6, in getPDFContent
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
    NET.pdf'
    You put the file in C:\, but you didn't tell Python where it is. You
    gave just the filename "Components-of-Dot-NET.pdf", and it's looking in
    the current directory, which probably isn't C:\.

    Try providing the full pathname:

    print
    getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", "ignore")
  • W G Sneddon at Sep 26, 2010 at 11:38 pm

    On Sep 26, 7:10?pm, flebber wrote:
    I was trying to use Pypdf following a recipe from the Activestate
    cookbooks. However I cannot get it too work. Unsure if it is me or it
    is beacuse sets are deprecated.

    I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
    NET.pdf" You could use anything I was just testing with it.

    I was using the last script on that page that was most recently
    updated. I am using python 2.6.

    http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...

    import pyPdf

    def getPDFContent(path):
    ? ? content = "C:\Components-of-Dot-NET.pdf"
    ? ? # Load PDF into pyPDF
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ? ? # Iterate pages
    ? ? for i in range(0, pdf.getNumPages()):
    ? ? ? ? # Extract text from page and add to content
    ? ? ? ? content += pdf.getPage(i).extractText() + "\n"
    ? ? # Collapse whitespace
    ? ? content = " ".join(content.replace(u"\xa0", " ").strip().split())
    ? ? return content

    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")

    This is my error.



    Warning (from warnings module):
    ? File "C:\Documents and Settings\Family\Application Data\Python
    \Python26\site-packages\pyPdf\pdf.py", line 52
    ? ? from sets import ImmutableSet
    DeprecationWarning: the sets module is deprecated

    Traceback (most recent call last):
    ? File "C:/Python26/Pdfread", line 15, in <module>
    ? ? print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    ? File "C:/Python26/Pdfread", line 6, in getPDFContent
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
    NET.pdf'


    Looks like a issue with finding the file.
    how do you pass the path?
  • Flebber at Sep 27, 2010 at 12:39 am

    On Sep 27, 9:38?am, "w.g.sned... at gmail.com" wrote:
    On Sep 26, 7:10?pm, flebber wrote:

    I was trying to use Pypdf following a recipe from the Activestate
    cookbooks. However I cannot get it too work. Unsure if it is me or it
    is beacuse sets are deprecated.
    I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
    NET.pdf" You could use anything I was just testing with it.
    I was using the last script on that page that was most recently
    updated. I am using python 2.6.
    http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...
    import pyPdf
    def getPDFContent(path):
    ? ? content = "C:\Components-of-Dot-NET.pdf"
    ? ? # Load PDF into pyPDF
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ? ? # Iterate pages
    ? ? for i in range(0, pdf.getNumPages()):
    ? ? ? ? # Extract text from page and add to content
    ? ? ? ? content += pdf.getPage(i).extractText() + "\n"
    ? ? # Collapse whitespace
    ? ? content = " ".join(content.replace(u"\xa0", " ").strip().split())
    ? ? return content
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    This is my error.
    Warning (from warnings module):
    ? File "C:\Documents and Settings\Family\Application Data\Python
    \Python26\site-packages\pyPdf\pdf.py", line 52
    ? ? from sets import ImmutableSet
    DeprecationWarning: the sets module is deprecated
    Traceback (most recent call last):
    ? File "C:/Python26/Pdfread", line 15, in <module>
    ? ? print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    ? File "C:/Python26/Pdfread", line 6, in getPDFContent
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'

    Looks like a issue with finding the file.
    how do you pass the path?
    okay thanks I thought that when I set content here

    def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"

    that i was defining where it is.

    but yeah I updated script to below and it works. That is the contents
    are displayed to the interpreter. How do I output to a .txt file?

    import pyPdf

    def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

    print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
  • Flebber at Sep 27, 2010 at 2:08 am

    On Sep 27, 10:39?am, flebber wrote:
    On Sep 27, 9:38?am, "w.g.sned... at gmail.com" wrote:


    On Sep 26, 7:10?pm, flebber wrote:

    I was trying to use Pypdf following a recipe from the Activestate
    cookbooks. However I cannot get it too work. Unsure if it is me or it
    is beacuse sets are deprecated.
    I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
    NET.pdf" You could use anything I was just testing with it.
    I was using the last script on that page that was most recently
    updated. I am using python 2.6.
    import pyPdf
    def getPDFContent(path):
    ? ? content = "C:\Components-of-Dot-NET.pdf"
    ? ? # Load PDF into pyPDF
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ? ? # Iterate pages
    ? ? for i in range(0, pdf.getNumPages()):
    ? ? ? ? # Extract text from page and add to content
    ? ? ? ? content += pdf.getPage(i).extractText() + "\n"
    ? ? # Collapse whitespace
    ? ? content = " ".join(content.replace(u"\xa0", " ").strip().split())
    ? ? return content
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    This is my error.
    Warning (from warnings module):
    ? File "C:\Documents and Settings\Family\Application Data\Python
    \Python26\site-packages\pyPdf\pdf.py", line 52
    ? ? from sets import ImmutableSet
    DeprecationWarning: the sets module is deprecated
    Traceback (most recent call last):
    ? File "C:/Python26/Pdfread", line 15, in <module>
    ? ? print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    ? File "C:/Python26/Pdfread", line 6, in getPDFContent
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'
    Looks like a issue with finding the file.
    how do you pass the path?
    okay thanks I thought that when I set content here

    def getPDFContent(path):
    ? ? content = "C:\Components-of-Dot-NET.pdf"

    that i was defining where it is.

    but yeah I updated script to below and it works. That is the contents
    are displayed to the interpreter. How do I output to a .txt file?

    import pyPdf

    def getPDFContent(path):
    ? ? content = "C:\Components-of-Dot-NET.pdf"
    ? ? # Load PDF into pyPDF
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ? ? # Iterate pages
    ? ? for i in range(0, pdf.getNumPages()):
    ? ? ? ? # Extract text from page and add to content
    ? ? ? ? content += pdf.getPage(i).extractText() + "\n"
    ? ? # Collapse whitespace
    ? ? content = " ".join(content.replace(u"\xa0", " ").strip().split())
    ? ? return content

    print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    I have found far more advanced scripts searching around. But will have
    to keep trying as I cannot get an output file or specify the path.

    Edit very strangely whilst searching for examples I found my own post
    just written here ranking number 5 on google within 2 hours. Bizzare.

    http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf.aspx

    Replicates our thread as thiers. I was searching ggole with "pypdf
    return to txt file"
  • Flebber at Sep 27, 2010 at 2:19 am

    On Sep 27, 12:08?pm, flebber wrote:
    On Sep 27, 10:39?am, flebber wrote:


    On Sep 27, 9:38?am, "w.g.sned... at gmail.com" <w.g.sned... at gmail.com>
    wrote:
    On Sep 26, 7:10?pm, flebber wrote:

    I was trying to use Pypdf following a recipe from the Activestate
    cookbooks. However I cannot get it too work. Unsure if it is me or it
    is beacuse sets are deprecated.
    I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
    NET.pdf" You could use anything I was just testing with it.
    I was using the last script on that page that was most recently
    updated. I am using python 2.6.
    import pyPdf
    def getPDFContent(path):
    ? ? content = "C:\Components-of-Dot-NET.pdf"
    ? ? # Load PDF into pyPDF
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ? ? # Iterate pages
    ? ? for i in range(0, pdf.getNumPages()):
    ? ? ? ? # Extract text from page and add to content
    ? ? ? ? content += pdf.getPage(i).extractText() + "\n"
    ? ? # Collapse whitespace
    ? ? content = " ".join(content.replace(u"\xa0", " ").strip().split())
    ? ? return content
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    This is my error.
    Warning (from warnings module):
    ? File "C:\Documents and Settings\Family\Application Data\Python
    \Python26\site-packages\pyPdf\pdf.py", line 52
    ? ? from sets import ImmutableSet
    DeprecationWarning: the sets module is deprecated
    Traceback (most recent call last):
    ? File "C:/Python26/Pdfread", line 15, in <module>
    ? ? print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    ? File "C:/Python26/Pdfread", line 6, in getPDFContent
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'
    Looks like a issue with finding the file.
    how do you pass the path?
    okay thanks I thought that when I set content here
    def getPDFContent(path):
    ? ? content = "C:\Components-of-Dot-NET.pdf"
    that i was defining where it is.
    but yeah I updated script to below and it works. That is the contents
    are displayed to the interpreter. How do I output to a .txt file?
    import pyPdf
    def getPDFContent(path):
    ? ? content = "C:\Components-of-Dot-NET.pdf"
    ? ? # Load PDF into pyPDF
    ? ? pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ? ? # Iterate pages
    ? ? for i in range(0, pdf.getNumPages()):
    ? ? ? ? # Extract text from page and add to content
    ? ? ? ? content += pdf.getPage(i).extractText() + "\n"
    ? ? # Collapse whitespace
    ? ? content = " ".join(content.replace(u"\xa0", " ").strip().split())
    ? ? return content
    print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    I have found far more advanced scripts searching around. But will have
    to keep trying as I cannot get an output file or specify the path.

    Edit very strangely whilst searching for examples I found my own post
    just written here ranking number 5 on google within 2 hours. Bizzare.

    http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf...

    Replicates our thread as thiers. I was searching ggole with "pypdf
    return to txt file"
    Traceback (most recent call last):
    File "C:/Python26/Pdfread", line 16, in <module>
    open('x.txt', 'w').write(content)
    NameError: name 'content' is not defined
    >>>

    When i use.

    import pyPdf

    def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.txt"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

    print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    open('x.txt', 'w').write(content)
  • Dave Angel at Sep 27, 2010 at 4:46 am

    On 2:59 PM, flebber wrote:
    <snip>
    Traceback (most recent call last):
    File "C:/Python26/Pdfread", line 16, in<module>
    open('x.txt', 'w').write(content)
    NameError: name 'content' is not defined
    When i use.

    import pyPdf

    def getPDFContent(path):
    content =C:\Components-of-Dot-NET.txt"
    # Load PDF into pyPDF
    pdf =yPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content +=df.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = ".join(content.replace(u"\xa0", " ").strip().split())
    return content

    print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    open('x.txt', 'w').write(content)
    There's no global variable content, that was local to the function. So
    it's lost when the function exits. it does return the value, but you
    give it to print, and don't save it anywhere.

    data = getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")

    outfile = open('x.txt', 'w')
    outfile.write(data)

    close(outfile)

    I used a different name to emphasize that this is *not* the same
    variable as content inside the function. In this case, it happens to
    have the same value. And if you used the same name, you could be
    confused about which is which.


    DaveA
  • Flebber at Sep 27, 2010 at 2:19 pm

    On Sep 27, 2:46?pm, Dave Angel wrote:
    On 2:59 PM, flebber wrote:

    <snip>
    Traceback (most recent call last):
    ? ?File "C:/Python26/Pdfread", line 16, in<module>
    ? ? ?open('x.txt', 'w').write(content)
    NameError: name 'content' is not defined
    When i use.
    import pyPdf
    def getPDFContent(path):
    ? ? ?content =C:\Components-of-Dot-NET.txt"
    ? ? ?# Load PDF into pyPDF
    ? ? ?pdf =yPdf.PdfFileReader(file(path, "rb"))
    ? ? ?# Iterate pages
    ? ? ?for i in range(0, pdf.getNumPages()):
    ? ? ? ? ?# Extract text from page and add to content
    ? ? ? ? ?content +=df.getPage(i).extractText() + "\n"
    ? ? ?# Collapse whitespace
    ? ? ?content = ".join(content.replace(u"\xa0", " ").strip().split())
    ? ? ?return content
    print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    open('x.txt', 'w').write(content)
    There's no global variable content, that was local to the function. ?So
    it's lost when the function exits. ?it does return the value, but you
    give it to print, and don't save it anywhere.

    data = getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")

    outfile = open('x.txt', 'w')
    outfile.write(data)

    close(outfile)

    I used a different name to emphasize that this is *not* the same
    variable as content inside the function. ?In this case, it happens to
    have the same value. ?And if you used the same name, you could be
    confused about which is which.

    DaveA
    Thank You everyone.
  • MRAB at Sep 27, 2010 at 2:49 am

    On 27/09/2010 01:39, flebber wrote:
    On Sep 27, 9:38 am, "w.g.sned... at gmail.com"wrote:
    On Sep 26, 7:10 pm, flebberwrote:
    I was trying to use Pypdf following a recipe from the Activestate
    cookbooks. However I cannot get it too work. Unsure if it is me or it
    is beacuse sets are deprecated.
    I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
    NET.pdf" You could use anything I was just testing with it.
    I was using the last script on that page that was most recently
    updated. I am using python 2.6.
    http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...
    import pyPdf
    def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    This is my error.
    Warning (from warnings module):
    File "C:\Documents and Settings\Family\Application Data\Python
    \Python26\site-packages\pyPdf\pdf.py", line 52
    from sets import ImmutableSet
    DeprecationWarning: the sets module is deprecated
    Traceback (most recent call last):
    File "C:/Python26/Pdfread", line 15, in<module>
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    File "C:/Python26/Pdfread", line 6, in getPDFContent
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'

    Looks like a issue with finding the file.
    how do you pass the path?
    okay thanks I thought that when I set content here

    def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"

    that i was defining where it is.

    but yeah I updated script to below and it works. That is the contents
    are displayed to the interpreter. How do I output to a .txt file?

    import pyPdf

    def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    That simply binds to a local name; 'content' is a local variable in the
    function 'getPDFContent'.
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    You're opening a file whose path is in 'path'.
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(i).extractText() + "\n"
    That appends to 'content'.
    # Collapse whitespace
    'content' now contains the text of the PDF, starting with
    r"C:\Components-of-Dot-NET.pdf".
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

    print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    Outputting to a .txt file is simple: open the file for writing using
    'open', write the string to it, and then close it.
  • Flebber at Sep 27, 2010 at 2:56 am

    On Sep 27, 12:49?pm, MRAB wrote:
    On 27/09/2010 01:39, flebber wrote:


    On Sep 27, 9:38 am, "w.g.sned... at gmail.com"<w.g.sned... at gmail.com>
    wrote:
    On Sep 26, 7:10 pm, flebber<flebber.c... at gmail.com> ?wrote:
    I was trying to use Pypdf following a recipe from the Activestate
    cookbooks. However I cannot get it too work. Unsure if it is me or it
    is beacuse sets are deprecated.
    I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
    NET.pdf" You could use anything I was just testing with it.
    I was using the last script on that page that was most recently
    updated. I am using python 2.6.
    import pyPdf
    def getPDFContent(path):
    ? ? ?content = "C:\Components-of-Dot-NET.pdf"
    ? ? ?# Load PDF into pyPDF
    ? ? ?pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ? ? ?# Iterate pages
    ? ? ?for i in range(0, pdf.getNumPages()):
    ? ? ? ? ?# Extract text from page and add to content
    ? ? ? ? ?content += pdf.getPage(i).extractText() + "\n"
    ? ? ?# Collapse whitespace
    ? ? ?content = " ".join(content.replace(u"\xa0", " ").strip().split())
    ? ? ?return content
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    This is my error.
    Warning (from warnings module):
    ? ?File "C:\Documents and Settings\Family\Application Data\Python
    \Python26\site-packages\pyPdf\pdf.py", line 52
    ? ? ?from sets import ImmutableSet
    DeprecationWarning: the sets module is deprecated
    Traceback (most recent call last):
    ? ?File "C:/Python26/Pdfread", line 15, in<module>
    ? ? ?print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    ? ?File "C:/Python26/Pdfread", line 6, in getPDFContent
    ? ? ?pdf = pyPdf.PdfFileReader(file(path, "rb"))
    ---> ?IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> ?NET.pdf'
    Looks like a issue with finding the file.
    how do you pass the path?
    okay thanks I thought that when I set content here
    def getPDFContent(path):
    ? ? ?content = "C:\Components-of-Dot-NET.pdf"
    that i was defining where it is.
    but yeah I updated script to below and it works. That is the contents
    are displayed to the interpreter. How do I output to a .txt file?
    import pyPdf
    def getPDFContent(path):
    ? ? ?content = "C:\Components-of-Dot-NET.pdf"
    That simply binds to a local name; 'content' is a local variable in the
    function 'getPDFContent'.
    ? ? ?# Load PDF into pyPDF
    ? ? ?pdf = pyPdf.PdfFileReader(file(path, "rb"))
    You're opening a file whose path is in 'path'.
    ? ? ?# Iterate pages
    ? ? ?for i in range(0, pdf.getNumPages()):
    ? ? ? ? ?# Extract text from page and add to content
    ? ? ? ? ?content += pdf.getPage(i).extractText() + "\n"
    That appends to 'content'.
    ? ? ?# Collapse whitespace
    'content' now contains the text of the PDF, starting with
    r"C:\Components-of-Dot-NET.pdf".
    ? ? ?content = " ".join(content.replace(u"\xa0", " ").strip().split())
    ? ? ?return content
    print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
    "ignore")
    Outputting to a .txt file is simple: open the file for writing using
    'open', write the string to it, and then close it.
    Thats what I was trying to do with

    open('x.txt', 'w').write(content)

    the rest of the script works it wont output the tect though
  • Tim Roberts at Sep 27, 2010 at 3:12 am

    flebber wrote:
    okay thanks I thought that when I set content here

    def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    You have a backslash problem here. You need need to say:
    content = "C:\\Components-of-Dot-NET.pdf"
    or
    content = "C:/Components-of-Dot-NET.pdf"
    or
    content = "C:/Components-of-Dot-NET.pdf"
    --
    Tim Roberts, timr at probo.com
    Providenza & Boekelheide, Inc.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedSep 26, '10 at 11:10p
activeSep 27, '10 at 2:19p
posts11
users5
websitepython.org

People

Translate

site design / logo © 2023 Grokbase