FAQ
Hi,

I wanted to extract the meta-data from an encrypted/protected PDF file
and could not find any Python scripts to do this. So, I decided to
write something myself, the result follows.

This demonstration utility requires the `pdftools` files from
<http://www.boddie.org.uk/david/Projects/Python/pdftools/> but the
decryption functions themselves should be usable with other Python PDF
libraries.

Documentation is marginal and all I can say is that worked on the
three PDF files I tested it on... :-)

--Phil.

P.S. The usual Usenet-mangling warning applies--yeah, I know--I should
put it up on a web site somewhere... :-)

#!/usr/bin/python
#
# Decrypt PDF Info
#
# Decrypts PDF files and displays meta-data associated with them. (If
the
# file isn't encrypted the information is displayed as is.)
#
# The results are similar to xpdf's `pdfinfo` utility.
#
# It should be possible to decrypt all of the objects contained
# in the PDF, but this only reads the Document Information Dictionary.
#
# (Note: All the PDF handling is provided by `pdftools`, this just
adds
# the ability to deal with encrypted PDF files.)
#
# Requires:
# + pdftools
# <http://www.boddie.org.uk/david/Projects/Python/pdftools/>
#
# Based on:
# + `pdfdecrypt.pl`
# <http://www-2.cs.cmu.edu/~dst/Adobe/Gallery/pdfdecrypt.pl>
[PDFPL]
#
# Incorporates:
# + RC4 from CipherSaber implementation by Ka-Ping Yee
<ping at lfw.org>
# <http://www.xs4all.nl/~cg/ciphersaber/comp/python.txt>
#
# References:
# + <http://www-2.cs.cmu.edu/~dst/Adobe/Gallery/anon21jul01-pdf-encryption.txt>
[PDFE]
#
# Author:
# follower at myrealbox.com (Standing on *many* shoulders...)
#
import sys
import md5
import struct

from pdftools import PDFdocument

def arcfour(input, key):
"""
Perform the ARCFOUR (RC4) algorithm on a given input list of bytes
with
a key given as a list of bytes, and return the output as a list of
bytes.

(From CipherSaber implementation by Ka-Ping Yee <ping at lfw.org>
<http://www.xs4all.nl/~cg/ciphersaber/comp/python.txt>)
"""
i, j, state = 0, 0, range(256)
for i in range(256):
j = (j + state[i] + key[i % len(key)]) % 256
state[i], state[j] = state[j], state[i]
i, j, output = 0, 0, []
for byte in input:
i = (i + 1) % 256
j = (j + state[i]) % 256
state[i], state[j] = state[j], state[i]
n = (state[i] + state[j]) % 256
output.append(byte ^ state[n])
return output


_passwordPad = [
0x28, 0xbf, 0x4e, 0x5e, 0x4e, 0x75, 0x8a, 0x41,
0x64, 0x00, 0x4e, 0x56, 0xff, 0xfa, 0x01, 0x08,
0x2e, 0x2e, 0x00, 0xb6, 0xd0, 0x68, 0x3e, 0x80,
0x2f, 0x0c, 0xa9, 0xfe, 0x64, 0x53, 0x69, 0x7a]

passwordPad = "".join([chr(b) for b in _passwordPad])

def calculateFileKey(fileId, ownerHash, userHash, permissions,
userPassword = ""):
"""
Calculates the file key for the document as described in
references
(see [PDFE] and [PDFPL]).
"""
md = md5.new()
md.update((userPassword + passwordPad)[:32])
md.update(ownerHash)
md.update(struct.pack("<L", permissions))
md.update(fileId)

fileKey = md.digest()[:5]

return fileKey


def calculateObjectKey(fileKey, objectNumber, generationNumber):
"""
Calculates the key for the object as described in references
(see [PDFE] and [PDFPL]).
"""
md = md5.new()
md.update(fileKey)
md.update(struct.pack("<L", objectNumber)[:3])
md.update(struct.pack("<L", generationNumber)[:2])
objectKey = md.digest()[:10]

return objectKey


class NotEncryptedException(Exception):
"""
The supplied PDF document is not encrypted.
"""


KEY_OWNER_HASH = 'O'
KEY_USER_HASH = 'U'
KEY_PERMISSIONS = 'P'

def getFileKey(doc, userPassword = ""):
"""
Extracts the information required to calculate the file key
from the supplied PDF document.

In most cases `userPassword` can be left empty.
"""
fileId = doc.trailer_dict['ID'][0] # Is the ID always repeated?

try:
encryptDict = doc.dereference(doc.trailer_dict['Encrypt'])
except KeyError:
raise NotEncryptedException
# TODO: Check encryption version is ok. (filter/standard/v/1/r/2)

ownerHash = encryptDict[KEY_OWNER_HASH]
userHash = encryptDict[KEY_USER_HASH]
# `permissions` should be "four-byte integer, LSB first."
permissions = encryptDict[KEY_PERMISSIONS]

fileKey = calculateFileKey(fileId, ownerHash, userHash,
permissions,
userPassword)

# Sanity check user password
assert(decrypt(userHash, fileKey) == passwordPad)

return fileKey


def decrypt(text, key):
"""
Decrypts the supplied object (as a string) with the supplied key.

Returns "plain text" form of object as a string.
"""
return "".join([chr(b)
for b in arcfour(map(ord, text), map(ord, key))])


def showDocumentInfo(doc, fileKey):
"""
Displays the content of the (optionally encrypted) Document
Information
Dictionary for the supplied PDF document.
"""
infoDictRef = doc.trailer_dict['Info']
objectNumber = infoDictRef.obj
generationNumber = infoDictRef.gen

infoDict = doc.dereference(infoDictRef)

objectKey = calculateObjectKey(fileKey, objectNumber,
generationNumber)

for field, encryptedValue in infoDict.iteritems():
if fileKey:
value = decrypt(encryptedValue, objectKey)
else:
value = encryptedValue
print "%s: %s" % (field, value)


if __name__ == "__main__":
try:
filename = sys.argv[1]
except IndexError:
raise SystemExit("Usage %s <filename.pdf>" % sys.argv[0])

doc = PDFdocument(filename)

try:
fileKey = getFileKey(doc)
except NotEncryptedException:
fileKey = ""

showDocumentInfo(doc, fileKey)

Search Discussions

  • Follower at Aug 17, 2004 at 4:13 pm
    Hi,

    I wanted to extract the meta-data from an encrypted/protected PDF file
    and could not find any Python scripts to do this. So, I decided to
    write something myself, the result follows.

    This demonstration utility requires the `pdftools` files from
    <http://www.boddie.org.uk/david/Projects/Python/pdftools/> but the
    decryption functions themselves should be usable with other Python PDF
    libraries.

    Documentation is marginal and all I can say is that worked on the
    three PDF files I tested it on... :-)

    --Phil.

    P.S. The usual Usenet-mangling warning applies--yeah, I know--I should
    put it up on a web site somewhere... :-)

    #!/usr/bin/python
    #
    # Decrypt PDF Info
    #
    # Decrypts PDF files and displays meta-data associated with them. (If
    the
    # file isn't encrypted the information is displayed as is.)
    #
    # The results are similar to xpdf's `pdfinfo` utility.
    #
    # It should be possible to decrypt all of the objects contained
    # in the PDF, but this only reads the Document Information Dictionary.
    #
    # (Note: All the PDF handling is provided by `pdftools`, this just
    adds
    # the ability to deal with encrypted PDF files.)
    #
    # Requires:
    # + pdftools
    # <http://www.boddie.org.uk/david/Projects/Python/pdftools/>
    #
    # Based on:
    # + `pdfdecrypt.pl`
    # <http://www-2.cs.cmu.edu/~dst/Adobe/Gallery/pdfdecrypt.pl>
    [PDFPL]
    #
    # Incorporates:
    # + RC4 from CipherSaber implementation by Ka-Ping Yee
    <ping at lfw.org>
    # <http://www.xs4all.nl/~cg/ciphersaber/comp/python.txt>
    #
    # References:
    # + <http://www-2.cs.cmu.edu/~dst/Adobe/Gallery/anon21jul01-pdf-encryption.txt>
    [PDFE]
    #
    # Author:
    # follower at myrealbox.com (Standing on *many* shoulders...)
    #
    import sys
    import md5
    import struct

    from pdftools import PDFdocument

    def arcfour(input, key):
    """
    Perform the ARCFOUR (RC4) algorithm on a given input list of bytes
    with
    a key given as a list of bytes, and return the output as a list of
    bytes.

    (From CipherSaber implementation by Ka-Ping Yee <ping at lfw.org>
    <http://www.xs4all.nl/~cg/ciphersaber/comp/python.txt>)
    """
    i, j, state = 0, 0, range(256)
    for i in range(256):
    j = (j + state[i] + key[i % len(key)]) % 256
    state[i], state[j] = state[j], state[i]
    i, j, output = 0, 0, []
    for byte in input:
    i = (i + 1) % 256
    j = (j + state[i]) % 256
    state[i], state[j] = state[j], state[i]
    n = (state[i] + state[j]) % 256
    output.append(byte ^ state[n])
    return output


    _passwordPad = [
    0x28, 0xbf, 0x4e, 0x5e, 0x4e, 0x75, 0x8a, 0x41,
    0x64, 0x00, 0x4e, 0x56, 0xff, 0xfa, 0x01, 0x08,
    0x2e, 0x2e, 0x00, 0xb6, 0xd0, 0x68, 0x3e, 0x80,
    0x2f, 0x0c, 0xa9, 0xfe, 0x64, 0x53, 0x69, 0x7a]

    passwordPad = "".join([chr(b) for b in _passwordPad])

    def calculateFileKey(fileId, ownerHash, userHash, permissions,
    userPassword = ""):
    """
    Calculates the file key for the document as described in
    references
    (see [PDFE] and [PDFPL]).
    """
    md = md5.new()
    md.update((userPassword + passwordPad)[:32])
    md.update(ownerHash)
    md.update(struct.pack("<L", permissions))
    md.update(fileId)

    fileKey = md.digest()[:5]

    return fileKey


    def calculateObjectKey(fileKey, objectNumber, generationNumber):
    """
    Calculates the key for the object as described in references
    (see [PDFE] and [PDFPL]).
    """
    md = md5.new()
    md.update(fileKey)
    md.update(struct.pack("<L", objectNumber)[:3])
    md.update(struct.pack("<L", generationNumber)[:2])
    objectKey = md.digest()[:10]

    return objectKey


    class NotEncryptedException(Exception):
    """
    The supplied PDF document is not encrypted.
    """


    KEY_OWNER_HASH = 'O'
    KEY_USER_HASH = 'U'
    KEY_PERMISSIONS = 'P'

    def getFileKey(doc, userPassword = ""):
    """
    Extracts the information required to calculate the file key
    from the supplied PDF document.

    In most cases `userPassword` can be left empty.
    """
    fileId = doc.trailer_dict['ID'][0] # Is the ID always repeated?

    try:
    encryptDict = doc.dereference(doc.trailer_dict['Encrypt'])
    except KeyError:
    raise NotEncryptedException
    # TODO: Check encryption version is ok. (filter/standard/v/1/r/2)

    ownerHash = encryptDict[KEY_OWNER_HASH]
    userHash = encryptDict[KEY_USER_HASH]
    # `permissions` should be "four-byte integer, LSB first."
    permissions = encryptDict[KEY_PERMISSIONS]

    fileKey = calculateFileKey(fileId, ownerHash, userHash,
    permissions,
    userPassword)

    # Sanity check user password
    assert(decrypt(userHash, fileKey) == passwordPad)

    return fileKey


    def decrypt(text, key):
    """
    Decrypts the supplied object (as a string) with the supplied key.

    Returns "plain text" form of object as a string.
    """
    return "".join([chr(b)
    for b in arcfour(map(ord, text), map(ord, key))])


    def showDocumentInfo(doc, fileKey):
    """
    Displays the content of the (optionally encrypted) Document
    Information
    Dictionary for the supplied PDF document.
    """
    infoDictRef = doc.trailer_dict['Info']
    objectNumber = infoDictRef.obj
    generationNumber = infoDictRef.gen

    infoDict = doc.dereference(infoDictRef)

    objectKey = calculateObjectKey(fileKey, objectNumber,
    generationNumber)

    for field, encryptedValue in infoDict.iteritems():
    if fileKey:
    value = decrypt(encryptedValue, objectKey)
    else:
    value = encryptedValue
    print "%s: %s" % (field, value)


    if __name__ == "__main__":
    try:
    filename = sys.argv[1]
    except IndexError:
    raise SystemExit("Usage %s <filename.pdf>" % sys.argv[0])

    doc = PDFdocument(filename)

    try:
    fileKey = getFileKey(doc)
    except NotEncryptedException:
    fileKey = ""

    showDocumentInfo(doc, fileKey)
  • Herb Schilling at Aug 20, 2004 at 1:43 am

    Hi,

    I wanted to extract the meta-data from an encrypted/protected PDF file
    and could not find any Python scripts to do this. So, I decided to
    write something myself, the result follows.

    This demonstration utility requires the `pdftools` files from
    <http://www.boddie.org.uk/david/Projects/Python/pdftools/> but the
    decryption functions themselves should be usable with other Python PDF
    libraries.

    Documentation is marginal and all I can say is that worked on the
    three PDF files I tested it on... :-)

    --Phil.

    P.S. The usual Usenet-mangling warning applies--yeah, I know--I should
    put it up on a web site somewhere... :-)

    #!/usr/bin/python
    #
    # Decrypt PDF Info
    #
    # Decrypts PDF files and displays meta-data associated with them. (If
    the
    # file isn't encrypted the information is displayed as is.)
    #
    # The results are similar to xpdf's `pdfinfo` utility.
    #
    # It should be possible to decrypt all of the objects contained
    # in the PDF, but this only reads the Document Information Dictionary.
    #
    # (Note: All the PDF handling is provided by `pdftools`, this just
    adds
    # the ability to deal with encrypted PDF files.)
    #
    # Requires:
    # + pdftools
    # <http://www.boddie.org.uk/david/Projects/Python/pdftools/>
    #
    # Based on:
    # + `pdfdecrypt.pl`
    # <http://www-2.cs.cmu.edu/~dst/Adobe/Gallery/pdfdecrypt.pl>
    [PDFPL]
    #
    # Incorporates:
    # + RC4 from CipherSaber implementation by Ka-Ping Yee
    <ping at lfw.org>
    # <http://www.xs4all.nl/~cg/ciphersaber/comp/python.txt>
    #
    # References:
    # +
    <http://www-2.cs.cmu.edu/~dst/Adobe/Gallery/anon21jul01-pdf-encryption.txt>
    [PDFE]
    #
    # Author:
    # follower at myrealbox.com (Standing on *many* shoulders...)
    #
    import sys
    import md5
    import struct

    from pdftools import PDFdocument

    def arcfour(input, key):
    """
    Perform the ARCFOUR (RC4) algorithm on a given input list of bytes
    with
    a key given as a list of bytes, and return the output as a list of
    bytes.

    (From CipherSaber implementation by Ka-Ping Yee <ping at lfw.org>
    <http://www.xs4all.nl/~cg/ciphersaber/comp/python.txt>)
    """
    i, j, state = 0, 0, range(256)
    for i in range(256):
    j = (j + state[i] + key[i % len(key)]) % 256
    state[i], state[j] = state[j], state[i]
    i, j, output = 0, 0, []
    for byte in input:
    i = (i + 1) % 256
    j = (j + state[i]) % 256
    state[i], state[j] = state[j], state[i]
    n = (state[i] + state[j]) % 256
    output.append(byte ^ state[n])
    return output


    _passwordPad = [
    0x28, 0xbf, 0x4e, 0x5e, 0x4e, 0x75, 0x8a, 0x41,
    0x64, 0x00, 0x4e, 0x56, 0xff, 0xfa, 0x01, 0x08,
    0x2e, 0x2e, 0x00, 0xb6, 0xd0, 0x68, 0x3e, 0x80,
    0x2f, 0x0c, 0xa9, 0xfe, 0x64, 0x53, 0x69, 0x7a]

    passwordPad = "".join([chr(b) for b in _passwordPad])

    def calculateFileKey(fileId, ownerHash, userHash, permissions,
    userPassword = ""):
    """
    Calculates the file key for the document as described in
    references
    (see [PDFE] and [PDFPL]).
    """
    md = md5.new()
    md.update((userPassword + passwordPad)[:32])
    md.update(ownerHash)
    md.update(struct.pack("<L", permissions))
    md.update(fileId)

    fileKey = md.digest()[:5]

    return fileKey


    def calculateObjectKey(fileKey, objectNumber, generationNumber):
    """
    Calculates the key for the object as described in references
    (see [PDFE] and [PDFPL]).
    """
    md = md5.new()
    md.update(fileKey)
    md.update(struct.pack("<L", objectNumber)[:3])
    md.update(struct.pack("<L", generationNumber)[:2])
    objectKey = md.digest()[:10]

    return objectKey


    class NotEncryptedException(Exception):
    """
    The supplied PDF document is not encrypted.
    """


    KEY_OWNER_HASH = 'O'
    KEY_USER_HASH = 'U'
    KEY_PERMISSIONS = 'P'

    def getFileKey(doc, userPassword = ""):
    """
    Extracts the information required to calculate the file key
    from the supplied PDF document.

    In most cases `userPassword` can be left empty.
    """
    fileId = doc.trailer_dict['ID'][0] # Is the ID always repeated?

    try:
    encryptDict = doc.dereference(doc.trailer_dict['Encrypt'])
    except KeyError:
    raise NotEncryptedException
    # TODO: Check encryption version is ok. (filter/standard/v/1/r/2)

    ownerHash = encryptDict[KEY_OWNER_HASH]
    userHash = encryptDict[KEY_USER_HASH]
    # `permissions` should be "four-byte integer, LSB first."
    permissions = encryptDict[KEY_PERMISSIONS]

    fileKey = calculateFileKey(fileId, ownerHash, userHash,
    permissions,
    userPassword)

    # Sanity check user password
    assert(decrypt(userHash, fileKey) == passwordPad)

    return fileKey


    def decrypt(text, key):
    """
    Decrypts the supplied object (as a string) with the supplied key.

    Returns "plain text" form of object as a string.
    """
    return "".join([chr(b)
    for b in arcfour(map(ord, text), map(ord, key))])


    def showDocumentInfo(doc, fileKey):
    """
    Displays the content of the (optionally encrypted) Document
    Information
    Dictionary for the supplied PDF document.
    """
    infoDictRef = doc.trailer_dict['Info']
    objectNumber = infoDictRef.obj
    generationNumber = infoDictRef.gen

    infoDict = doc.dereference(infoDictRef)

    objectKey = calculateObjectKey(fileKey, objectNumber,
    generationNumber)

    for field, encryptedValue in infoDict.iteritems():
    if fileKey:
    value = decrypt(encryptedValue, objectKey)
    else:
    value = encryptedValue
    print "%s: %s" % (field, value)


    if __name__ == "__main__":
    try:
    filename = sys.argv[1]
    except IndexError:
    raise SystemExit("Usage %s <filename.pdf>" % sys.argv[0])

    doc = PDFdocument(filename)

    try:
    fileKey = getFileKey(doc)
    except NotEncryptedException:
    fileKey = ""

    showDocumentInfo(doc, fileKey)

    --
    Herb Schilling
  • David Boddie at Aug 20, 2004 at 11:50 am
    follower at gmail.com (Follower) wrote in message news:<mailman.1868.1092839116.5135.python-announce-list at python.org>...
    I wanted to extract the meta-data from an encrypted/protected PDF file
    and could not find any Python scripts to do this. So, I decided to
    write something myself, the result follows.

    This demonstration utility requires the `pdftools` files from
    <http://www.boddie.org.uk/david/Projects/Python/pdftools/> but the
    decryption functions themselves should be usable with other Python PDF
    libraries.
    I'm glad that someone found the tools useful. :-)
    Unfortunately, I couldn't find any protected PDF files to test your script
    out on. Can you provide some URLs?
    Documentation is marginal and all I can say is that worked on the
    three PDF files I tested it on... :-)
    Do you want to supply a patch for pdftools to allow these sorts of files
    to be supported transparently?

    David
  • Follower at Aug 21, 2004 at 1:32 am
    davidb at mcs.st-and.ac.uk (David Boddie) wrote in message news:<4de76ee2.0408200350.3ccb9c50 at posting.google.com>...
    Unfortunately, I couldn't find any protected PDF files to test your script
    out on. Can you provide some URLs?
    I found some examples here (protected/encrypted with no user
    password):

    * <http://www.idealsoftware.com/ev_pdf_samples.htm>
    * <http://www.paisano.com/pdf/pdftest.htm>

    It also looks like the following GPL'd tool enables you to encrypt
    without Adobe tools:

    * <http://www.accesspdf.com/pdftk/>
    Do you want to supply a patch for pdftools to allow these sorts of files
    to be supported transparently?
    The code works well enough for me at the moment that my motivation for
    taking it further is a little low at the moment, sorry. :-) I'm not
    sure how straight-forward it would be to add it transparently, as at a
    minimum I think it needs to handle strings and streams differently. I
    will keep your request in mind though...

    Incidentally, have you considered making the Numeric import in
    pdftools optional? I just commented it out and things worked fine for
    me as I wasn't actually accessing page content.

    Thanks for your work with pdftools.

    --Phil.
  • David Boddie at Aug 23, 2004 at 2:27 pm
    follower at gmail.com (Follower) wrote in message news:<3c18c08f.0408201732.6766d375 at posting.google.com>...
    I found some examples here (protected/encrypted with no user
    password):

    * <http://www.idealsoftware.com/ev_pdf_samples.htm>
    * <http://www.paisano.com/pdf/pdftest.htm>
    Thanks. I'll take a look at those sites.
    It also looks like the following GPL'd tool enables you to encrypt
    without Adobe tools:

    * <http://www.accesspdf.com/pdftk/>
    It could be interesting to see exactly what needs to be done to encrypt
    PDF files, but pdftools doesn't support file output at the moment.
    The code works well enough for me at the moment that my motivation for
    taking it further is a little low at the moment, sorry. :-) I'm not
    sure how straight-forward it would be to add it transparently, as at a
    minimum I think it needs to handle strings and streams differently. I
    will keep your request in mind though...
    No problem. I've no great desire to improve it, either. I'm hoping that
    someone will want to take it over and maintain it.
    Incidentally, have you considered making the Numeric import in
    pdftools optional? I just commented it out and things worked fine for
    me as I wasn't actually accessing page content.
    I was trying to extend the module towards support for exporting the
    contents in suitable file formats, and Numeric was used just for
    matrix handling. A streamlined pdftools would just provide a simpler
    wrapper around the objects in the PDF file.
    Thanks for your work with pdftools.
    Thanks for taking it further!

    David

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedAug 17, '04 at 4:13p
activeAug 23, '04 at 2:27p
posts6
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase