FAQ
Hi ,

I am currently working on "Information retrieval from semi structured
Documents" in which there is a need to read data from Resumes.

Could anyone tell me is there any python API to read Word doc?


Thanks and regards,
Shailja

=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090513/be8b3245/attachment.htm>

Search Discussions

  • Kushal Kumaran at May 13, 2009 at 12:37 pm

    On Wed, May 13, 2009 at 4:28 PM, Shailja Gulati wrote:
    Hi ,

    I am currently working on "Information retrieval from semi structured
    Documents" in which there is a need to read data from Resumes.

    Could anyone tell me is there any python API to read Word doc?
    If you're using Windows, you can use COM APIs to read Word documents.
    Or you can use OpenOffice.org using uno. You can find examples of
    either by googling.

    --
    kushal
  • Norseman at May 13, 2009 at 8:55 pm

    Kushal Kumaran wrote:
    On Wed, May 13, 2009 at 4:28 PM, Shailja Gulati wrote:
    Hi ,

    I am currently working on "Information retrieval from semi structured
    Documents" in which there is a need to read data from Resumes.

    Could anyone tell me is there any python API to read Word doc?
    If you're using Windows, you can use COM APIs to read Word documents.
    Or you can use OpenOffice.org using uno. You can find examples of
    either by googling.
    ============================
    One problem that I keep getting with OOo an UNO and python. When asked
    to output a .txt file it comes out sorta pk-zipped. Same for .csv files
    it outputs. If you can, I suggest you work with Microsoft's COM. I have
    had better luck there. Not much, but better. Usually get a real .txt

    For what it is worth, in OOo I did have some progress by creating a
    macro to write out text in it and setting it to run on EVERY file it
    opens and ten close OOo after the write. Then batched the OOo file.doc
    process with a:

    files2process.sh #files2process.bat in window$
    ================
    swriter file1.doc
    swriter file2.doc
    .
    .

    not very elegant, but it worked for me.


    To be honest - I just give those to a clerk and let them point and click
    until done these days. Less frustrating. Documentation bad for each.


    Steve
  • Tim Golden at May 13, 2009 at 1:01 pm

    Shailja Gulati wrote:
    Hi ,

    I am currently working on "Information retrieval from semi structured
    Documents" in which there is a need to read data from Resumes.

    Could anyone tell me is there any python API to read Word doc?
    If you haven't already, get hold of the pywin32 extensions:

    http://pywin32.sf.net

    <code>
    import win32com.client

    doc = win32com.client.GetObject ("c:/temp/temp.doc")
    text = doc.Range ().Text

    </code>

    Note that this will give you a unicode object with \r line-delimiters.
    You could read para by para if that were more useful:

    <code>
    import win32com.client

    doc = win32com.client.GetObject ("c:/temp/temp.doc")
    lines = [p.Range () for p in doc.Paragraphs]

    </code>

    TJG
  • Norseman at May 13, 2009 at 9:00 pm

    Tim Golden wrote:
    Shailja Gulati wrote:
    Hi ,

    I am currently working on "Information retrieval from semi structured
    Documents" in which there is a need to read data from Resumes.

    Could anyone tell me is there any python API to read Word doc?
    If you haven't already, get hold of the pywin32 extensions:

    http://pywin32.sf.net

    <code>
    import win32com.client

    doc = win32com.client.GetObject ("c:/temp/temp.doc")
    text = doc.Range ().Text

    </code>

    Note that this will give you a unicode object with \r line-delimiters.
    You could read para by para if that were more useful:

    <code>
    import win32com.client

    doc = win32com.client.GetObject ("c:/temp/temp.doc")
    lines = [p.Range () for p in doc.Paragraphs]

    </code>

    TJG
    =======================
    I saw this right after responding to Kushal's 5:37AM today posting.

    Thank you for the tip. I'll try these first chance I get.
    Word, swriter, whatever - I'm not partial when it comes to automating.


    Today is: 20090513

    Steve
  • Norseman at May 13, 2009 at 9:48 pm

    norseman wrote:
    Tim Golden wrote:
    Shailja Gulati wrote:
    Hi ,

    I am currently working on "Information retrieval from semi structured
    Documents" in which there is a need to read data from Resumes.

    Could anyone tell me is there any python API to read Word doc?
    If you haven't already, get hold of the pywin32 extensions:

    http://pywin32.sf.net

    <code>
    import win32com.client

    doc = win32com.client.GetObject ("c:/temp/temp.doc")
    text = doc.Range ().Text

    </code>

    Note that this will give you a unicode object with \r line-delimiters.
    You could read para by para if that were more useful:

    <code>
    import win32com.client

    doc = win32com.client.GetObject ("c:/temp/temp.doc")
    lines = [p.Range () for p in doc.Paragraphs]

    </code>

    TJG
    =======================
    I saw this right after responding to Kushal's 5:37AM today posting.

    Thank you for the tip. I'll try these first chance I get.
    Word, swriter, whatever - I'm not partial when it comes to automating.


    Today is: 20090513

    Steve
    ================================
    Interesting:

    I did try these.

    Doc at once:
    outputs two x'0D' and the file. Then it appends x'0D' x'0D' x'0A' x'0D'
    x'0A' to end of file even though source file itself has no EOL.
    ( EOL is EndOfLine aka newline )

    That's cr cr There are two blank lines at begining.
    cr cr lf cr lf There is no EOL in source
    Any idea what those are about?
    One crlf is probably from python's print text, but the other?

    The lines=
    appends [u'\r', u'\r', u" to begining of output
    and \r"]x'0D'x'0A' to the end even though there is no EOL in source.

    output is understood: u'\r' is Apple EOL
    the crlf is probably from print lines.

    Programmers searching for specifics take note. The output is cooked.
    I don't have any "weird things" in the test file. (no font changes, no
    subscripts, etc) Might be best to take a real good look at a test file
    before assuming anything.

    But, having an idea of what the extras are makes it somewhat easier to
    allow for.


    Steve
  • Tim Golden at May 14, 2009 at 2:57 pm

    norseman wrote:
    I did try these.

    Doc at once:
    outputs two x'0D' and the file. Then it appends x'0D' x'0D' x'0A' x'0D'
    x'0A' to end of file even though source file itself has no EOL.
    ( EOL is EndOfLine aka newline )

    That's cr cr There are two blank lines at begining.
    cr cr lf cr lf There is no EOL in source
    Any idea what those are about?
    One crlf is probably from python's print text, but the other?

    The lines=
    appends [u'\r', u'\r', u" to begining of output
    and \r"]x'0D'x'0A' to the end even though there is no EOL in source.

    output is understood: u'\r' is Apple EOL
    the crlf is probably from print lines.
    Not clear what you're doing to get there. This is the (wrapped)
    output from my interpreter, using Word 2003. As you can see, new
    doc: one "\r", nothing more.

    <dump>
    Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32
    bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more
    information.
    import win32com.client
    word = win32com.client.gencache.EnsureDispatch
    ("Word.Application")
    doc = word.Documents.Add ()
    print repr (doc.Range ().Text)
    u'\r'
    >>>

    </dump>
  • Norseman at May 14, 2009 at 4:23 pm

    Tim Golden wrote:
    norseman wrote:
    I did try these.

    Doc at once:
    outputs two x'0D' and the file. Then it appends x'0D' x'0D' x'0A'
    x'0D' x'0A' to end of file even though source file itself has no EOL.
    ( EOL is EndOfLine aka newline )

    That's cr cr There are two blank lines at begining.
    cr cr lf cr lf There is no EOL in source
    Any idea what those are about?
    One crlf is probably from python's print text, but the other?

    The lines=
    appends [u'\r', u'\r', u" to begining of output
    and \r"]x'0D'x'0A' to the end even though there is no EOL in source.

    output is understood: u'\r' is Apple EOL
    the crlf is probably from print lines.
    Not clear what you're doing to get there. This is the (wrapped) output
    from my interpreter, using Word 2003. As you can see, new
    doc: one "\r", nothing more.

    <dump>
    Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
    (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    import win32com.client
    word = win32com.client.gencache.EnsureDispatch ("Word.Application")
    doc = word.Documents.Add ()
    print repr (doc.Range ().Text)
    u'\r'
    </dump>
    ==============
    The original "do it this way" snippets were:
    <code>
    import win32com.client

    doc = win32com.client.GetObject ("c:/temp/temp.doc")
    text = doc.Range ().Text

    </code>

    Note that this will give you a unicode object with \r line-delimiters.
    You could read para by para if that were more useful:

    <code>
    import win32com.client

    doc = win32com.client.GetObject ("c:/temp/temp.doc")
    lines = [p.Range () for p in doc.Paragraphs]

    </code>


    and I added:

    print text after "text =" line above

    print lines after "lines =" line above

    then ran file using python test.py >letmesee
    followed by viewing letmesee in hex



    Steve
  • Tim Golden at May 14, 2009 at 8:40 am
    [forwarding back to the list]
    Please reply to the list: I'm not the only person
    who can help, and I might not have the time even
    if I can.

    Shailja Gulati wrote:
    I have installed win32com but still not able to run tht code as its giving
    error

    File "readDocPython.py", line 1, in ?
    import win32com.client
    File "C:\pywin32-212\com\win32com\__init__.py", line 5, in ?
    import win32api, sys, os
    ImportError: No module named win32api >
    i have even installed win32api.dll in Sytsem folder but
    still can't do t.Any further help??
    I don't know how you've installed it, but I really don't expect
    to see it in c:\pywin32-212.... Download *and run* the
    .exe installer, from here, choosing the one which corresponds
    to your Python installation?

    http://sourceforge.net/project/platformdownload.php?group_idx018

    Don't try to download the source -- the zip -- and add it to
    sys.path, which is the only thing I can imagine you've done there.


    TJG
  • Shailja Gulati at May 14, 2009 at 8:49 am
    [forwarding back to the list]
    Please reply to the list: I'm not the only person
    who can help, and I might not have the time even
    if I can.

    Shailja Gulati wrote:
    I have installed win32com but still not able to run tht code as its giving
    error

    File "readDocPython.py", line 1, in ?
    import win32com.client
    File "C:\pywin32-212\com\win32com\__init__.py", line 5, in ?
    import win32api, sys, os
    ImportError: No module named win32api >
    i have even installed win32api.dll in Sytsem folder but
    still can't do t.Any further help??
    I don't know how you've installed it, but I really don't expect
    to see it in c:\pywin32-212.... Download *and run* the
    .exe installer, from here, choosing the one which corresponds
    to your Python installation?


    Sorry about mailing u Tim.It just happened by mistake.

    Reg win32api , i m still facing the same problem of Import error...Could
    anyone pls help?? m stuck

    http://sourceforge.net/project/platformdownload.php?group_idx018

    Don't try to download the source -- the zip -- and add it to
    sys.path, which is the only thing I can imagine you've done there.


    TJG
    --
    http://mail.python.org/mailman/listinfo/python-list

    ForwardSourceID:NT0000EC7A
    =====-----=====-----====Notice: The information contained in this e-mail
    message and/or attachments to it may contain
    confidential or privileged information. If you are
    not the intended recipient, any dissemination, use,
    review, distribution, printing or copying of the
    information contained in this e-mail message
    and/or attachments to it are strictly prohibited. If
    you have received this communication in error,
    please notify us by reply e-mail or telephone and
    immediately and permanently delete the message
    and any attachments. Thank you


    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20090514/6e7e8b79/attachment.htm>
  • Tim Golden at May 14, 2009 at 9:08 am

    Shailja Gulati wrote:
    Sorry about mailing u Tim.It just happened by mistake.

    Reg win32api , i m still facing the same problem of Import error...Could
    anyone pls help?? m stuck
    Shailja. Did you download and install the download .exe
    from the link below?

    TJG

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMay 13, '09 at 10:58a
activeMay 14, '09 at 4:23p
posts11
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase