FAQ
Hello!

I have to convert an HTML document to rtf with python, was just googling
for an hour and did find nothing ;-(
Has anybody an Idea how to convert (under Linux) an HTML or Pdf Document
to Rtf?

Thanks, AXEL

Search Discussions

  • Cameron Laird at Dec 14, 2004 at 2:08 pm
    In article <slrncrtlf0.ara.e9526547 at stud3.tuwien.ac.at>,
    Alexander Straschil wrote:
    Hello!

    I have to convert an HTML document to rtf with python, was just googling
    for an hour and did find nothing ;-(
    Has anybody an Idea how to convert (under Linux) an HTML or Pdf Document
    to Rtf?

    Thanks, AXEL
    Are you trying to convert one document in particular, or automate the
    process of conveting arbitrary HTML documents? What computing host is
    available to you--Win*? Linux? MacOS? Solaris!? Is Word installed?
    OpenOffice? Why have you specified Python?
  • Axel Straschil at Dec 14, 2004 at 3:02 pm
    Hello!

    Sorry Cameron, I was replying, now my folloup ;-):
    Are you trying to convert one document in particular, or automate the
    process of conveting arbitrary HTML documents?
    I have an small CMS System where the customer has the posibility to view
    certain Html-Pages as Pdf, the CMS ist Python based. I also thought
    about
    passing the Url to an external converter Script, but found nothing ;-(

    What computing host is available to you--Win*? Linux? MacOS?
    Solaris!? Linux
    Is Word installed? No.
    OpenOffice? Yes.
    Why have you specified Python?
    Becouse I like Python ;-)
    The System behind generating the HTML-Code is written in Python.

    Thanks,
    AXEL.
  • Cameron Laird at Dec 14, 2004 at 5:08 pm
    In article <slrncru049.bs9.axel at m2.sine>,
    Axel Straschil wrote:
    Hello!

    Sorry Cameron, I was replying, now my folloup ;-):
    Are you trying to convert one document in particular, or automate the
    process of conveting arbitrary HTML documents?
    I have an small CMS System where the customer has the posibility to view
    certain Html-Pages as Pdf, the CMS ist Python based. I also thought
    about
    passing the Url to an external converter Script, but found nothing ;-(

    What computing host is available to you--Win*? Linux? MacOS?
    Solaris!? Linux
    Is Word installed? No.
    OpenOffice? Yes.
    Why have you specified Python?
    Becouse I like Python ;-)
    The System behind generating the HTML-Code is written in Python.
    .
    .
    .
    That's a fine reason to use Python. It helps me to know, though.

    I do a lot of this sort of thing--automation of conversion between
    different Web display-formats. I don't have a one-line answer for
    the particular one you describe, but it's certainly feasible.

    I'm willing to bet there's an HTML-to-RTF converter available for
    Linux, but I've never needed (more accurately: I have written my
    own for special purposes--for my situations, it hasn't been diffi-
    cult) one, so I can't say for sure. My first step would be to
    look for such an application. Failing that, I'd script OpenOffice
    (with Python!) to read the HTML, and SaveAs RTF.

    I list a few PDF-to-RTF converters in <URL:
    http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#RTF >.
    Again, I think there are more, but haven't yet made the time to
    hunt them all down.
  • Chas Emerick at Dec 15, 2004 at 10:50 pm
    I haven't seen any solid responses come across the wire, and I suspect
    there isn't a product or package that will do exactly what you want.

    <blatent_self_promotion>
    However, our company's product, PDFTextStream does do a phenomenal job
    of extracting text and metadata out of PDF documents. It's crazy-fast,
    has a clean API, and in general gets the job done very nicely. It
    presents two points of compromise from your idea situation:

    1. It only produces text, so you would have to take the text it
    provides and write it out as an RTF yourself (there are tons of
    packages and tools that do this). Since the RTF format has pretty weak
    formatting capabilities compared to PDF (and even compared to
    HTML+CSS), you'd likely never reproduce the original layout/content of
    the source document anyway.

    2. It is a Java library. You indicated in a later message that you
    were aiming to use a python package if possible just out of personal
    preference. Assuming such a thing does not exist, and you are able to
    introduce a Java component to your project, this would become a
    non-issue.
    </blatent_self_promotion>

    Let me know what your questions are.

    Chas Emerick
    cemerick at snowtide.com
    Snowtide Informatics Systems

    PDFTextStream: fast PDF text extraction for Java apps and Lucene
    http://snowtide.com/home/PDFTextStream/


    Alexander Straschil wrote:
    Hello!

    I have to convert an HTML document to rtf with python, was just
    googling
    for an hour and did find nothing ;-(
    Has anybody an Idea how to convert (under Linux) an HTML or Pdf
    Document
    to Rtf?

    Thanks, AXEL
  • Axel Straschil at Dec 16, 2004 at 7:30 am
    Hallo!
    However, our company's product, PDFTextStream does do a phenomenal job of
    extracting text and metadata out of PDF documents. It's crazy-fast, has a
    clean API, and in general gets the job done very nicely. It presents two
    points of compromise from your idea situation:
    1. It only produces text, so you would have to take the text it provides and
    write it out as an RTF yourself (there are tons of packages and tools that do
    this). Since the RTF format has pretty weak formatting capabilities compared
    I've got the Input Source in HTML, the Problem ist converting from any to
    RTF. Please give me a hint where the tons of packages are.

    Thanks,
    AXEL.
    --
    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
    "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
    interpreted as described in RFC 2119 [http://ietf.org/rfc/rfc2119.txt]
  • Mike Meyer at Dec 16, 2004 at 6:01 pm

    Axel Straschil <axel at straschil.com> writes:

    Hallo!
    However, our company's product, PDFTextStream does do a phenomenal
    job of extracting text and metadata out of PDF documents. It's
    crazy-fast, has a clean API, and in general gets the job done very
    nicely. It presents two points of compromise from your idea
    situation:
    1. It only produces text, so you would have to take the text it
    provides and write it out as an RTF yourself (there are tons of
    packages and tools that do this). Since the RTF format has pretty
    weak formatting capabilities compared
    I've got the Input Source in HTML, the Problem ist converting from any
    to RTF. Please give me a hint where the tons of packages are.
    That's easy. Load the HTML in MS Word, and save it as RTF. Script it
    via COM using the python win32all (I think that's what it's now
    called) package.

    <mike
    --
    Mike Meyer <mwm at mired.org> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
  • Axel Straschil at Dec 16, 2004 at 7:30 pm
    Hello!
    That's easy. Load the HTML in MS Word, and save it as RTF. Script it
    via COM using the python win32all (I think that's what it's now
    called) package.
    As I wrote in my posting and the subject: linux ;-)
    I could try to do this with open office, by I'm afraid this will not
    be a performant solution ;-(
    I realy was spending hour's on that, the only thing I found was a
    spezifikation for reach text, maybe a good point to start a project ...

    Lg
    AXEL.
  • Stephen Thorne at Dec 17, 2004 at 1:00 am

    On Thu, 16 Dec 2004 19:30:37 +0000 (UTC), Axel Straschil wrote:
    That's easy. Load the HTML in MS Word, and save it as RTF. Script it
    via COM using the python win32all (I think that's what it's now
    called) package.
    As I wrote in my posting and the subject: linux ;-)
    I could try to do this with open office, by I'm afraid this will not
    be a performant solution ;-(
    I realy was spending hour's on that, the only thing I found was a
    spezifikation for reach text, maybe a good point to start a project ...
    I've been able to successfully get konqueror to generate a pdf from a
    html file via dcop. It's something along the lines of:
    % dcop konqueror-25827 html-widget1 print 1
    You can launch konq in a xvfb (X Virtual Framebuffer) then communicate
    via dcop to send commands to the browser (load this url, print this
    page, etc).

    I've been investigating doing the same feat using JS/XUL/etc in
    mozilla. It probably is possible. There's lots of documentation about
    the XPCOM api available from http://xulplanet.com/

    As for converting to RTF, someone has already pointed out PyRTF.

    Regards,
    Stephen Thorne
  • Axel Straschil at Dec 17, 2004 at 7:55 am
    Hello!
    I've been able to successfully get konqueror to generate a pdf from a
    html file via dcop. It's something along the lines of:
    For that stuff, I'm using htmloc (http://www.htmldoc.org/).

    Lg,
    AXEL.
    --
    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
    "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
    interpreted as described in RFC 2119 [http://ietf.org/rfc/rfc2119.txt]
  • Stephen Thorne at Dec 18, 2004 at 12:12 am

    On Fri, 17 Dec 2004 07:55:10 +0000 (UTC), Axel Straschil wrote:
    Hello!
    I've been able to successfully get konqueror to generate a pdf from a
    html file via dcop. It's something along the lines of:
    For that stuff, I'm using htmloc (http://www.htmldoc.org/).
    I found htmldoc and every other open source purpose built html->pdf
    converter to be deficient enough to discourage us from using them. For
    our requirements only web-browsers had the quality of rendering
    required.

    Stephen.
  • Mike Meyer at Dec 17, 2004 at 12:28 am

    Axel Straschil <axel at straschil.com> writes:

    Hello!
    That's easy. Load the HTML in MS Word, and save it as RTF. Script it
    via COM using the python win32all (I think that's what it's now
    called) package.
    As I wrote in my posting and the subject: linux ;-)
    I could try to do this with open office, by I'm afraid this will not
    be a performant solution ;-(
    I realy was spending hour's on that, the only thing I found was a
    spezifikation for reach text, maybe a good point to start a project ...
    Sorry. I forgot the original subject.

    You might take a look at PyRTF in PyPI. It's still in beta,
    though. But it might be enough that coupled with the HTMLParser.py to
    get you where you need to go.

    <mike
    --
    Mike Meyer <mwm at mired.org> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
  • Axel Straschil at Dec 17, 2004 at 8:03 am
    Hello!
    You might take a look at PyRTF in PyPI. It's still in beta,
    I think PyRTF would be the right choice, thanks. Yust had a short look
    at it.

    Lg,
    AXEL.
    --
    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
    "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
    interpreted as described in RFC 2119 [http://ietf.org/rfc/rfc2119.txt]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedDec 14, '04 at 12:00p
activeDec 18, '04 at 12:12a
posts13
users6
websitepython.org

People

Translate

site design / logo © 2022 Grokbase