I haven't seen any solid responses come across the wire, and I suspect
there isn't a product or package that will do exactly what you want.
However, our company's product, PDFTextStream does do a phenomenal job
of extracting text and metadata out of PDF documents. It's crazy-fast,
has a clean API, and in general gets the job done very nicely. It
presents two points of compromise from your idea situation:
1. It only produces text, so you would have to take the text it
provides and write it out as an RTF yourself (there are tons of
packages and tools that do this). Since the RTF format has pretty weak
formatting capabilities compared to PDF (and even compared to
HTML+CSS), you'd likely never reproduce the original layout/content of
the source document anyway.
2. It is a Java library. You indicated in a later message that you
were aiming to use a python package if possible just out of personal
preference. Assuming such a thing does not exist, and you are able to
introduce a Java component to your project, this would become a
Let me know what your questions are.
cemerick at snowtide.com
Snowtide Informatics Systems
PDFTextStream: fast PDF text extraction for Java apps and Lucenehttp://snowtide.com/home/PDFTextStream/
Alexander Straschil wrote:
I have to convert an HTML document to rtf with python, was just
for an hour and did find nothing ;-(
Has anybody an Idea how to convert (under Linux) an HTML or Pdf