FAQ
Hello all,

I'm replying to two threads at once as what I have to say relates to both.

My company recently started an open source project called Aperture
(http://sourceforge.net/projects/aperture), together with the German
DFKI institute. The project is still very much in alpha stage, but I do
believe we already have some code parts that could help people here.

Basically, it's a framework for crawling information sources (file
systems, mail folders, websites, ...) and extracting as much information
from it as possible. Besides full-text extraction, we also put a lot of
effort in extraction and modeling of the metadata occurring in these
sources and document formats. Both parties have some proprietary code
lying on the shelf that is being open sourced and ported to the Aperture
architecture.

Now on to the raised questions:

arnaudbuffet@free.fr wrote:
WordDocument wd = new WordDocument(is);
jwang@dicarta.com wrote:
MS Word - I know that POI exists, but development on the Word portion
seems to have stopped, and there are a lot of nasty looking bugs in
their DB. Since we're involved in dealing with contracts, many of our
Word files are large and complicated. How has everyone's experience
with POI's Word parsing been?
My experience is that the WordDocument class crashes on about 25% of the
documents, i.e. it throws some sort of Exception. I've tested POI
2.5.1-final as well as the current code in CVS, but both produce this
result. I even suspect the output to be 100% the same, but I haven't
verified this.

Another reason I don't like this class is that it operates on an
InputStream and internally creates a POIFSFileSystem which you cannot
access, so that it becomes hard to extract document metadata as well
(for which you need the PFSFS) without buffering the entire InputStream.
The same applies to TextMining's WordExtractor, which also operates on
top of lower level POI components.

I've recently committed a WordExtractor to Aperture that uses its own
code operating on these lower level POI datastructures, which works a
lot better, failing only 5% of my 300 test docs. I don't pretend to
understand all the internals of the POI APIs, but it Works For Me.

When POI throws an exception, the WordExtractor will revert to applying
a heuristic string extraction algorithm to extract as much
human-readable text as possible from the binary stream, which works
quite well on MS Office files, i.e. the output is reasonably well for
indexing purposes.

Be sure to checkout Aperture from CVS as this code isn't part of the
alpha 1 release. A next official release is expected in a month.

jwang@dicarta.com wrote:
RTF - javax.swing looks fine, we use those classes already.
Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly"
because in the past I had many issues with it, typically throwing
exceptions on 25-50% of my test documents. Recently I haven't seen a
single one (using Java 1.5.0), so either I am now feeding it a more
optimal document set or the Swing people have worked on the
implementation. In that case people using Java 1.4.x may see different
results.
Word Perfect - There doesn't seem to be any converters for this format?
I'm actively working on this :) We have some proprietary code that will
become part of Aperture. Right now I cannot say how well it performs in
practice though, although we've never had complaints with our
proprietary apps.

The code uses a heuristic string extraction algorithm tuned for
WordPerfect documents. This may be an issue, e.g. when you also want to
display the extraction results to end users.

If you're interested: one way you can help me get the most out of it is
by sending me some example WordPerfect documents because I hardly have
those on my hard drive. Fake documents made with very new or old
WordPerfect versions are also most welcome.


Regards,

Chris
http://aduna.biz
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Nick Burch at Feb 9, 2006 at 12:29 pm

    On Thu, 9 Feb 2006, Christiaan Fluit wrote:
    My experience is that the WordDocument class crashes on about 25% of the
    documents, i.e. it throws some sort of Exception. I've tested POI
    2.5.1-final as well as the current code in CVS, but both produce this
    result. I even suspect the output to be 100% the same, but I haven't
    verified this.
    You could try using org.apache.poi.hwpf.HWPFDocument, and getting the
    range, then the paragraphs, and grab the text from each paragraph. If
    there's interest, I could probably commit an extractor that does this to
    poi.

    (WordDocument is from the hdf package, which is older and less reliable
    than the current hwpf stuff)
    Another reason I don't like this class is that it operates on an
    InputStream and internally creates a POIFSFileSystem which you cannot
    access, so that it becomes hard to extract document metadata as well
    (for which you need the PFSFS) without buffering the entire InputStream.
    If you're using HWPFDocument from cvs, then you can create that from a
    POIFSFileSystem.

    Nick

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Christiaan Fluit at Feb 9, 2006 at 1:13 pm

    Nick Burch wrote:
    You could try using org.apache.poi.hwpf.HWPFDocument, and getting the
    range, then the paragraphs, and grab the text from each paragraph. If
    there's interest, I could probably commit an extractor that does this to
    poi.
    Yes, that's exactly what I'm doing. Having this in POI would benefit me
    a lot though, as I hardly understand the POI basics to be honest (my
    fault, not POI's).

    This is my current code (adapted from Aperture code in CVS):

    HWPFDocument doc = new HWPFDocument(poiFileSystem);
    StringBuffer buffer = new StringBuffer(4096);

    Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
    while (textPieces.hasNext()) {
    TextPiece piece = (TextPiece) textPieces.next();

    // the following is derived from
    // http://article.gmane.org/gmane.comp.jakarta.poi.devel/7406
    String encoding = "Cp1252";
    if (piece.usesUnicode()) {
    encoding = "UTF-16LE";
    }

    buffer.append(new String(piece.getRawBytes(), encoding));
    }

    // normalize end-of-line characters and remove any lines
    // containing macros
    BufferedReader reader = new BufferedReader(new
    StringReader(buffer.toString()));
    buffer.setLength(0);

    String line;
    while ((line = reader.readLine()) != null) {
    if (line.indexOf("DOCPROPERTY") == -1) {
    buffer.append(line);
    buffer.append(END_OF_LINE);
    }
    }

    // fetch the extracted full-text
    String text = buffer.toString();


    Regards,

    Chris
    --

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nick Burch at Feb 14, 2006 at 11:03 am

    On Thu, 9 Feb 2006, Christiaan Fluit wrote:
    Yes, that's exactly what I'm doing. Having this in POI would benefit me
    a lot though, as I hardly understand the POI basics to be honest (my
    fault, not POI's).
    OK, that's now in POI (you'll need a scratchpad build from late yesterday
    or today, see http://encore.torchbox.com/poi-cvs-build/ for jars)

    The code is in org.apache.poi.hwpf.extractor.WordExtractor, and it
    supports grabbing all the text, or grabbing an array of the text in each
    paragraph

    If you have any problems/queries/comments on it, then you'll probably get
    a better response on poi-user than here!

    Nick

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dmitry Goldenberg at Feb 9, 2006 at 4:10 pm
    Chris,

    Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what do you see as the timeframe for adding WordPerfect support? Are you considering supporting any other sources such as MS Project, Framemaker, etc?

    Thanx,
    - Dmitry

    ________________________________

    From: Christiaan Fluit
    Sent: Thu 2/9/2006 4:09 AM
    To: java-user@lucene.apache.org
    Subject: Re: Word files & Build vs. Buy?



    Hello all,

    I'm replying to two threads at once as what I have to say relates to both.

    My company recently started an open source project called Aperture
    (http://sourceforge.net/projects/aperture), together with the German
    DFKI institute. The project is still very much in alpha stage, but I do
    believe we already have some code parts that could help people here.

    Basically, it's a framework for crawling information sources (file
    systems, mail folders, websites, ...) and extracting as much information
    from it as possible. Besides full-text extraction, we also put a lot of
    effort in extraction and modeling of the metadata occurring in these
    sources and document formats. Both parties have some proprietary code
    lying on the shelf that is being open sourced and ported to the Aperture
    architecture.

    Now on to the raised questions:

    arnaudbuffet@free.fr wrote:
    WordDocument wd = new WordDocument(is);
    jwang@dicarta.com wrote:
    MS Word - I know that POI exists, but development on the Word portion
    seems to have stopped, and there are a lot of nasty looking bugs in
    their DB. Since we're involved in dealing with contracts, many of our
    Word files are large and complicated. How has everyone's experience
    with POI's Word parsing been?
    My experience is that the WordDocument class crashes on about 25% of the
    documents, i.e. it throws some sort of Exception. I've tested POI
    2.5.1-final as well as the current code in CVS, but both produce this
    result. I even suspect the output to be 100% the same, but I haven't
    verified this.

    Another reason I don't like this class is that it operates on an
    InputStream and internally creates a POIFSFileSystem which you cannot
    access, so that it becomes hard to extract document metadata as well
    (for which you need the PFSFS) without buffering the entire InputStream.
    The same applies to TextMining's WordExtractor, which also operates on
    top of lower level POI components.

    I've recently committed a WordExtractor to Aperture that uses its own
    code operating on these lower level POI datastructures, which works a
    lot better, failing only 5% of my 300 test docs. I don't pretend to
    understand all the internals of the POI APIs, but it Works For Me.

    When POI throws an exception, the WordExtractor will revert to applying
    a heuristic string extraction algorithm to extract as much
    human-readable text as possible from the binary stream, which works
    quite well on MS Office files, i.e. the output is reasonably well for
    indexing purposes.

    Be sure to checkout Aperture from CVS as this code isn't part of the
    alpha 1 release. A next official release is expected in a month.

    jwang@dicarta.com wrote:
    RTF - javax.swing looks fine, we use those classes already.
    Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly"
    because in the past I had many issues with it, typically throwing
    exceptions on 25-50% of my test documents. Recently I haven't seen a
    single one (using Java 1.5.0), so either I am now feeding it a more
    optimal document set or the Swing people have worked on the
    implementation. In that case people using Java 1.4.x may see different
    results.
    Word Perfect - There doesn't seem to be any converters for this format?
    I'm actively working on this :) We have some proprietary code that will
    become part of Aperture. Right now I cannot say how well it performs in
    practice though, although we've never had complaints with our
    proprietary apps.

    The code uses a heuristic string extraction algorithm tuned for
    WordPerfect documents. This may be an issue, e.g. when you also want to
    display the extraction results to end users.

    If you're interested: one way you can help me get the most out of it is
    by sending me some example WordPerfect documents because I hardly have
    those on my hard drive. Fake documents made with very new or old
    WordPerfect versions are also most welcome.


    Regards,

    Chris
    http://aduna.biz
    --

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Christiaan Fluit at Feb 10, 2006 at 3:05 pm

    Dmitry Goldenberg wrote:
    Awesome stuff. A few questions: is your Excel extractor somehow
    better than POI's? and, what do you see as the timeframe for adding
    WordPerfect support? Are you considering supporting any other sources
    such as MS Project, Framemaker, etc?
    I just committed a WordPerfectExtractor ;)

    It's based on code developed in-house at Aduna and it seems to work
    quite well on my test collection of WordPerfect documents. Only
    sometimes words are split in the middle, I'm still looking into that.

    The test set has a bias for older WordPerfect documents though, I'm
    trying to get my hands on a recent copy of WordPerfect to see if the
    latest format is also supported and to create unit tests for it.

    To interactively test the extractor(s) yourselves:

    - checkout Aperture from CVS (see
    http://sourceforge.net/cvs/?group_id=150969)
    - do "ant release"
    - go to build\release\bin and execute fileinspector.bat
    - drag any file (WordPerfect or any other format) to see what MIME type
    Aperture thinks it is and to execute the corresponding Extractor, if
    available. The two tabs show the extracted full-text and an RDF dump of
    the metadata. For WordPerfect, only full-text extraction is currently
    supported.

    Our ExcelExtractor is basically nothing more than glue code between POI
    and the rest of our framework, meaning that an application using the
    framework can request an Extractor implementation for
    "application/vnd.ms-excel", feed it an InputStream and get the text and
    metadata back.

    The only advantage of our ExcelExtractor over direct use of POI is that,
    when POI throws an Exception on a particular document, it reverts to a
    heuristic string extraction algorithm which is often able to extract
    full-text from a document with reasonable quality, i.e. suited for indexing.

    We are surely considering supporting more formats. Which ones we will
    work on depends on a number of factors, e.g. availability of open source
    libs for that format, complexity of the file format (we did WordPerfect
    by ourselves), customer demand, code contributions from others, etc. In
    any case, if you need support for format XYZ, you can always send me
    some example files and I'll take a look at how hard it is to add support
    for it.


    Chris
    --

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 9, '06 at 12:11p
activeFeb 14, '06 at 11:03a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase