FAQ
Hi all,

I have a rather nice html parser that I got from SourceForge. Does anyone know of any good parsers for pdf and Microsoft Office Suite (.doc, .ppt, .xls, etc), any help would be much appreciated.

Pete Lewis

Search Discussions

  • Adriano Labate at May 28, 2003 at 12:03 pm
    The www.textmining.org text extractors work very well for Word and pdf
    documents.
    They use both PDFBox and POI.

    For Excel, using POI directly is very easy. Tell me if you want to see
    code samples.

    I'm looking myself for a Powerpoint text extractor, if you know one...

    Adriano Labate


    -----Message d'origine-----
    De : Pete Lewis
    Envoyé : mercredi, 28 mai 2003 12:48
    À : Lucene Users List
    Objet : Parsers


    Hi all,

    I have a rather nice html parser that I got from SourceForge. Does
    anyone know of any good parsers for pdf and Microsoft Office Suite
    (.doc, .ppt, .xls, etc), any help would be much appreciated.

    Pete Lewis




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Pete Lewis at May 28, 2003 at 1:02 pm
    Hi Adriano

    Thanks. Code samples would be nice :)

    Will come back if I find something for .ppt.

    Pete

    ----- Original Message -----
    From: "Adriano Labate" <labate@verticali.com>
    To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, May 28, 2003 1:03 PM
    Subject: RE : Parsers


    The www.textmining.org text extractors work very well for Word and pdf
    documents.
    They use both PDFBox and POI.

    For Excel, using POI directly is very easy. Tell me if you want to see
    code samples.

    I'm looking myself for a Powerpoint text extractor, if you know one...

    Adriano Labate


    -----Message d'origine-----
    De : Pete Lewis
    Envoyé : mercredi, 28 mai 2003 12:48
    À : Lucene Users List
    Objet : Parsers


    Hi all,

    I have a rather nice html parser that I got from SourceForge. Does
    anyone know of any good parsers for pdf and Microsoft Office Suite
    (.doc, .ppt, .xls, etc), any help would be much appreciated.

    Pete Lewis




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Adriano Labate at May 28, 2003 at 1:25 pm
    Pete,

    Here's some samples.

    For Word using Textmining:
    String textContent = new
    WordExtractor().extractText(inputStream);

    For PDF using Textmining:
    String textContent = new
    PDFExtractor().extractText(inputStream);

    For Excel using POI:
    (From
    http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakart
    a.apache.org&msgId=698633)

    /**
    * Extract text from an Microsoft Excel input stream.
    * @param inputStream
    * @return The raw text obtained by concatenating all text cells
    from top to bottom, left to right.
    * @throws IOException
    */
    private static String extractExcelContent(InputStream inputStream)
    throws IOException {
    HSSFWorkbook wb = new HSSFWorkbook(inputStream);
    int nbSheets = wb.getNumberOfSheets();
    StringBuffer content = new StringBuffer(1024);

    for (int i = 0; i < nbSheets; i++) {
    HSSFSheet sheet = wb.getSheetAt(i);
    int nbRows = sheet.getLastRowNum();

    for (int j = 0; j < nbRows; j++) {
    HSSFRow row = sheet.getRow(j);
    if (row == null) // empty row
    continue;

    boolean isLineFound = false;
    Iterator it = row.cellIterator();
    while (it.hasNext()) {
    HSSFCell cell = (HSSFCell)it.next();
    int type = cell.getCellType();

    if (type == HSSFCell.CELL_TYPE_STRING) {
    content.append(cell.getStringCellValue());
    content.append(" ");
    isLineFound = true;
    }
    }

    if (isLineFound)
    content.append("\n"); // separate lines/raws
    }
    }

    return content.toString();
    }

    Adriano


    -----Message d'origine-----
    De : Pete Lewis
    Envoyé : mercredi, 28 mai 2003 15:02
    À : Lucene Users List
    Objet : Re: Parsers


    Hi Adriano

    Thanks. Code samples would be nice :)

    Will come back if I find something for .ppt.

    Pete

    ----- Original Message -----
    From: "Adriano Labate" <labate@verticali.com>
    To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, May 28, 2003 1:03 PM
    Subject: RE : Parsers


    The www.textmining.org text extractors work very well for Word and pdf
    documents. They use both PDFBox and POI.

    For Excel, using POI directly is very easy. Tell me if you want to see
    code samples.

    I'm looking myself for a Powerpoint text extractor, if you know one...

    Adriano Labate


    -----Message d'origine-----
    De : Pete Lewis
    Envoyé : mercredi, 28 mai 2003 12:48
    À : Lucene Users List
    Objet : Parsers


    Hi all,

    I have a rather nice html parser that I got from SourceForge. Does
    anyone know of any good parsers for pdf and Microsoft Office Suite
    (.doc, .ppt, .xls, etc), any help would be much appreciated.

    Pete Lewis




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Victor Hadianto at May 28, 2003 at 11:12 pm

    The www.textmining.org text extractors work very well for Word and pdf
    documents.
    They use both PDFBox and POI.

    For Excel, using POI directly is very easy. Tell me if you want to see
    code samples.

    I'm looking myself for a Powerpoint text extractor, if you know one...
    Another solution is to use Microsoft Office itself. You can setup a server
    that serve request to convert Microsoft Office doc. There are many ways of
    doing this, for example using Python to directly call Office then put your
    python script in a webserver.

    Or you can set a .Net conversion server and you can call this .Net service
    using a Web Service, and many other interesting technique.

    victor


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Pete Lewis at May 29, 2003 at 8:05 am
    Hi Victor

    Thanks.

    In the past I have used the Inso OutsideIn filters and found them very good;
    however I'd like to come up with a pure Java solution, so if there is a Java
    equivalent to the Inso filters I be grateful for any details. Failing that,
    I thought that I'd go for individual parsers initially using the file
    extensions to select the correct parser but in the future adding a file type
    recogniser for files without extensions. Hence my request for anyone
    knowing of good parsers particularly for the most common formats.

    That being said, has anyone come across a Powerpoint parser?

    Pete

    ----- Original Message -----
    From: "Victor Hadianto" <victorh@nuix.com.au>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Thursday, May 29, 2003 12:01 AM
    Subject: Re: RE : Parsers

    The www.textmining.org text extractors work very well for Word and pdf
    documents.
    They use both PDFBox and POI.

    For Excel, using POI directly is very easy. Tell me if you want to see
    code samples.

    I'm looking myself for a Powerpoint text extractor, if you know one...
    Another solution is to use Microsoft Office itself. You can setup a server
    that serve request to convert Microsoft Office doc. There are many ways of
    doing this, for example using Python to directly call Office then put your
    python script in a webserver.

    Or you can set a .Net conversion server and you can call this .Net service
    using a Web Service, and many other interesting technique.

    victor


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Andrzej Bialecki at May 29, 2003 at 8:08 am

    Victor Hadianto wrote:
    The www.textmining.org text extractors work very well for Word and pdf
    documents.
    They use both PDFBox and POI.

    For Excel, using POI directly is very easy. Tell me if you want to see
    code samples.

    I'm looking myself for a Powerpoint text extractor, if you know one...

    Another solution is to use Microsoft Office itself. You can setup a server
    that serve request to convert Microsoft Office doc. There are many ways of
    doing this, for example using Python to directly call Office then put your
    python script in a webserver.

    Or you can set a .Net conversion server and you can call this .Net service
    using a Web Service, and many other interesting technique.
    I'm using successfully a combination of Office automation via Jawin
    (free Java/COM bridge) to convert PPT files. You need to learn a bit
    about the pseudo-object model of PowerPoint to properly convert various
    objects, but this information can be found at msdn.microsoft.com.

    Obviously I'd love to learn about an alternative, because then I could
    free my clients from dependance on Office... I already use POI to
    convert XLS and DOC files, and it works _very_ well.


    --
    Best regards,
    Andrzej Bialecki

    -------------------------------------------------
    Software Architect, System Integration Specialist
    CEN/ISSS EC Workshop, ECIMF project chair
    EU FP6 E-Commerce Expert/Evaluator
    -------------------------------------------------
    FreeBSD developer (http://www.freebsd.org)




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Victor Hadianto at May 29, 2003 at 8:27 am

    I'm using successfully a combination of Office automation via Jawin
    (free Java/COM bridge) to convert PPT files. You need to learn a bit
    about the pseudo-object model of PowerPoint to properly convert various
    objects, but this information can be found at msdn.microsoft.com.
    Hmm this is really a nice idea, I've never heard of Jawin until now.

    wes



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Andrzej Bialecki at May 29, 2003 at 8:49 am

    Victor Hadianto wrote:
    I'm using successfully a combination of Office automation via Jawin
    (free Java/COM bridge) to convert PPT files. You need to learn a bit
    about the pseudo-object model of PowerPoint to properly convert various
    objects, but this information can be found at msdn.microsoft.com.

    Hmm this is really a nice idea, I've never heard of Jawin until now.
    I highly recommend it - it works pretty well, it's stable, mature, and
    most of all free :-) Sure, it has a well-known range of problems, e.g.
    with calls to functions that require structs, but as it happens most of
    the automation interfaces don't use them. I've been using it for
    Java-Windows integration on various occasions, solving such "taboo"
    problems like reading/creating Windows shortcuts, file conversion,
    reading Outlook mail etc.

    It works also with DLL's, although this is a bit more involved... It
    uses an extensible marshaller/de-marshaller, so if you know COM pretty
    well you can extend it to handle any conceivable parameter types.

    --
    Best regards,
    Andrzej Bialecki

    -------------------------------------------------
    Software Architect, System Integration Specialist
    CEN/ISSS EC Workshop, ECIMF project chair
    EU FP6 E-Commerce Expert/Evaluator
    -------------------------------------------------
    FreeBSD developer (http://www.freebsd.org)




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Pete Lewis at May 29, 2003 at 2:14 pm
    Hi guys

    Thanks, Jawin looks really nice :)

    Pete
    ----- Original Message -----
    From: "Andrzej Bialecki" <ab@getopt.org>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Thursday, May 29, 2003 9:45 AM
    Subject: Re: RE : Parsers

    Victor Hadianto wrote:
    I'm using successfully a combination of Office automation via Jawin
    (free Java/COM bridge) to convert PPT files. You need to learn a bit
    about the pseudo-object model of PowerPoint to properly convert various
    objects, but this information can be found at msdn.microsoft.com.

    Hmm this is really a nice idea, I've never heard of Jawin until now.
    I highly recommend it - it works pretty well, it's stable, mature, and
    most of all free :-) Sure, it has a well-known range of problems, e.g.
    with calls to functions that require structs, but as it happens most of
    the automation interfaces don't use them. I've been using it for
    Java-Windows integration on various occasions, solving such "taboo"
    problems like reading/creating Windows shortcuts, file conversion,
    reading Outlook mail etc.

    It works also with DLL's, although this is a bit more involved... It
    uses an extensible marshaller/de-marshaller, so if you know COM pretty
    well you can extend it to handle any conceivable parameter types.

    --
    Best regards,
    Andrzej Bialecki

    -------------------------------------------------
    Software Architect, System Integration Specialist
    CEN/ISSS EC Workshop, ECIMF project chair
    EU FP6 E-Commerce Expert/Evaluator
    -------------------------------------------------
    FreeBSD developer (http://www.freebsd.org)




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • David Warnock at May 29, 2003 at 8:36 am
    Andrzej,

    Another solution for all MS Office formats is to use openoffice.org the
    latest betas have a powerful Java SDK. So for example you could script a
    central copy to open MS Docs and save as html for parsing in lucene. Or
    you could save in Openoffice.org formats (which are zipped xml) and
    throw those at lucene.

    Dave
    Another solution is to use Microsoft Office itself. You can setup a
    server that serve request to convert Microsoft Office doc. There are
    many ways of doing this, for example using Python to directly call
    Office then put your python script in a webserver.

    --
    David Warnock, Sundayta Ltd. http://www.sundayta.com
    iDocSys for Document Management. VisibleResults for Fundraising.
    Development and Hosting of Web Applications and Sites.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Andrzej Bialecki at May 29, 2003 at 8:56 am

    David Warnock wrote:
    Andrzej,

    Another solution for all MS Office formats is to use openoffice.org the
    latest betas have a powerful Java SDK. So for example you could script a
    central copy to open MS Docs and save as html for parsing in lucene. Or
    you could save in Openoffice.org formats (which are zipped xml) and
    throw those at lucene.

    Dave
    Another solution is to use Microsoft Office itself. You can setup a
    server that serve request to convert Microsoft Office doc. There are
    many ways of doing this, for example using Python to directly call
    Office then put your python script in a webserver.
    Yes, I checked this solution in the past, but (unless something changed
    drastically) OpenOffice converters and Java integration are coupled
    tightly with the whole suite, so basically you have to install the whole
    suite (50MB?) just to be able to use the converters. In my case (a
    desktop utility) that would be an overkill... However, for server-based
    converters this could make a lot of sense - but then I believe you can
    work directly with the internal OO object model instead of xml files.

    And I agree that their Java SDK has almost everything you may want, even
    a nice document bean that allows you to work with a document editor in
    JComponent.

    --
    Best regards,
    Andrzej Bialecki

    -------------------------------------------------
    Software Architect, System Integration Specialist
    CEN/ISSS EC Workshop, ECIMF project chair
    EU FP6 E-Commerce Expert/Evaluator
    -------------------------------------------------
    FreeBSD developer (http://www.freebsd.org)




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • David Warnock at May 29, 2003 at 11:02 am
    Andrzej,
    Yes, I checked this solution in the past, but (unless something changed
    drastically) OpenOffice converters and Java integration are coupled
    tightly with the whole suite, so basically you have to install the whole
    suite (50MB?) just to be able to use the converters. In my case (a
    desktop utility) that would be an overkill... However, for server-based
    converters this could make a lot of sense - but then I believe you can
    work directly with the internal OO object model instead of xml files.
    Sorry, I am so deeply into server mode these days I don't remember about
    desktop uses.

    Dave
    --
    David Warnock, Sundayta Ltd. http://www.sundayta.com
    iDocSys for Document Management. VisibleResults for Fundraising.
    Development and Hosting of Web Applications and Sites.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 28, '03 at 10:48a
activeMay 29, '03 at 2:14p
posts13
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase