FAQ
Im trying to construct a plaintext parser for different file formats like ms
word, excel, powerpoint, rich text format, plain text, html, pdf etc.

I use the known libraries PDFBox, POI and some parts from AtLeap...and now I
should support the OpenOffice formats and the more important msg-fromat (MS
outlook message format).

Does someone know how I can simply (like POI) extract plaint text from msg?
Probably there exists an open source library like for pdf or ms office
files?

I need the plain text because the only way for me seems to extract all the
plain text from every single document, and then add it to my lucene
index...this is necessary to get the best excerpt from highlighter...

Thx

Simon Dietschi
--
View this message in context: http://www.nabble.com/Lucene---FileFormat-t1485959.html#a4024568
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Thanh nguyen at Apr 21, 2006 at 3:36 pm
    Hi all,

    Did anyone use Lucene to index WT10G? Can it index
    WT10G in compressed format (.gz) or we have to unzip
    it first?

    Further more, does Lucene support TREC format? I mean
    can it receive a topic file like "<TOP> <NUM> 1
    <TITLE> abc def </TOP>" and produce a results file
    which we can use with trec_eval program?

    Any help will be appretiated,
    Thanh








    ________________________________________________________
    Bạn có sử dụng Yahoo! không?
    Hãy xem thử trang chủ Yahoo! Việt Nam!
    http://vn.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dmitry Goldenberg at Apr 21, 2006 at 4:14 pm
    Simon,

    I wonder if using Zoe might do the trick - http://guests.evectors.it/zoe/
    Have you tried it?

    - Dmitry

    ________________________________

    From: Fisheye
    Sent: Fri 4/21/2006 7:23 AM
    To: java-user@lucene.apache.org
    Subject: Lucene - FileFormat




    Im trying to construct a plaintext parser for different file formats like ms
    word, excel, powerpoint, rich text format, plain text, html, pdf etc.

    I use the known libraries PDFBox, POI and some parts from AtLeap...and now I
    should support the OpenOffice formats and the more important msg-fromat (MS
    outlook message format).

    Does someone know how I can simply (like POI) extract plaint text from msg?
    Probably there exists an open source library like for pdf or ms office
    files?

    I need the plain text because the only way for me seems to extract all the
    plain text from every single document, and then add it to my lucene
    index...this is necessary to get the best excerpt from highlighter...

    Thx

    Simon Dietschi
    --
    View this message in context: http://www.nabble.com/Lucene---FileFormat-t1485959.html#a4024568
    Sent from the Lucene - Java Users forum at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 21, '06 at 11:24a
activeApr 21, '06 at 4:14p
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase