FAQ
Dear all,



Currently I am using Lucene jave 2.3.2 demo to parse Microsoft 2003 and 2007
docs and PDF files.

It is able to parse files with *.pdf, *.doc, *.xls etc.

But it does not search in files of Microsoft 2007 docs.

It shows indexing *.docx and other Microsoft 2007 doc files.



Does Lucene java supports parsing of extensions *.docx, *.pptx, *.mpp i.e.
Microsoft Windows 2007 documents?

If it supports, what should be done in Lucene demo 2.3.2 to search queries
on file with above mentioned extensions?



Thanks

Kumar

Search Discussions

  • Erick Erickson at Jun 27, 2008 at 1:13 pm
    Lucene doesn't actually support any of the document types. What happens
    is that some program is used to parse the files into an indexable stream
    and that stream is indexed. That used to be POI in the old days.

    I confess I haven't used the latest demo, but I assume that under the
    covers there's some program installed that Microsoft documents are
    pushed through to get indexable tokens. So the real question is
    whether that program handles the documents you're interested in.

    I know this isn't very helpful, but you'll have to dig into this in some
    detail if you really want to index Microsoft documents. If you don't
    need to, then you don't need to waste time on this issue.

    Best
    Erick
    On Fri, Jun 27, 2008 at 7:08 AM, Kumar Gaurav wrote:

    Dear all,



    Currently I am using Lucene jave 2.3.2 demo to parse Microsoft 2003 and
    2007
    docs and PDF files.

    It is able to parse files with *.pdf, *.doc, *.xls etc.

    But it does not search in files of Microsoft 2007 docs.

    It shows indexing *.docx and other Microsoft 2007 doc files.



    Does Lucene java supports parsing of extensions *.docx, *.pptx, *.mpp i.e.
    Microsoft Windows 2007 documents?

    If it supports, what should be done in Lucene demo 2.3.2 to search queries
    on file with above mentioned extensions?



    Thanks

    Kumar
  • Hasan Diwan at Jun 27, 2008 at 8:13 pm
    Kumar:
    Assuming you want to index a pre-parsed document...

    2008/6/27 Erick Erickson <erickerickson@gmail.com>:
    If it supports, what should be done in Lucene demo 2.3.2 to search queries
    on file with above mentioned extensions?
    The new ODF-compatible Office 2007 is not supported by POI. However,
    you could write a JNI wrapper around OpenOffice, which does have this
    support.
    --
    Cheers,
    Hasan Diwan <hasan.diwan@gmail.com>

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nick Burch at Jun 28, 2008 at 3:08 pm

    On Fri, 27 Jun 2008, Hasan Diwan wrote:
    The new ODF-compatible Office 2007 is not supported by POI.
    Actually, it is, just not the version in trunk. You can download nightly
    builds of the ooxml branch from
    http://encore.torchbox.com/poi-svn-build/OOXML-Branch/

    And there ought to be a formal beta release very soon now

    Nick

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 27, '08 at 11:08a
activeJun 28, '08 at 3:08p
posts4
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase