FAQ
Hi,



I have a PDF Parser which uses PDFBox libary to parse PDF documents into
plain text.

I have tried this parser by sending the output directly to the
commandline and it works, I
get the plain text, like I get it with my HTMLParser.

But there is a problem with the indexing, I think:

I can only find the document with lucene with words which occur before
the first dot. With
a word which occurs after the first "." Lucene doesn't find the related
document. So I think
these words are not indexed. I do not use any special analyzer and I
cannot understand
why it does not work.



The indexing with html files work, there I can find words after the
first ".".

Its also possible that the missing indexing is not related to the "."
Character, but to

The first newline (\n). I don't know.



Have you got any ideas for making pdf-index work?



Siegfried Puchbauer

Search Discussions

  • Otis Gospodnetic at Sep 1, 2004 at 11:39 pm
    Siegfried,

    My guess is that the '.' is accidental and there is nothing special
    about '.'. I've used PDFBox+Lucene and it worked well. Are you aware
    of Lucene-specific classes included in the PDFBox distribution? You
    may be able to use those classes, or you could at least look at their
    source, to make sure you're using both PDFBox and Lucene API correctly.

    Also, how long is the extracted text?
    See this:
    http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#maxFieldLength
    Maybe that's the problem.

    Otis


    --- Siegfried Puchbauer wrote:
    Hi,



    I have a PDF Parser which uses PDFBox libary to parse PDF documents
    into
    plain text.

    I have tried this parser by sending the output directly to the
    commandline and it works, I
    get the plain text, like I get it with my HTMLParser.

    But there is a problem with the indexing, I think:

    I can only find the document with lucene with words which occur
    before
    the first dot. With
    a word which occurs after the first "." Lucene doesn't find the
    related
    document. So I think
    these words are not indexed. I do not use any special analyzer and I
    cannot understand
    why it does not work.



    The indexing with html files work, there I can find words after the
    first ".".

    Its also possible that the missing indexing is not related to the "."
    Character, but to

    The first newline (\n). I don't know.



    Have you got any ideas for making pdf-index work?



    Siegfried Puchbauer



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 1, '04 at 2:50p
activeSep 1, '04 at 11:39p
posts2
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase