FAQ
Hi,

for indexing PDF files we have to undo word hyphenation. The basic idea
is simply to remove the hyphen when a new line and a small letter
follows. Of course this approach isnt 100%-foolproofed but checking
against a dictionary wouldnt be as well...

Since we face this problem too when highlighting using HTMLCharStripper
(yes, we do have hyphenation in our HTML docs...) it seems to me I have
to adjust the JFlex generated StandardTokenizerImpl.

Is this the right approach and hwo would I have to modify this script?

Thanks
Wulf


PS: I see that there are changes made in the brand new 3.1.0 version we
are using 3.0.3, but as far I understand no relevant changes in this
respect.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Yonik Seeley at Apr 1, 2011 at 4:24 pm
    Solr has a hyphenated word filter you could copy.
    http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html

    On trunk, this has been folded into the analysis module.

    -Yonik
    http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
    25-26, San Francisco
    On Fri, Apr 1, 2011 at 11:50 AM, Wulf Berschin wrote:
    Hi,

    for indexing PDF files we have to undo word hyphenation. The basic idea is
    simply to remove the hyphen when a new line and a small letter follows. Of
    course this approach isnt 100%-foolproofed but checking against a dictionary
    wouldnt be as well...

    Since we face this problem too when highlighting using HTMLCharStripper
    (yes, we do have hyphenation in our HTML docs...) it seems to me I have to
    adjust the JFlex generated StandardTokenizerImpl.

    Is this the right approach and hwo would I have to modify this script?

    Thanks
    Wulf


    PS: I see that there are changes made in the brand new 3.1.0 version we are
    using 3.0.3, but as far I understand no relevant changes in this respect.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Wulf Berschin at Apr 4, 2011 at 12:21 pm
    Thank you, Yonnik for this hint. (Again, I wasn't aware that obviousely
    Solr offers useful extensions for the Lucene indexing process and I
    wonder why they haven't been added to Lucene itself.)

    Anyway, since the HyphenatedWordsFilter needs newlines in the input I
    will have to take another Tokenizer than StandardTokenizer. If I simply
    take the WhitespaceTokenizerFactory (as suggested by
    HyphenatedWordsFilterFactory) I will loose the punctuation handling done
    by StandardTokenizer, right? What will I have to borrow for that? Or do
    I have to extend StandardTokenizerImpl.jflex?

    Wulf


    Am 01.04.2011 18:23, schrieb Yonik Seeley:
    Solr has a hyphenated word filter you could copy.
    http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html

    On trunk, this has been folded into the analysis module.

    -Yonik
    http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
    25-26, San Francisco

    On Fri, Apr 1, 2011 at 11:50 AM, Wulf Berschinwrote:
    Hi,

    for indexing PDF files we have to undo word hyphenation. The basic idea is
    simply to remove the hyphen when a new line and a small letter follows. Of
    course this approach isnt 100%-foolproofed but checking against a dictionary
    wouldnt be as well...

    Since we face this problem too when highlighting using HTMLCharStripper
    (yes, we do have hyphenation in our HTML docs...) it seems to me I have to
    adjust the JFlex generated StandardTokenizerImpl.

    Is this the right approach and hwo would I have to modify this script?

    Thanks
    Wulf


    PS: I see that there are changes made in the brand new 3.1.0 version we are
    using 3.0.3, but as far I understand no relevant changes in this respect.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 1, '11 at 3:50p
activeApr 4, '11 at 12:21p
posts3
users2
websitelucene.apache.org

2 users in discussion

Wulf Berschin: 2 posts Yonik Seeley: 1 post

People

Translate

site design / logo © 2022 Grokbase