FAQ
Hi,

is there a Tokenizer in Lucene, that tokenizes XML correctly?

I.e. that one gets from the following XML:
<span>this is <span attr="foo">example</span>text.</span>

Tokens (or similar):
<span> | this | is | <span attr="foo"> | example | </span> | text. | </span>

Or would i need to write such a Tokenizer myself?

regards
Christoph Hermann

--
Christoph Hermann
Institut für Informatik
Tel: +49 761-203-8171 Fax: +49 761-203-8162
e-mail: hermann@informatik.uni-freiburg.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erick Erickson at Oct 15, 2010 at 6:21 pm
    Well, it's hard to say what "correctly" would be. Remove all
    XML? Preserve attributes? Preserve tags? Put the attributes
    and values into fields in the document? My point is that there's
    no obviously "correct" parsing.

    But if you just want to strip out all the <....>, it seems like
    PatternTokenizer might work for you...

    HTH
    Erick

    2010/10/15 Christoph Hermann <hermann@informatik.uni-freiburg.de>
    Hi,

    is there a Tokenizer in Lucene, that tokenizes XML correctly?

    I.e. that one gets from the following XML:
    <span>this is <span attr="foo">example</span>text.</span>

    Tokens (or similar):
    <span> | this | is | <span attr="foo"> | example | </span> | text. |
    </span>

    Or would i need to write such a Tokenizer myself?

    regards
    Christoph Hermann

    --
    Christoph Hermann
    Institut für Informatik
    Tel: +49 761-203-8171 Fax: +49 761-203-8162
    e-mail: hermann@informatik.uni-freiburg.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 15, '10 at 4:16p
activeOct 15, '10 at 6:21p
posts2
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase