FAQ
Hello,

I´m building a search engine for HTML-Dokuments, and I´ve got a HTML-parsing
problem.

This documents are in german. In this documents are different special
characters, and different ways of writing this special characters, like "ö",
"ö" and "&#246". Do somebody know a parsing engine that has no problems
with all this different ways to write this special characters?

Thanks

b.warzecha

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erik Hatcher at May 3, 2005 at 12:33 pm

    On May 3, 2005, at 4:35 AM, Bartosch Warzecha wrote:
    Hello,

    I´m building a search engine for HTML-Dokuments, and I´ve got a
    HTML-parsing
    problem.

    This documents are in german. In this documents are different special
    characters, and different ways of writing this special characters,
    like "ö",
    "ö" and "&#246". Do somebody know a parsing engine that has no
    problems
    with all this different ways to write this special characters?
    What HTML parser are you using? Those entity references should not
    be seen by your code once resolved by a parser. Try NekoHTML.

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Damian Gajda at May 3, 2005 at 4:13 pm
    Hello,
    This documents are in german. In this documents are different special
    characters, and different ways of writing this special characters, like "ö",
    "ö" and "&#246". Do somebody know a parsing engine that has no problems
    with all this different ways to write this special characters?
    I've created a component for parsing HTML entities (special characters).
    This component is a part of ObjectLedge project - it is stored in
    components subproject. Please feel free to use this component. It is
    licensed under BSD (Apache like) license. You will need to check the
    ledge-components CVS module.

    http://objectledge.org/

    You are also welcome to use ObjectLedge as a whole :)

    Regards,
    --
    Damian Gajda



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 3, '05 at 8:36a
activeMay 3, '05 at 4:13p
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase