FAQ
Hi,

I was hoping it wouldn't come to this:

I've got unicode in my source HTML. In particular, within meta tags,
and it's getting broken by the indexer. Note that I'm not trying to
query on any of this, just store and retrieve document titles with
unicode characters.

Has anyone else experienced this? I know this is just a demo, but
it's been working really well and I hate to give it up!

Is this easily fixable? I'm a little worried by this comment in
SimpleCharStream.java:

/**
* An implementation of interface CharStream, where the stream is assumed to
* contain only ASCII characters (without unicode processing).
*/

This is likely a show-stopper for me on this parser.

Can anyone recommend the shortest path to another HTML parser
that is unicode friendly?

Thanks for anything.

Fred


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Daniel Naber at Sep 24, 2004 at 7:12 pm

    On Friday 24 September 2004 19:58, Fred Toth wrote:

    I've got unicode in my source HTML. In particular, within meta tags,
    and it's getting broken by the indexer. Note that I'm not trying to
    query on any of this, just store and retrieve document titles with
    unicode characters.
    Please try again with the code from CVS, Christoph Goller committed a fix
    for this problem (at least I think it was this problem) 1-3 weeks ago.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Fred Toth at Sep 25, 2004 at 1:23 am
    Sorry, that didn't cure it.

    Again, anyone want to point me to the quickest replacement
    HTML parser (that's unicode clean)?

    Thanks,

    Fred
    At 03:17 PM 9/24/2004, you wrote:
    On Friday 24 September 2004 19:58, Fred Toth wrote:

    I've got unicode in my source HTML. In particular, within meta tags,
    and it's getting broken by the indexer. Note that I'm not trying to
    query on any of this, just store and retrieve document titles with
    unicode characters.
    Please try again with the code from CVS, Christoph Goller committed a fix
    for this problem (at least I think it was this problem) 1-3 weeks ago.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karthik N S at Oct 1, 2004 at 8:31 am
    Hi


    Apologies .........

    Can Somebody Please tell me or how to include a constructer within
    'org.apache.lucene.demo.html.HtmlParser.java' ,
    So that using the Constructer read the String argument,Strips the HTML
    Tags and returns the String with out Tags.
    Currently 'org.apache.lucene.demo.html.HtmlParser.java' method accepts
    fullpath of the file and then reads
    the Content to Strip Tags......




    Thx in Advance
    Karthik


    -----Original Message-----
    From: Daniel Naber
    Sent: Saturday, September 25, 2004 12:47 AM
    To: Lucene Users List
    Subject: Re: demo IndexHTML parser breaks unicode?

    On Friday 24 September 2004 19:58, Fred Toth wrote:

    I've got unicode in my source HTML. In particular, within meta tags,
    and it's getting broken by the indexer. Note that I'm not trying to
    query on any of this, just store and retrieve document titles with
    unicode characters.
    Please try again with the code from CVS, Christoph Goller committed a fix
    for this problem (at least I think it was this problem) 1-3 weeks ago.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Wallen at Sep 25, 2004 at 1:49 am
    In org.apache.lucene.demo.HTMLDocument you need to change the input stream
    to use a different encoding. Replace the fis with this:

    fis = new InputStreamReader(new FileInputStream(f), "UTF-16");

    -----Original Message-----
    From: Fred Toth
    Sent: Friday, September 24, 2004 9:25 PM
    To: Lucene Users List
    Subject: Re: demo IndexHTML parser breaks unicode?


    Sorry, that didn't cure it.

    Again, anyone want to point me to the quickest replacement
    HTML parser (that's unicode clean)?

    Thanks,

    Fred
    At 03:17 PM 9/24/2004, you wrote:
    On Friday 24 September 2004 19:58, Fred Toth wrote:

    I've got unicode in my source HTML. In particular, within meta tags,
    and it's getting broken by the indexer. Note that I'm not trying to
    query on any of this, just store and retrieve document titles with
    unicode characters.
    Please try again with the code from CVS, Christoph Goller committed a fix
    for this problem (at least I think it was this problem) 1-3 weeks ago.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Fred Toth at Sep 25, 2004 at 3:00 am
    Hi,

    Thanks for the tip, but that didn't work in my case. Presumably
    with this patch, and the changes in CVS, this makes the parser
    work with UTF-16. I can't really tell because the index appears
    now to be completely UTF-16 and I can't search for anything.

    My input is actually UTF-8 anyway, and if I patch all the streams
    to use UTF-8 instead of UTF-16, I get parser errors.

    So I'm stuck.

    Thanks for your help,

    Fred
    At 09:46 PM 9/24/2004, wallen@Cyveillance.com wrote:
    In org.apache.lucene.demo.HTMLDocument you need to change the input stream
    to use a different encoding. Replace the fis with this:

    fis = new InputStreamReader(new FileInputStream(f), "UTF-16");

    -----Original Message-----
    From: Fred Toth
    Sent: Friday, September 24, 2004 9:25 PM
    To: Lucene Users List
    Subject: Re: demo IndexHTML parser breaks unicode?


    Sorry, that didn't cure it.

    Again, anyone want to point me to the quickest replacement
    HTML parser (that's unicode clean)?

    Thanks,

    Fred
    At 03:17 PM 9/24/2004, you wrote:
    On Friday 24 September 2004 19:58, Fred Toth wrote:

    I've got unicode in my source HTML. In particular, within meta tags,
    and it's getting broken by the indexer. Note that I'm not trying to
    query on any of this, just store and retrieve document titles with
    unicode characters.
    Please try again with the code from CVS, Christoph Goller committed a fix
    for this problem (at least I think it was this problem) 1-3 weeks ago.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Erik Hatcher at Sep 25, 2004 at 3:13 pm
    As for alternative HTML parsers, there are a few notable ones:

    NekoHTML - Nutch uses it

    JTidy - My <index> Ant task in the sandbox uses it

    and HTMLParser

    All of the above are surely far more battle-tested in production than
    Lucene's demo parser, and I'd be surprised if they did not correctly
    handle Unicode.

    Erik

    On Sep 24, 2004, at 11:01 PM, Fred Toth wrote:

    Hi,

    Thanks for the tip, but that didn't work in my case. Presumably
    with this patch, and the changes in CVS, this makes the parser
    work with UTF-16. I can't really tell because the index appears
    now to be completely UTF-16 and I can't search for anything.

    My input is actually UTF-8 anyway, and if I patch all the streams
    to use UTF-8 instead of UTF-16, I get parser errors.

    So I'm stuck.

    Thanks for your help,

    Fred
    At 09:46 PM 9/24/2004, wallen@Cyveillance.com wrote:
    In org.apache.lucene.demo.HTMLDocument you need to change the input
    stream
    to use a different encoding. Replace the fis with this:

    fis = new InputStreamReader(new FileInputStream(f), "UTF-16");

    -----Original Message-----
    From: Fred Toth
    Sent: Friday, September 24, 2004 9:25 PM
    To: Lucene Users List
    Subject: Re: demo IndexHTML parser breaks unicode?


    Sorry, that didn't cure it.

    Again, anyone want to point me to the quickest replacement
    HTML parser (that's unicode clean)?

    Thanks,

    Fred
    At 03:17 PM 9/24/2004, you wrote:
    On Friday 24 September 2004 19:58, Fred Toth wrote:

    I've got unicode in my source HTML. In particular, within meta
    tags,
    and it's getting broken by the indexer. Note that I'm not trying
    to
    query on any of this, just store and retrieve document titles with
    unicode characters.
    Please try again with the code from CVS, Christoph Goller committed a fix
    for this problem (at least I think it was this problem) 1-3 weeks ago.
    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 24, '04 at 5:57p
activeOct 1, '04 at 8:31a
posts7
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase