FAQ
Hi,

I've been working with the HTML parser demo that comes with
Lucene and I'm trying to understand why it's multi-threaded,
and, more importantly, how to exit gracefully on errors.

I've discovered if I throw an exception in the front-end static
code (main(), etc.), the JVM hangs instead of exiting. Presumably
this is because there are threads hanging around doing something.
But I'm not sure what!

Any pointers? I just want to exit gracefully on an error such as
a required meta tag is missing or similar.

Thanks,

Fred


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Roy-lucene-user at Sep 23, 2004 at 5:06 pm
    Hi Fred,

    We were originally attempting to use the demo html parser (Lucene 1.2), but as
    you know, its for a demo. I think its threaded to optimize on time, to allow
    the calling thread to grab the title or top message even though its not done
    parsing the entire html document. That's just a guess, I would love to hear
    from others about this. Anyway, since it is a separate thread, a token error
    could kill it and there is no way for the calling thread to know about it.

    We had to create our own html parser since we only cared about grabbing the
    entire text from the html document and also we wanted to avoid the extra
    thread. We also do a lot of "SKIP"ping for minimal EOF errors (html documents
    in email almost never follow standards). For your html needs, you might want
    to check out other JavaCC HTML parsers from the JavaCC web site.

    Roy.

    On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote
    Hi,

    I've been working with the HTML parser demo that comes with
    Lucene and I'm trying to understand why it's multi-threaded,
    and, more importantly, how to exit gracefully on errors.

    I've discovered if I throw an exception in the front-end static
    code (main(), etc.), the JVM hangs instead of exiting. Presumably
    this is because there are threads hanging around doing something.
    But I'm not sure what!

    Any pointers? I just want to exit gracefully on an error such as
    a required meta tag is missing or similar.

    Thanks,

    Fred

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Doug Cutting at Sep 23, 2004 at 5:53 pm

    roy-lucene-user@xemaps.com wrote:
    We were originally attempting to use the demo html parser (Lucene 1.2), but as
    you know, its for a demo. I think its threaded to optimize on time, to allow
    the calling thread to grab the title or top message even though its not done
    parsing the entire html document.
    That's almost right. I originally wrote it that way to avoid having to
    ever buffer the entire text of the document. The document is indexed
    while it is parsed. But, as observed, this has lots of problems and was
    probably a bad idea.

    Could someone provide a patch that removes the multi-threading? We'd
    simply use a StringBuffer in HTMLParser.jj to collect the text. Calls
    to pipeOut.write() would be replaced with text.append(). Then have the
    HTMLParser's constructor parse the page before returning, rather than
    spawn a thread, and getReader() would return a StringReader. The public
    API of HTMLParser need not change at all and lots of complex threading
    code would be thrown away. Anyone interested in coding this?

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Roy-lucene-user at Sep 23, 2004 at 7:47 pm
    On Thu, 23 Sep 2004 10:53:26 -0700, Doug Cutting wrote
    roy-lucene-user@xemaps.com wrote:
    We were originally attempting to use the demo html parser (Lucene 1.2), but as
    you know, its for a demo. I think its threaded to optimize on time, to allow
    the calling thread to grab the title or top message even though its not done
    parsing the entire html document.
    That's almost right. I originally wrote it that way to avoid having
    to ever buffer the entire text of the document. The document is
    indexed while it is parsed. But, as observed, this has lots of
    problems and was probably a bad idea.

    Could someone provide a patch that removes the multi-threading?
    We'd simply use a StringBuffer in HTMLParser.jj to collect the text.
    Calls to pipeOut.write() would be replaced with text.append().
    Then have the HTMLParser's constructor parse the page before
    returning, rather than spawn a thread, and getReader() would return
    a StringReader. The public API of HTMLParser need not change at all
    and lots of complex threading code would be thrown away. Anyone
    interested in coding this?
    While we're on the subject...

    When using the HTMLParser I tend to get a lot of token manager errors that
    basically kill the thread (usually unexpected EOF). Even if we were to remove
    the multi-threading of the HTMLParser, these token manager errors would pretty
    much kill the calling app (Error vs Exception). Any idea how to get around this?

    Perhaps this question really belongs on the javacc list?

    Roy.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 23, '04 at 2:43a
activeSep 23, '04 at 7:47p
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase