FAQ
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_________________________________________________________
Do You Yahoo!?
150万曲MP3疯狂搜,带您闯入音乐殿堂
http://music.yisou.com/
美女明星应有尽有,搜遍美图、艳图和酷图
http://image.yisou.com
1G就是1000兆,雅虎电邮自助扩容!
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Sergiu gordea at Feb 1, 2005 at 9:54 am

    Jingkang Zhang wrote:
    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?
    maybe you can try this library...

    http://htmlparser.sourceforge.net/

    I use the following code to get the text from HTML files,
    it was not intensively tested, but it works.

    import org.htmlparser.Node;
    import org.htmlparser.Parser;
    import org.htmlparser.util.NodeIterator;
    import org.htmlparser.util.Translate;

    Parser parser = new Parser(source.getAbsolutePath());
    NodeIterator iter = parser.elements();
    while (iter.hasMoreNodes()) {
    Node element = (Node) iter.nextNode();
    //System.out.println("1:" + element.getText());
    String text = Translate.decode(element.toPlainTextString());
    if (Utils.notEmptyString(text))
    writer.write(text);
    }

    Sergiu
    _________________________________________________________
    Do You Yahoo!?
    150万曲MP3疯狂搜,带您闯入音乐殿堂
    http://music.yisou.com/
    美女明星应有尽有,搜遍美图、艳图和酷图
    http://image.yisou.com
    1G就是1000兆,雅虎电邮自助扩容!
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Michael Giles at Feb 1, 2005 at 1:50 pm
    When I tested parsers a year or so ago for intensive use in Furl, the
    best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page)
    parser by far was TagSoup ( http://www.tagsoup.info ). It is actively
    maintained and improved and I have never had any problems with it.

    -Mike

    Jingkang Zhang wrote:
    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150万曲MP3疯狂搜,带您闯入音乐殿堂
    http://music.yisou.com/
    美女明星应有尽有,搜遍美图、艳图和酷图
    http://image.yisou.com
    1G就是1000兆,雅虎电邮自助扩容!
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Chuck Williams at Feb 1, 2005 at 5:18 pm
    I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far.

    Chuck
    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better? >
    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function? >
    _________________________________________________________
    Do You Yahoo!?
    150万曲MP3疯狂搜,带您闯入音乐殿堂
    http://music.yisou.com/
    美女明星应有尽有,搜遍美图、艳图和酷图
    http://image.yisou.com
    1G就是1000兆,雅虎电邮自助扩容!
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
    il_1g/ >
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karl Koch at Feb 2, 2005 at 11:17 am
    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and simple
    (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure but
    need a facility to clean up HTML into its normal underlying content before
    indexing that content as a whole.

    Karl

    I think that depends on what you want to do. The Lucene demo parser does
    simple mapping of HTML files into Lucene Documents; it does not give you a
    parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the
    same API; will likely become part of Xerces), and so maps an HTML document
    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know it as well --
    based on its UI, it appears to be focused primarily on HTML validation and
    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go beyond
    indexing them in Lucene, and really like it. It has been robust for me so
    far.

    Chuck
    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
    il_1g/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
    Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Sergiu gordea at Feb 2, 2005 at 12:49 pm
    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:
    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and simple
    (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure but
    need a facility to clean up HTML into its normal underlying content before
    indexing that content as a whole.

    Karl



    I think that depends on what you want to do. The Lucene demo parser does
    simple mapping of HTML files into Lucene Documents; it does not give you a
    parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses
    the

    same API; will likely become part of Xerces), and so maps an HTML document
    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know it as
    well --

    based on its UI, it appears to be focused primarily on HTML validation and
    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go beyond
    indexing them in Lucene, and really like it. It has been robust for me so
    far.

    Chuck
    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
    il_1g/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karl Koch at Feb 2, 2005 at 2:23 pm
    Hi,

    yes, but the library your are using is quite big. I was thinking that a 5kB
    code could actually do that. That sourceforge project is doing much more
    than that but I do not need it.

    Karl
    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:
    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and simple
    (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure but
    need a facility to clean up HTML into its normal underlying content before
    indexing that content as a whole.

    Karl



    I think that depends on what you want to do. The Lucene demo parser
    does
    simple mapping of HTML files into Lucene Documents; it does not give you
    a
    parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses
    the

    same API; will likely become part of Xerces), and so maps an HTML
    document
    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know it as
    well --

    based on its UI, it appears to be focused primarily on HTML validation
    and
    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go
    beyond
    indexing them in Lucene, and really like it. It has been robust for me
    so
    far.

    Chuck
    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
    il_1g/
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
    +++ GMX - die erste Adresse für Mail, Message, More +++

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Sergiu gordea at Feb 2, 2005 at 2:28 pm

    Karl Koch wrote:
    Hi,

    yes, but the library your are using is quite big. I was thinking that a 5kB
    code could actually do that. That sourceforge project is doing much more
    than that but I do not need it.
    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the size.

    You can use 3 lines of code with a good regular expresion to eliminate
    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu
    Karl


    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:


    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and simple
    (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure but
    need a facility to clean up HTML into its normal underlying content
    before

    indexing that content as a whole.

    Karl





    I think that depends on what you want to do. The Lucene demo parser
    does

    simple mapping of HTML files into Lucene Documents; it does not give you
    a

    parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses


    the



    same API; will likely become part of Xerces), and so maps an HTML
    document

    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know it as


    well --



    based on its UI, it appears to be focused primarily on HTML validation
    and

    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go
    beyond

    indexing them in Lucene, and really like it. It has been robust for me
    so

    far.

    Chuck
    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
    il_1g/
    ---------------------------------------------------------------------

    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karl Koch at Feb 2, 2005 at 6:03 pm
    I am in control of the html, which means it is well formated HTML. I use
    only HTML files which I have transformed from XML. No external HTML (e.g.
    the web).

    Are there any very-short solutions for that?

    Karl
    Karl Koch wrote:
    Hi,

    yes, but the library your are using is quite big. I was thinking that a 5kB
    code could actually do that. That sourceforge project is doing much more
    than that but I do not need it.
    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the size.

    You can use 3 lines of code with a good regular expresion to eliminate
    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu
    Karl


    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:


    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and
    simple
    (KISS)) which allows to remove all HTML tags from HTML content? HTML
    3.2
    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure
    but
    need a facility to clean up HTML into its normal underlying content
    before

    indexing that content as a whole.

    Karl





    I think that depends on what you want to do. The Lucene demo parser
    does

    simple mapping of HTML files into Lucene Documents; it does not give
    you
    a

    parse tree for the HTML doc. CyberNeko is an extension of Xerces
    (uses

    the



    same API; will likely become part of Xerces), and so maps an HTML
    document

    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know it
    as

    well --



    based on its UI, it appears to be focused primarily on HTML validation
    and

    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go
    beyond

    indexing them in Lucene, and really like it. It has been robust for
    me
    so

    far.

    Chuck
    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
    il_1g/
    ---------------------------------------------------------------------

    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
    Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Sergiu gordea at Feb 2, 2005 at 6:23 pm

    Karl Koch wrote:
    I am in control of the html, which means it is well formated HTML. I use
    only HTML files which I have transformed from XML. No external HTML (e.g.
    the web).

    Are there any very-short solutions for that?
    if you are using only correct formated HTML pages and you are in control
    of these pages.
    you can use a regular exprestion to remove the tags.

    something like
    replaceAll("<*>","");

    This is the ideea behind the operation. If you will search on google you
    will find a more robust
    regular expression.

    Using a simple regular expression will be a very cheap solution, that
    can cause you a lot of problems in the future.

    It's up to you to use it ....

    Best,

    Sergiu
    Karl


    Karl Koch wrote:


    Hi,

    yes, but the library your are using is quite big. I was thinking that a
    5kB

    code could actually do that. That sourceforge project is doing much more
    than that but I do not need it.


    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the size.

    You can use 3 lines of code with a good regular expresion to eliminate
    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu


    Karl




    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:




    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and
    simple

    (KISS)) which allows to remove all HTML tags from HTML content? HTML
    3.2

    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure
    but

    need a facility to clean up HTML into its normal underlying content


    before



    indexing that content as a whole.

    Karl







    I think that depends on what you want to do. The Lucene demo parser


    does



    simple mapping of HTML files into Lucene Documents; it does not give
    you


    a



    parse tree for the HTML doc. CyberNeko is an extension of Xerces
    (uses




    the





    same API; will likely become part of Xerces), and so maps an HTML


    document



    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know it
    as




    well --





    based on its UI, it appears to be focused primarily on HTML validation


    and



    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go


    beyond



    indexing them in Lucene, and really like it. It has been robust for
    me


    so



    far.

    Chuck


    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡

    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma

    il_1g/



    ---------------------------------------------------------------------



    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Otis Gospodnetic at Feb 2, 2005 at 7:40 pm
    If you are not married to Java:
    http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm

    Otis

    --- sergiu gordea wrote:
    Karl Koch wrote:
    I am in control of the html, which means it is well formated HTML. I use
    only HTML files which I have transformed from XML. No external HTML (e.g.
    the web).

    Are there any very-short solutions for that?
    if you are using only correct formated HTML pages and you are in
    control
    of these pages.
    you can use a regular exprestion to remove the tags.

    something like
    replaceAll("<*>","");

    This is the ideea behind the operation. If you will search on google
    you
    will find a more robust
    regular expression.

    Using a simple regular expression will be a very cheap solution, that

    can cause you a lot of problems in the future.

    It's up to you to use it ....

    Best,

    Sergiu
    Karl


    Karl Koch wrote:


    Hi,

    yes, but the library your are using is quite big. I was thinking
    that a
    5kB

    code could actually do that. That sourceforge project is doing
    much more
    than that but I do not need it.


    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the
    size.
    You can use 3 lines of code with a good regular expresion to
    eliminate
    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu


    Karl




    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:




    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short
    and
    simple

    (KISS)) which allows to remove all HTML tags from HTML content?
    HTML
    3.2

    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other
    structure
    but

    need a facility to clean up HTML into its normal underlying
    content

    before



    indexing that content as a whole.

    Karl







    I think that depends on what you want to do. The Lucene demo
    parser

    does



    simple mapping of HTML files into Lucene Documents; it does not
    give
    you


    a



    parse tree for the HTML doc. CyberNeko is an extension of
    Xerces
    (uses




    the





    same API; will likely become part of Xerces), and so maps an
    HTML

    document



    into a full DOM that you can manipulate easily for a wide range
    of
    purposes. I haven't used JTidy at an API level and so don't
    know it
    as




    well --





    based on its UI, it appears to be focused primarily on HTML
    validation

    and



    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents
    that go

    beyond



    indexing them in Lucene, and really like it. It has been
    robust for
    me


    so



    far.

    Chuck


    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡

    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
    il_1g/



    ---------------------------------------------------------------------


    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karl Koch at Feb 3, 2005 at 9:59 am
    Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
    in a single class or even method called by another part in my Java
    application. It should also run on Java 1.1 and it should be small and
    simple. As I said before, I am in control of the HTML and it will be well
    formated, because I generate it from XML using XSLT.

    Karl
    If you are not married to Java:
    http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm

    Otis

    --- sergiu gordea wrote:
    Karl Koch wrote:
    I am in control of the html, which means it is well formated HTML. I use
    only HTML files which I have transformed from XML. No external HTML (e.g.
    the web).

    Are there any very-short solutions for that?
    if you are using only correct formated HTML pages and you are in
    control
    of these pages.
    you can use a regular exprestion to remove the tags.

    something like
    replaceAll("<*>","");

    This is the ideea behind the operation. If you will search on google
    you
    will find a more robust
    regular expression.

    Using a simple regular expression will be a very cheap solution, that

    can cause you a lot of problems in the future.

    It's up to you to use it ....

    Best,

    Sergiu
    Karl


    Karl Koch wrote:


    Hi,

    yes, but the library your are using is quite big. I was thinking
    that a
    5kB

    code could actually do that. That sourceforge project is doing
    much more
    than that but I do not need it.


    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the
    size.
    You can use 3 lines of code with a good regular expresion to
    eliminate
    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu


    Karl




    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:




    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short
    and
    simple

    (KISS)) which allows to remove all HTML tags from HTML content?
    HTML
    3.2

    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other
    structure
    but

    need a facility to clean up HTML into its normal underlying
    content

    before



    indexing that content as a whole.

    Karl







    I think that depends on what you want to do. The Lucene demo
    parser

    does



    simple mapping of HTML files into Lucene Documents; it does not
    give
    you


    a



    parse tree for the HTML doc. CyberNeko is an extension of
    Xerces
    (uses




    the





    same API; will likely become part of Xerces), and so maps an
    HTML

    document



    into a full DOM that you can manipulate easily for a wide range
    of
    purposes. I haven't used JTidy at an API level and so don't
    know it
    as




    well --





    based on its UI, it appears to be focused primarily on HTML
    validation

    and



    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents
    that go

    beyond



    indexing them in Lucene, and really like it. It has been
    robust for
    me


    so



    far.

    Chuck


    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡

    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
    il_1g/



    ---------------------------------------------------------------------


    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
    +++ GMX - die erste Adresse für Mail, Message, More +++

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Sergiu gordea at Feb 3, 2005 at 10:07 am

    Karl Koch wrote:
    Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
    in a single class or even method called by another part in my Java
    application. It should also run on Java 1.1 and it should be small and
    simple. As I said before, I am in control of the HTML and it will be well
    formated, because I generate it from XML using XSLT.
    Why don't you get the data directly from XML files?
    You can use a SAX parser, ... but I think it will require java 1.3 or at
    least 1.2.2

    Best,

    Sergiu
    Karl


    If you are not married to Java:
    http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm

    Otis

    --- sergiu gordea wrote:


    Karl Koch wrote:


    I am in control of the html, which means it is well formated HTML. I
    use

    only HTML files which I have transformed from XML. No external HTML
    (e.g.

    the web).

    Are there any very-short solutions for that?


    if you are using only correct formated HTML pages and you are in
    control
    of these pages.
    you can use a regular exprestion to remove the tags.

    something like
    replaceAll("<*>","");

    This is the ideea behind the operation. If you will search on google
    you
    will find a more robust
    regular expression.

    Using a simple regular expression will be a very cheap solution, that

    can cause you a lot of problems in the future.

    It's up to you to use it ....

    Best,

    Sergiu


    Karl




    Karl Koch wrote:




    Hi,

    yes, but the library your are using is quite big. I was thinking
    that a


    5kB



    code could actually do that. That sourceforge project is doing
    much more

    than that but I do not need it.




    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the
    size.

    You can use 3 lines of code with a good regular expresion to
    eliminate

    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu




    Karl






    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:






    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short
    and


    simple



    (KISS)) which allows to remove all HTML tags from HTML content?
    HTML


    3.2



    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other
    structure


    but



    need a facility to clean up HTML into its normal underlying
    content




    before





    indexing that content as a whole.

    Karl









    I think that depends on what you want to do. The Lucene demo
    parser




    does





    simple mapping of HTML files into Lucene Documents; it does not
    give


    you






    a





    parse tree for the HTML doc. CyberNeko is an extension of
    Xerces


    (uses








    the







    same API; will likely become part of Xerces), and so maps an
    HTML




    document





    into a full DOM that you can manipulate easily for a wide range
    of

    purposes. I haven't used JTidy at an API level and so don't
    know it


    as








    well --







    based on its UI, it appears to be focused primarily on HTML
    validation




    and





    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents
    that go




    beyond





    indexing them in Lucene, and really like it. It has been
    robust for


    me






    so





    far.

    Chuck




    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡



    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma



    il_1g/







    ---------------------------------------------------------------------





    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org

    For additional commands, e-mail:


    lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------

    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org

    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org









    ---------------------------------------------------------------------

    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org

    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org





    ---------------------------------------------------------------------

    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Luke Shannon at Feb 2, 2005 at 7:54 pm
    In our application I use regular expressions to strip all tags in one
    situation and specific ones in another situation. Here is sample code for
    both:

    This strips all html 4.0 tags except <p>, <ul>, <br>, <li>, <strong>, <em>,
    <u>:

    html_source =
    Pattern.compile("</?\\s?(A|ABBR|ACRONYM|ADDRESS|APPLET|AREA|B|BASE|BASEFONT|
    BDO|BIG|BLOCKQUOTE|BODY|BUTTON|CAPTION|CENTER|CITE|CODE|COL|COLGROUP|DD|DEL|
    DFN|DIR|DIV|DL|DT|FIELDSET|FONT|FORM|FRAME|FRAMESET|H1|H2|H3|H4|H5|H6|HEAD|H
    R|HTML|I|IFRAME|IMG|INPUT|INS|ISINDEX|KBD|LABEL|LEGEND|LINK|MAP|MENU|META|NO
    FRAMES|NOSCRIPT|OBJECT|OL|OPTGROUP|OPTION|PARAM|PRE|Q|S|SAMP|SCRIPT|SELECT|S
    MALL|SPAN|STRIKE|STYLE|SUB|SUP|TABLE|TBODY|TD|TEXTAREA|TFOOT|TH|THEAD|TITLE|
    TR|TT|VAR)(.|\n)*?\\s?>",
    Pattern.CASE_INSENSITIVE).matcher(html_source).replaceAll("");

    When I want to strip anything in a tag I use the following pattern with the
    code above:

    String strPattern1 = "<\\s?(.|\n)*?\\s?>";

    HTH

    Luke



    ----- Original Message -----
    From: "sergiu gordea" <gsergiu@ifit.uni-klu.ac.at>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, February 02, 2005 1:23 PM
    Subject: Re: which HTML parser is better?

    Karl Koch wrote:
    I am in control of the html, which means it is well formated HTML. I use
    only HTML files which I have transformed from XML. No external HTML (e.g.
    the web).

    Are there any very-short solutions for that?
    if you are using only correct formated HTML pages and you are in control
    of these pages.
    you can use a regular exprestion to remove the tags.

    something like
    replaceAll("<*>","");

    This is the ideea behind the operation. If you will search on google you
    will find a more robust
    regular expression.

    Using a simple regular expression will be a very cheap solution, that
    can cause you a lot of problems in the future.

    It's up to you to use it ....

    Best,

    Sergiu
    Karl


    Karl Koch wrote:


    Hi,

    yes, but the library your are using is quite big. I was thinking that a
    5kB

    code could actually do that. That sourceforge project is doing much
    more
    than that but I do not need it.


    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the
    size.
    You can use 3 lines of code with a good regular expresion to eliminate
    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu


    Karl




    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:




    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and
    simple

    (KISS)) which allows to remove all HTML tags from HTML content? HTML
    3.2

    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure
    but

    need a facility to clean up HTML into its normal underlying content


    before



    indexing that content as a whole.

    Karl







    I think that depends on what you want to do. The Lucene demo parser


    does



    simple mapping of HTML files into Lucene Documents; it does not give
    you


    a



    parse tree for the HTML doc. CyberNeko is an extension of Xerces
    (uses




    the





    same API; will likely become part of Xerces), and so maps an HTML


    document



    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know it
    as




    well --





    based on its UI, it appears to be focused primarily on HTML
    validation

    and



    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go


    beyond



    indexing them in Lucene, and really like it. It has been robust for
    me


    so



    far.

    Chuck


    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡

    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/m
    a
    il_1g/



    ---------------------------------------------------------------------



    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karl Koch at Feb 3, 2005 at 9:54 am
    Hello Sergiu,

    thank you for your help so far. I appreciate it.

    I am working with Java 1.1 which does not include regular expressions.

    Your turn ;-)
    Karl
    Karl Koch wrote:
    I am in control of the html, which means it is well formated HTML. I use
    only HTML files which I have transformed from XML. No external HTML (e.g.
    the web).

    Are there any very-short solutions for that?
    if you are using only correct formated HTML pages and you are in control
    of these pages.
    you can use a regular exprestion to remove the tags.

    something like
    replaceAll("<*>","");

    This is the ideea behind the operation. If you will search on google you
    will find a more robust
    regular expression.

    Using a simple regular expression will be a very cheap solution, that
    can cause you a lot of problems in the future.

    It's up to you to use it ....

    Best,

    Sergiu
    Karl


    Karl Koch wrote:


    Hi,

    yes, but the library your are using is quite big. I was thinking that a
    5kB

    code could actually do that. That sourceforge project is doing much
    more
    than that but I do not need it.


    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the
    size.
    You can use 3 lines of code with a good regular expresion to eliminate
    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu


    Karl




    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:




    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and
    simple

    (KISS)) which allows to remove all HTML tags from HTML content? HTML
    3.2

    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure
    but

    need a facility to clean up HTML into its normal underlying content


    before



    indexing that content as a whole.

    Karl







    I think that depends on what you want to do. The Lucene demo parser


    does



    simple mapping of HTML files into Lucene Documents; it does not give
    you


    a



    parse tree for the HTML doc. CyberNeko is an extension of Xerces
    (uses




    the





    same API; will likely become part of Xerces), and so maps an HTML


    document



    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know it
    as




    well --





    based on its UI, it appears to be focused primarily on HTML
    validation

    and



    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go


    beyond



    indexing them in Lucene, and really like it. It has been robust for
    me


    so



    far.

    Chuck


    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡

    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
    il_1g/



    ---------------------------------------------------------------------



    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
    +++ GMX - die erste Adresse für Mail, Message, More +++

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Sergiu gordea at Feb 3, 2005 at 10:04 am

    Karl Koch wrote:
    Hello Sergiu,

    thank you for your help so far. I appreciate it.

    I am working with Java 1.1 which does not include regular expressions.
    Why are you using Java 1.1? Are you so limited in resources?
    What operating system do you use?
    I asume that you just need to index the html files, and you need a
    html2txt conversion.
    If an external converter si a solution for you, you can use
    Runtime.executeCommnand(...) to run the converter that will extract the
    information from your HTMLs
    and generate a .txt file. Then you can use a reader to index the txt.

    As I told you before, the best solution depends on your constraints
    (time, effort, hardware, performance) and requirements :)

    Best,

    Sergiu
    Your turn ;-)
    Karl


    Karl Koch wrote:


    I am in control of the html, which means it is well formated HTML. I use
    only HTML files which I have transformed from XML. No external HTML (e.g.
    the web).

    Are there any very-short solutions for that?


    if you are using only correct formated HTML pages and you are in control
    of these pages.
    you can use a regular exprestion to remove the tags.

    something like
    replaceAll("<*>","");

    This is the ideea behind the operation. If you will search on google you
    will find a more robust
    regular expression.

    Using a simple regular expression will be a very cheap solution, that
    can cause you a lot of problems in the future.

    It's up to you to use it ....

    Best,

    Sergiu


    Karl




    Karl Koch wrote:




    Hi,

    yes, but the library your are using is quite big. I was thinking that a


    5kB



    code could actually do that. That sourceforge project is doing much
    more

    than that but I do not need it.




    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the
    size.

    You can use 3 lines of code with a good regular expresion to eliminate
    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu




    Karl






    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:






    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and


    simple



    (KISS)) which allows to remove all HTML tags from HTML content? HTML


    3.2



    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure


    but



    need a facility to clean up HTML into its normal underlying content




    before





    indexing that content as a whole.

    Karl









    I think that depends on what you want to do. The Lucene demo parser




    does





    simple mapping of HTML files into Lucene Documents; it does not give


    you






    a





    parse tree for the HTML doc. CyberNeko is an extension of Xerces


    (uses








    the







    same API; will likely become part of Xerces), and so maps an HTML




    document





    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know it


    as








    well --







    based on its UI, it appears to be focused primarily on HTML
    validation




    and





    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that go




    beyond





    indexing them in Lucene, and really like it. It has been robust for


    me






    so





    far.

    Chuck




    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡



    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma



    il_1g/







    ---------------------------------------------------------------------





    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:


    lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------

    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karl Koch at Feb 3, 2005 at 10:12 am
    I am using Java 1.1 with a Sharp Zaurus PDA. I have very limited memory
    constraints. I do not think CPU performance is a big issues though. But I
    have other parts in my application which use quite a lot of memory and
    soemthing run short. I therefore do not look into solutions which build up
    tag trees etc. More like a solution who reads a stream of HTML and
    transforms it into a stream of text.

    I see your point of using an external program. I am however not entirely
    sure if this is available. Also it would be much simpler to have a 3-5 kB
    solution in Java, perhaps encapsulated in a class which does the job without
    the need for advanced libraries which need 100-200 KB on my internal
    storage.

    I hope I could clarify my situation now.

    Cheers,
    Karl
    Karl Koch wrote:
    Hello Sergiu,

    thank you for your help so far. I appreciate it.

    I am working with Java 1.1 which does not include regular expressions.
    Why are you using Java 1.1? Are you so limited in resources?
    What operating system do you use?
    I asume that you just need to index the html files, and you need a
    html2txt conversion.
    If an external converter si a solution for you, you can use
    Runtime.executeCommnand(...) to run the converter that will extract the
    information from your HTMLs
    and generate a .txt file. Then you can use a reader to index the txt.

    As I told you before, the best solution depends on your constraints
    (time, effort, hardware, performance) and requirements :)

    Best,

    Sergiu
    Your turn ;-)
    Karl


    Karl Koch wrote:


    I am in control of the html, which means it is well formated HTML. I
    use
    only HTML files which I have transformed from XML. No external HTML
    (e.g.
    the web).

    Are there any very-short solutions for that?


    if you are using only correct formated HTML pages and you are in control
    of these pages.
    you can use a regular exprestion to remove the tags.

    something like
    replaceAll("<*>","");

    This is the ideea behind the operation. If you will search on google you
    will find a more robust
    regular expression.

    Using a simple regular expression will be a very cheap solution, that
    can cause you a lot of problems in the future.

    It's up to you to use it ....

    Best,

    Sergiu


    Karl




    Karl Koch wrote:




    Hi,

    yes, but the library your are using is quite big. I was thinking that
    a

    5kB



    code could actually do that. That sourceforge project is doing much
    more

    than that but I do not need it.




    you need just the htmlparser.jar 200k.
    ... you know ... the functionality is strongly correclated with the
    size.

    You can use 3 lines of code with a good regular expresion to
    eliminate
    the html tags,
    but this won't give you any guarantie that the text from the bad
    fromated html files will be
    correctly extracted...

    Best,

    Sergiu




    Karl






    Hi Karl,

    I already submitted a peace of code that removes the html tags.
    Search for my previous answer in this thread.

    Best,

    Sergiu

    Karl Koch wrote:






    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and


    simple



    (KISS)) which allows to remove all HTML tags from HTML content?
    HTML

    3.2



    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other
    structure

    but



    need a facility to clean up HTML into its normal underlying content




    before





    indexing that content as a whole.

    Karl









    I think that depends on what you want to do. The Lucene demo
    parser



    does





    simple mapping of HTML files into Lucene Documents; it does not
    give

    you






    a





    parse tree for the HTML doc. CyberNeko is an extension of Xerces


    (uses








    the







    same API; will likely become part of Xerces), and so maps an HTML




    document





    into a full DOM that you can manipulate easily for a wide range of
    purposes. I haven't used JTidy at an API level and so don't know
    it

    as








    well --







    based on its UI, it appears to be focused primarily on HTML
    validation




    and





    error detection/correction.

    I use CyberNeko for a range of operations on HTML documents that
    go



    beyond





    indexing them in Lucene, and really like it. It has been robust
    for

    me






    so





    far.

    Chuck




    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    _________________________________________________________
    Do You Yahoo!?
    150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
    http://music.yisou.com/
    ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
    http://image.yisou.com
    1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡



    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma


    il_1g/







    ---------------------------------------------------------------------




    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:


    lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org









    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
    Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Erik Hatcher at Feb 2, 2005 at 1:22 pm

    On Feb 2, 2005, at 6:17 AM, Karl Koch wrote:

    Hello,

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very short and
    simple
    (KISS)) which allows to remove all HTML tags from HTML content? HTML
    3.2
    would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other structure
    but
    need a facility to clean up HTML into its normal underlying content
    before
    indexing that content as a whole.
    The code in the Lucene Sandbox for parsing HTML with JTidy (under
    contributions/ant) for the <index> task does what you ask.

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Kauler, Leto S at Feb 2, 2005 at 11:13 pm
    We index the content from HTML files and because we only want the "good"
    text and do not care about the structure, well-formedness, etc we went
    with regular expressions similar to what Luke Shannon offered.

    Only real difference being that we firstly remove entire blocks of
    (script|style|csimport) and similar since the contents of those are not
    useful for keyword searching, and afterward just remove every leftover
    HTML tags. I have been meaning to add an expression to extract things
    like alt attribute text from <img> though.

    --Leto


    -----Original Message-----
    From: Karl Koch

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very
    short and simple
    (KISS)) which allows to remove all HTML tags from HTML
    content? HTML 3.2 would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other
    structure but need a facility to clean up HTML into its
    normal underlying content before indexing that content as a whole.

    Karl
    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?
    CONFIDENTIALITY NOTICE AND DISCLAIMER

    Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission.

    This disclaimer has been automatically added.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Sergiu gordea at Feb 3, 2005 at 7:08 am
    Kauler, Leto S wrote:

    Another very cheap, but robust solution in the case you use linux is to
    make lynx to parse your pages.

    lynx page.html > page.txt.

    This will strip out all html and script, style, csimport tags. And you
    will have a .txt file ready for indexing.

    Best,

    Sergiu
    We index the content from HTML files and because we only want the "good"
    text and do not care about the structure, well-formedness, etc we went
    with regular expressions similar to what Luke Shannon offered.

    Only real difference being that we firstly remove entire blocks of
    (script|style|csimport) and similar since the contents of those are not
    useful for keyword searching, and afterward just remove every leftover
    HTML tags. I have been meaning to add an expression to extract things
    like alt attribute text from <img> though.

    --Leto




    -----Original Message-----
    From: Karl Koch

    I have been following this thread and have another question.

    Is there a piece of sourcecode (which is preferably very
    short and simple
    (KISS)) which allows to remove all HTML tags from HTML
    content? HTML 3.2 would be enough...also no frames, CSS, etc.

    I do not need to have the HTML strucutre tree or any other
    structure but need a facility to clean up HTML into its
    normal underlying content before indexing that content as a whole.

    Karl


    -----Original Message-----
    From: Jingkang Zhang
    Sent: Tuesday, February 01, 2005 1:15 AM
    To: lucene-user@jakarta.apache.org
    Subject: which HTML parser is better?

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?
    CONFIDENTIALITY NOTICE AND DISCLAIMER

    Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission.

    This disclaimer has been automatically added.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Bill Tschumy at Feb 3, 2005 at 1:59 am
    No one has yet mentioned using ParserDelegator and ParserCallback that
    are part of HTMLEditorKit in Swing. I have been successfully using
    these classes to parse out the text of an HTML file. You just need to
    extend HTMLEditorKit.ParserCallback and override the various methods
    that are called when different tags are encountered.

    On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?
    --
    Bill Tschumy
    Otherwise -- Austin, TX
    http://www.otherwise.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karl Koch at Feb 3, 2005 at 10:06 am
    I appologise in advance, if some of my writing here has been said before.
    The last three answers to my question have been suggesting pattern matching
    solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
    is something I cannot use since I work with Java 1.1 on a PDA.

    I am wondering if somebody knows a piece of simple sourcecode with low
    requirement which is running under this tense specification.

    Thank you all,
    Karl
    No one has yet mentioned using ParserDelegator and ParserCallback that
    are part of HTMLEditorKit in Swing. I have been successfully using
    these classes to parse out the text of an HTML file. You just need to
    extend HTMLEditorKit.ParserCallback and override the various methods
    that are called when different tags are encountered.

    On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?
    --
    Bill Tschumy
    Otherwise -- Austin, TX
    http://www.otherwise.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Sergiu gordea at Feb 3, 2005 at 10:17 am

    Karl Koch wrote:
    I appologise in advance, if some of my writing here has been said before.
    The last three answers to my question have been suggesting pattern matching
    solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
    is something I cannot use since I work with Java 1.1 on a PDA.
    I see,

    In this case you can read line by line your HTML file and then write
    something like this:

    String line;
    int startPos, endPos;
    StringBuffer text = new StringBuffer();
    while((line = reader.readLine()) != null ){
    startPos = line.indexOf(">");
    endPos = line.indexOf("<");
    if(startPos >0 && endPos > startPos)
    text.append(line.substring(startPos, endPos));
    }

    This is just a sample code that should work if you have just one tag per
    line in the HTML file.
    This can be a start point for you.

    Hope it helps,

    Best,

    Sergiu
    I am wondering if somebody knows a piece of simple sourcecode with low
    requirement which is running under this tense specification.

    Thank you all,
    Karl


    No one has yet mentioned using ParserDelegator and ParserCallback that
    are part of HTMLEditorKit in Swing. I have been successfully using
    these classes to parse out the text of an HTML file. You just need to
    extend HTMLEditorKit.ParserCallback and override the various methods
    that are called when different tags are encountered.


    On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:


    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?
    --
    Bill Tschumy
    Otherwise -- Austin, TX
    http://www.otherwise.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karl Koch at Feb 3, 2005 at 10:20 am
    Thank you, I will do that.
    Karl Koch wrote:
    I appologise in advance, if some of my writing here has been said before.
    The last three answers to my question have been suggesting pattern matching
    solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
    is something I cannot use since I work with Java 1.1 on a PDA.
    I see,

    In this case you can read line by line your HTML file and then write
    something like this:

    String line;
    int startPos, endPos;
    StringBuffer text = new StringBuffer();
    while((line = reader.readLine()) != null ){
    startPos = line.indexOf(">");
    endPos = line.indexOf("<");
    if(startPos >0 && endPos > startPos)
    text.append(line.substring(startPos, endPos));
    }

    This is just a sample code that should work if you have just one tag per
    line in the HTML file.
    This can be a start point for you.

    Hope it helps,

    Best,

    Sergiu
    I am wondering if somebody knows a piece of simple sourcecode with low
    requirement which is running under this tense specification.

    Thank you all,
    Karl


    No one has yet mentioned using ParserDelegator and ParserCallback that
    are part of HTMLEditorKit in Swing. I have been successfully using
    these classes to parse out the text of an HTML file. You just need to
    extend HTMLEditorKit.ParserCallback and override the various methods
    that are called when different tags are encountered.


    On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:


    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?
    --
    Bill Tschumy
    Otherwise -- Austin, TX
    http://www.otherwise.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
    +++ GMX - die erste Adresse für Mail, Message, More +++

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Dawid Weiss at Feb 3, 2005 at 10:20 am
    Karl,

    Two things, try to experiment with both:

    1) I would try to write a lexical scanner that strips HTML tags, much
    like the regular expression does. Java lexical scanner packages produce
    nice pure Java classes that seldom use any advanced API, so they should
    work on Java 1.1. They are simple state machines with states encoded in
    integers -- this should work like a charm, be fast and small.

    2) Write a parser yourself. Having a regular expression it isn't that
    difficult to do... :)

    D.

    Karl Koch wrote:
    I appologise in advance, if some of my writing here has been said before.
    The last three answers to my question have been suggesting pattern matching
    solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
    is something I cannot use since I work with Java 1.1 on a PDA.

    I am wondering if somebody knows a piece of simple sourcecode with low
    requirement which is running under this tense specification.

    Thank you all,
    Karl

    No one has yet mentioned using ParserDelegator and ParserCallback that
    are part of HTMLEditorKit in Swing. I have been successfully using
    these classes to parse out the text of an HTML file. You just need to
    extend HTMLEditorKit.ParserCallback and override the various methods
    that are called when different tags are encountered.


    On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?
    --
    Bill Tschumy
    Otherwise -- Austin, TX
    http://www.otherwise.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Aurora at Feb 3, 2005 at 4:30 pm
    For all parser suggestion I think there is one important attribute. Some
    parsers returns data provide that the input HTML is sensible. Some parsers
    is designed to be most flexible as tolerant as it can be. If the input is
    clean and controlled the former class is sufficient. Even some regular
    expression may be sufficient. (I that's the original poster wants). If you
    are building a web crawler you need something really tolerant.

    Once I have prototyped a nice and fast parser. Later I have to abandon it
    because it failed to parse about 15% documents (problem handling nested
    quotes like onclick="alert('hi')").
    No one has yet mentioned using ParserDelegator and ParserCallback that
    are part of HTMLEditorKit in Swing. I have been successfully using
    these classes to parse out the text of an HTML file. You just need to
    extend HTMLEditorKit.ParserCallback and override the various methods
    that are called when different tags are encountered.

    On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:

    Three HTML parsers(Lucene web application
    demo,CyberNeko HTML Parser,JTidy) are mentioned in
    Lucene FAQ
    1.3.27.Which is the best?Can it filter tags that are
    auto-created by MS-word 'Save As HTML files' function?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Ian Soboroff at Feb 3, 2005 at 8:32 pm
    One which we've been using can be found at:
    http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

    We absolutely need to be able to recover gracefully from malformed
    HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there
    failed this criterion when we started our effort. The above one is
    kind of SAX-y but doesn't fall over at the sight of a real web page
    ;-)

    Ian


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Karl Koch at Feb 4, 2005 at 10:22 am
    The link does not work.
    One which we've been using can be found at:
    http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

    We absolutely need to be able to recover gracefully from malformed
    HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there
    failed this criterion when we started our effort. The above one is
    kind of SAX-y but doesn't fall over at the sight of a real web page
    ;-)

    Ian


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
    AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Ian Soboroff at Feb 4, 2005 at 3:37 pm
    Oops. It's in the Google cache and also the Internet Archive Wayback
    machine. I'll drop the original author a note to let him know that
    his links are stale.

    http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

    Ian

    "Karl Koch" <TheRanger@gmx.net> writes:
    The link does not work.
    One which we've been using can be found at:
    http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

    We absolutely need to be able to recover gracefully from malformed
    HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there
    failed this criterion when we started our effort. The above one is
    kind of SAX-y but doesn't fall over at the sight of a real web page
    ;-)


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 1, '05 at 9:14a
activeFeb 4, '05 at 3:37p
posts29
users13
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase