FAQ
Hallo,


JTidy is a very good HTMLParser but for HTML Websites made with the help
of Microssoft Office Products like Word for example it is not optimal.
Because ist returns "Microsoft specific HTML Tags" instead of only text.
Or as should I handle HTML Pages with source begins so

"

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<link rel=File-List href="index-Dateien/filelist.xml">

"

like XML Files and using a XML -Parser instead of a HTML-Parser?


I think it should be a HTML page because of

"<meta http-equiv=Content-Type content="text/html; charset=windows-1252">"

I am glad for every kind



Greetings


Gaston



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Mark Benussi at Dec 4, 2005 at 12:22 am
    I use JTidy also, but not for Lucene parsing. There is no easy way of
    handling this, you simply have to remove all crappy Microsoft inserts as
    they come.

    -----Original Message-----
    From: Gaston
    Sent: 03 December 2005 13:49
    To: java-user@lucene.apache.org
    Subject: best html parser for html documents generated by microsoft products

    Hallo,


    JTidy is a very good HTMLParser but for HTML Websites made with the help
    of Microssoft Office Products like Word for example it is not optimal.
    Because ist returns "Microsoft specific HTML Tags" instead of only text.
    Or as should I handle HTML Pages with source begins so

    "

    <html xmlns:v="urn:schemas-microsoft-com:vml"
    xmlns:o="urn:schemas-microsoft-com:office:office"
    xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
    xmlns="http://www.w3.org/TR/REC-html40">

    <head>
    <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
    <link rel=File-List href="index-Dateien/filelist.xml">

    "

    like XML Files and using a XML -Parser instead of a HTML-Parser?


    I think it should be a HTML page because of

    "<meta http-equiv=Content-Type content="text/html; charset=windows-1252">"

    I am glad for every kind



    Greetings


    Gaston



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 3, '05 at 1:49p
activeDec 4, '05 at 12:22a
posts2
users2
websitelucene.apache.org

2 users in discussion

Gaston: 1 post Mark Benussi: 1 post

People

Translate

site design / logo © 2022 Grokbase