FAQ
When I was using Cyber Neko HTML Parser parse HTML
files( created by Microsoft word ), if the file
contains HTML built-in entity references(for example:
 ) , node value may contain unknown character.

Like this:
source html:
<DIV>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt
18pt"><SPAN lang=EN-US style="mso-bidi-font-size:
10.5pt"><FONT face="Times New Roman"><FONT
size=3>-rw-r--r--<SPAN style="mso-spacerun:
yes">&nbsp;&nbsp;&nbsp; </SPAN>1 root<SPAN
style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp;&nbsp;
</SPAN>root<SPAN style="mso-spacerun:
yes">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</SPAN>50 Jan 21 16:12
_1e.f6<o:p></o:p></FONT></FONT></SPAN></P>
</DIV>

after parsing html:
-rw-r--r--��?1 root���� root���������� 50 Jan 21 16:12
_1e.f6

How can I avoid it?

_________________________________________________________
Do You Yahoo!?
150万曲MP3疯狂搜,带您闯入音乐殿堂
http://music.yisou.com/
美女明星应有尽有,搜遍美图、艳图和酷图
http://image.yisou.com
1G就是1000兆,雅虎电邮自助扩容!
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Jason Polites at Feb 18, 2005 at 6:17 am
    This is not an unknown character.. it is a non breaking space (unicode value
    0x00A0)


    ----- Original Message -----
    From: "Jingkang Zhang" <zjingk@yahoo.com.cn>
    To: <lucene-user@jakarta.apache.org>
    Sent: Friday, February 18, 2005 5:12 PM
    Subject: The problem of using Cyber Neko HTML Parser parse HTML files

    When I was using Cyber Neko HTML Parser parse HTML
    files( created by Microsoft word ), if the file
    contains HTML built-in entity references(for example:
    &nbsp;) , node value may contain unknown character.

    Like this:
    source html:
    <DIV>
    <P class=MsoNormal style="MARGIN: 0cm 0cm 0pt
    18pt"><SPAN lang=EN-US style="mso-bidi-font-size:
    10.5pt"><FONT face="Times New Roman"><FONT
    size=3>-rw-r--r--<SPAN style="mso-spacerun:
    yes">&nbsp;&nbsp;&nbsp; </SPAN>1 root<SPAN
    style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp;&nbsp;
    </SPAN>root<SPAN style="mso-spacerun:
    yes">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </SPAN>50 Jan 21 16:12
    _1e.f6<o:p></o:p></FONT></FONT></SPAN></P>
    </DIV>

    after parsing html:
    -rw-r--r--牋?1 root牋牋 root牋牋牋牋牋 50 Jan 21 16:12
    _1e.f6

    How can I avoid it?

    _________________________________________________________
    Do You Yahoo!?
    150万曲MP3疯狂搜,带您闯入音乐殿堂
    http://music.yisou.com/
    美女明星应有尽有,搜遍美图、艳图和酷图
    http://image.yisou.com
    1G就是1000兆,雅虎电邮自助扩容!
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Jingkang Zhang at Feb 18, 2005 at 6:48 am
    Thank you. But how can I view correct output? If my
    html files using different encode method (Like :
    UTF-8, ISO8859-1, GBK , JIS, etc) , how can I treat
    it?



    --- Jason Polites <jasonpolites@tpg.com.au> 的正文:
    This is not an unknown character.. it is a non
    breaking space (unicode value
    0x00A0)


    ----- Original Message -----
    From: "Jingkang Zhang" <zjingk@yahoo.com.cn>
    To: <lucene-user@jakarta.apache.org>
    Sent: Friday, February 18, 2005 5:12 PM
    Subject: The problem of using Cyber Neko HTML Parser
    parse HTML files

    When I was using Cyber Neko HTML Parser parse HTML
    files( created by Microsoft word ), if the file
    contains HTML built-in entity references(for example:
    &nbsp;) , node value may contain unknown
    character.
    Like this:
    source html:
    <DIV>
    <P class=MsoNormal style="MARGIN: 0cm 0cm 0pt
    18pt"><SPAN lang=EN-US style="mso-bidi-font-size:
    10.5pt"><FONT face="Times New Roman"><FONT
    size=3>-rw-r--r--<SPAN style="mso-spacerun:
    yes">&nbsp;&nbsp;&nbsp; </SPAN>1 root<SPAN
    style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp;&nbsp;
    </SPAN>root<SPAN style="mso-spacerun:
    yes">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </SPAN>50 Jan 21 16:12
    _1e.f6<o:p></o:p></FONT></FONT></SPAN></P>
    </DIV>

    after parsing html:
    -rw-r--r--��?1 root������ root��������������?50
    Jan 21 16:12
    _1e.f6

    How can I avoid it?
    _________________________________________________________
    Do You Yahoo!?
    150涓����MP3���������锛�甯���ㄩ����ラ�充��娈垮��
    http://music.yisou.com/
    缇�濂虫�����搴����灏芥��锛�������缇���俱����冲�惧����峰��
    http://image.yisou.com
    1G灏辨��1000���锛���������甸�������╂�╁�癸�?> >
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
    ---------------------------------------------------------------------
    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail:
    lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail:
    lucene-user-help@jakarta.apache.org
    _________________________________________________________
    Do You Yahoo!?
    150万曲MP3疯狂搜,带您闯入音乐殿堂
    http://music.yisou.com/
    美女明星应有尽有,搜遍美图、艳图和酷图
    http://image.yisou.com
    1G就是1000兆,雅虎电邮自助扩容!
    http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 18, '05 at 6:12a
activeFeb 18, '05 at 6:48a
posts3
users2
websitelucene.apache.org

2 users in discussion

Jingkang Zhang: 2 posts Jason Polites: 1 post

People

Translate

site design / logo © 2022 Grokbase