FAQ
Gentlemen,

Regarding this bug report: https://bugs.php.net/bug.php?id=49705

As more developers move away from using regular expressions to parse
HTML and start using DOMDocument, I've noticed that quite a few
stumble over encoding "issues". They're not bugs, because it's
documented (I think) that if a document is loaded using
::loadHTMLFile() or if it contains a "content-type" meta tag which
specifies the character encoding it will work as expected.

So far I've suggested a hack that involves adding the meta-tag in
front of the string that contains the HTML. As horrible as it seems,
that does the job!

That said, I'm hoping to get enough internals support to add a
parameter to ::loadHTML() that set / overrides the default character
set when processing the document; when given, any <meta> tags
pertaining to character set encoding should be ignored (AFAIK that's
also the browser's behavior).

Btw, there's another patch that also introduces a new parameter to
::parseHTML() which has gone into 5.4 branch
(https://bugs.php.net/bug.php?id=54037), so it looks like this would
be the second (optional) parameter then.

Thoughts?

--
--
Tjerk

Search Discussions

  • Ferenc Kovacs at Aug 7, 2012 at 9:35 pm

    On Fri, Jun 1, 2012 at 5:57 PM, Tjerk Meesters wrote:

    Gentlemen,

    Regarding this bug report: https://bugs.php.net/bug.php?id=49705

    As more developers move away from using regular expressions to parse
    HTML and start using DOMDocument, I've noticed that quite a few
    stumble over encoding "issues". They're not bugs, because it's
    documented (I think) that if a document is loaded using
    ::loadHTMLFile() or if it contains a "content-type" meta tag which
    specifies the character encoding it will work as expected.

    So far I've suggested a hack that involves adding the meta-tag in
    front of the string that contains the HTML. As horrible as it seems,
    that does the job!

    That said, I'm hoping to get enough internals support to add a
    parameter to ::loadHTML() that set / overrides the default character
    set when processing the document; when given, any <meta> tags
    pertaining to character set encoding should be ignored (AFAIK that's
    also the browser's behavior).

    Btw, there's another patch that also introduces a new parameter to
    ::parseHTML() which has gone into 5.4 branch
    (https://bugs.php.net/bug.php?id=54037), so it looks like this would
    be the second (optional) parameter then.

    Thoughts?
    would be nice.
    bump.


    --
    Ferenc Kovács
    @Tyr43l - http://tyrael.hu

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupphp-internals @
categoriesphp
postedJun 1, '12 at 3:58p
activeAug 7, '12 at 9:35p
posts2
users2
websitephp.net

2 users in discussion

Tjerk Meesters: 1 post Ferenc Kovacs: 1 post

People

Translate

site design / logo © 2022 Grokbase