FAQ
Hey internals!

In a recent discussion on PHP Roundtable, we talked about the byte
order mark in php files. If you create a php file with the following:

<?php
header("X-foo: Bar");
echo "Foo!".PHP_EOL;

And save it as UTF-8 with BOM, interesting things happen depending on
the SAPI & configuration.

If you run it from the CLI you get an error:
PHP Warning: Cannot modify header information - headers already sent by (output started at %s:1) in %s on line %d
But it doesn't seem to return the BOM to std out (but I could be doing
this part wrong). If you run it from `php -S`, and load it in a
browser, the web server returns a code point \u{feff} as the first
code point of the response body.

BOM's should not be treated as characters and should not be sent to
the output. Is there any reason this should be considered the expected
behavior? If not, I'd like to create an RFC to change it. :)

Thanks,
Sammy Kaye Powers
sammyk.me

Search Discussions

  • Stanislav Malyshev at May 31, 2016 at 12:40 am
    Hi!
    BOM's should not be treated as characters and should not be sent to
    the output. Is there any reason this should be considered the expected
    behavior?
    The reason would be PHP does not know where surrounding output ends and
    the code starts, beyond <?php. That means if there is some stuff in the
    file before <?php, it would be output - and it's an intended behavior,
    and so will happen with BOM too. Particular sequence of bytes being BOM
    and whether it is desired or not depends on context, but PHP engine does
    not have this context. Remember that pure HTML page is also a valid PHP
    file.
    --
    Stas Malyshev
    smalyshev@gmail.com
  • Sara Golemon at May 31, 2016 at 1:57 am

    On Mon, May 30, 2016 at 5:40 PM, Stanislav Malyshev wrote:
    BOM's should not be treated as characters and should not be sent to
    the output. Is there any reason this should be considered the expected
    behavior?
    The reason would be PHP does not know where surrounding output ends and
    the code starts, beyond <?php. That means if there is some stuff in the
    file before <?php, it would be output - and it's an intended behavior,
    and so will happen with BOM too. Particular sequence of bytes being BOM
    and whether it is desired or not depends on context, but PHP engine does
    not have this context. Remember that pure HTML page is also a valid PHP
    file.
    I'm with Sammy on the principle that being able to have a BOM in a
    given file is important to any non-ascii code development. Though we
    can argue whether that's good or even necessary, I honestly don't know
    how prevalent non-english coding is among PHP developers.

    In fact, the idea of stripping content from a script file isn't
    without precedent. Shebang lines are routinely removed from
    cli/cgi/fpm, and if you want to properly output it, you need to do so
    in a coded echo statement. (The stripping only applies to a literal,
    non-scripting line in the file, not dynamic output).

    So can we apply the same to the BOM? There's the obvious BC danger of
    files which might depend on this behavior (declaring their encoding
    via BOM, which happens to be the same as the script encoding).

    So how about declare statement?

    {U+FEFF}<?php
       declare(strip_bom=true);

    code(); code(); code();

    It's got the advantage of being per-file (a view template might
    actually want the BOM included, while some business logic piece
    doesn't, for example. It's a compile-time strip, so it has no runtime
    cost. It's non-surprising, since it's stated in every file for which
    the BOM strip is intentional.

    -Sara
  • Stanislav Malyshev at May 31, 2016 at 2:18 am
    Hi!
    In fact, the idea of stripping content from a script file isn't
    without precedent. Shebang lines are routinely removed from
    cli/cgi/fpm, and if you want to properly output it, you need to do so
    True, because in the context of CLI we know what is expected - a CLI
    script which can start with #!. It is very unlikely that we'd have a
    template run directly as CLI script and we would have this template
    starting with #! which we want to output. But we lack such context in a
    generic script - namely, the context that would tell us if it's safe to
    drop the BOM.
    So can we apply the same to the BOM? There's the obvious BC danger of
    files which might depend on this behavior (declaring their encoding
    via BOM, which happens to be the same as the script encoding).
    Given that BOM in script files is mostly useless, and BOM in UTF-8 is
    useless and not recommended for use either, I don't see why we need to.

    In general, I don't think BOM is a real issue worth messing with the
    lexer. Surely, from time to time somebody would use weird editor which
    produces BOMs, like editing PHP scripts in Word. Surely, they'd have
    weird effects that would force them to spend 5 minutes googling and
    fixing it. I don't think it is the reason to spend day-persons of our
    collective time to find a fix to this very niche problem and risk
    potential BC issues.

    If it is really becoming an issue, we could probably make the lexer
    treat BOM+<? the same as <?, but I'm not convinced it is a serious
    enough issue.
    So how about declare statement?

    {U+FEFF}<?php
    declare(strip_bom=true);
    That presumes you know there's BOM in the beginning of your file. If so,
    why don't you just delete it instead of typing a long declare directive?
    If you don't know it, you'd be forced to add it to every (non-template)
    file in your codebase - which sounds a bit excessive.

    --
    Stas Malyshev
    smalyshev@gmail.com
  • Sara Golemon at May 31, 2016 at 3:52 am

    On Mon, May 30, 2016 at 7:18 PM, Stanislav Malyshev wrote:
    In fact, the idea of stripping content from a script file isn't
    without precedent. Shebang lines are routinely removed from
    cli/cgi/fpm, and if you want to properly output it, you need to do so
    True, because in the context of CLI we know what is expected - a CLI
    script which can start with #!. It is very unlikely that we'd have a
    template run directly as CLI script and we would have this template
    starting with #! which we want to output. But we lack such context in a
    generic script - namely, the context that would tell us if it's safe to
    drop the BOM.
    That was the idea of the declare(), to provide that context, since it
    can't be reliably inferred.
    So can we apply the same to the BOM? There's the obvious BC danger of
    files which might depend on this behavior (declaring their encoding
    via BOM, which happens to be the same as the script encoding).
    Given that BOM in script files is mostly useless, and BOM in UTF-8 is
    useless and not recommended for use either, I don't see why we need to.

    In general, I don't think BOM is a real issue worth messing with the
    lexer. Surely, from time to time somebody would use weird editor which
    produces BOMs, like editing PHP scripts in Word. Surely, they'd have
    weird effects that would force them to spend 5 minutes googling and
    fixing it. I don't think it is the reason to spend day-persons of our
    collective time to find a fix to this very niche problem and risk
    potential BC issues.
    Agreed it's niche, and agreed that it's mostly the editor's fault for
    putting the BOM in place to begin with. Disagree on the value of the
    time that would be needed to provide some sort of benefit.

    I will say though, that you're almost certainly right that it's not a
    significant problem (if it's one at all), and I'd want to hear from
    people who encounter this on a regular basis for which there isn't a
    much simpler fix available (such as disabling BOM emission in their
    editor of choice).
    If it is really becoming an issue, we could probably make the lexer
    treat BOM+<? the same as <?, but I'm not convinced it is a serious
    enough issue.
    That's probably a reasonable compromise on the context issue. It
    provides a clean escape hatch for intentional BOMs by echoing those
    bytes from script, even if it is magic behavior which is generally to
    be avoided.
    That presumes you know there's BOM in the beginning of your file. If so,
    why don't you just delete it instead of typing a long declare directive?
    Dunno. I just like to argue.

    -Sara
  • Andreas Heigl at May 31, 2016 at 7:42 am
    Hi All.

    As the BOM is only relevant on UTF-16 and UTF-32 encoded files and
    UTF-8-encoded files are strongly discouraged from having one[1] - (Use
    of a BOM is neither required nor recommended for UTF-8) there are two
    questions that arise IMO.

    1. Does PHP support Files encoded in UTF16 or UTF-32? If so, we need to
    handle the BOM somehow. If not, is that a requirement?

    2. Wouldn't it be an easier approach to have a userland-lib that scans
    files for a BOM and raises a warning? Like have an add-on to
    php-cs-fixer or something like that? Especially the UTF-8 BOM
    (\xEF\xBB\xBF) right at the start of a file would be easily to spot.

    Just my 0.02€

    Cheers

    Andreas

    [1] www.unicode.org/versions/Unicode5.0.0/ch02.pdf#page=30, Am 31.05.16
    um 05:52 schrieb Sara Golemon


    --
                                                                   ,,,
                                                                  (o o)
    +---------------------------------------------------------ooO-(_)-Ooo-+
    Andreas Heigl |
    mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
    http://andreas.heigl.org http://hei.gl/wiFKy7 |
    +---------------------------------------------------------------------+
    http://hei.gl/root-ca |
    +---------------------------------------------------------------------+
  • Derick Rethans at May 31, 2016 at 10:11 am

    On Mon, 30 May 2016, Stanislav Malyshev wrote:

    If it is really becoming an issue, we could probably make the lexer
    treat BOM+<? the same as <?, but I'm not convinced it is a serious
    enough issue.
    That that would break the case when somebody is trying to serve/generate
    a file which starts with a BOM though....

    cheers,
    Derick

    --
    http://derickrethans.nl | http://xdebug.org
    Like Xdebug? Consider a donation: http://xdebug.org/donate.php
    twitter: @derickr and @xdebug
    Posted with an email client that doesn't mangle email: alpine
  • Ángel González at May 31, 2016 at 2:06 am

    On 31/05/16 02:33, Sammy Kaye Powers wrote:
    BOM's should not be treated as characters and should not be sent to
    the output. Is there any reason this should be considered the expected
    behavior? If not, I'd like to create an RFC to change it. :)
    What about
    «Hello Foo!
    Today is <?= date("d F Y") ?>» ?

    If there's a BOM, should it be sent?
  • Andrea Faulds at May 31, 2016 at 11:52 am
    Hi Sammy,

    Sammy Kaye Powers wrote:
    If you create a php file with the following:

    <?php
    header("X-foo: Bar");
    echo "Foo!".PHP_EOL;

    And save it as UTF-8 with BOM, interesting things happen depending on
    the SAPI & configuration.

    If you run it from the CLI you get an error:
    PHP Warning: Cannot modify header information - headers already sent by (output started at %s:1) in %s on line %d
    But it doesn't seem to return the BOM to std out (but I could be doing
    this part wrong). If you run it from `php -S`, and load it in a
    browser, the web server returns a code point \u{feff} as the first
    code point of the response body.

    BOM's should not be treated as characters and should not be sent to
    the output. Is there any reason this should be considered the expected
    behavior? If not, I'd like to create an RFC to change it. :)
    I suspect that this part of the Zend Engine is much-neglected, but PHP
    actually can detect the BOM, and strip it from the output, if you have
    zend.multibyte turned on:

    https://github.com/php/php-src/blob/3b0a6dfeb2896fb204db48d11364c09942b1ad01/Zend/zend_language_scanner.l#L292

    I haven't tried this myself, though.

    Thanks.
    --
    Andrea Faulds
    https://ajf.me/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupphp-internals @
categoriesphp
postedMay 31, '16 at 12:33a
activeMay 31, '16 at 11:52a
posts9
users7
websitephp.net

People

Translate

site design / logo © 2018 Grokbase