FAQ
hi,

First we are working on providing updated numbers using 5.6 vs
5.6+patch, to have an actual base reference. Wordpress, Symfony and
Drupal will be used.

Also, Anthony posted something on IRC which summarizes very well what
has been said in the other threads. be recently or in the previous
vote to merge the 64bit patch in 5.x.

I will simply copy it here. I seriously hope anyone to carefully read
it and vote with all facts in mind, keeping the big picture in mind,
long term. Thanks.

+++ Anthony's post +++

This thread has been pointed out to me by a few people. As the
originator of this patch and concept I feel that I should clarify a
few points.

# Rationale

The reason that I originally started this patch was to clean up and
standardize the underlying types. This is to introduce predictability,
portability and type sanity into the engine and entire cphp
implementation.

## Rationale for Int64:

Without this patch, the size of integers (longs) varies based on which
compiler you use. This means that even for identical target
architectures behavior can change with respect to userland code.
Refactoring this allows for consistent sizes that can be relied upon
by the programmer. This is an effort to make it a bit easier to rely
on integer width as a developer.

And ideally this is a free cost to most implementations, since ints
are already 64 bits wide, so there is no memory overhead. And
performance stays the same as well.

## Rationale for size_t (string lengths):

This has significant advantages. There are some costs to doing it, but
they are not as significant as they may appear on the surface. Let's
dive into it:

### It's The Correct Data Type

The C89 spec indicates in 3.3.3.4 (
http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that
the size_t type was created specifically for usage in this context. It
is always, 100% guaranteed to be able to hold the bounds of every
possible array element. Strings in C are simply char arrays.
Therefore, the correct data type to use for string sizes (which really
are just an offset qualifier) is size_t.

Additionally, calloc, malloc, etc all expect parameters of type size_t
for exactly this reason.

Another good reference on it: http://www.viva64.com/en/a/0050/

### It's The Secure Data Type

size_t (and ptrdiff_t) are the only C89 types that are 100% guaranteed
to be able to hold the size of any possible object that the compiler
will support. Other types will vary depending on the data model that
the compiler supports, as the spec only defines minimum widths.

This is so important that CERT issued a coding standard for it:
INT01-C ( https://www.securecoding.cert.org/confluence/display/seccode/INT01-C.+Use+rsize_t+or+size_t+for+all+integer+values+representing+the+size+of+an+object
).

One of the reasons is that it's difficult to do overflow checks in a
portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 .
In there, they recommend using the C99 uintptr_t type, but suggest
using size_t for platforms that don't have uintptr_t support (and
since we target C89 for the engine, that's out).

Apple's Secure Coding Guide's section on Avoiding Integer Overflows
and Underflows says the same thing:
https://developer.apple.com/library/mac/documentation/security/conceptual/securecodingguide/Articles/BufferOverflows.html

### About Long Strings

The fact that changing to size_t allows strings (and arrays) to be >
4gb is a side-effect. A welcome one, but a side effect none the less.
The primary reason to use it is that it's the correct data type, and
gives you the most safety and security.

# Response To Concerns Mentioned

I'll respond here to some of the concerns mentioned in this thread:

## size_t uses more memory and will result in more CPU cache misses,
which will result in worse performance

Well, size_t will use more memory. No doubt about that.

But the performance side is more nuanced. And as several benchmarks in
this thread indicate, there isn't a practical difference. Heck, the
benchmarks on Windows show an improvement in some cases.

And there is a reason for that. Since a pointer is a 64 bit data type,
and a int is a 32 bit data type, any time you add the two will result
in extra CPU cycles needed for the cast. This can be clearly seen by
analyzing a simple malloc call with an int vs a size_t param. Here's
the diff:

     < movl $5, -12(%rbp)
     < movl -12(%rbp), %eax
     < cltq
     ---
movq $5, -16(%rbp)
movq -16(%rbp), %rax
Now, a cache miss is much more expensive than a cast, but we don't
have proof that cache misses will actually occur.

In fact, in the benchmarks, the worst difference is 2%. Which is
hardly significant (as indicated by several people here). But also
notice that in both benchmarks (those done by Microsoft, and those
done by Dmitry), some specific tests actually executed **faster** with
the size_t transforms (namely Hello World, Wordpress, etc). So to say
even 2% is not really the full story.

We'll come back to the memory thing in a bit.

## Macro Renames and ZPP changes

This was my idea, and I don't think it's been properly justified.

### ZPP Changes

The ZPP changes are critical. The reason is that varargs is casting an
arbitrary block of memory to a type, and then writing to it. So
existing code that does zpp("s", str, &int_len) would wind up with a
buffer overflow. Because zpp would be trying to write a 64 bit value
to a 32 bit container. The other 32 bits would fall off the end, into
who knows what. At BEST this can result in a segfault. At worst,
memory corruption and MASSIVE security vulnerabilities.

Also note that the compiler *can't* and actively doesn't catch these
types of errors. That means that it's largely luck and testing that
will lead to it.

So, I chose to break BC and rename the ZPP symbols. Because that WILL
error, and provide the developer with a meaningful indication that an
improper data type was provided. As I considered a fatal error that an
invalid type was supplied was a better way of identifying to the
developer that "HEY, THIS NEEDS TO BE CHANGED ASAP" than just letting
them hit random segfaults at runtime.

If there is a way to get around this by giving the compiler more
information, then do it. But to just leave the types there, and leave
it to chance if a buffer overflow occurs, is dangerous. Which is why I
made the call that the ZPP types **needed** to be changed.

### Macro Renames

The reason for the rename is largely the same as with the ZPP changes.
The severity of not changing is less (since the compiler will warn and
do an implicit cast for you). But it's still there. Which is why I
chose to change it. This is less critical, but was done to better
indicate to the developer what needs to change to properly support the
new system.

## Memory Overhead

This is definitely a concern. There is a potential to double the
amount of memory that PHP takes. Which on the surface looks enormous.
And if we stop at the surface, we definitely shouldn't do it!

But as we look deeper, we see that in actuality, the difference is not
double. In fact, most data structures, as identified by Dmitry
himself, only increase by between 6% (zend_op_array) 50%
(zend_string's size). So that "double" figure quickly drops.

But that's at the structure level. Let's look at what actually happens
in practice. Dmitry himself also provides these answers. The average
memory increase is 8% for Wordpress, and 6% for ZF1.

Let's put that 8% in context. Wordpress used 12MB, and now it uses
13MB. 1MB more. That's not overly significant. ZF used 29MB. Now it
uses 31MB. Still not overly significant.

Don't get me wrong, it's still more. And more is bad. But it's not
nearly as bad as it's being played out to be.

To put this into context, 5.4 saved up to 50% memory from 5.3
(depending on benchmark). 8 << 50.

Now, I'm not saying that memory should be thrown around willy-nilly.
But given the rationale that I gave above, I think the benefits of
sanity, portability and security clearly are significant enough for
the relatively small cost in memory.

Cheers,
--
Pierre

@pierrejoye | http://www.libgd.org

Search Discussions

  • Nikita Popov at May 14, 2014 at 6:48 pm

    On Wed, May 14, 2014 at 6:39 PM, Pierre Joye wrote:

    ## Rationale for size_t (string lengths):

    This has significant advantages. There are some costs to doing it, but
    they are not as significant as they may appear on the surface. Let's
    dive into it:

    ### It's The Correct Data Type

    The C89 spec indicates in 3.3.3.4 (
    http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that
    the size_t type was created specifically for usage in this context. It
    is always, 100% guaranteed to be able to hold the bounds of every
    possible array element. Strings in C are simply char arrays.
    Therefore, the correct data type to use for string sizes (which really
    are just an offset qualifier) is size_t.
    I don't have a copy of the C standard handy, so I'll quote what C++ has to
    say on size_t:
    The type size_t is an implementation-defined unsigned integer type that
    is large enough to contain the size in bytes of any object

    So a size_t type can contain the size of *any* object. If we want to
    support *any* object, then we must use the size_t type, that's correct.
    However, we may reasonably choose to disallow creating objects of certain
    types (e.g. strings) with sizes that are larger than a uint32_t, to save
    memory. If we chose to implement such a restriction, the uint32_t type
    would provide us with the *exact* same guarantees as size_t does - it is
    large enough to contain the size of any object that we allow to be created.

    ### It's The Secure Data Type
    size_t (and ptrdiff_t) are the only C89 types that are 100% guaranteed
    to be able to hold the size of any possible object that the compiler
    will support. Other types will vary depending on the data model that
    the compiler supports, as the spec only defines minimum widths.

    This is so important that CERT issued a coding standard for it:
    INT01-C (
    https://www.securecoding.cert.org/confluence/display/seccode/INT01-C.+Use+rsize_t+or+size_t+for+all+integer+values+representing+the+size+of+an+object
    ).

    One of the reasons is that it's difficult to do overflow checks in a
    portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 .
    In there, they recommend using the C99 uintptr_t type, but suggest
    using size_t for platforms that don't have uintptr_t support (and
    since we target C89 for the engine, that's out).
    Overflow checks are an issue that is independent of the chosen size storage
    type. They always you need to occur. The addition of two size_t values can
    overflow in just the same way as the addition of two uint32_t values. There
    is no difference in security between using size_t and uint32_t. The
    security issue lies in absence of proper overflow checks (or in the
    presence of incorrect overflow checks).

    If anybody can come up with an example where usage of uint32_t over size_t
    (in a context where we ensured that allocations do not surpass that size)
    causes security or other safety issues, please speak up. I wasn't able to
    find anything in that direction.

    ### ZPP Changes
    The ZPP changes are critical. The reason is that varargs is casting an
    arbitrary block of memory to a type, and then writing to it. So
    existing code that does zpp("s", str, &int_len) would wind up with a
    buffer overflow. Because zpp would be trying to write a 64 bit value
    to a 32 bit container. The other 32 bits would fall off the end, into
    who knows what. At BEST this can result in a segfault. At worst,
    memory corruption and MASSIVE security vulnerabilities.

    Also note that the compiler *can't* and actively doesn't catch these
    types of errors. That means that it's largely luck and testing that
    will lead to it.

    So, I chose to break BC and rename the ZPP symbols. Because that WILL
    error, and provide the developer with a meaningful indication that an
    improper data type was provided. As I considered a fatal error that an
    invalid type was supplied was a better way of identifying to the
    developer that "HEY, THIS NEEDS TO BE CHANGED ASAP" than just letting
    them hit random segfaults at runtime.

    If there is a way to get around this by giving the compiler more
    information, then do it. But to just leave the types there, and leave
    it to chance if a buffer overflow occurs, is dangerous. Which is why I
    made the call that the ZPP types **needed** to be changed.
    Afaik Johannes has written a Clang-based static analyzer that detects type
    mismatches between the zpp format string and the passed arguments (gcov
    also seems to have some kind of analyzer for that). That seems like a good
    way to find incorrect declarations without having to rename everything -
    and this also covers checks for all types that weren't changed :)

    Nikita
  • Stas Malyshev at May 15, 2014 at 10:03 pm
    Hi!
    ### It's The Correct Data Type

    The C89 spec indicates in 3.3.3.4 (
    http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that
    the size_t type was created specifically for usage in this context. It
    is always, 100% guaranteed to be able to hold the bounds of every
    possible array element. Strings in C are simply char arrays.
    Here is my problem with it - we don't need a type that allows to hold
    the bounds of every possible array element. It's like buying a house
    that could contain all your relatives, acquaintances, friends and people
    that you have ever met if they would decide to come to you to stay all
    at once. Too expensive and very impractical. We're using unified string
    sizes now, and 99% of the strings we're using never even reach limits of
    16 bits, let alone come close to limits of int. Carrying around a 64-bit
    value to store that is just waste. We'd be just storing megabytes of
    zeroes without any use. Whatever theoretical reasons there are for
    generic application to use that, they hardly can be applied to very
    specialized and supposed to be highly optimized case as the language
    engine is. It's a fine argument in the generic case, but we do not have
    the generic case here.
    ### It's The Secure Data Type
    "Security" is quickly becoming a thought-terminating cliche, both in
    programming and outside. "It's necessary for security", ergo we should
    pay whatever it takes for it and not question it. It's not right and we
    should not fall into this trap. We can and should question it, and
    evaluate the costs carefully.

    In most cases where we deal with possible overflows, 64-bit value can be
    overflown just as 32-bit can. Most integer overflows we've had in PHP
    were demonstrable on both 32-bit and 64-bit platforms.
    size_t (and ptrdiff_t) are the only C89 types that are 100% guaranteed
    to be able to hold the size of any possible object that the compiler
    will support. Other types will vary depending on the data model that
    the compiler supports, as the spec only defines minimum widths.
    Again, as I pointed out, it is fine in theory but in practice we a)
    don't need that and b) it doesn't actually add much security-wise as
    these values still can be overflown with the same ease in most cases
    (e.g., see recent fixed array size bug - same overflow regardless of
    length var size). Yes, we should be careful at mixing int with
    size-types, and that's why I welcome introduction of size-types to
    emphasize this. However, making them 64-bit is a different question for me.
    This is a very good generic advice for writing generic functions like
    copy() example there. However, again, we don't have generic case here.
    We can have our copy()'s and emalloc()'s work with size_t. However, when
    we talking about zval, it's a bit different story.
    One of the reasons is that it's difficult to do overflow checks in a
    portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 .
    I agree, doing it right may be tricky. However, we already have
    primitives for doing it that work and are optimized for common
    platforms. It is a solved problem. So bringing it as an argument does
    not make much sense - we don't need an additional effort to do this
    difficult thing, because we've already done it. Of course, these checks
    have to be actually used - but they would have to be used in 64-bit case
    too!
    ### About Long Strings

    The fact that changing to size_t allows strings (and arrays) to be >
    4gb is a side-effect. A welcome one, but a side effect none the less.
    The primary reason to use it is that it's the correct data type, and
    gives you the most safety and security.
    Here I must disagree again, as I see inflating the string length
    variable as the most unwelcome side effect. Which we may yet find a way
    to tolerate it and work around it, but by itself it is nothing but a
    drag for anybody but 0.0001% of PHP developers who actually finds it
    necessary to stuff 4G strings into PHP.
    But that's at the structure level. Let's look at what actually happens
    in practice. Dmitry himself also provides these answers. The average
    memory increase is 8% for Wordpress, and 6% for ZF1.

    Let's put that 8% in context. Wordpress used 12MB, and now it uses
    13MB. 1MB more. That's not overly significant. ZF used 29MB. Now it
    uses 31MB. Still not overly significant.
    I think it is pretty significant. If we could reduce memory usage by
    6-8%, would we consider it a win? I think we would. Thus, we should
    consider the same increase a loss. However, the bigger loss may be in
    inflating the sizes of frequently-used structures like zend_string.

    I think we should look very closely at how we can reduce the memory
    impact and not just dismiss it as insignificant. I like the idea of the
    patch, and the cleanup of the types and 64-bit support has been long
    overdue. However, I would hate to pay for that by dragging literally
    megabytes of zeroes around for no purpose but to satisfy an abstract
    requirement written for generic case.
    --
    Stanislav Malyshev, Software Architect
    SugarCRM: http://www.sugarcrm.com/
    (408)454-6900 ext. 227
  • Pierre Joye at May 16, 2014 at 3:31 am
    Hi Stas,
    On Fri, May 16, 2014 at 12:03 AM, Stas Malyshev wrote:
    Hi!
    ### It's The Correct Data Type

    The C89 spec indicates in 3.3.3.4 (
    http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that
    the size_t type was created specifically for usage in this context. It
    is always, 100% guaranteed to be able to hold the bounds of every
    possible array element. Strings in C are simply char arrays.
    Here is my problem with it - we don't need a type that allows to hold
    the bounds of every possible array element. It's like buying a house
    that could contain all your relatives, acquaintances, friends and people
    that you have ever met if they would decide to come to you to stay all
    at once.
    Too expensive and very impractical. We're using unified string
    sizes now, and 99% of the strings we're using never even reach limits of
    16 bits, let alone come close to limits of int. Carrying around a 64-bit
    value to store that is just waste. We'd be just storing megabytes of
    zeroes without any use. Whatever theoretical reasons there are for
    generic application to use that, they hardly can be applied to very
    specialized and supposed to be highly optimized case as the language
    engine is. It's a fine argument in the generic case, but we do not have
    the generic case here.

    I wonder if one actually reads it correctly and other replies, but let
    me say it again.

    It is a side effect not a goal, at all.
    ### It's The Secure Data Type
    "Security" is quickly becoming a thought-terminating cliche, both in
    programming and outside. "It's necessary for security", ergo we should
    pay whatever it takes for it and not question it. It's not right and we
    should not fall into this trap. We can and should question it, and
    evaluate the costs carefully.

    In most cases where we deal with possible overflows, 64-bit value can be
    overflown just as 32-bit can. Most integer overflows we've had in PHP
    were demonstrable on both 32-bit and 64-bit platforms.
    I disagree here, even more lately.
    This is a very good generic advice for writing generic functions like
    copy() example there. However, again, we don't have generic case here.
    We can have our copy()'s and emalloc()'s work with size_t. However, when
    we talking about zval, it's a bit different story.
    I disagree again too, and I am really not alone, if we open our minds,
    it has been proven that this is the way to go. Not sure what else I
    could say as we already answered this point numerous times.
    One of the reasons is that it's difficult to do overflow checks in a
    portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 .
    I agree, doing it right may be tricky. However, we already have
    primitives for doing it that work and are optimized for common
    platforms. It is a solved problem. So bringing it as an argument does
    not make much sense - we don't need an additional effort to do this
    difficult thing, because we've already done it. Of course, these checks
    have to be actually used - but they would have to be used in 64-bit case
    too!
    My point exactly, we did not do this job, not even remotely.
    ### About Long Strings

    The fact that changing to size_t allows strings (and arrays) to be >
    4gb is a side-effect. A welcome one, but a side effect none the less.
    The primary reason to use it is that it's the correct data type, and
    gives you the most safety and security.
    Here I must disagree again, as I see inflating the string length
    variable as the most unwelcome side effect. Which we may yet find a way
    to tolerate it and work around it, but by itself it is nothing but a
    drag for anybody but 0.0001% of PHP developers who actually finds it
    necessary to stuff 4G strings into PHP.
    I have to sigh here. Again, we do not care about 4G string, it is a side effect.

    But that's at the structure level. Let's look at what actually happens
    in practice. Dmitry himself also provides these answers. The average
    memory increase is 8% for Wordpress, and 6% for ZF1.
    These numbers are wrong, I show you and to the list other numbers,
    which are half of what Dmitry shows. And it is without investigating
    deeper in phpng, which offers more rooms for improvements and
    optimization than 5.x, which we use as base for tests, compatibility
    and performance comparison. How hard it is to understand that we
    cannot use phpng as base as of now?
    Let's put that 8% in context. Wordpress used 12MB, and now it uses
    13MB. 1MB more. That's not overly significant. ZF used 29MB. Now it
    uses 31MB. Still not overly significant.
    I think it is pretty significant. If we could reduce memory usage by
    6-8%, would we consider it a win? I think we would. Thus, we should
    consider the same increase a loss. However, the bigger loss may be in
    inflating the sizes of frequently-used structures like zend_string.
    It is not 8%. Again.
    I think we should look very closely at how we can reduce the memory
    impact and not just dismiss it as insignificant. I like the idea of the
    patch, and the cleanup of the types and 64-bit support has been long
    overdue. However, I would hate to pay for that by dragging literally
    megabytes of zeroes around for no purpose but to satisfy an abstract
    requirement written for generic case.
    I agree here. Even with 4%, we can improve it more. And we will.

    However it seems that some readers take wrong numbers as fact,
    prototypes as final versions and then decide to shut down a long due
    change based on these temporary or promising numbers, which will
    change anyway until phpng gets somewhere near alpha. I am somehow out
    of arguments to explain that :)

    Cheers,
    --
    Pierre

    @pierrejoye | http://www.libgd.org
  • Ulf Wendel at May 16, 2014 at 5:26 am

    Am 16.05.2014 05:31, schrieb Pierre Joye:
    Hi Stas,
    On Fri, May 16, 2014 at 12:03 AM, Stas Malyshev wrote:
    Hi!
    ### It's The Correct Data Type

    The C89 spec indicates in 3.3.3.4 (
    http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that
    the size_t type was created specifically for usage in this context. It
    is always, 100% guaranteed to be able to hold the bounds of every
    possible array element. Strings in C are simply char arrays.
    Here is my problem with it - we don't need a type that allows to hold
    the bounds of every possible array element. It's like buying a house
    that could contain all your relatives, acquaintances, friends and people
    that you have ever met if they would decide to come to you to stay all
    at once.
    Too expensive and very impractical. We're using unified string
    sizes now, and 99% of the strings we're using never even reach limits of
    16 bits, let alone come close to limits of int. Carrying around a 64-bit
    value to store that is just waste. We'd be just storing megabytes of
    zeroes without any use. Whatever theoretical reasons there are for
    generic application to use that, they hardly can be applied to very
    specialized and supposed to be highly optimized case as the language
    engine is. It's a fine argument in the generic case, but we do not have
    the generic case here.

    I wonder if one actually reads it correctly and other replies, but let
    me say it again.

    It is a side effect not a goal, at all.
    Assuming unified behaviour is the #1 goal, the lowest common denominator
    could be used instead of the largest. The latter obviously has
    undesired side effects.

    Ulf
  • Pierre Joye at May 16, 2014 at 5:30 am

    On Fri, May 16, 2014 at 7:25 AM, Ulf Wendel wrote:

    Assuming unified behaviour is the #1 goal, the lowest common denominator
    could be used instead of the largest. The latter obviously has
    undesired side effects.
    Nice rhetoric but sadly missing some of the key points:

    - for class/variables and other language keywords, we can keep what we
    have (see my other replies for the reasons)
    - we can always limit the maximum size of a storage (hash, zval,
    buffer, etc) at the implementation
       . we already do that
       . it still gives us the benefits of using a clean, safe and
    standard implementation (see other replies for a more detailed
    explanation about these points)
    - the lower common denominator, platforms wild is what we do with this patch

    --
    Pierre

    @pierrejoye | http://www.libgd.org
  • Ulf Wendel at May 16, 2014 at 7:41 am

    Am 16.05.2014 07:30, schrieb Pierre Joye:
    - the lower common denominator, platforms wild is what we do with this patch
    Aha. https://wiki.php.net/rfc/size_t_and_int64_next lists three 64bit
    platforms:

    int - 2x 32bit, 1x 64bit -> lower common -> 32bit
    long - 2x 64bit, 1x 32bit -> lower common -> 32bit

    And, the RFC proposes:

    s/int/size_t -> 64bit
    s/long/int64_t -> 64bit

    Ulf
  • Dmitry Stogov at May 16, 2014 at 7:50 am
    Pierre,

    1) if you reduce size off all language entities to 32-bit, then what this
    patch would provide except of extending string to have 64-bit length?
    but you always talk - it's a side effect (I'm not talking about IS_LONG
    part).

    2) In phpng we use zend_string* for class, function and other names, so
    extending zend_string would extend all of them anyway.

    Thanks. Dmitry.





    On Fri, May 16, 2014 at 9:30 AM, Pierre Joye wrote:
    On Fri, May 16, 2014 at 7:25 AM, Ulf Wendel wrote:

    Assuming unified behaviour is the #1 goal, the lowest common denominator
    could be used instead of the largest. The latter obviously has
    undesired side effects.
    Nice rhetoric but sadly missing some of the key points:

    - for class/variables and other language keywords, we can keep what we
    have (see my other replies for the reasons)
    - we can always limit the maximum size of a storage (hash, zval,
    buffer, etc) at the implementation
    . we already do that
    . it still gives us the benefits of using a clean, safe and
    standard implementation (see other replies for a more detailed
    explanation about these points)
    - the lower common denominator, platforms wild is what we do with this
    patch

    --
    Pierre

    @pierrejoye | http://www.libgd.org

    --
    PHP Internals - PHP Runtime Development Mailing List
    To unsubscribe, visit: http://www.php.net/unsub.php
  • Pierre Joye at May 16, 2014 at 8:59 am

    On Fri, May 16, 2014 at 9:50 AM, Dmitry Stogov wrote:
    Pierre,

    1) if you reduce size off all language entities to 32-bit, then what this
    patch would provide except of extending string to have 64-bit length?
    but you always talk - it's a side effect (I'm not talking about IS_LONG
    part).
    You keep using the same arguments in the loop and I keep answering the
    same thing as well. Not sure that will lead us anywhere. Anyway, let
    me try a last time.

    To be able to have 2^64 (-1 as zend_string use unsigned char) is by
    far not the goal but what is very well described in the links we
    posted, the answers we gave, etc.
    2) In phpng we use zend_string* for class, function and other names, so
    extending zend_string would extend all of them anyway.
    Besides the fact that phpng is still a huge prototype and many things
    may still change in the next months/year, choices being done now does
    not mean they are written in stone. And as I said numerous times
    already, I did not have the time to analyze phpng more deeper yet
    while busy fixing it to make at least build or do not crash on
    supported platforms.

    Cheers,
    --
    Pierre

    @pierrejoye | http://www.libgd.org
  • Yasuo Ohgaki at Jul 5, 2014 at 10:12 pm

    On Thu, May 15, 2014 at 1:39 AM, Pierre Joye wrote:

    First we are working on providing updated numbers using 5.6 vs
    5.6+patch, to have an actual base reference. Wordpress, Symfony and
    Drupal will be used.

    Also, Anthony posted something on IRC which summarizes very well what
    has been said in the other threads. be recently or in the previous
    vote to merge the 64bit patch in 5.x.

    I will simply copy it here. I seriously hope anyone to carefully read
    it and vote with all facts in mind, keeping the big picture in mind,
    long term. Thanks.

    +++ Anthony's post +++

    This thread has been pointed out to me by a few people. As the
    originator of this patch and concept I feel that I should clarify a
    few points.

    # Rationale

    The reason that I originally started this patch was to clean up and
    standardize the underlying types. This is to introduce predictability,
    portability and type sanity into the engine and entire cphp
    implementation.

    ## Rationale for Int64:

    Without this patch, the size of integers (longs) varies based on which
    compiler you use. This means that even for identical target
    architectures behavior can change with respect to userland code.
    Refactoring this allows for consistent sizes that can be relied upon
    by the programmer. This is an effort to make it a bit easier to rely
    on integer width as a developer.

    And ideally this is a free cost to most implementations, since ints
    are already 64 bits wide, so there is no memory overhead. And
    performance stays the same as well.

    ## Rationale for size_t (string lengths):

    This has significant advantages. There are some costs to doing it, but
    they are not as significant as they may appear on the surface. Let's
    dive into it:

    ### It's The Correct Data Type

    The C89 spec indicates in 3.3.3.4 (
    http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that
    the size_t type was created specifically for usage in this context. It
    is always, 100% guaranteed to be able to hold the bounds of every
    possible array element. Strings in C are simply char arrays.
    Therefore, the correct data type to use for string sizes (which really
    are just an offset qualifier) is size_t.

    Additionally, calloc, malloc, etc all expect parameters of type size_t
    for exactly this reason.

    Another good reference on it: http://www.viva64.com/en/a/0050/

    ### It's The Secure Data Type

    size_t (and ptrdiff_t) are the only C89 types that are 100% guaranteed
    to be able to hold the size of any possible object that the compiler
    will support. Other types will vary depending on the data model that
    the compiler supports, as the spec only defines minimum widths.

    This is so important that CERT issued a coding standard for it:
    INT01-C (
    https://www.securecoding.cert.org/confluence/display/seccode/INT01-C.+Use+rsize_t+or+size_t+for+all+integer+values+representing+the+size+of+an+object
    ).

    One of the reasons is that it's difficult to do overflow checks in a
    portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 .
    In there, they recommend using the C99 uintptr_t type, but suggest
    using size_t for platforms that don't have uintptr_t support (and
    since we target C89 for the engine, that's out).

    Apple's Secure Coding Guide's section on Avoiding Integer Overflows
    and Underflows says the same thing:

    https://developer.apple.com/library/mac/documentation/security/conceptual/securecodingguide/Articles/BufferOverflows.html

    ### About Long Strings

    The fact that changing to size_t allows strings (and arrays) to be >
    4gb is a side-effect. A welcome one, but a side effect none the less.
    The primary reason to use it is that it's the correct data type, and
    gives you the most safety and security.

    # Response To Concerns Mentioned

    I'll respond here to some of the concerns mentioned in this thread:

    ## size_t uses more memory and will result in more CPU cache misses,
    which will result in worse performance

    Well, size_t will use more memory. No doubt about that.

    But the performance side is more nuanced. And as several benchmarks in
    this thread indicate, there isn't a practical difference. Heck, the
    benchmarks on Windows show an improvement in some cases.

    And there is a reason for that. Since a pointer is a 64 bit data type,
    and a int is a 32 bit data type, any time you add the two will result
    in extra CPU cycles needed for the cast. This can be clearly seen by
    analyzing a simple malloc call with an int vs a size_t param. Here's
    the diff:

    < movl $5, -12(%rbp)
    < movl -12(%rbp), %eax
    < cltq
    ---
    movq $5, -16(%rbp)
    movq -16(%rbp), %rax
    Now, a cache miss is much more expensive than a cast, but we don't
    have proof that cache misses will actually occur.

    In fact, in the benchmarks, the worst difference is 2%. Which is
    hardly significant (as indicated by several people here). But also
    notice that in both benchmarks (those done by Microsoft, and those
    done by Dmitry), some specific tests actually executed **faster** with
    the size_t transforms (namely Hello World, Wordpress, etc). So to say
    even 2% is not really the full story.

    We'll come back to the memory thing in a bit.

    ## Macro Renames and ZPP changes

    This was my idea, and I don't think it's been properly justified.

    ### ZPP Changes

    The ZPP changes are critical. The reason is that varargs is casting an
    arbitrary block of memory to a type, and then writing to it. So
    existing code that does zpp("s", str, &int_len) would wind up with a
    buffer overflow. Because zpp would be trying to write a 64 bit value
    to a 32 bit container. The other 32 bits would fall off the end, into
    who knows what. At BEST this can result in a segfault. At worst,
    memory corruption and MASSIVE security vulnerabilities.

    Also note that the compiler *can't* and actively doesn't catch these
    types of errors. That means that it's largely luck and testing that
    will lead to it.

    So, I chose to break BC and rename the ZPP symbols. Because that WILL
    error, and provide the developer with a meaningful indication that an
    improper data type was provided. As I considered a fatal error that an
    invalid type was supplied was a better way of identifying to the
    developer that "HEY, THIS NEEDS TO BE CHANGED ASAP" than just letting
    them hit random segfaults at runtime.

    If there is a way to get around this by giving the compiler more
    information, then do it. But to just leave the types there, and leave
    it to chance if a buffer overflow occurs, is dangerous. Which is why I
    made the call that the ZPP types **needed** to be changed.

    ### Macro Renames

    The reason for the rename is largely the same as with the ZPP changes.
    The severity of not changing is less (since the compiler will warn and
    do an implicit cast for you). But it's still there. Which is why I
    chose to change it. This is less critical, but was done to better
    indicate to the developer what needs to change to properly support the
    new system.

    ## Memory Overhead

    This is definitely a concern. There is a potential to double the
    amount of memory that PHP takes. Which on the surface looks enormous.
    And if we stop at the surface, we definitely shouldn't do it!

    But as we look deeper, we see that in actuality, the difference is not
    double. In fact, most data structures, as identified by Dmitry
    himself, only increase by between 6% (zend_op_array) 50%
    (zend_string's size). So that "double" figure quickly drops.

    But that's at the structure level. Let's look at what actually happens
    in practice. Dmitry himself also provides these answers. The average
    memory increase is 8% for Wordpress, and 6% for ZF1.

    Let's put that 8% in context. Wordpress used 12MB, and now it uses
    13MB. 1MB more. That's not overly significant. ZF used 29MB. Now it
    uses 31MB. Still not overly significant.

    Don't get me wrong, it's still more. And more is bad. But it's not
    nearly as bad as it's being played out to be.

    To put this into context, 5.4 saved up to 50% memory from 5.3
    (depending on benchmark). 8 << 50.

    Now, I'm not saying that memory should be thrown around willy-nilly.
    But given the rationale that I gave above, I think the benefits of
    sanity, portability and security clearly are significant enough for
    the relatively small cost in memory.
    Nice post.
    Rather than short term outcome, we should consider next decade for new
    major release.

    Memory overhead is reasonable to accept.
    Performance impact is reasonable to accept.
    And most importantly, cleaner, consistent, safer, modern and extensible
    code base is must have for the next decade. IMHO.

    I hope everyone agrees to the change.

    Regards,

    --
    Yasuo Ohgaki
    yohgaki@ohgaki.net

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupphp-internals @
categoriesphp
postedMay 14, '14 at 4:39p
activeJul 5, '14 at 10:12p
posts10
users6
websitephp.net

People

Translate

site design / logo © 2022 Grokbase