FAQ
Hi,

I'm grappling with a design flaw I just uncovered in stream filters, and
need some advice on how best to fix it. The problem exists since the
introduction of stream filters, and has 3 parts. 2 of them can probably
be fixed safely in PHP 5.2+, but I think the third may require an
internal redesign of stream filters, and so would probably have to be
PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
appreciated on this).

The first part of the bug that I encountered is best described here:
http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
than this, as the attempts to cache data is dangerous any time a stream
filter is attached to a stream. I should also note that the patch in
this bug contains feature additions that would have to wait for PHP 5.3.

I ran into this problem because I was trying to use stream filters to
read in a bz2-compressed file within a zip archive in the phar
extension. This was failing, and I first tracked the problem down to an
attempt by php_stream_filter_append to read in a bunch of data and cache
it, which caused more stuff to be passed into the bz2 decompress filter
than it could handle, making it barf. After fixing this problem, I ran
into the problem described in the bug above because of
php_stream_fill_read_buffer doing the same thing when I tried to read
the data, because I requested it return 176 decompressed bytes, and so
php_stream_read passed in 176 bytes to the decompress filter. Only 144
of those bytes were actually bz2-compressed data, and so the filter
barfed upon trying to decompress the remaining data (same as bug #46026,
found differently).

You can probably tell from my explanation that this is an
extraordinarily complex problem. There's 3 inter-related problems here:

1) bz2 (and zlib) stream filter should stop trying to decompress when it
reaches the stream end regardless of how many bytes it is told to
decompress (easy to fix)
2) it is never safe to cache read data when a read stream filter is
appended, as there is no safe way to determine in advance how much of
the stream can be safely filtered. (would be easy to fix if it weren't
for #3)
3) there is no clear way to request that a certain number of filtered
bytes be returned from a stream, versus how many unfiltered bytes should
be passed into the stream. (very hard to fix without design change)

I need some advice on #3 from the original designers of stream filters
and streams, as well as any experts who have dealt with this kind of
problem in other contexts. In this situation, should we expect stream
filters to always stop filtering if they reach the end of valid input?
Even in this situation, there is potential that less data is available
than passed in. A clear example would be if we requested only 170
bytes. 144 of those bytes would be passed in as the complete compressed
data, and bz2.decompress would decompress all of it to 176 bytes. 170
of those bytes would be returned from php_stream_read, and 6 would have
to be placed in a cache for future reads. Thus, there would need to be
some way of marking the cache as valid because of this logic path:

<?php
$a = fopen('blah.zip');
fseek($a, 132); // fills read buffer with unfiltered data
stream_filter_append($a, 'bzip2.decompress'); // clears read buffer cache
$b = fread($a, 170); // fills read buffer cache with 6 bytes
fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
read buffer cache
stream_filter_append($a, 'zlib.inflate');
?>

The question is what should happen when we append the second filter
'zlib.inflate' to filter the filtered data? If we clear the read buffer
as we did in the first case, it will result in lost data. So, let's
assume we preserve the read buffer. Then, if we perform:

<?php
$c = fread($a, 7);
?>

and assume the remaining 3 bytes expand to 8 bytes, how should the read
buffer cache be handled? Should the first 3 bytes still be the filtered
bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
decompressed zlib data?

Basically, I am wondering if perhaps we need to implement a read buffer
cache for each stream filter. This could solve our problem, I think.
The data would be stored like so:

stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
next byte for php_stream_read()
bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
pointer to byte 171 as the next byte for php_stream_read()
zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to
byte 8 as the next byte for php_stream_read()

This way, we would essentially have a stack of stream data. If the zlib
filter were then removed, we could "back up" to the bzip2 filter and so
on. This will allow proper read cache filling, and remove the weird
ambiguities that are apparent in a filtered stream. I don't think we
would need to worry about backwards compatibility here, as the most
common use case would be unaffected by this change, and the use case it
would fix has never actually worked.

I haven't got a patch for this yet, but it would be easy to do if the
logic is sound.

Thanks,
Greg

Search Discussions

  • Marcus Boerger at Oct 11, 2008 at 7:00 pm
    Hello Gregory,

    Saturday, October 11, 2008, 7:45:09 PM, you wrote:
    Hi,
    I'm grappling with a design flaw I just uncovered in stream filters, and
    need some advice on how best to fix it. The problem exists since the
    introduction of stream filters, and has 3 parts. 2 of them can probably
    be fixed safely in PHP 5.2+, but I think the third may require an
    internal redesign of stream filters, and so would probably have to be
    PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
    appreciated on this).
    The first part of the bug that I encountered is best described here:
    http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
    than this, as the attempts to cache data is dangerous any time a stream
    filter is attached to a stream. I should also note that the patch in
    this bug contains feature additions that would have to wait for PHP 5.3.
    I ran into this problem because I was trying to use stream filters to
    read in a bz2-compressed file within a zip archive in the phar
    extension. This was failing, and I first tracked the problem down to an
    attempt by php_stream_filter_append to read in a bunch of data and cache
    it, which caused more stuff to be passed into the bz2 decompress filter
    than it could handle, making it barf. After fixing this problem, I ran
    into the problem described in the bug above because of
    php_stream_fill_read_buffer doing the same thing when I tried to read
    the data, because I requested it return 176 decompressed bytes, and so
    php_stream_read passed in 176 bytes to the decompress filter. Only 144
    of those bytes were actually bz2-compressed data, and so the filter
    barfed upon trying to decompress the remaining data (same as bug #46026,
    found differently).
    You can probably tell from my explanation that this is an
    extraordinarily complex problem. There's 3 inter-related problems here:
    1) bz2 (and zlib) stream filter should stop trying to decompress when it
    reaches the stream end regardless of how many bytes it is told to
    decompress (easy to fix)
    2) it is never safe to cache read data when a read stream filter is
    appended, as there is no safe way to determine in advance how much of
    the stream can be safely filtered. (would be easy to fix if it weren't
    for #3)
    3) there is no clear way to request that a certain number of filtered
    bytes be returned from a stream, versus how many unfiltered bytes should
    be passed into the stream. (very hard to fix without design change)
    I need some advice on #3 from the original designers of stream filters
    and streams, as well as any experts who have dealt with this kind of
    problem in other contexts. In this situation, should we expect stream
    filters to always stop filtering if they reach the end of valid input?
    Even in this situation, there is potential that less data is available
    than passed in. A clear example would be if we requested only 170
    bytes. 144 of those bytes would be passed in as the complete compressed
    data, and bz2.decompress would decompress all of it to 176 bytes. 170
    of those bytes would be returned from php_stream_read, and 6 would have
    to be placed in a cache for future reads. Thus, there would need to be
    some way of marking the cache as valid because of this logic path:
    <?php
    $a = fopen('blah.zip');
    fseek($a, 132); // fills read buffer with unfiltered data
    stream_filter_append($a, 'bzip2.decompress'); // clears read buffer cache
    $b = fread($a, 170); // fills read buffer cache with 6 bytes
    fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
    read buffer cache
    stream_filter_append($a, 'zlib.inflate'); ?>>
    The question is what should happen when we append the second filter
    'zlib.inflate' to filter the filtered data? If we clear the read buffer
    as we did in the first case, it will result in lost data. So, let's
    assume we preserve the read buffer. Then, if we perform:
    <?php
    $c = fread($a, 7); ?>>
    and assume the remaining 3 bytes expand to 8 bytes, how should the read
    buffer cache be handled? Should the first 3 bytes still be the filtered
    bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
    decompressed zlib data?
    Basically, I am wondering if perhaps we need to implement a read buffer
    cache for each stream filter. This could solve our problem, I think.
    The data would be stored like so:
    stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
    next byte for php_stream_read()
    bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
    pointer to byte 171 as the next byte for php_stream_read()
    zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to
    byte 8 as the next byte for php_stream_read()
    This way, we would essentially have a stack of stream data. If the zlib
    filter were then removed, we could "back up" to the bzip2 filter and so
    on. This will allow proper read cache filling, and remove the weird
    ambiguities that are apparent in a filtered stream. I don't think we
    would need to worry about backwards compatibility here, as the most
    common use case would be unaffected by this change, and the use case it
    would fix has never actually worked.
    I haven't got a patch for this yet, but it would be easy to do if the
    logic is sound.
    Thanks,
    Greg



    The filters need an input state and an output state. And they need to
    respect the the fact that number of requested bytes does not necessarily
    mean reading the same amount of data on the other end. In fact the output
    side does generally not know whether less, the same amount or more input is
    to be read. But this can be dealt with inside the filter. However the
    filters should return the number input vs the number of output filters
    always independently. Regarding states we would be interested if reaching
    EOD on the input state meant reaching EOD on the output side prior to the
    requested amount, at the requested amount or not at all yet (more data
    available).

    Best regards,
    Marcus
  • Gregory Beaver at Oct 11, 2008 at 7:30 pm

    Marcus Boerger wrote:
    The filters need an input state and an output state. And they need to
    respect the the fact that number of requested bytes does not necessarily
    mean reading the same amount of data on the other end. In fact the output
    side does generally not know whether less, the same amount or more input is
    to be read. But this can be dealt with inside the filter. However the
    filters should return the number input vs the number of output filters
    always independently. Regarding states we would be interested if reaching
    EOD on the input state meant reaching EOD on the output side prior to the
    requested amount, at the requested amount or not at all yet (more data
    available).
    Hi,

    All of this makes sense, but how to implement it? Currently, there is
    no clear way to answer questions like:

    1) how much data did the filter consume, and how much is left over?
    2) how much data is left in the original stream prior to filtering?

    Perhaps this could be implemented by overloading ftell() so that it can
    accept a filter resource as well as a stream resource? Also, a new
    function could be implemented:

    stream_filter_datasize(resource $filter) // I'm open to better names

    that would either return the total size of filtered data available, or
    false if this is indeterminate. This would allow:

    <?php
    $startoffile = 123;
    $a = fopen('blah.zip');
    fseek($a, $startoffile);
    $filter = stream_filter_append($a, 'bzip2.decompress');
    $b = fread($a, 170);
    echo ftell($a),", ",ftell($filter),",
    ",stream_filter_datasize($filter),"\n";
    // outputs "145, 170, 176"
    ?>

    Greg
  • Ilia Alshanetsky at Oct 12, 2008 at 3:22 am
    Greg,

    First of great job on the analysis of the problem. As far as I can
    tell (correct me if I am wrong) the problem lies in extraordinary
    complex use of stream filters, with >2 stacking
    such as phar://zip://bz2:// (for example). Since I'd wager that this
    is not a common use care, I would prefer to refrain from any major
    changes for this for the 5.2 release. Whatever simple solutions we can
    implement such as fix for #1 (which appears to have already been
    committed) are more then welcome, however my call is that #2 and #3
    wait for 5.3, unless a simple and reliable solution is found.

    On 11-Oct-08, at 1:45 PM, Gregory Beaver wrote:

    Hi,

    I'm grappling with a design flaw I just uncovered in stream filters,
    and
    need some advice on how best to fix it. The problem exists since the
    introduction of stream filters, and has 3 parts. 2 of them can
    probably
    be fixed safely in PHP 5.2+, but I think the third may require an
    internal redesign of stream filters, and so would probably have to be
    PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
    appreciated on this).

    The first part of the bug that I encountered is best described here:
    http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
    than this, as the attempts to cache data is dangerous any time a
    stream
    filter is attached to a stream. I should also note that the patch in
    this bug contains feature additions that would have to wait for PHP
    5.3.

    I ran into this problem because I was trying to use stream filters to
    read in a bz2-compressed file within a zip archive in the phar
    extension. This was failing, and I first tracked the problem down
    to an
    attempt by php_stream_filter_append to read in a bunch of data and
    cache
    it, which caused more stuff to be passed into the bz2 decompress
    filter
    than it could handle, making it barf. After fixing this problem, I
    ran
    into the problem described in the bug above because of
    php_stream_fill_read_buffer doing the same thing when I tried to read
    the data, because I requested it return 176 decompressed bytes, and so
    php_stream_read passed in 176 bytes to the decompress filter. Only
    144
    of those bytes were actually bz2-compressed data, and so the filter
    barfed upon trying to decompress the remaining data (same as bug
    #46026,
    found differently).

    You can probably tell from my explanation that this is an
    extraordinarily complex problem. There's 3 inter-related problems
    here:

    1) bz2 (and zlib) stream filter should stop trying to decompress
    when it
    reaches the stream end regardless of how many bytes it is told to
    decompress (easy to fix)
    2) it is never safe to cache read data when a read stream filter is
    appended, as there is no safe way to determine in advance how much of
    the stream can be safely filtered. (would be easy to fix if it weren't
    for #3)
    3) there is no clear way to request that a certain number of filtered
    bytes be returned from a stream, versus how many unfiltered bytes
    should
    be passed into the stream. (very hard to fix without design change)

    I need some advice on #3 from the original designers of stream filters
    and streams, as well as any experts who have dealt with this kind of
    problem in other contexts. In this situation, should we expect stream
    filters to always stop filtering if they reach the end of valid input?
    Even in this situation, there is potential that less data is available
    than passed in. A clear example would be if we requested only 170
    bytes. 144 of those bytes would be passed in as the complete
    compressed
    data, and bz2.decompress would decompress all of it to 176 bytes. 170
    of those bytes would be returned from php_stream_read, and 6 would
    have
    to be placed in a cache for future reads. Thus, there would need to
    be
    some way of marking the cache as valid because of this logic path:

    <?php
    $a = fopen('blah.zip');
    fseek($a, 132); // fills read buffer with unfiltered data
    stream_filter_append($a, 'bzip2.decompress'); // clears read buffer
    cache
    $b = fread($a, 170); // fills read buffer cache with 6 bytes
    fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
    read buffer cache
    stream_filter_append($a, 'zlib.inflate');
    ?>

    The question is what should happen when we append the second filter
    'zlib.inflate' to filter the filtered data? If we clear the read
    buffer
    as we did in the first case, it will result in lost data. So, let's
    assume we preserve the read buffer. Then, if we perform:

    <?php
    $c = fread($a, 7);
    ?>

    and assume the remaining 3 bytes expand to 8 bytes, how should the
    read
    buffer cache be handled? Should the first 3 bytes still be the
    filtered
    bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
    decompressed zlib data?

    Basically, I am wondering if perhaps we need to implement a read
    buffer
    cache for each stream filter. This could solve our problem, I think.
    The data would be stored like so:

    stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
    next byte for php_stream_read()
    bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
    pointer to byte 171 as the next byte for php_stream_read()
    zlib.inflate filter: 8 bytes of decompressed zlib data, and a
    pointer to
    byte 8 as the next byte for php_stream_read()

    This way, we would essentially have a stack of stream data. If the
    zlib
    filter were then removed, we could "back up" to the bzip2 filter and
    so
    on. This will allow proper read cache filling, and remove the weird
    ambiguities that are apparent in a filtered stream. I don't think we
    would need to worry about backwards compatibility here, as the most
    common use case would be unaffected by this change, and the use case
    it
    would fix has never actually worked.

    I haven't got a patch for this yet, but it would be easy to do if the
    logic is sound.

    Thanks,
    Greg



    --
    PHP Internals - PHP Runtime Development Mailing List
    To unsubscribe, visit: http://www.php.net/unsub.php
    Ilia Alshanetsky
  • Greg Beaver at Oct 12, 2008 at 4:35 am

    Ilia Alshanetsky wrote:
    Greg,

    First of great job on the analysis of the problem. As far as I can
    tell (correct me if I am wrong) the problem lies in extraordinary
    complex use of stream filters, with >2 stacking
    such as phar://zip://bz2:// (for example). Since I'd wager that this
    is not a common use care, I would prefer to refrain from any major
    changes for this for the 5.2 release. Whatever simple solutions we can
    implement such as fix for #1 (which appears to have already been
    committed) are more then welcome, however my call is that #2 and #3
    wait for 5.3, unless a simple and reliable solution is found.
    Hi Ilia,

    Actually, I ran into the problem when using a single stream filter on a
    file that contained uncompressed data after the compressed data. When
    filling the read buffer, PHP tries to load in an arbitrary number of
    bytes. The bz2 and zlib filters were failing because they would
    continue to attempt to process data after the underlying library (libbz2
    or libzlib) would tell it that it was done, triggering an error. The
    bug in bz2/zlib was fixed today, but the problem remains - php is
    passing in more data to stream filters than it should, as it is never
    safe to assume that this is an OK thing to do. The problem of chaining
    filters is an example I doubt anyone has actually tried. The same
    problem exists if one appends a filter to decompress some data, and then
    use fseek on the filtered data when not all of the data has been
    decompressed (which is actually a typical use case and one I ran into
    today). To be clear, although the example was inside the phar
    extension, it uses a file-based stream and stream_filter_append() with
    bzip2.decompress stream filter to decompress the contents of a file
    within a zip archive, no use of the zip:// stream is done inside the
    phar extension.

    However, I think it is safe to assume nobody has tried this, as we would
    have had a bug report before I ran into it today, so I'm fine with
    fixing it in 5.3 unless this new information makes you think it is a
    critical addition to 5.2 since it is essentially broken.

    Greg
  • Lukas Kahwe Smith at Nov 4, 2008 at 9:52 pm
    Hi,

    Sorry about the top post, since I am CC'ing Wez and Sara again with
    the tiny hope of a reaction.
    It seems the streams layer is effectively unmaintained. Greg's seems
    to have gotten his fingers a bit in there, but for such a critical
    piece I guess it would be good to have at least 1-2 more people
    familiar. I guess reviewing this issue is a good start.

    Here is hoping ..

    regards,
    Lukas
    On 11.10.2008, at 19:45, Gregory Beaver wrote:

    Hi,

    I'm grappling with a design flaw I just uncovered in stream filters,
    and
    need some advice on how best to fix it. The problem exists since the
    introduction of stream filters, and has 3 parts. 2 of them can
    probably
    be fixed safely in PHP 5.2+, but I think the third may require an
    internal redesign of stream filters, and so would probably have to be
    PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
    appreciated on this).

    The first part of the bug that I encountered is best described here:
    http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
    than this, as the attempts to cache data is dangerous any time a
    stream
    filter is attached to a stream. I should also note that the patch in
    this bug contains feature additions that would have to wait for PHP
    5.3.

    I ran into this problem because I was trying to use stream filters to
    read in a bz2-compressed file within a zip archive in the phar
    extension. This was failing, and I first tracked the problem down
    to an
    attempt by php_stream_filter_append to read in a bunch of data and
    cache
    it, which caused more stuff to be passed into the bz2 decompress
    filter
    than it could handle, making it barf. After fixing this problem, I
    ran
    into the problem described in the bug above because of
    php_stream_fill_read_buffer doing the same thing when I tried to read
    the data, because I requested it return 176 decompressed bytes, and so
    php_stream_read passed in 176 bytes to the decompress filter. Only
    144
    of those bytes were actually bz2-compressed data, and so the filter
    barfed upon trying to decompress the remaining data (same as bug
    #46026,
    found differently).

    You can probably tell from my explanation that this is an
    extraordinarily complex problem. There's 3 inter-related problems
    here:

    1) bz2 (and zlib) stream filter should stop trying to decompress
    when it
    reaches the stream end regardless of how many bytes it is told to
    decompress (easy to fix)
    2) it is never safe to cache read data when a read stream filter is
    appended, as there is no safe way to determine in advance how much of
    the stream can be safely filtered. (would be easy to fix if it weren't
    for #3)
    3) there is no clear way to request that a certain number of filtered
    bytes be returned from a stream, versus how many unfiltered bytes
    should
    be passed into the stream. (very hard to fix without design change)

    I need some advice on #3 from the original designers of stream filters
    and streams, as well as any experts who have dealt with this kind of
    problem in other contexts. In this situation, should we expect stream
    filters to always stop filtering if they reach the end of valid input?
    Even in this situation, there is potential that less data is available
    than passed in. A clear example would be if we requested only 170
    bytes. 144 of those bytes would be passed in as the complete
    compressed
    data, and bz2.decompress would decompress all of it to 176 bytes. 170
    of those bytes would be returned from php_stream_read, and 6 would
    have
    to be placed in a cache for future reads. Thus, there would need to
    be
    some way of marking the cache as valid because of this logic path:

    <?php
    $a = fopen('blah.zip');
    fseek($a, 132); // fills read buffer with unfiltered data
    stream_filter_append($a, 'bzip2.decompress'); // clears read buffer
    cache
    $b = fread($a, 170); // fills read buffer cache with 6 bytes
    fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
    read buffer cache
    stream_filter_append($a, 'zlib.inflate');
    ?>

    The question is what should happen when we append the second filter
    'zlib.inflate' to filter the filtered data? If we clear the read
    buffer
    as we did in the first case, it will result in lost data. So, let's
    assume we preserve the read buffer. Then, if we perform:

    <?php
    $c = fread($a, 7);
    ?>

    and assume the remaining 3 bytes expand to 8 bytes, how should the
    read
    buffer cache be handled? Should the first 3 bytes still be the
    filtered
    bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
    decompressed zlib data?

    Basically, I am wondering if perhaps we need to implement a read
    buffer
    cache for each stream filter. This could solve our problem, I think.
    The data would be stored like so:

    stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
    next byte for php_stream_read()
    bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
    pointer to byte 171 as the next byte for php_stream_read()
    zlib.inflate filter: 8 bytes of decompressed zlib data, and a
    pointer to
    byte 8 as the next byte for php_stream_read()

    This way, we would essentially have a stack of stream data. If the
    zlib
    filter were then removed, we could "back up" to the bzip2 filter and
    so
    on. This will allow proper read cache filling, and remove the weird
    ambiguities that are apparent in a filtered stream. I don't think we
    would need to worry about backwards compatibility here, as the most
    common use case would be unaffected by this change, and the use case
    it
    would fix has never actually worked.

    I haven't got a patch for this yet, but it would be easy to do if the
    logic is sound.

    Thanks,
    Greg



    --
    PHP Internals - PHP Runtime Development Mailing List
    To unsubscribe, visit: http://www.php.net/unsub.php
    Lukas Kahwe Smith
    mls@pooteeweet.org
  • Wez Furlong at Nov 4, 2008 at 10:01 pm
    I'm too busy to review anything anytime soon; got lots on for the next
    couple of months here at work.

    --Wez.
    On Nov 4, 2008, at 4:51 PM, Lukas Kahwe Smith wrote:

    Hi,

    Sorry about the top post, since I am CC'ing Wez and Sara again with
    the tiny hope of a reaction.
    It seems the streams layer is effectively unmaintained. Greg's seems
    to have gotten his fingers a bit in there, but for such a critical
    piece I guess it would be good to have at least 1-2 more people
    familiar. I guess reviewing this issue is a good start.

    Here is hoping ..

    regards,
    Lukas
    On 11.10.2008, at 19:45, Gregory Beaver wrote:

    Hi,

    I'm grappling with a design flaw I just uncovered in stream
    filters, and
    need some advice on how best to fix it. The problem exists since the
    introduction of stream filters, and has 3 parts. 2 of them can
    probably
    be fixed safely in PHP 5.2+, but I think the third may require an
    internal redesign of stream filters, and so would probably have to be
    PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
    appreciated on this).

    The first part of the bug that I encountered is best described here:
    http://bugs.php.net/bug.php?id=46026. However, it is a deeper
    problem
    than this, as the attempts to cache data is dangerous any time a
    stream
    filter is attached to a stream. I should also note that the patch in
    this bug contains feature additions that would have to wait for PHP
    5.3.

    I ran into this problem because I was trying to use stream filters to
    read in a bz2-compressed file within a zip archive in the phar
    extension. This was failing, and I first tracked the problem down
    to an
    attempt by php_stream_filter_append to read in a bunch of data and
    cache
    it, which caused more stuff to be passed into the bz2 decompress
    filter
    than it could handle, making it barf. After fixing this problem, I
    ran
    into the problem described in the bug above because of
    php_stream_fill_read_buffer doing the same thing when I tried to read
    the data, because I requested it return 176 decompressed bytes, and
    so
    php_stream_read passed in 176 bytes to the decompress filter. Only
    144
    of those bytes were actually bz2-compressed data, and so the filter
    barfed upon trying to decompress the remaining data (same as bug
    #46026,
    found differently).

    You can probably tell from my explanation that this is an
    extraordinarily complex problem. There's 3 inter-related problems
    here:

    1) bz2 (and zlib) stream filter should stop trying to decompress
    when it
    reaches the stream end regardless of how many bytes it is told to
    decompress (easy to fix)
    2) it is never safe to cache read data when a read stream filter is
    appended, as there is no safe way to determine in advance how much of
    the stream can be safely filtered. (would be easy to fix if it
    weren't
    for #3)
    3) there is no clear way to request that a certain number of filtered
    bytes be returned from a stream, versus how many unfiltered bytes
    should
    be passed into the stream. (very hard to fix without design change)

    I need some advice on #3 from the original designers of stream
    filters
    and streams, as well as any experts who have dealt with this kind of
    problem in other contexts. In this situation, should we expect
    stream
    filters to always stop filtering if they reach the end of valid
    input?
    Even in this situation, there is potential that less data is
    available
    than passed in. A clear example would be if we requested only 170
    bytes. 144 of those bytes would be passed in as the complete
    compressed
    data, and bz2.decompress would decompress all of it to 176 bytes.
    170
    of those bytes would be returned from php_stream_read, and 6 would
    have
    to be placed in a cache for future reads. Thus, there would need
    to be
    some way of marking the cache as valid because of this logic path:

    <?php
    $a = fopen('blah.zip');
    fseek($a, 132); // fills read buffer with unfiltered data
    stream_filter_append($a, 'bzip2.decompress'); // clears read buffer
    cache
    $b = fread($a, 170); // fills read buffer cache with 6 bytes
    fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
    read buffer cache
    stream_filter_append($a, 'zlib.inflate');
    ?>

    The question is what should happen when we append the second filter
    'zlib.inflate' to filter the filtered data? If we clear the read
    buffer
    as we did in the first case, it will result in lost data. So, let's
    assume we preserve the read buffer. Then, if we perform:

    <?php
    $c = fread($a, 7);
    ?>

    and assume the remaining 3 bytes expand to 8 bytes, how should the
    read
    buffer cache be handled? Should the first 3 bytes still be the
    filtered
    bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
    decompressed zlib data?

    Basically, I am wondering if perhaps we need to implement a read
    buffer
    cache for each stream filter. This could solve our problem, I think.
    The data would be stored like so:

    stream: 170 bytes of unfiltered data, and a pointer to byte 145 as
    the
    next byte for php_stream_read()
    bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
    pointer to byte 171 as the next byte for php_stream_read()
    zlib.inflate filter: 8 bytes of decompressed zlib data, and a
    pointer to
    byte 8 as the next byte for php_stream_read()

    This way, we would essentially have a stack of stream data. If the
    zlib
    filter were then removed, we could "back up" to the bzip2 filter
    and so
    on. This will allow proper read cache filling, and remove the weird
    ambiguities that are apparent in a filtered stream. I don't think we
    would need to worry about backwards compatibility here, as the most
    common use case would be unaffected by this change, and the use
    case it
    would fix has never actually worked.

    I haven't got a patch for this yet, but it would be easy to do if the
    logic is sound.

    Thanks,
    Greg



    --
    PHP Internals - PHP Runtime Development Mailing List
    To unsubscribe, visit: http://www.php.net/unsub.php
    Lukas Kahwe Smith
    mls@pooteeweet.org

  • Lukas Kahwe Smith at Nov 10, 2008 at 12:09 pm

    On 11.10.2008, at 19:45, Gregory Beaver wrote:

    The first part of the bug that I encountered is best described here:
    http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
    than this, as the attempts to cache data is dangerous any time a
    stream
    filter is attached to a stream. I should also note that the patch in
    this bug contains feature additions that would have to wait for PHP
    5.3.

    Can you file a bug report for the second part?
    Not sure really how to proceed, I guess we need someone to step up to
    help Wez/Sara with maintaining the streams layer.

    regards,
    Lukas Kahwe Smith
    mls@pooteeweet.org
  • Arnaud Le Blanc at Nov 17, 2008 at 10:35 am
    Hi,
    On Saturday 11 October 2008 19:45:09 Gregory Beaver wrote:
    Hi,

    I'm grappling with a design flaw I just uncovered in stream filters, and
    need some advice on how best to fix it. The problem exists since the
    introduction of stream filters, and has 3 parts. 2 of them can probably
    be fixed safely in PHP 5.2+, but I think the third may require an
    internal redesign of stream filters, and so would probably have to be
    PHP 5.3+, even though it is a clear bugfix (Ilia, your opinion
    appreciated on this).

    The first part of the bug that I encountered is best described here:
    http://bugs.php.net/bug.php?id=46026. However, it is a deeper problem
    than this, as the attempts to cache data is dangerous any time a stream
    filter is attached to a stream. I should also note that the patch in
    this bug contains feature additions that would have to wait for PHP 5.3.

    I ran into this problem because I was trying to use stream filters to
    read in a bz2-compressed file within a zip archive in the phar
    extension. This was failing, and I first tracked the problem down to an
    attempt by php_stream_filter_append to read in a bunch of data and cache
    it, which caused more stuff to be passed into the bz2 decompress filter
    than it could handle, making it barf. After fixing this problem, I ran
    into the problem described in the bug above because of
    php_stream_fill_read_buffer doing the same thing when I tried to read
    the data, because I requested it return 176 decompressed bytes, and so
    php_stream_read passed in 176 bytes to the decompress filter. Only 144
    of those bytes were actually bz2-compressed data, and so the filter
    barfed upon trying to decompress the remaining data (same as bug #46026,
    found differently).

    You can probably tell from my explanation that this is an
    extraordinarily complex problem. There's 3 inter-related problems here:

    1) bz2 (and zlib) stream filter should stop trying to decompress when it
    reaches the stream end regardless of how many bytes it is told to
    decompress (easy to fix)
    2) it is never safe to cache read data when a read stream filter is
    appended, as there is no safe way to determine in advance how much of
    the stream can be safely filtered. (would be easy to fix if it weren't
    for #3)
    3) there is no clear way to request that a certain number of filtered
    bytes be returned from a stream, versus how many unfiltered bytes should
    be passed into the stream. (very hard to fix without design change)

    I need some advice on #3 from the original designers of stream filters
    and streams, as well as any experts who have dealt with this kind of
    problem in other contexts. In this situation, should we expect stream
    filters to always stop filtering if they reach the end of valid input?
    Even in this situation, there is potential that less data is available
    than passed in. A clear example would be if we requested only 170
    bytes. 144 of those bytes would be passed in as the complete compressed
    data, and bz2.decompress would decompress all of it to 176 bytes. 170
    of those bytes would be returned from php_stream_read, and 6 would have
    to be placed in a cache for future reads. Thus, there would need to be
    some way of marking the cache as valid because of this logic path:

    <?php
    $a = fopen('blah.zip');
    fseek($a, 132); // fills read buffer with unfiltered data
    stream_filter_append($a, 'bzip2.decompress'); // clears read buffer cache
    $b = fread($a, 170); // fills read buffer cache with 6 bytes
    fseek($a, 3, SEEK_CUR); // this should seek within the filtered data
    read buffer cache
    stream_filter_append($a, 'zlib.inflate');
    ?>

    The question is what should happen when we append the second filter
    'zlib.inflate' to filter the filtered data? If we clear the read buffer
    as we did in the first case, it will result in lost data. So, let's
    assume we preserve the read buffer. Then, if we perform:

    <?php
    $c = fread($a, 7);
    ?>

    and assume the remaining 3 bytes expand to 8 bytes, how should the read
    buffer cache be handled? Should the first 3 bytes still be the filtered
    bzip2 decompressed data, and the last 3 replaced with the 8 bytes of
    decompressed zlib data?

    Basically, I am wondering if perhaps we need to implement a read buffer
    cache for each stream filter. This could solve our problem, I think.
    The data would be stored like so:

    stream: 170 bytes of unfiltered data, and a pointer to byte 145 as the
    next byte for php_stream_read()
    bzip2.decompress filter: 176 bytes of decompressed bzip2 data, and a
    pointer to byte 171 as the next byte for php_stream_read()
    zlib.inflate filter: 8 bytes of decompressed zlib data, and a pointer to
    byte 8 as the next byte for php_stream_read()

    This way, we would essentially have a stack of stream data. If the zlib
    filter were then removed, we could "back up" to the bzip2 filter and so
    on. This will allow proper read cache filling, and remove the weird
    ambiguities that are apparent in a filtered stream. I don't think we
    would need to worry about backwards compatibility here, as the most
    common use case would be unaffected by this change, and the use case it
    would fix has never actually worked.

    I haven't got a patch for this yet, but it would be easy to do if the
    logic is sound.
    The problem is mainly to be able to filter a given amount of bytes, starting
    at given position, and to known when this amount of bytes have been passed to
    the filter.

    I would propose a new argument to stream_filter_append:

    stream_filter_append(stream, filter_name[, max_input_bytes])

    So that only max_input_bytes will be passed to the filter. To known when this
    amount of bytes have been passed to the filter I would propose to make the
    stream act as a slice of the original stream: returns EOF once max_input_bytes
    have been passed to the filter. Removing the filter clears the EOF flag and
    allows to read again from the stream.

    Your proposition of a read buffer cache for each filter would help in that,
    and in making filters more robust to some use cases.

    Regards,

    Arnaud

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupphp-internals @
categoriesphp
postedOct 11, '08 at 5:43p
activeNov 17, '08 at 10:35a
posts9
users6
websitephp.net

People

Translate

site design / logo © 2022 Grokbase