FAQ
Hi Nuno, all,

I didn't test it, but yeah that should fix the # problem. :-) BTW, I also
had other ideas about checking for <?, <%, <script>, etc. tags in the inline
HTML scanning part, so the largest chunk of HTML is always grabbed (I'll
send the patch in the future; didn't modify anything yet, and it's not
related to the subject anyway :-)).

Still wondering about the behavior of re2c at EOF being different than
Flex -- can't re2c have an addition/enhancement that simply keeps track of
the rule that *would have* matched before hitting EOF (e.g. YYCURSOR >=
YYLIMIT) and then jump to it when doing the YYFILL check?

Another thing that isn't working is the warning about /* Unterminated
comments... (never seen). The optimization for comment parsing I was going
to do (along with the above HTML stuff) would also work around that -- not
using re2c rules, but a manual scan or zend_memnstr() for the closing */.

Like I said in the comments for Bug #45372, if the last thing at the end of
a file is matched by a variable length rule, it will not be returned.
Because of

#define YYFILL(n) { if (YYCURSOR >= YYLIMIT) return 0; }

I put the ? in the subject line because I'm not sure how important this
really is, but it just seems broken to me (though it's usually with invalid
code), and I couldn't think of a workaround with my limited knowledge of
re2c (though I think it would need to be changed internally). Some things
this affects are 1) the tokenizer extension -- the last token won't be
returned (if variable length, of course); 2) highlighting (if someone is
trying to "see" an unclosed string error, for example? PHP highlighting on
forums...), and parse errors can be different than previously if the parser
gets one less token, for example:

$foo = "Unclosed string<newline> // Different error line number; I think
the space before the quote is the last token returned (even w/o newline)

function test // *Nothing* after test, used to say expecting '(', now says
expecting T_STRING; again, space before test is the last token

function test()
OR
array

at the end of a file work the same still because ")" and "array" are fixed
length matches... Or, add a few newlines at the end and they won't be
counted, etc.

It's been awhile since I checked out the details, so I can't recall at the
moment if there are more serious examples. Also not sure if some of this is
affecting the ini scanner (see Bug #45384), as I haven't really look at its
code.

What are everyone's thoughts...?


- Matt


----- Original Message -----
From: "Nuno Lopes"
Sent: Tuesday, July 08, 2008
nlopess Tue Jul 8 15:16:35 2008 UTC

Modified files: (Branch: PHP_5_3)
/ZendEngine2 zend_language_scanner.l
Log:
now really fix once and for all the #-style comments.
also remove some duplicated code in <?, <%, <%= handlers. this also has
the side-effect of producing better bytecodes in some special cases
http://cvs.php.net/viewvc.cgi/ZendEngine2/zend_language_scanner.l?r1=1.131.2.11.2.13.2.21&r2=1.131.2.11.2.13.2.22&diff_format=u

Search Discussions

  • Nuno Lopes at Jul 9, 2008 at 11:21 pm

    I didn't test it, but yeah that should fix the # problem. :-) BTW, I also
    had other ideas about checking for <?, <%, <script>, etc. tags in the
    inline
    HTML scanning part, so the largest chunk of HTML is always grabbed (I'll
    send the patch in the future; didn't modify anything yet, and it's not
    related to the subject anyway :-)).
    my code doesn't find the optimal largest chunk of inline html, but almost.
    It just gives up when it finds a potential tag. It can be made optimal
    easily, at some expense. I don't know if it's beneficial or not.

    Still wondering about the behavior of re2c at EOF being different than
    Flex -- can't re2c have an addition/enhancement that simply keeps track of
    the rule that *would have* matched before hitting EOF (e.g. YYCURSOR >=
    YYLIMIT) and then jump to it when doing the YYFILL check?
    Yes, this is horrible.. I'm also afraid there might be some other corner
    cases that we are returning EOF where we shouldn't. This behaviour can be
    workarounded with the state feature though.

    Another thing that isn't working is the warning about /* Unterminated
    comments... (never seen). The optimization for comment parsing I was
    going
    to do (along with the above HTML stuff) would also work around that -- not
    using re2c rules, but a manual scan or zend_memnstr() for the closing */.
    Ok, please file a bug report and assign it to me, or go ahead and fix it
    yourself :-)

    Like I said in the comments for Bug #45372, if the last thing at the end
    of
    a file is matched by a variable length rule, it will not be returned.
    Because of

    #define YYFILL(n) { if (YYCURSOR >= YYLIMIT) return 0; }

    I put the ? in the subject line because I'm not sure how important this
    really is, but it just seems broken to me (though it's usually with
    invalid
    code), and I couldn't think of a workaround with my limited knowledge of
    re2c (though I think it would need to be changed internally). Some things
    this affects are 1) the tokenizer extension -- the last token won't be
    returned (if variable length, of course); 2) highlighting (if someone is
    trying to "see" an unclosed string error, for example? PHP highlighting on
    forums...), and parse errors can be different than previously if the
    parser
    gets one less token, for example:
    I'm not much worried about input errors, although I agree the current
    approach isn't the best one. As I said, this can be workrounded with the
    states thing (IIRC).

    It's been awhile since I checked out the details, so I can't recall at the
    moment if there are more serious examples. Also not sure if some of this
    is
    affecting the ini scanner (see Bug #45384), as I haven't really look at
    its
    code.
    The ini scanner is a bit broken yes.. :/

    Maybe Marcus can help us here? Maybe add some new feature to re2c or help in
    implementing some workarond?

    Nuno
  • Lukas Kahwe Smith at Jul 22, 2008 at 9:44 pm

    On 10.07.2008, at 01:21, Nuno Lopes wrote:

    I didn't test it, but yeah that should fix the # problem. :-) BTW,
    I also
    had other ideas about checking for <?, <%, <script>, etc. tags in
    the inline
    HTML scanning part, so the largest chunk of HTML is always grabbed
    (I'll
    send the patch in the future; didn't modify anything yet, and it's
    not
    related to the subject anyway :-)).
    my code doesn't find the optimal largest chunk of inline html, but
    almost. It just gives up when it finds a potential tag. It can be
    made optimal easily, at some expense. I don't know if it's
    beneficial or not.

    Still wondering about the behavior of re2c at EOF being different
    than
    Flex -- can't re2c have an addition/enhancement that simply keeps
    track of
    the rule that *would have* matched before hitting EOF (e.g.
    YYCURSOR >=
    YYLIMIT) and then jump to it when doing the YYFILL check?
    Yes, this is horrible.. I'm also afraid there might be some other
    corner cases that we are returning EOF where we shouldn't. This
    behaviour can be workarounded with the state feature though.

    Another thing that isn't working is the warning about /* Unterminated
    comments... (never seen). The optimization for comment parsing I
    was going
    to do (along with the above HTML stuff) would also work around that
    -- not
    using re2c rules, but a manual scan or zend_memnstr() for the
    closing */.
    Ok, please file a bug report and assign it to me, or go ahead and
    fix it yourself :-)

    Like I said in the comments for Bug #45372, if the last thing at
    the end of
    a file is matched by a variable length rule, it will not be returned.
    Because of

    #define YYFILL(n) { if (YYCURSOR >= YYLIMIT) return 0; }

    I put the ? in the subject line because I'm not sure how important
    this
    really is, but it just seems broken to me (though it's usually with
    invalid
    code), and I couldn't think of a workaround with my limited
    knowledge of
    re2c (though I think it would need to be changed internally). Some
    things
    this affects are 1) the tokenizer extension -- the last token won't
    be
    returned (if variable length, of course); 2) highlighting (if
    someone is
    trying to "see" an unclosed string error, for example? PHP
    highlighting on
    forums...), and parse errors can be different than previously if
    the parser
    gets one less token, for example:
    I'm not much worried about input errors, although I agree the
    current approach isn't the best one. As I said, this can be
    workrounded with the states thing (IIRC).

    It's been awhile since I checked out the details, so I can't recall
    at the
    moment if there are more serious examples. Also not sure if some
    of this is
    affecting the ini scanner (see Bug #45384), as I haven't really
    look at its
    code.
    The ini scanner is a bit broken yes.. :/

    Maybe Marcus can help us here? Maybe add some new feature to re2c or
    help in implementing some workarond?

    Whats the status here?

    regards,
    Lukas Kahwe Smith
    mls@pooteeweet.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupphp-internals @
categoriesphp
postedJul 8, '08 at 4:50p
activeJul 22, '08 at 9:44p
posts3
users3
websitephp.net

People

Translate

site design / logo © 2022 Grokbase