FAQ
See the commit message in the attached patch.

Note that the lines above this are clearly wrong, as LOC may not be set
now when the first node should match locale. It also means that a
synthetic start node could match with all of: locale, with default
semantics, with unicode semantics, and (soon) in just ascii; instead of
these being mutually exclusive as in any other node. I need to think
about how that can happen.

Search Discussions

  • Demerphq at Nov 10, 2010 at 10:27 am

    On 10 November 2010 06:24, karl williamson wrote:
    See the commit message in the attached patch.

    Note that the lines above this are clearly wrong, as LOC may not be set now
    when the first node should match locale.  It also means that a synthetic
    start node could match with all of: locale, with default semantics, with
    unicode semantics, and (soon) in just ascii; instead of these being mutually
    exclusive as in any other node.  I need to think about how that can happen.
    Ilya is on record as saying that the code involved here is probably
    buggy and could do with being reworked.

    There are also comments saying that the logic is wrong.

    I always thought the startclass stuff could be reworked. The concept
    is sound I think, but the implementation is deeply cryptic.

    Part of the problem is that the optimiser does way way too much. It
    does optree manipulation (peephole optimiser basically, so things like
    EXACT merging, various next-op optimisations to skip unneeded opcodes,
    the trie optimisation, etc), as well as analysis for startclass
    optimisation, and longest-required string optimisation at the same
    time.

    The reason I think the one routine does so much (some of it was split
    out by me into subs at least) is that the rules for traversing the
    compiled program are not trivial, and therefore it is much safer in
    many respects to shoe-horn new logic into the old code.

    I have long thought that if we changed the parsing process so the
    first pass built a true parsetree of the regex and then we would do an
    optimisation pass, and then a final transcribing pass then we would
    overall have much cleaner code, and would have much much more options
    in terms of optimization, and I think we would find it would make
    proving some of our logic correct much easier.

    Yves




    --
    perl -Mre=debug -e "/just|another|perl|hacker/"
  • Karl williamson at Nov 11, 2010 at 7:57 pm

    demerphq wrote:
    On 10 November 2010 06:24, karl williamson wrote:
    See the commit message in the attached patch.

    Note that the lines above this are clearly wrong, as LOC may not be set now
    when the first node should match locale. It also means that a synthetic
    start node could match with all of: locale, with default semantics, with
    unicode semantics, and (soon) in just ascii; instead of these being mutually
    exclusive as in any other node. I need to think about how that can happen.
    Ilya is on record as saying that the code involved here is probably
    buggy and could do with being reworked.
    Actually this is a bug I introduced by not understanding the
    implications of adding the /l modifier.

    It is mind-boggling to me to realize that given that change and the
    future ones for /u and /a, that a bit in the start class ANYOF bitmap
    could match two or three different code points with three different folds.

    To illustrate, consider /(?l:\xC0)|(?u:\xC0)|(?a:\xC0)(?d:\xC0)/i.

    If the optimizer created a start node for this, the bit for C0 would
    have the following meanings:
    1) whatever it does in the current locale. In an 8859-7 locale, it
    would be GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS. I don't know
    what it folds to in that locale, but it's not E0.

    2) LATIN CAPITAL LETTER A WITH GRAVE, which folds to E0

    3) Just C0, another nameless non-ASCII character, and it folds to itself.
    There are also comments saying that the logic is wrong.

    I always thought the startclass stuff could be reworked. The concept
    is sound I think, but the implementation is deeply cryptic.

    Part of the problem is that the optimiser does way way too much. It
    does optree manipulation (peephole optimiser basically, so things like
    EXACT merging, various next-op optimisations to skip unneeded opcodes,
    the trie optimisation, etc), as well as analysis for startclass
    optimisation, and longest-required string optimisation at the same
    time.

    The reason I think the one routine does so much (some of it was split
    out by me into subs at least)
    Thank you

    is that the rules for traversing the
    compiled program are not trivial, and therefore it is much safer in
    many respects to shoe-horn new logic into the old code.

    I have long thought that if we changed the parsing process so the
    first pass built a true parsetree of the regex and then we would do an
    optimisation pass, and then a final transcribing pass then we would
    overall have much cleaner code, and would have much much more options
    in terms of optimization, and I think we would find it would make
    proving some of our logic correct much easier.
    Sounds good. When do you start? :)
    Yves


  • Demerphq at Nov 12, 2010 at 10:15 am

    On 11 November 2010 20:57, karl williamson wrote:
    demerphq wrote:
    On 10 November 2010 06:24, karl williamson <public@khwilliamson.com>
    wrote:
    See the commit message in the attached patch.

    Note that the lines above this are clearly wrong, as LOC may not be set
    now
    when the first node should match locale.  It also means that a synthetic
    start node could match with all of: locale, with default semantics, with
    unicode semantics, and (soon) in just ascii; instead of these being
    mutually
    exclusive as in any other node.  I need to think about how that can
    happen.
    Ilya is on record as saying that the code involved here is probably
    buggy and could do with being reworked.
    Actually this is a bug I introduced by not understanding the implications of
    adding the /l modifier.

    It is mind-boggling to me to realize that given that change and the future
    ones for /u and /a, that a bit in the start class ANYOF bitmap could match
    two or three different code points with three different folds.
    What you say there rings alarm bells for me. "a bit in the start class
    ANYOF bitmap could match two or three different code points with three
    different folds" seems to me to suggest that you would not store
    whatever codepoints could match that is both the folded and unfolder,
    but instead only the original. That doesnt make sense to me.

    The ANYOF bitmap can be "overly large", it doesnt have to be correct
    in the sense that false positives are generally speaking acceptable,
    as they will be rejected later. The intention here is to reduce the
    number of times the main engine is used, so false-positives arent
    really a big deal, they would just mean the optimization didn't help
    us for one or two positions, and we had to pay a cost we would pay
    anyway.

    So one bit in the bitmap should correspond to one char. There should
    not be any folding during the processing of an ANYOF start-class, it
    should be sufficient to do character by character lookup in the bitmap
    for codepoints 0-255.
    To illustrate, consider /(?l:\xC0)|(?u:\xC0)|(?a:\xC0)(?d:\xC0)/i.

    If the optimizer created a start node for this, the bit for C0 would have
    the following meanings:
    1) whatever it does in the current locale.  In an 8859-7 locale, it would be
    GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS.  I don't know what it
    folds to in that locale, but it's not E0.
    Lets forget the locale stuff. It should be deferred to run time, so
    its a whole different kettle of fish.
    2) LATIN CAPITAL LETTER A WITH GRAVE, which folds to E0
    Ok, so we turn on bit C0 and E0 in the bit map.
    3) Just C0, another nameless non-ASCII character, and it folds to itself.
    So the bitmap would contain C0 and E0. Whats the problem here again? :-)

    Also, there is another point here, currently we dont do startclass
    optimisations for folded data. So why are you worrying about this?
    I have long thought that if we changed the parsing process so the
    first pass built a true parsetree of the regex and then we would do an
    optimisation pass, and then a final transcribing pass then we would
    overall have much cleaner code, and would have much much more options
    in terms of optimization, and I think we would find it would make
    proving some of our logic correct much easier.
    Sounds good.  When do you start? :)
    Christmas. :-)

    Yves


    --
    perl -Mre=debug -e "/just|another|perl|hacker/"
  • Karl williamson at Nov 12, 2010 at 5:26 pm

    demerphq wrote:
    On 11 November 2010 20:57, karl williamson wrote:
    demerphq wrote:
    On 10 November 2010 06:24, karl williamson <public@khwilliamson.com>
    wrote:
    See the commit message in the attached patch.

    Note that the lines above this are clearly wrong, as LOC may not be set
    now
    when the first node should match locale. It also means that a synthetic
    start node could match with all of: locale, with default semantics, with
    unicode semantics, and (soon) in just ascii; instead of these being
    mutually
    exclusive as in any other node. I need to think about how that can
    happen.
    Ilya is on record as saying that the code involved here is probably
    buggy and could do with being reworked.
    Actually this is a bug I introduced by not understanding the implications of
    adding the /l modifier.

    It is mind-boggling to me to realize that given that change and the future
    ones for /u and /a, that a bit in the start class ANYOF bitmap could match
    two or three different code points with three different folds.
    What you say there rings alarm bells for me. "a bit in the start class
    ANYOF bitmap could match two or three different code points with three
    different folds" seems to me to suggest that you would not store
    whatever codepoints could match that is both the folded and unfolder,
    but instead only the original. That doesnt make sense to me.
    I think my language imprecision confused you. In other words I
    misspoke. And I wasn't saying it wasn't workable, just that it was mind
    boggling to me. Better would be to say that a bit in the start class
    could mean three different characters, all with the same code point.
    And further, their folds could be different bits. There is no problem
    with that; its just surprising to me.

    In the case of the example, C0 would be stored, plus all its folds
    knowable at compile time, in this case E0. A flag would be set to
    indicate a run-time fold because of the locale and that would also be
    tested.
    [snip]
    Also, there is another point here, currently we dont do startclass
    optimisations for folded data. So why are you worrying about this?
    I was unaware of this and it doesn't seem true, unless I misunderstand
    you. There is a start class generated for /i regexes. And the bug
    report was that the optimizer dropped it incorrectly. What I'm
    proposing to do is each time we set a bit in the bitmap to also set its
    fold if applicable. That way, for non-locale nodes, the bitmap has
    compiled in all the necessary folding information, so the current
    runtime fold tests can be removed, again except for locale. This is all
    under the radar of the optimizer.

    We don't currently set the bitmap for \p properties (nor can we because
    of user-defined ones) so these all have to be done via the runtime
    swash, as currently. But we can set (or clear) flags to indicate that
    there is nothing that can possibly match in the swash that wasn't in the
    bitmap, avoiding unnecessary work.
  • Father Chrysostomos at Nov 11, 2010 at 5:37 pm

    Karl Williamson wrote:
    From ac4079311807c26fc9201b8927e0ee4d2eda046b Mon Sep 17 00:00:00 2001
    From: Karl Williamson <public@khwilliamson.com>
    Date: Tue, 9 Nov 2010 22:05:44 -0700
    Subject: [PATCH] PATCH: [perl #78994]: optimizer loses folding
    Thank you. Applied as 8951c46.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedNov 10, '10 at 5:24a
activeNov 12, '10 at 5:26p
posts6
users3
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase