FAQ
I have mentioned in earlier posts about the upcoming need for going
beyond the single-char pattern modifiers /msixpodualgcer. (Some
examples include being able to override /i definitions, for user-defined
Unicode private-use properties, for allowing one to globally say that \b
really should be \b{wb}, and others.)

I'm here proposing a syntax for doing this. An example would be
   /(?mi{long-modifier}u: ... )/

A long modifier is simply anything enclosed in {} between the '(?' and
the ':'. Each such modifier would have its own pair of braces. This is
currently illegal syntax. I do not see the need for, and hence propose
explicitly not to accept these at this time except in the infix (?:...)
notation. We could expand at some point to accept the (?...) notation
having long modifiers if there is demand. But I'm not sure I would want
the postfix notation to ever allow long modifiers.

The syntax of what's enclosed in the {} is not specified now, except
that anything within wouldn't break current parsing of the pattern as a
whole. Hence probably braces would have to be balanced, etc.

Long modifiers at this time would be essentially for other pragmas to
fill-in, and not for users. We wouldn't document what they are, but
obviously any stringification of the pattern would show them. At some
point in the future, after we are comfortable with this, we could
add/document some intended to be user-specifiable.

It might be that the pragmas that generate long modifiers would be
marked experimental at first so that this all could be removed.

One behavior that has been requested is to make the (?[...]) behavior
work on regular [...] classes (that is the extra syntax checks, etc, but
not the set operations). One could say something like

   use re 'X[]';

and within its scope, any regex that uses a regular bracketed character
class would get the extended rules. This would be implemented via a
long modifier. (X[] for extended bracketed class is something I just
pulled out of the air, and I'm sure there are much better ways of
spelling it.)

Search Discussions

  • Tom Christiansen at Aug 27, 2014 at 11:41 am
    Karl Williamson wrote
        on Tue, 26 Aug 2014 18:12:19 MDT:
    I have mentioned in earlier posts about the upcoming need for going
    beyond the single-char pattern modifiers /msixpodualgcer. (Some
    examples include being able to override /i definitions, for user-defined
    Unicode private-use properties, for allowing one to globally say that \b
    really should be \b{wb}, and others.)
    I'm here proposing a syntax for doing this. An example would be
    /(?mi{long-modifier}u: ... )/
    I have one main question, and a few random thoughts.

    My main question is:

         Are these long-modifiers always, never, or sometimes
         considered "modifiers or modifiers"?

    Specifically, I wonder whether that is
         (A) to be considered four modifiers,
         (B) or is it three modifiers,
      or (C) might it perhaps be either one of those depending on the modifier in question.

    Four modifiers:

         m
         i
         {long-modifier}
         u

    Three modifiers:

         m
         i{long-modifier}
         u

    In other words, is the {long-modifier} bit something that can
    stand on its own, or can it only occur following a particular
    short modifier?

    The third possibility, C above, is that some of them might be
    a subtype of short modifiers acting adverbially on the short
    modifier itself, but others might be completely stand-alone.

    I could probably find points in favor of all three possible
    interpretations.

    The i flag seems especially um "affinitive" to the modifier-of-a-modifier
    way of looking at things. It might even admit multiple simultaneous ones:

         i{turkic}
         i{uca=1}
         i{turkic, uca=1}

    Although now I wonder about how substraction would work. Hm.

         (?flags-flags: ... )

    But the s&m flags also might like adverbial modifiers of their own,
    something that make them think not about \n but about \R instead.
    Then again, that might be a long-modifier that would reasonably
    apply to both both s&m.

    I also see the point of having some of these be basically internal
    markings triggered by 'use re' pragma variants instead of being things
    that the user gets at. So maybe something like

         use re '/{linebreak=unicode}'

    Or some such to make all the s&m stuff treat any \R grapheme as previously
    it were treating (or not treating) \n.

    Sometimes I think I might prefer

         (?break{word}: \bfoo\b)

    Over

         \b{word}foo\b{word}

    I don't know whether this where to sneak in tr18's level-3 tailoring
    bits. Possibly so, possibly not. I'm looking at things like their
    own examples of \T{locale_id}...\E, or \X{es-u-co-trad} or \b{w}.

         If both tailored and default regular expressions are supported, then
         a number of different mechanism are affected. There are two main
         alternatives for control of tailored support:

            * coarse-grained support: the whole regular expression (or the
              whole script in which the regular expression occurs) can be
              marked as being tailored.
            * fine-grained support: any part of the regular expression can be
              marked in some way as being tailored.

         For example, fine-grained support could use some syntax such as the
         following to indicate tailoring to a locale within a certain range.
         Locale (or language) IDs should use the syntax from locale identifier
         definition in [UTS35], Section 3. Identifiers . Note that the locale id
         of "root" or "und" indicates the root locale, such as in the CLDR root
         collation. \T{<locale_id>}..\E

         ---

         For example, an implementation could interpret \X{es-u-co-trad} as
         matching a collation grapheme cluster for a traditional Spanish
         ordering, or use a switch to change the meaning of \X during some
         span of the regular expression.

         ---

         For example, an implementation could interpret \b{x:...} as matching the
         word break positions according to the locale information in CLDR [UTS35]
         (which are tailorings of word break positions in [UAX29]).

         Thus it could interpret

      \b{w:und} or \b{w} as matching a root word break
      \b{w:ja} as matching a Japanese word break
      \b{l:ja} as matching a Japanese line break

         Alternatively, it could use a switch to change the meaning of \b and \B
         during some span of the regular expression.

    More random thoughts...

    Here are the sorts of mnemonics I use in my own head when I use the
    one-letter modifiers (which are variously pattern flags, match-operator
    flags, or substitute-operator flags):

        [Note that the s&m mods' interpretation were banged into my head
         by Larry through some not inconsiderable effort on his part,
         because they weren't really fitting into existing holes that well.]

         m multiline(d)
         s singleline(d)
         i insensitive
         x expand(ed)
         p preserve(d)
         o onetime
         d dual(istic)
         u unicode
         a ascii
         l locale
         g global
         c continue(d)
         e evaluat(ed)
         r return(ed)

    I've just noticed that grammatically, those are basically all "noun
    modifiers", whether as adjectives or attributive nouns or participial
    adjectively. That is, they all fit into the <BLAH> slot in

         This is a <BLAH> match.

    The reason this is interesting is that I seem to recall the perl6 folks
    calling these adverbs, not adjectives. I guess you can just plop on
    an -ly for most of those to adverb them, which works fine with globally
    but somewhat (English-)dubiously for singlelinèdly.

    For the operator flags, I can how see those being adverbs, since it
    applies to the verbing (matching, substituting) operation.

         while this string globally matches the pattern....

    But with the pattern-compilation flags, they modify the compiled
    pattern itself, not how the match operator uses that pattern to
    perform its duties. But this has always been a (mild, minor) confusion,
    since the syntactic slot after

         m/.../abcdefgȝhijklmnopqrstþuvwxyz
         s/.../abcdefgȝhijklmnopqrstþuvwxyz
         qr/../abcdefgȝhijklmnopqrstþuvwxyz

    accepts single letters, not caring "when" the apply.

    --tom
  • Karl Williamson at Aug 28, 2014 at 4:44 am

    On 08/27/2014 05:41 AM, Tom Christiansen wrote:
    Karl Williamson <public@khwilliamson.com> wrote
    on Tue, 26 Aug 2014 18:12:19 MDT:
    I have mentioned in earlier posts about the upcoming need for going
    beyond the single-char pattern modifiers /msixpodualgcer. (Some
    examples include being able to override /i definitions, for user-defined
    Unicode private-use properties, for allowing one to globally say that \b
    really should be \b{wb}, and others.)
    I'm here proposing a syntax for doing this. An example would be
    /(?mi{long-modifier}u: ... )/
    I have one main question, and a few random thoughts.

    My main question is:

    Are these long-modifiers always, never, or sometimes
    considered "modifiers or modifiers"?
    I had to read this several time to grok it, as it contains a typo. For
    those of you who don't understand it, the 'or' should be 'of'
    Specifically, I wonder whether that is
    (A) to be considered four modifiers,
    I intended it to be (A).
    (B) or is it three modifiers,
    or (C) might it perhaps be either one of those depending on the modifier in question.

    Four modifiers:

    m
    i
    {long-modifier}
    u

    Three modifiers:

    m
    i{long-modifier}
    u

    In other words, is the {long-modifier} bit something that can
    stand on its own, or can it only occur following a particular
    short modifier?

    The third possibility, C above, is that some of them might be
    a subtype of short modifiers acting adverbially on the short
    modifier itself, but others might be completely stand-alone.

    I could probably find points in favor of all three possible
    interpretations.

    The i flag seems especially um "affinitive" to the modifier-of-a-modifier
    way of looking at things. It might even admit multiple simultaneous ones:

    i{turkic}
    i{uca=1}
    i{turkic, uca=1}
    My intent was to get a syntax that was currently illegal, and did the
    job at hand. I can see the use-case for modifying modifiers, but what I
    was proposing wasn't that. Better suggestions welcome, or we could use
    [] for this use-case instead of {} if and when we implement that use-case.
    Although now I wonder about how substraction would work. Hm.

    (?flags-flags: ... )

    But the s&m flags also might like adverbial modifiers of their own,
    something that make them think not about \n but about \R instead.
    Then again, that might be a long-modifier that would reasonably
    apply to both both s&m.
    My mother always told me to stay away from s&m ;)

    Some modifiers already are illegal after a minus. Most of the new ones
    would be too.
    I also see the point of having some of these be basically internal
    markings triggered by 'use re' pragma variants instead of being things
    that the user gets at. So maybe something like

    use re '/{linebreak=unicode}'

    Or some such to make all the s&m stuff treat any \R grapheme as previously
    it were treating (or not treating) \n.
    Exactly. That is my proposal, to make all the known long modifiers only
    be generated by a pragma. That way, if something goes awry we can
    change or remove them without worrying about back compat. Aftger
    gaining field-experience, we could relax this. Perhaps the pragma could
    even generate the modifier to look like

    {experimental:linebreak=unicode}

    so that someone trying to bypass the pragma would certainly be
    forewarned of the inadvisability of doing so.
    Sometimes I think I might prefer

    (?break{word}: \bfoo\b)

    Over

    \b{word}foo\b{word}
    Yes, but that can wait until we gain experience. For 5.22, I would
    propose that you'd have to say

       use re '/\b=wb'

    or some such, to get the effect of changing \b behavior. I think the
    Unicode definition will be preferable in general to the current one, so
    I would think almost all code that cares would want to only use it, and
    not have different ones scattered around, except rarely, so I don't see
    the use-case for specifying which break you want on a per-regex basis.
    (Also, I am now favoring 'wb' over 'word' because the former is an
    official unicode name and would be less likely to be misinterpreted as
    our current \b.)
    I don't know whether this where to sneak in tr18's level-3 tailoring
    bits. Possibly so, possibly not. I'm looking at things like their
    own examples of \T{locale_id}...\E, or \X{es-u-co-trad} or \b{w}.

    If both tailored and default regular expressions are supported, then
    a number of different mechanism are affected. There are two main
    alternatives for control of tailored support:

    * coarse-grained support: the whole regular expression (or the
    whole script in which the regular expression occurs) can be
    marked as being tailored.
    * fine-grained support: any part of the regular expression can be
    marked in some way as being tailored.

    For example, fine-grained support could use some syntax such as the
    following to indicate tailoring to a locale within a certain range.
    Locale (or language) IDs should use the syntax from locale identifier
    definition in [UTS35], Section 3. Identifiers . Note that the locale id
    of "root" or "und" indicates the root locale, such as in the CLDR root
    collation. \T{<locale_id>}..\E

    ---

    For example, an implementation could interpret \X{es-u-co-trad} as
    matching a collation grapheme cluster for a traditional Spanish
    ordering, or use a switch to change the meaning of \X during some
    span of the regular expression.

    ---

    For example, an implementation could interpret \b{x:...} as matching the
    word break positions according to the locale information in CLDR [UTS35]
    (which are tailorings of word break positions in [UAX29]).

    Thus it could interpret

    \b{w:und} or \b{w} as matching a root word break
    \b{w:ja} as matching a Japanese word break
    \b{l:ja} as matching a Japanese line break

    Alternatively, it could use a switch to change the meaning of \b and \B
    during some span of the regular expression.
    I hadn't thought about tailoring, and am glad you are. That's for the
    future, but if an extensible syntax can be found now, so much the better.
    More random thoughts...

    Here are the sorts of mnemonics I use in my own head when I use the
    one-letter modifiers (which are variously pattern flags, match-operator
    flags, or substitute-operator flags):

    [Note that the s&m mods' interpretation were banged into my head
    by Larry through some not inconsiderable effort on his part,
    because they weren't really fitting into existing holes that well.]

    m multiline(d)
    s singleline(d)
    i insensitive
    x expand(ed)
    p preserve(d)
    o onetime
    d dual(istic)
    u unicode
    a ascii
    l locale
    g global
    c continue(d)
    e evaluat(ed)
    r return(ed)

    I've just noticed that grammatically, those are basically all "noun
    modifiers", whether as adjectives or attributive nouns or participial
    adjectively. That is, they all fit into the <BLAH> slot in

    This is a <BLAH> match.

    The reason this is interesting is that I seem to recall the perl6 folks
    calling these adverbs, not adjectives. I guess you can just plop on
    an -ly for most of those to adverb them, which works fine with globally
    but somewhat (English-)dubiously for singlelinèdly.

    For the operator flags, I can how see those being adverbs, since it
    applies to the verbing (matching, substituting) operation.

    while this string globally matches the pattern.... I'm not sure there is really a use-case for having different interpretations scattered around.

    But with the pattern-compilation flags, they modify the compiled
    pattern itself, not how the match operator uses that pattern to
    perform its duties. But this has always been a (mild, minor) confusion,
    since the syntactic slot after

    m/.../abcdefgȝhijklmnopqrstþuvwxyz
    s/.../abcdefgȝhijklmnopqrstþuvwxyz
    qr/../abcdefgȝhijklmnopqrstþuvwxyz

    accepts single letters, not caring "when" the apply.

    --tom

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedAug 27, '14 at 12:12a
activeAug 28, '14 at 4:44a
posts3
users2
websiteperl.org

People

Translate

site design / logo © 2018 Grokbase