FAQ
I've been thinking about the oft-expressed issue here concerning e.g.,
making \d match only 0-9, and now have a concrete proposal.

I'm proposing a '/a' regex modifier that would restrict matches of \d,
\s, \w, and [:posix:] to characters in the ASCII character set. This
would be true even on utf8-encoded patterns and targets.

I'm leaning against having this affect case insensitive matching, thus
'"\N{LATIN SMALL LETTER LONG S}" =~ /s/ai' would still be true.

The modifier would be added automatically when a regex is compiled in
the scope of something like 'use re "ascii"'. It could also be
explicitly stated in a (?a...) construct. It could not be expressed as
a /suffix in 5.14, unless Jesse changes his mind.

There are 3 features I think that this could interact with, i.e., what
happens if a regex is compiled in the scope of any combination of:
use re 'ascii'
use bytes
use locale
use feature 'unicode_strings'

1) bytes. I don't think that there is any conflict, as all ascii chars
are single bytes anyway.

2) locale. I believe locale should have precedence.

3) unicode_strings. I believe ascii should have precedence.

If you have an inquiring mind, the current implementation is that bytes
has highest precedence, followed by locale, then unicode_strings. I
propose to insert ascii between locale and unicode_strings in the
pecking order.

Search Discussions

  • Ævar Arnfjörð Bjarmason at Sep 21, 2010 at 8:23 pm

    On Tue, Sep 21, 2010 at 18:27, karl williamson wrote:
    I've been thinking about the oft-expressed issue here concerning e.g.,
    making \d match only 0-9, and now have a concrete proposal.
    I very much like where this is going. As Yves/I have discussed on list
    before the brevity of \d doesn't really fit with what it does.

    Most people that use it probably want to match some ASCII number as
    part of input validation. But if you're working with Unicode it'll
    match everything Unicode thinks is a number, which unless you're
    writing some Unicode library or something is rarely what you actually
    want. It really should be spelled "\N{A UNICODE NUMBER YES REALLY}"
    instead :)

    Anyway, it's too late to change that now (and Larry doesn't want
    to). But providing this via a flag like /a would be a great workaround
    at this point.. I like it.
    I'm proposing a '/a' regex modifier that would restrict matches of \d, \s,
    \w, and [:posix:] to characters in the ASCII character set.    This would be
    true even on utf8-encoded patterns and targets.
    Excellent! Finally I can write:

    $str =~ /^src=(?<id>\d+)$/a;

    Instead of:

    $str =~ /^src=(?<id>[0-9]+)$/;
    I'm leaning against having this affect case insensitive matching, thus
    '"\N{LATIN SMALL LETTER LONG S}" =~ /s/ai' would still be true.
    That sounds odd. Do I understand you correctly that these would pass
    (pseudocode):

    # match funny digits under unicode
    ok("src=<some non-[0-9] unicode numbershere>" =~ /^src=\d+$/);

    # But now with /a
    ok(!"src=<some non-[0-9] unicode numbershere>" =~ /^src=\d+$/a);

    But that this would too:

    # eek, we wanted HTML SRC= and now we have some funny "ſ"
    # character in our database
    ok("\N{LATIN SMALL LETTER LONG S}rc=123" =~ /^SRC=\d+$/ai);

    Surely the main advantage of this feature is that you can write
    validation patterns for something like a database and make sure your
    /\d+/ only matches ASCII numbers (which this does).

    Folding \N{LATIN SMALL LETTER LONG S} into "s" seems to go against the
    spirit of that, since then you suddenly have to hurt your brain with
    Unicode all over again. When you really just want to match ASCII, but
    against UTF-8 strings.
  • Reverend Chip at Sep 21, 2010 at 8:32 pm

    On 9/21/2010 1:23 PM, Ævar Arnfjörð Bjarmason wrote:
    On Tue, Sep 21, 2010 at 18:27, karl williamson wrote:
    I've been thinking about the oft-expressed issue here concerning e.g.,
    making \d match only 0-9, and now have a concrete proposal.
    I very much like where this is going. As Yves/I have discussed on list
    before the brevity of \d doesn't really fit with what it does.
    So //a would be a combination of C<no locale> and the brand new
    behavior of matching even Unicode strings with non-Unicode character
    classes.
    I like.
  • Hv at Sep 22, 2010 at 12:13 pm
    karl williamson wrote:
    :I've been thinking about the oft-expressed issue here concerning e.g.,
    :making \d match only 0-9, and now have a concrete proposal.
    :
    :I'm proposing a '/a' regex modifier that would restrict matches of \d,
    :\s, \w, and [:posix:] to characters in the ASCII character set. This
    :would be true even on utf8-encoded patterns and targets.
    :
    :I'm leaning against having this affect case insensitive matching, thus
    :'"\N{LATIN SMALL LETTER LONG S}" =~ /s/ai' would still be true.

    [...]

    I'd love to see this - I would hope that this would allow greater accuracy,
    clarity and speed when matching patterns over data known or intended to be
    pure ASCII.

    With the sort of uses I have in mind for this, I would *not* want the /i
    behaviour you describe, both for the type of reason described by Avar
    ($str =~ /^src=/ai), and for the speed benefits that I imagine a simple
    ASCII match could achieve - even if those speed benefits are not realised
    in the initial implementation, the behaviour we decide on and document now
    will determine whether we have the option to realise them later.

    Hugo
  • Karl williamson at Sep 22, 2010 at 1:22 pm

    hv@crypt.org wrote:
    karl williamson wrote:
    :I've been thinking about the oft-expressed issue here concerning e.g.,
    :making \d match only 0-9, and now have a concrete proposal.
    :
    :I'm proposing a '/a' regex modifier that would restrict matches of \d,
    :\s, \w, and [:posix:] to characters in the ASCII character set. This
    :would be true even on utf8-encoded patterns and targets.
    :
    :I'm leaning against having this affect case insensitive matching, thus
    :'"\N{LATIN SMALL LETTER LONG S}" =~ /s/ai' would still be true.

    [...]

    I'd love to see this - I would hope that this would allow greater accuracy,
    clarity and speed when matching patterns over data known or intended to be
    pure ASCII.

    With the sort of uses I have in mind for this, I would *not* want the /i
    behaviour you describe, both for the type of reason described by Avar
    ($str =~ /^src=/ai), and for the speed benefits that I imagine a simple
    ASCII match could achieve - even if those speed benefits are not realised
    in the initial implementation, the behaviour we decide on and document now
    will determine whether we have the option to realise them later.

    Hugo
    Doesn't "use bytes" already do this?
  • Hv at Sep 22, 2010 at 11:40 pm
    karl williamson wrote:
    :hv@crypt.org wrote:
    :> karl williamson wrote:
    :> :I've been thinking about the oft-expressed issue here concerning e.g.,
    :> :making \d match only 0-9, and now have a concrete proposal.
    :> :
    :> :I'm proposing a '/a' regex modifier that would restrict matches of \d,
    :> :\s, \w, and [:posix:] to characters in the ASCII character set. This
    :> :would be true even on utf8-encoded patterns and targets.
    :> :
    :> :I'm leaning against having this affect case insensitive matching, thus
    :> :'"\N{LATIN SMALL LETTER LONG S}" =~ /s/ai' would still be true.
    :>
    :> [...]
    :>
    :> I'd love to see this - I would hope that this would allow greater accuracy,
    :> clarity and speed when matching patterns over data known or intended to be
    :> pure ASCII.
    :>
    :> With the sort of uses I have in mind for this, I would *not* want the /i
    :> behaviour you describe, both for the type of reason described by Avar
    :> ($str =~ /^src=/ai), and for the speed benefits that I imagine a simple
    :> ASCII match could achieve - even if those speed benefits are not realised
    :> in the initial implementation, the behaviour we decide on and document now
    :> will determine whether we have the option to realise them later.
    :
    :Doesn't "use bytes" already do this?

    Hmm, maybe, taking advantage of utf8 having bit value 0x80 set on each
    byte for codepoints above 0x7f. I've never actually used it - it's one
    of those things that are lumped together in the "you don't want to do
    this" bundle in my mind, probably precisely because it requires knowing
    more about utf8 encoding than is properly healthy.

    Of course the pragma works at a different level to a regexp option - you'd
    need the latter, for example, to encapsulate an exact pattern in a qr{}
    object that can then be used outside the context in which it was created.

    Maybe it's worth asking the other way around too: what led you to "lean"
    towards having /ai do Unicode-aware case-insensitivity? It seems like
    a decidedly non-obvious choice to me ...

    Hugo
  • Karl williamson at Sep 23, 2010 at 5:02 am

    hv@crypt.org wrote:
    karl williamson wrote:
    :hv@crypt.org wrote:
    :> karl williamson wrote:
    :> :I've been thinking about the oft-expressed issue here concerning e.g.,
    :> :making \d match only 0-9, and now have a concrete proposal.
    :> :
    :> :I'm proposing a '/a' regex modifier that would restrict matches of \d,
    :> :\s, \w, and [:posix:] to characters in the ASCII character set. This
    :> :would be true even on utf8-encoded patterns and targets.
    :> :
    :> :I'm leaning against having this affect case insensitive matching, thus
    :> :'"\N{LATIN SMALL LETTER LONG S}" =~ /s/ai' would still be true.
    :>
    :> [...]
    :>
    :> I'd love to see this - I would hope that this would allow greater accuracy,
    :> clarity and speed when matching patterns over data known or intended to be
    :> pure ASCII.
    :>
    :> With the sort of uses I have in mind for this, I would *not* want the /i
    :> behaviour you describe, both for the type of reason described by Avar
    :> ($str =~ /^src=/ai), and for the speed benefits that I imagine a simple
    :> ASCII match could achieve - even if those speed benefits are not realised
    :> in the initial implementation, the behaviour we decide on and document now
    :> will determine whether we have the option to realise them later.
    :
    :Doesn't "use bytes" already do this?

    Hmm, maybe, taking advantage of utf8 having bit value 0x80 set on each
    byte for codepoints above 0x7f. I've never actually used it - it's one
    of those things that are lumped together in the "you don't want to do
    this" bundle in my mind, probably precisely because it requires knowing
    more about utf8 encoding than is properly healthy.

    Of course the pragma works at a different level to a regexp option - you'd
    need the latter, for example, to encapsulate an exact pattern in a qr{}
    object that can then be used outside the context in which it was created.

    Maybe it's worth asking the other way around too: what led you to "lean"
    towards having /ai do Unicode-aware case-insensitivity? It seems like
    a decidedly non-obvious choice to me ...
    The concern has been that e.g. \d can match many more things than one
    thinks, and can lead to security holes. I've never heard a concern that
    matching a cyrillic capital M case insensitively with a cyrillic lower
    case m is a problem, for example.

    "use bytes" has now been deprecated; I'm not sure why. I haven't tried
    doing so, but I suspect that, if there aren't bugs, that setting it at
    the beginning of the program would lead to the behavior you desire, the
    old way with nothing ever getting set to utf8. Some of the
    documentation even uses the terms byte vs character semantics to
    describe this dichotomy, so 'bytes' is a reasonable term, with some
    precedent. If this behavior is what people want, maybe we should relook
    at "use bytes", and dust it off to make sure that it works properly this
    way, and un-deprecate it, for this kind of usage, documenting it as
    such, rather than as a way to look at utf8 strings byte-by-byte. For
    that, we have utf8::encode and decode anyway.
    Hugo
  • Dave Mitchell at Sep 23, 2010 at 10:18 am

    On Wed, Sep 22, 2010 at 11:02:41PM -0600, karl williamson wrote:
    "use bytes" has now been deprecated; I'm not sure why.
    Because it breaks encapsulation and gives direct access to how perl
    happens at a particular moment to be storing strings internally.

    --
    The Enterprise successfully ferries an alien VIP from one place to another
    without serious incident.
    -- Things That Never Happen in "Star Trek" #7
  • Chip Salzenberg at Sep 23, 2010 at 8:55 pm

    On Thu, Sep 23, 2010 at 3:18 AM, Dave Mitchell wrote:
    On Wed, Sep 22, 2010 at 11:02:41PM -0600, karl williamson wrote:
    "use bytes" has now been deprecated; I'm not sure why.
    Because it breaks encapsulation and gives direct access to how perl
    happens at a particular moment to be storing strings internally.
    Then it should have been deprecated only after Perl stopped acting
    differently depending on whether that was the case. And given that
    sysread and syswrite aren't being deprecated, that would be "never".
    *sigh*
  • Chip Salzenberg at Sep 25, 2010 at 2:11 am

    On Thu, Sep 23, 2010 at 1:55 PM, Chip Salzenberg wrote:
    On Thu, Sep 23, 2010 at 3:18 AM, Dave Mitchell wrote:
    On Wed, Sep 22, 2010 at 11:02:41PM -0600, karl williamson wrote:
    "use bytes" has now been deprecated; I'm not sure why.
    Because it breaks encapsulation and gives direct access to how perl
    happens at a particular moment to be storing strings internally.
    Then it should have been deprecated only after Perl stopped acting
    differently depending on whether that was the case.  And given that
    sysread and syswrite aren't being deprecated, that would be "never".
    *sigh*
    I've since been clued in that sysread and syswrite actually are
    capable of dealing in characters, which is sad. I obviously haven't
    been keeping down with current events. I would say "up" but that
    isn't the relevant direction. In any case, while it'd be nice if Perl
    userspace could be 100% unaware of encoding, that's about as likely as
    it being 100% unaware of platform ... which is to day, not at all.
  • Zefram at Sep 27, 2010 at 4:07 pm

    karl williamson wrote:
    Doesn't "use bytes" already do this?
    "use bytes" screws up the matching of any non-ASCII character, in addition
    to restricting \d et al to match only ASCII characters. "use bytes"
    screws up Perl's string model, is an abomination, and ought to die.

    -zefram
  • Ben Morrow at Sep 27, 2010 at 5:50 pm

    Quoth zefram@fysh.org (Zefram):
    karl williamson wrote:
    Doesn't "use bytes" already do this?
    "use bytes" screws up the matching of any non-ASCII character, in addition
    to restricting \d et al to match only ASCII characters. "use bytes"
    screws up Perl's string model, is an abomination, and ought to die.
    What exactly is wrong with '"use bytes" screws up Perl's string model,
    is an abomination, and ought to be fixed'? That is: in the scope of 'use
    bytes', string operations will *down*grade strings rather than upgrading
    them (and do so properly, discarding information, probably with a
    warning, if necessary).

    Ben
  • Eric Brine at Sep 27, 2010 at 9:12 pm

    On Mon, Sep 27, 2010 at 1:50 PM, Ben Morrow wrote:

    Quoth zefram@fysh.org (Zefram):
    karl williamson wrote:
    Doesn't "use bytes" already do this?
    "use bytes" screws up the matching of any non-ASCII character, in addition
    to restricting \d et al to match only ASCII characters. "use bytes"
    screws up Perl's string model, is an abomination, and ought to die.
    What exactly is wrong with '"use bytes" screws up Perl's string model,
    is an abomination, and ought to be fixed'? That is: in the scope of 'use
    bytes', string operations will *down*grade strings rather than upgrading
    them (and do so properly, discarding information, probably with a
    warning, if necessary).
    That's not what it does.
    perl -MDevel::Peek -le"my $x = chr(0xE9); utf8::upgrade($x);
    utf8::downgrade($x); print length($x);"
    1
    perl -MDevel::Peek -le"my $x = chr(0xE9); utf8::upgrade($x); use bytes;
    print length($x);"
    2
  • Ben Morrow at Sep 27, 2010 at 10:26 pm

    Quoth ikegami@adaelis.com (Eric Brine):
    On Mon, Sep 27, 2010 at 1:50 PM, Ben Morrow wrote:
    Quoth zefram@fysh.org (Zefram):
    karl williamson wrote:
    Doesn't "use bytes" already do this?
    "use bytes" screws up the matching of any non-ASCII character, in addition
    to restricting \d et al to match only ASCII characters. "use bytes"
    screws up Perl's string model, is an abomination, and ought to die.
    What exactly is wrong with '"use bytes" screws up Perl's string model,
    is an abomination, and ought to be fixed'? That is: in the scope of 'use
    bytes', string operations will *down*grade strings rather than upgrading
    them (and do so properly, discarding information, probably with a
    warning, if necessary).
    That's not what it does.
    I know. I'm suggesting that that is what it *ought* to do, since I
    believe that behaviour would often be useful, and is closer to the
    original intent than the current broken behaviour.

    Never mind. This thread is about /a, which is something I would like to
    see, so I'd rather not distract from that further.

    Ben
  • Karl williamson at Sep 29, 2010 at 8:51 pm

    Ben Morrow wrote:
    Quoth ikegami@adaelis.com (Eric Brine):
    On Mon, Sep 27, 2010 at 1:50 PM, Ben Morrow wrote:
    Quoth zefram@fysh.org (Zefram):
    karl williamson wrote:
    Doesn't "use bytes" already do this?
    "use bytes" screws up the matching of any non-ASCII character, in addition
    to restricting \d et al to match only ASCII characters. "use bytes"
    screws up Perl's string model, is an abomination, and ought to die.
    What exactly is wrong with '"use bytes" screws up Perl's string model,
    is an abomination, and ought to be fixed'? That is: in the scope of 'use
    bytes', string operations will *down*grade strings rather than upgrading
    them (and do so properly, discarding information, probably with a
    warning, if necessary).
    That's not what it does.
    I know. I'm suggesting that that is what it *ought* to do, since I
    believe that behaviour would often be useful, and is closer to the
    original intent than the current broken behaviour.

    Never mind. This thread is about /a, which is something I would like to
    see, so I'd rather not distract from that further.

    Ben
    This tangent came about because I suggested that what it appeared to me
    that some people wanted was really already there in 'use bytes'. And, I
    still think I'm right, provided it's scope is the whole program, and
    calls nothing that creates utf8.

    In other words, I think that a solution would be if there were a version
    of use bytes that when confronted with something marked utf8, attempted
    to downgrade it, and if not possible did something like throw a fatal
    error. Perhaps this is close to what Ben was saying. I don't think
    that a pragma that did this would be screwing up the string model, etc.
    It would be saying, "I want a Perl just like the Perl that worked in
    the dear old days."

    So I'd like to explore that a little further. I presume that the
    semantics of use bytes couldn't be changed, even though it's deprecated,
    for backwards compatibility. But there could be a different pragma, or
    something like 'use bytes "good_old_days"' to get the new semantics.

    Looking at the guts of utf8-encoded strings can be done with
    utf8::encode(), but it would be better to have a non-destructive version
    with a better name as to what it means.

    If people really want /a to not deal with utf8 semantics at all, I think
    that this proposed pragma functionality would perhaps be a better way to go.
  • Ben Morrow at Sep 29, 2010 at 9:20 pm

    Quoth public@khwilliamson.com (karl williamson):
    Ben Morrow wrote:
    Quoth ikegami@adaelis.com (Eric Brine):
    On Mon, Sep 27, 2010 at 1:50 PM, Ben Morrow wrote:
    Quoth zefram@fysh.org (Zefram):
    karl williamson wrote:
    Doesn't "use bytes" already do this?
    "use bytes" screws up the matching of any non-ASCII character, in addition
    to restricting \d et al to match only ASCII characters. "use bytes"
    screws up Perl's string model, is an abomination, and ought to die.
    What exactly is wrong with '"use bytes" screws up Perl's string model,
    is an abomination, and ought to be fixed'? That is: in the scope of 'use
    bytes', string operations will *down*grade strings rather than upgrading
    them (and do so properly, discarding information, probably with a
    warning, if necessary).
    That's not what it does.
    I know. I'm suggesting that that is what it *ought* to do, since I
    believe that behaviour would often be useful, and is closer to the
    original intent than the current broken behaviour.

    Never mind. This thread is about /a, which is something I would like to
    see, so I'd rather not distract from that further.
    This tangent came about because I suggested that what it appeared to me
    that some people wanted was really already there in 'use bytes'. And, I
    still think I'm right, provided it's scope is the whole program, and
    calls nothing that creates utf8.

    In other words, I think that a solution would be if there were a version
    of use bytes that when confronted with something marked utf8, attempted
    to downgrade it, and if not possible did something like throw a fatal
    error. Perhaps this is close to what Ben was saying.
    Yes, exactly. I think I'd rather see a warning, along the same lines as
    the current 'Wide character in print' warning, for the same reasons. I
    wouldn't object to an exception, though.
    I don't think
    that a pragma that did this would be screwing up the string model, etc.
    It would be saying, "I want a Perl just like the Perl that worked in
    the dear old days."
    use 5.005; # :)
    So I'd like to explore that a little further. I presume that the
    semantics of use bytes couldn't be changed, even though it's deprecated,
    for backwards compatibility.
    IMHO if this were done carefully it could reasonably be considered a bug
    fix. 'use bytes'' current behaviour in the face of SvUTF8 strings is
    unreasonable, and probably never what the user meant. Fixing it to
    downgrade(-or-die) would, I suspect, fix more bugs than it caused.

    Ben
  • Zefram at Sep 27, 2010 at 10:25 pm

    Ben Morrow wrote:
    What exactly is wrong with '"use bytes" screws up Perl's string model,
    is an abomination, and ought to be fixed'?
    It can't be fixed. Its fundamental purpose is inherently broken.
    It is beyond redemption.
    That is: in the scope of 'use
    bytes', string operations will *down*grade strings rather than upgrading
    them
    That operation bears no resemblance to what "use bytes" currently does,
    and should therefore not have the same name. Furthermore, once we've
    fixed the representation dependence, implicitly downgrading strings will
    be a no-op. Your proposed pragma doesn't sound useful.

    -zefram
  • Jesse Vincent at Sep 28, 2010 at 2:07 pm

    On Tue 21.Sep'10 at 12:27:50 -0600, karl williamson wrote:
    I've been thinking about the oft-expressed issue here concerning
    e.g., making \d match only 0-9, and now have a concrete proposal.

    I'm proposing a '/a' regex modifier that would restrict matches of
    \d, \s, \w, and [:posix:] to characters in the ASCII character set.
    This would be true even on utf8-encoded patterns and targets.
    Forgive me for being late to the party, but I'd love to see this.
  • PetaMem R&D at Oct 26, 2010 at 6:49 pm
    Hi there,
    On Tue, Sep 21, 2010 at 8:27 PM, karl williamson wrote:

    I've been thinking about the oft-expressed issue here concerning e.g.,
    making \d match only 0-9, and now have a concrete proposal.
    I remembered vaguely this thread and concluded that \d - does not match only
    [0-9], but instead the whole bunch of digits unicode has to offer. So I
    tried:

    ----------

    use 5.010;
    use strict;
    use warnings;
    use utf8;

    my $number = '፩፫፯'; # tried here ethiopic, khmer, lao, tibetean etc.

    if ($number =~ m{\d}xms) {
    print "number ($number) is composed of digits.\n";
    }

    ---------

    Nothing. Nada. Nüscht. Niente.

    Perl 5.12.2 that is. So I must really missing the point in this whole /a
    regex
    modifier discussion, because to me, the original claim (\d matches not only
    [0-9]) seems not to be true.


    --
    regards,
    Marcel @ PetaMem R&D
  • Zefram at Oct 26, 2010 at 7:30 pm

    PetaMem R&D wrote:
    I remembered vaguely this thread and concluded that \d - does not match only
    [0-9], but instead the whole bunch of digits unicode has to offer. So I
    tried:
    Non-ASCII string literals don't work the way you expect, pretty much
    regardless of what you expect. Try expressing it all in ASCII:

    my $number = "\x{666}"; # Arabic-Indic digit six

    This is matched by /\d/ on all Perl versions from 5.7.1 onwards:

    $ perl5.13.6 -lwe 'print "\x{666}" =~ /\d/ ? "digit" : "no"'
    digit
    $ perl5.12.2 -lwe 'print "\x{666}" =~ /\d/ ? "digit" : "no"'
    digit
    $ perl5.10.1 -lwe 'print "\x{666}" =~ /\d/ ? "digit" : "no"'
    digit
    $ perl5.8.9 -lwe 'print "\x{666}" =~ /\d/ ? "digit" : "no"'
    digit
    $ perl5.8.0 -lwe 'print "\x{666}" =~ /\d/ ? "digit" : "no"'
    digit
    $ perl5.7.1 -lwe 'print "\x{666}" =~ /\d/ ? "digit" : "no"'
    digit
    $ perl5.7.0 -lwe 'print "\x{666}" =~ /\d/ ? "digit" : "no"'
    no
    $ perl5.6.2 -lwe 'print "\x{666}" =~ /\d/ ? "digit" : "no"'
    no

    -zefram
  • PetaMem R&D at Oct 26, 2010 at 8:50 pm

    On Tue, Oct 26, 2010 at 9:30 PM, Zefram wrote:
    Non-ASCII string literals don't work the way you expect, pretty much
    regardless of what you expect. Try expressing it all in ASCII:
    I see. In fact, it seems some digits are matched while some are not.
    E.g. in 5.12.2 the aforementioned ethiopian are not, while devanagari are
    matched.

    So regardless of the /a discussion - which has all my blessing - I would
    like to
    state a corollary proposal for the \d semantics:

    Make it work. i.e. really DO match all possible digits. Else "won't work the
    way
    you expect, regardless of what you expect" could be construed a bug.


    regards,
    --
    regards,
    Marcel @ PetaMem R&D
  • Tom Christiansen at Oct 26, 2010 at 9:03 pm

    I see. In fact, it seems some digits are matched while some are not. Yep.
    E.g. in 5.12.2 the aforementioned ethiopian are not, while devanagari are
    matched. Yep.
    So regardless of the /a discussion - which has all my blessing - I would
    like to state a corollary proposal for the \d semantics:
    Make it work. i.e. really DO match all possible digits. Else "won't
    work the way you expect, regardless of what you expect" could be
    construed a bug.
    It's not our fault.

    % unichars '\pN' '\D' 'NAME =~ /DIGIT/' | wc -l
    91

    --tom
  • John at Oct 26, 2010 at 10:27 pm

    On 26/10/2010 22:02, Tom Christiansen wrote:
    Make it work. i.e. really DO match all possible digits. Else "won't
    work the way you expect, regardless of what you expect" could be
    construed a bug.
    It's not our fault.

    % unichars '\pN' '\D' 'NAME =~ /DIGIT/' | wc -l
    91

    --tom
    So should \d without the /a modifier match only decimal digits we can ge
    a value for. IIRC there are some numeric characters with very strange
    values.
  • Tom Christiansen at Oct 26, 2010 at 11:01 pm

    Make it work. i.e. really DO match all possible digits. Else "won't
    work the way you expect, regardless of what you expect" could be
    construed a bug.
    It's not our fault.

    % unichars '\pN' '\D' 'NAME =~ /DIGIT/' | wc -l
    91
    So should \d without the /a modifier match only decimal digits we can ge
    a value for. IIRC there are some numeric characters with very strange
    values.
    \d exactly maps to \p{Nd}. That's all. Whether something has a
    numeric value or not doesn't matter.

    Please play around with the programs I sent; they will help you
    get a feel for these things.

    --tom

    % uniprops -a 1
    U+0031 ‹1› \N{ DIGIT ONE }:
    \w \d \pN \p{Nd}
    AHex ASCII_Hex_Digit All Any Alnum ASCII Assigned Common Zyyy
    Decimal_Number Digit Nd N Gr_Base Grapheme_Base Graph GrBase Hex
    XDigit Hex_Digit ID_Continue IDC Number PerlWord PosixAlnum
    PosixDigit PosixGraph PosixPrint Print Word XID_Continue XIDC
    Age:1.1 Block=Basic_Latin Bidi_Class:EN Bidi_Class=European_Number
    Bidi_Class:European_Number Bc=EN Block:ASCII Block:Basic_Latin
    Blk=ASCII Canonical_Combining_Class:0
    Canonical_Combining_Class=Not_Reordered
    Canonical_Combining_Class:Not_Reordered Ccc=NR
    Canonical_Combining_Class:NR Script=Common
    Decomposition_Type:None Dt=None East_Asian_Width:Na
    East_Asian_Width=Narrow East_Asian_Width:Narrow Ea=Na
    Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX
    Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
    Hangul_Syllable_Type=Not_Applicable
    Hangul_Syllable_Type:Not_Applicable Hst=NA
    Joining_Group:No_Joining_Group Jg=NoJoiningGroup
    Joining_Type:Non_Joining Jt=U Joining_Type:U
    Joining_Type=Non_Joining Line_Break:NU Line_Break=Numeric
    Line_Break:Numeric Lb=NU Numeric_Type:De Numeric_Type=Decimal
    Numeric_Type:Decimal Nt=De Numeric_Value:1 Nv=1 Present_In:1.1
    Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1
    Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
    Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0
    Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Common Sc=Zyyy
    Script:Zyyy Sentence_Break:NU Sentence_Break=Numeric
    Sentence_Break:Numeric SB=NU Word_Break:NU Word_Break=Numeric
    Word_Break:Numeric WB=NU

    % unichars -a '\p{Numeric_Value=1}' '\D'
    ¹ 185 0000B9 SUPERSCRIPT ONE
    ౹ 3193 000C79 TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR
    ౼ 3196 000C7C TELUGU FRACTION DIGIT ONE FOR EVEN POWERS OF FOUR
    ፩ 4969 001369 ETHIOPIC DIGIT ONE
    ៱ 6129 0017F1 KHMER SYMBOL LEK ATTAK MUOY
    ₁ 8321 002081 SUBSCRIPT ONE
    ⅟ 8543 00215F FRACTION NUMERATOR ONE
    Ⅰ 8544 002160 ROMAN NUMERAL ONE
    ⅰ 8560 002170 SMALL ROMAN NUMERAL ONE
    ① 9312 002460 CIRCLED DIGIT ONE
    ⑴ 9332 002474 PARENTHESIZED DIGIT ONE
    ⒈ 9352 002488 DIGIT ONE FULL STOP
    ⓵ 9461 0024F5 DOUBLE CIRCLED DIGIT ONE
    ❶ 10102 002776 DINGBAT NEGATIVE CIRCLED DIGIT ONE
    ➀ 10112 002780 DINGBAT CIRCLED SANS-SERIF DIGIT ONE
    ➊ 10122 00278A DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
    ㆒ 12690 003192 IDEOGRAPHIC ANNOTATION ONE MARK
    ㈠ 12832 003220 PARENTHESIZED IDEOGRAPH ONE
    ㊀ 12928 003280 CIRCLED IDEOGRAPH ONE
    ꛦ 42726 00A6E6 BAMUM LETTER MO
    𐄇 65799 010107 AEGEAN NUMBER ONE
    𐅂 65858 010142 GREEK ACROPHONIC ATTIC ONE DRACHMA
    𐅘 65880 010158 GREEK ACROPHONIC HERAEUM ONE PLETHRON
    𐅙 65881 010159 GREEK ACROPHONIC THESPIAN ONE
    𐅚 65882 01015A GREEK ACROPHONIC HERMIONIAN ONE
    𐌠 66336 010320 OLD ITALIC NUMERAL ONE
    𐏑 66513 0103D1 OLD PERSIAN NUMBER ONE
    𐡘 67672 010858 IMPERIAL ARAMAIC NUMBER ONE
    𐤖 67862 010916 PHOENICIAN NUMBER ONE
    𐩀 68160 010A40 KHAROSHTHI DIGIT ONE
    𐩽 68221 010A7D OLD SOUTH ARABIAN NUMBER ONE
    𐭘 68440 010B58 INSCRIPTIONAL PARTHIAN NUMBER ONE
    𐭸 68472 010B78 INSCRIPTIONAL PAHLAVI NUMBER ONE
    𐹠 69216 010E60 RUMI DIGIT ONE
    𒐕 74773 012415 CUNEIFORM NUMERIC SIGN ONE GESH2
    𒐞 74782 01241E CUNEIFORM NUMERIC SIGN ONE GESHU
    𒐬 74796 01242C CUNEIFORM NUMERIC SIGN ONE SHARU
    𒐴 74804 012434 CUNEIFORM NUMERIC SIGN ONE BURU
    𒑏 74831 01244F CUNEIFORM NUMERIC SIGN ONE BAN2
    𒑘 74840 012458 CUNEIFORM NUMERIC SIGN ONE ESHE3
    𝍠 119648 01D360 COUNTING ROD UNIT DIGIT ONE
    🄂 127234 01F102 DIGIT ONE COMMA
  • Karl williamson at Oct 27, 2010 at 2:43 am
    \d matches only characters that can safely be assumed to have decimal
    positional values. For example, 123 means 1 hundred + twenty + three.

    This is so they can safely be used together in strings. Other digit
    characters that match \p{Nt=Di} mean a digit from 0-9, but a string of
    them together may not mean what most people would think a string of them
    means.

    Further, starting in Unicode 6.0, no character may be classified as \d
    (which is also written as any of: \p{Nd}, \p{Gc=Nd}, or \p{Nt=De})
    unless it is in a block of consecutive characters starting with the one
    that means 0. This means that programs can safely find the code points
    of the neighbors of an input \d.

    I intend to add the function Unicode::UCD::num() in 5.14 which will
    safely return the numeric value of the input string, or undef if no safe
    value is available. A single character input string would be safe if it
    had any numeric value. A string of multiple characters would be
    considered safe only if all are neighbor \d's of each other in the same
    script.
  • PetaMem R&D at Oct 27, 2010 at 7:46 am

    On Wed, Oct 27, 2010 at 4:42 AM, karl williamson wrote:

    \d matches only characters that can safely be assumed to have decimal
    positional values. For example, 123 means 1 hundred + twenty + three.
    Great, let's talk about semantics. So I assume you can tell me what this
    number is: ꯰0꩔꘤
    It's perfectly matched by m{\A\d+\z}xms

    This is so they can safely be used together in strings. Other digit
    characters that match \p{Nt=Di} mean a digit from 0-9, but a string of them
    together may not mean what most people would think a string of them means.
    That's true, but my argument - see above - is, that the situation ALREADY
    is, that \d matches also characters, that in combination (concatenated
    string of just these characters) have NO decimal positional values. Proof by
    contradiction.

    Further, starting in Unicode 6.0, no character may be classified as \d
    (which is also written as any of: \p{Nd}, \p{Gc=Nd}, or \p{Nt=De}) unless it
    is in a block of consecutive characters starting with the one that means 0.
    This means that programs can safely find the code points of the neighbors
    of an input \d.
    I hope I do not understand this. Because if I do... care to bring some
    examples?

    I intend to add the function Unicode::UCD::num() in 5.14 which will safely
    return the numeric value of the input string, or undef if no safe value is
    available. A single character input string would be safe if it had any
    numeric value. A string of multiple characters would be considered safe
    only if all are neighbor \d's of each other in the same script.
    Ah. Which would mean, the example I have given above would return undef from
    that function. Indeed, that would be most welcome.

    --
    regards,
    Marcel @ PetaMem R&D
  • Demerphq at Oct 27, 2010 at 11:18 am

    On 27 October 2010 09:46, PetaMem R&D wrote:
    On Wed, Oct 27, 2010 at 4:42 AM, karl williamson wrote:

    \d matches only characters that can safely be assumed to have decimal
    positional values.  For example, 123 means 1 hundred + twenty + three.
    Great, let's talk about semantics. So I assume you can tell me what this
    number is: ꯰0꩔꘤
    It's perfectly matched by m{\A\d+\z}xms
    I regret to say that it is not helpful posting examples like that.
    Please post code that includes the code point documented.

    "\x{1234}"

    is vastly preferable to pasting that codepoint into an email to make a point.

    Anyway, the rules of how we match digits under Unicode are not
    determined by us, and should be addressed to the Unicode Consortium
    and not us. We have in the past deviated from the expected behaviour
    of Unicode in name of "expediancy" and DWIM and it caused a lot of
    problems, so the only area that we will change this behaviour is to
    make ourselves /more/ compliant and not less.

    This is so they can safely be used together in strings.  Other digit
    characters that match \p{Nt=Di} mean a digit from 0-9, but a string of them
    together may not mean what most people would think a string of them means.
    That's true, but my argument - see above - is, that the situation ALREADY
    is, that \d matches also characters, that in combination (concatenated
    string of just these characters) have NO decimal positional values. Proof by
    contradiction.

    Further, starting in Unicode 6.0, no character may be classified as \d
    (which is also written as any of: \p{Nd}, \p{Gc=Nd}, or \p{Nt=De}) unless it
    is in a block of consecutive characters starting with the one that means 0.
    This means that programs can safely find the code points of the neighbors
    of an input \d.
    I hope I do not understand this. Because if I do... care to bring some
    examples?
    Im not following your misunderstanding.

    Basically what they mean is that ord("9")-ord("0") should always equal 9:

    $ perl -le'print ord("9")-ord("0")'
    9

    Or in other words, knowing the numeric value of any single digit in a
    given script, and its codepoint, one should also know all the other
    codepoints for digits in that script..

    So knowing that codepoint 54 is numerically equivalent to the number 6
    allows us to know that codepoint 48 is the ascii character for 0, and
    that 57 is the codepoint for 9.


    I intend to add the function Unicode::UCD::num() in 5.14 which will safely
    return the numeric value of the input string, or undef if no safe value is
    available.  A single character input string would be safe if it had any
    numeric value.  A string of multiple characters would be considered safe
    only if all are neighbor \d's of each other in the same script.
    Ah. Which would mean, the example I have given above would return undef from
    that function. Indeed, that would be most welcome.

    --
    regards,
    Marcel @ PetaMem R&D


    --
    perl -Mre=debug -e "/just|another|perl|hacker/"
  • Tom Christiansen at Oct 26, 2010 at 8:58 pm

    I remembered vaguely this thread and concluded that \d - does not match only
    [0-9], but instead the whole bunch of digits unicode has to offer. So I
    tried:
    Non-ASCII string literals don't work the way you expect, pretty much
    regardless of what you expect. Try expressing it all in ASCII:
    my $number = "\x{666}"; # Arabic-Indic digit six
    That isn't the issue. See my other mail.

    --tom
  • Tom Christiansen at Oct 26, 2010 at 8:49 pm

    On Tue, Sep 21, 2010 at 8:27 PM, karl williamson wrote:

    I've been thinking about the oft-expressed issue here concerning e.g.,
    making \d match only 0-9, and now have a concrete proposal.
    I remembered vaguely this thread and concluded that \d - does not match only
    [0-9], but instead the whole bunch of digits unicode has to offer. So I
    tried:
    ----------
    use 5.010;
    use strict;
    use warnings;
    use utf8;
    my $number = '፩፫፯'; # tried here ethiopic, khmer, lao, tibetean etc.
    if ($number =~ m{\d}xms) {
    print "number ($number) is composed of digits.\n";
    }
    Nothing. Nada. Nüscht. Niente.
    Perl 5.12.2 that is. So I must really missing the point in this whole /a
    regex modifier discussion, because to me, the original claim (\d matches
    not only [0-9]) seems not to be true.
    What's going on is that you're using code point U+1369 above. First, let's
    make sure that it seems like it should match:

    % uninames 1369
    ፩ 4969 1369 ETHIOPIC DIGIT ONE

    Ok, but the name isn't what determines matchability.
    You have to check out its properties:

    % uniprops U+1369
    U+1369 ‹፩› \N{ ETHIOPIC DIGIT ONE }:
    \pN \p{No}
    All Any Assigned InEthiopic Ethiopic Is_Ethiopic Ethi N No Gr_Base
    Grapheme_Base Graph GrBase ID_Continue IDC Number Other_Number
    Print XID_Continue XIDC

    As you see, that has, amongst other things, the \p{Number} and
    \p{Other_Number} properties.

    But \d isn't the same as \pN or \p{No}.

    Watch what properties an ASCII DIGIT ZERO has:

    % uniprops "DIGIT ZERO"
    U+0030 ‹0› \N{ DIGIT ZERO }:
    \w \d \pN \p{Nd}
    AHex ASCII_Hex_Digit All Any Alnum ASCII Assigned Common Zyyy
    Decimal_Number Digit Nd N Gr_Base Grapheme_Base Graph GrBase
    Hex XDigit Hex_Digit ID_Continue IDC Number PerlWord PosixAlnum
    PosixDigit PosixGraph PosixPrint Print Word XID_Continue XIDC

    A \d *is* the same as \p{Nd}; that is, to \p{Decimal_Number}.

    There are lots of non-ASCII decimal numbers:

    % unichars '\d' '\P{ASCII}' | wc -l
    401

    I enclose a tarball with all three of the handy little scripts I used
    above. I think you find they're far more useful than I've shown here.

    --tom
  • Paul LeoNerd Evans at Nov 2, 2010 at 11:32 pm

    On Tue, Sep 21, 2010 at 12:27:50PM -0600, karl williamson wrote:
    I'm proposing a '/a' regex modifier that would restrict matches of
    \d, \s, \w, and [:posix:] to characters in the ASCII character set.
    This would be true even on utf8-encoded patterns and targets.
    Forgive my extreme lateness, but isn't this what we debated for months
    previously?

    Doesn't it break this?

    $ perl -E '"123"=~m/\d/and say"digit"'

    --
    Paul "LeoNerd" Evans

    leonerd@leonerd.org.uk
    ICQ# 4135350 | Registered Linux# 179460
    http://www.leonerd.org.uk/
  • Eric Brine at Nov 3, 2010 at 2:12 am

    On Tue, Nov 2, 2010 at 7:32 PM, Paul LeoNerd Evans wrote:
    On Tue, Sep 21, 2010 at 12:27:50PM -0600, karl williamson wrote:
    I'm proposing a '/a' regex modifier that would restrict matches of
    \d, \s, \w, and [:posix:] to characters in the ASCII character set.
    This would be true even on utf8-encoded patterns and targets.
    Forgive my extreme lateness, but isn't this what we debated for months
    previously?

    Doesn't it break this?

    $ perl -E '"123"=~m/\d/and say"digit"'

    Karl is aware of that. "It could not be expressed as a /suffix in 5.14"
    (just as /(?a:)/). A deprecation cycle would be needed.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedSep 21, '10 at 6:28p
activeNov 3, '10 at 2:12a
posts31
users14
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase