FAQ
It's not explicitly specified, if insignificant whitespace is allowed in
\c[...], \x[...], etc.

Std.pm allows e.g.

"\x[ 41 , 42 , 43 ]"

For convenience - especially with long charnames - it should be possible
to write

"\c[
SPACE, # blafasel
LATIN SMALL LETTER A, # some comment
COMBINING DOT BELOW, # thisandthat
]"

Helmut Wollmersdorfer

Search Discussions

  • Larry Wall at Apr 28, 2009 at 5:09 pm

    On Mon, Apr 27, 2009 at 11:04:03AM +0200, Helmut Wollmersdorfer wrote:
    It's not explicitly specified, if insignificant whitespace is allowed in
    \c[...], \x[...], etc.

    Std.pm allows e.g.

    "\x[ 41 , 42 , 43 ]"

    For convenience - especially with long charnames - it should be possible
    to write

    "\c[
    SPACE, # blafasel
    LATIN SMALL LETTER A, # some comment
    COMBINING DOT BELOW, # thisandthat
    ]"
    Does anyone know offhand whether the Unicode Consortium has an explicit
    policy against use of punctuation in a charname? So far they only
    seem to use hyphen and parens, but I wonder to what extent we can
    depend on that...

    In any case, STD doesn't currently try to check the string in \c[...]
    for correctness. It just scans for the closing bracket. We will
    certainly need to refine this, and the suggested approach is certainly
    a possible outcome, if we decide it's sufficiently unambiguous.

    Larry
  • Mark J. Reed at Apr 28, 2009 at 5:28 pm

    On Tue, Apr 28, 2009 at 10:22 AM, Larry Wall wrote:
    Does anyone know offhand whether the Unicode Consortium has an explicit
    policy against use of punctuation in a charname?  So far they only
    seem to use hyphen and parens, but I wonder to what extent we can
    depend on that...
    According to the 5.0.0 standard, section 4.8:

    "Unicode character names contain only uppercase Latin letters A
    through Z, digits, space, and hyphen-minus."

    So it seems the notes in parentheses are not considered part of the char name.

    --
    Mark J. Reed <markjreed@gmail.com>
  • Patrick R. Michaud at Apr 28, 2009 at 6:27 pm

    On Tue, Apr 28, 2009 at 01:28:40PM -0400, Mark J. Reed wrote:
    On Tue, Apr 28, 2009 at 10:22 AM, Larry Wall wrote:
    Does anyone know offhand whether the Unicode Consortium has an explicit
    policy against use of punctuation in a charname?  So far they only
    seem to use hyphen and parens, but I wonder to what extent we can
    depend on that...
    According to the 5.0.0 standard, section 4.8:

    "Unicode character names contain only uppercase Latin letters A
    through Z, digits, space, and hyphen-minus."

    So it seems the notes in parentheses are not considered part of the char name.
    Countering this, though:

    * The XML schema for the "Unicode Character Database in XML" [1]
    seems to allow parens in the character name property:

    character-name = xsd:string { pattern="([A-Z0-9 #\-\(\)]*)|(<control>)" }

    * The Unicode character name database [2] has parens in the
    name property field for many characters

    000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;

    * ICU doesn't seem to recognize the versions of the name without
    the parens (or if it does, I haven't been able to figure out the
    correct incantations to make it do so).

    Of course, it's very possible that I'm misreading the Unicode
    specifications, and the note that Mark cites would seem to be
    very explicit. But thus far in playing with this I've seen
    more indications that the parens are allowed or even required
    than I've seen that indicate they're excluded.

    Pm

    [1] http://www.unicode.org/reports/tr42/tr42-3.html#N66310
    [2] http://unicode.org/Public/UNIDATA/UnicodeData.txt
  • Mark J. Reed at Apr 28, 2009 at 7:08 pm

    On Tue, Apr 28, 2009 at 2:27 PM, Patrick R. Michaud wrote:
    According to the 5.0.0 standard, section 4.8:

    "Unicode character names contain only uppercase Latin letters A
    through Z, digits, space, and hyphen-minus."

    So it seems the notes in parentheses are not considered part of the char name.
    Countering this, though:

    * The XML schema for the "Unicode Character Database in XML" [1]
    seems to allow parens in the character name property:

    character-name = xsd:string { pattern="([A-Z0-9 #\-\(\)]*)|(<control>)" }
    Also '#', though I see no character names containing that symbol.

    But all the parentheses I see in the list of character names are
    surrounding lowercase letters, which are explicitly disallowed not
    only in the spec I quoted, but in the XML scheme definition you quote
    above. e.g.

    00C6 LATIN CAPITAL LETTER AE (ash)
    * The Unicode character name database [2] has parens in the
    name property field for many characters

    000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
    That's not the name property field. The Unicode character name is
    field 1 ("<control>", in this case). The field whose value is "LINE
    FEED (LF)" is the Unicode_1_Name field, wihch for control characters
    supplies the ISO 6429 name.

    --
    Mark J. Reed <markjreed@gmail.com>
  • Patrick R. Michaud at Apr 28, 2009 at 7:57 pm

    On Tue, Apr 28, 2009 at 03:08:05PM -0400, Mark J. Reed wrote:
    On Tue, Apr 28, 2009 at 2:27 PM, Patrick R. Michaud wrote:
    * The Unicode character name database [2] has parens in the
    name property field for many characters

    000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
    That's not the name property field. The Unicode character name is
    field 1 ("<control>", in this case). The field whose value is "LINE
    FEED (LF)" is the Unicode_1_Name field, wihch for control characters
    supplies the ISO 6429 name.
    Ah, thanks for the excellent clarification.

    Returning to the original question: Would this then mean
    that we don't provide a way to specify U+000A and other control
    characters using a "name" inside of \c[...]?

    Or (more likely) does it mean that the names we accept inside
    of the \c[...] are more than just the strict
    "Unicode character names" listed above--i.e., the Unicode_1_Name
    field and other related aliases (whatever those might be)?

    Pm
  • Patrick R. Michaud at Apr 28, 2009 at 6:35 pm

    On Tue, Apr 28, 2009 at 07:22:18AM -0700, Larry Wall wrote:
    On Mon, Apr 27, 2009 at 11:04:03AM +0200, Helmut Wollmersdorfer wrote:
    Std.pm allows e.g.

    "\x[ 41 , 42 , 43 ]"

    For convenience - especially with long charnames - it should be possible
    to write

    "\c[
    SPACE, # blafasel
    LATIN SMALL LETTER A, # some comment
    COMBINING DOT BELOW, # thisandthat
    ]"
    In any case, STD doesn't currently try to check the string in \c[...]
    for correctness. It just scans for the closing bracket. We will
    certainly need to refine this, and the suggested approach is certainly
    a possible outcome, if we decide it's sufficiently unambiguous.
    FWIW, Rakudo and PGE now allow spaces inside the brackets, although they
    don't understand the # ... comments yet.

    Pm

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl6-language @
categoriesperl
postedApr 27, '09 at 9:06a
activeApr 28, '09 at 7:57p
posts7
users4
websiteperl6.org

People

Translate

site design / logo © 2021 Grokbase