FAQ

Search Discussions

  • Gregory Ewing at Nov 17, 2013 at 3:41 am


    The type system looks very interesting!


    It's just a pity they based the syntax on C rather
    than something more enlightened. (Why do people
    keep doing that when they design languages?)


    --
    Greg
  • Chris Angelico at Nov 17, 2013 at 4:10 am

    On Sun, Nov 17, 2013 at 2:41 PM, Gregory Ewing wrote:
    Neal Becker wrote:

    The type system looks very interesting!

    It's just a pity they based the syntax on C rather
    than something more enlightened. (Why do people
    keep doing that when they design languages?)

    Because in many ways it's an excellent syntactic structure, and - more
    importantly - it's one that's familiar to a huge number of
    programmers. That's pretty valuable.


    ChrisA
  • Cameron Simpson at Nov 17, 2013 at 4:44 am

    On 17Nov2013 15:10, Chris Angelico wrote:
    On Sun, Nov 17, 2013 at 2:41 PM, Gregory Ewing
    wrote:
    Neal Becker wrote:
    The type system looks very interesting!

    It's just a pity they based the syntax on C rather
    than something more enlightened. (Why do people
    keep doing that when they design languages?)
    Because in many ways it's an excellent syntactic structure, and - more
    importantly - it's one that's familiar to a huge number of
    programmers. That's pretty valuable.

    Indeed. If your core innovation is the type system (for example),
    why be _gratuitously_ different in areas where your language semantics
    are conventional?


    And of course your default syntax will come from what you're
    comfortable with unless the syntax is something you're rebelling
    against.
    --
    Cameron Simpson <cs@zip.com.au>


    If you can keep your head while all those about you are losing theirs,
    perhaps you don't understand the situation.
             - Paul Wilson <paul_wilson.dbs@dbsnotes.dbsoftware.com>
  • Steven D'Aprano at Nov 17, 2013 at 5:48 am

    On Sun, 17 Nov 2013 16:41:07 +1300, Gregory Ewing wrote:


    Neal Becker wrote:
    The type system looks very interesting!

    It's just a pity they based the syntax on C rather than something more
    enlightened. (Why do people keep doing that when they design languages?)



    When the only tool you've used is a hammer, every tool you design ends up
    looking like a hammer.




    --
    Steven
  • Jkn at Nov 17, 2013 at 8:34 am
    Hi Stephen


    On Sunday, 17 November 2013 05:48:58 UTC, Steven D'Aprano wrote:

    [...]
    It's just a pity they based the syntax on C rather than something more
    enlightened. (Why do people keep doing that when they design languages?)

    When the only tool you've used is a hammer, every tool you design ends up
    looking like a hammer.

    true, and yet ... if [I] were to design a hammer, would you be justified in assuming that that is the only tool I know about?


         J^n
  • Mark Lawrence at Nov 17, 2013 at 12:41 pm

    On 17/11/2013 03:41, Gregory Ewing wrote:
    Neal Becker wrote:
    The type system looks very interesting!

    It's just a pity they based the syntax on C rather
    than something more enlightened. (Why do people
    keep doing that when they design languages?)

    As a rule of thumb people don't like change? This obviously assumes
    that language designers are people :)


    --
    Python is the second best programming language in the world.
    But the best has yet to be invented. Christian Tismer


    Mark Lawrence
  • Gregory Ewing at Nov 17, 2013 at 10:33 pm

    Mark Lawrence wrote:


    As a rule of thumb people don't like change? This obviously assumes
    that language designers are people :)

    That's probably true (on both counts).


    I guess this means we need to encourage more
    Pythoneers to become language designers!


    --
    Greg
  • Tim Daneliuk at Nov 17, 2013 at 10:48 pm

    On 11/17/2013 04:33 PM, Gregory Ewing wrote:
    Mark Lawrence wrote:
    As a rule of thumb people don't like change? This obviously assumes that language designers are people :)
    That's probably true (on both counts).

    I guess this means we need to encourage more
    Pythoneers to become language designers!

    Ahem, I already commented on this in some detail"


         https://mail.python.org/pipermail/python-list/2004-September/241055.html


    --
    ----------------------------------------------------------------------------
    Tim Daneliuk tundra at tundraware.com
    PGP Key: http://www.tundraware.com/PGP/
  • Mark Lawrence at Nov 18, 2013 at 11:51 pm

    On 17/11/2013 22:48, Tim Daneliuk wrote:
    On 11/17/2013 04:33 PM, Gregory Ewing wrote:
    Mark Lawrence wrote:
    As a rule of thumb people don't like change? This obviously assumes
    that language designers are people :)
    That's probably true (on both counts).

    I guess this means we need to encourage more
    Pythoneers to become language designers!
    Ahem, I already commented on this in some detail"


    https://mail.python.org/pipermail/python-list/2004-September/241055.html

    Fantastic, very promising indeed. I know it needs bringing up to date,
    but to make it fly can I safely assume that we'll be seeing a PEP fairly
    shortly?


    As an aside, I noticed that the previous message was "negative stride
    list slices", why do I have a strong sense of deja vu?


    I refuse to mention another message that I noticed whilst browsing, on
    the grounds that I don't want to be accused of multiple manslaughter by
    way of causing heart attacks :)


    --
    Python is the second best programming language in the world.
    But the best has yet to be invented. Christian Tismer


    Mark Lawrence
  • Tim Daneliuk at Nov 19, 2013 at 12:31 am

    On 11/18/2013 05:51 PM, Mark Lawrence wrote:
    can I safely assume that we'll be seeing a PEP fairly shortly?



    For Immediate Press Release:




    We at TundraWare are now entering our 10th year of debate in the YAPDL
    design as to what ought to be a statement and what ought to be a function.
    The Statementists are currently winning 3 bouts to 2 over the
    Functionists but there is much more gnashing of teeth and wringing of
    hands to come. We remain true to the original vision of the language as
    an unwanted appendage to Python which will promote fractionalisation and
    thus improve opportunity for future billings.


    We are also contemplating an offshoot language that melds the best of Java
    into YAPDL. Known as JAPDL ("Jah.piddle") it is targeted particularly
    to Rastafri programmers worldwide. The primary contribution of JAPDL
    to the language arts is the replacement of the GIL (Global Interpreter Lock)
    with the much simpler, DR (Dread Lock).
    ----------------------------------------------------------------------------
    Tim Daneliuk tundra at tundraware.com
    PGP Key: http://www.tundraware.com/PGP/
  • Chris Angelico at Nov 18, 2013 at 12:42 am

    On Mon, Nov 18, 2013 at 9:33 AM, Gregory Ewing wrote:
    Mark Lawrence wrote:
    As a rule of thumb people don't like change? This obviously assumes that
    language designers are people :)

    That's probably true (on both counts).

    I guess this means we need to encourage more
    Pythoneers to become language designers!

    Easy! Just make Python really bad in every way except syntax. Then
    people will be constantly thinking "If only Python were more X and
    less Y... great syntax but the language sucks in so many ways!" and
    they'll borrow the syntax into their new languages.


    If you're setting out to create a new language, you probably want it
    to be "Foo, except X" for some Foo and X. So you'll keep everything
    about Foo that doesn't conflict with your changes. I would expect to
    see Python-like syntax in a language that's designed to be "Python,
    except compilable to C for performance"... and whaddayaknow, Cython
    fits that description. Thing is, Python is just so much better than
    (C, C#, JavaScript, Java) that there's hardly as much impetus to
    create a new language.


    ChrisA
  • Rick Johnson at Nov 18, 2013 at 12:18 am

    On Saturday, November 16, 2013 9:41:07 PM UTC-6, Gregory Ewing wrote:
    The type system looks very interesting!

    Indeed.


    I went to the site assuming this would be another language
    that i would never like, however, after a few minutes
    reading the tour, i could not stop!


    I read through the entire tour with excitement, all the while
    actually yelling; "yes" and sometimes even "yes, yes, YES"


    But not only is the language interesting, the web site
    itself is phenomenal! This is a fine example of twenty first
    century design at work.


    I've always found the Python web site to be a cluttered
    mess, but ceylon-lang.org is just the opposite! A clean and
    simplistic web site with integrated console fiddling --
    heck, they even took the time to place a button near every
    example!


    Some of the aspects of ceylons syntax i find interesting are:


         Instead of using single, double, and triple quotes to
         basically represent the same literals ceylon decided to
         implement each uniquely. Also, back-tick interpolation
         and Unicode embedding is much more elegant!


         The use of a post-fix question mark to denote a
         declared Type that can optionally be null.


         The ceylon designers ACTUALLY understand what the
         word "variable" means!


         Immutable attributes, yes, yes, YES!


         The multiplication operator can ONLY be used on
         numerics. Goodbye subtle bug!


         Explicit "return" required in methods/functions!


         No "default initialization to null"


         No omitting braces in control structures
         (Consistency is the key!!!)


         The assert statement is much more useful than
         Python's


         The "tagging" of iterable types using regexp
         inspired syntax "*" and "+" is and interesting idea


         Conditional logic is both concise and explicit using
         "exists" and "nonempty" over the implicit "if value:"


         Range objects are vastly superior to Python's lowly
         range() func.


         Comprehensions are ordered more logically than
         Python IMO, since i want to know where i'm looking
         BEFORE i find out what will be the return value




             Ceylon: [for (p in people) p.name]
             Python:[p.name for p in people]
             Ruby: people.collect{|p| p.name}


             Ceylon: for (i in 0..100) if (i%3==0) i
             Python: [i for i in range(100) if i%3==0]
             Ruby: (1..10).select{|x| x%3==0}


             Funny thing is, out of all three languages,
             Ruby's syntax is linear and therefor
             easiest to read. Ruby is the language i
             WANT to love but i can't :( due to too many
             inconsistencies. But this example shines!

    It's just a pity they based the syntax on C rather
    than something more enlightened. (Why do people
    keep doing that when they design languages?)

    What do you have in mind?


    Please elaborate because we could use a good intelligent
    conversation, instead of rampant troll posts.
  • Gregory Ewing at Nov 18, 2013 at 6:45 am

    Rick Johnson wrote:
    The multiplication operator can ONLY be used on
    numerics.

    I'm not convinced about that part. I notice that
    subtraction, multiplication and division are bundled
    into a single interface Numeric, but there is a
    separate one called Summable for addition --
    apparently so that they could use + for string
    concatenation.


    This seems to be a case of one rule for the language
    designers and a different one for everyone else.
    If it's okay for '+' to be used on something that's
    not a number, why not '*'?


    --
    Greg
  • Chris Angelico at Nov 18, 2013 at 6:56 am

    On Mon, Nov 18, 2013 at 5:45 PM, Gregory Ewing wrote:
    Rick Johnson wrote:
    The multiplication operator can ONLY be used on
    numerics.

    I'm not convinced about that part. I notice that
    subtraction, multiplication and division are bundled
    into a single interface Numeric, but there is a
    separate one called Summable for addition --
    apparently so that they could use + for string
    concatenation.

    This seems to be a case of one rule for the language
    designers and a different one for everyone else.
    If it's okay for '+' to be used on something that's
    not a number, why not '*'?

    That's something Java did (using + for strings, but not supporting
    operator overloading for custom classes, so you can't make your own
    string-like or number-like class and use + with it), and IMO it's one
    of the language's annoying flaws. Give people the power to use
    whatever operator they choose in whatever way they choose, and accept
    that occasionally you'll get less-than-stellar usage. It's a cost that
    you pay happily when you let people name their own functions; why not
    give the same freedom for operators?


    ChrisA
  • Wxjmfauth at Nov 18, 2013 at 9:44 am
    character
    Satisfied Interfaces: Comparable<Character>, Enumerable<Character>, Ordinal<Other>
    A 32-bit Unicode character.
    Satisfied Interfaces: Category, Cloneable<List<Element>>, Collection<Element>,
    Comparable<String>, Correspondence<Integer,Element>, Iterable<Element,Null>,
    List<Character>, Ranged<Integer,String>, Summable<String>




    string
    Satisfied Interfaces: Category, Cloneable<List<Element>>, Collection<Element>,
    Comparable<String>, Correspondence<Integer,Element>, Iterable<Element,Null>,
    List<Character>, Ranged<Integer,String>, Summable<String>
    A string of characters. Each character in the string is a 32-bit Unicode
    character. The internal UTF-16 encoding is hidden from clients.
    A string is a Category of its Characters, and of its substrings:




    Clean. Far, far away from a unicode handling which may require
    18 bytes (!) more to encode a non ascii n-chars string than a
    ascii n-chars string.
    (With performances following expectedly "globally" the same logic)

    sys.getsizeof('a')
    26
    sys.getsizeof('\U0001d11e')
    44




    jmf
  • Mark Lawrence at Nov 18, 2013 at 9:56 am

    On 18/11/2013 09:44, wxjmfauth at gmail.com wrote:
    character
    Satisfied Interfaces: Comparable<Character>, Enumerable<Character>, Ordinal<Other>
    A 32-bit Unicode character.
    Satisfied Interfaces: Category, Cloneable<List<Element>>, Collection<Element>,
    Comparable<String>, Correspondence<Integer,Element>, Iterable<Element,Null>,
    List<Character>, Ranged<Integer,String>, Summable<String>


    string
    Satisfied Interfaces: Category, Cloneable<List<Element>>, Collection<Element>,
    Comparable<String>, Correspondence<Integer,Element>, Iterable<Element,Null>,
    List<Character>, Ranged<Integer,String>, Summable<String>
    A string of characters. Each character in the string is a 32-bit Unicode
    character. The internal UTF-16 encoding is hidden from clients.
    A string is a Category of its Characters, and of its substrings:


    Clean. Far, far away from a unicode handling which may require
    18 bytes (!) more to encode a non ascii n-chars string than a
    ascii n-chars string.
    (With performances following expectedly "globally" the same logic)
    sys.getsizeof('a')
    26
    sys.getsizeof('\U0001d11e')
    44


    jmf

    In [3]: sys.getsizeof(1)
    Out[3]: 14


    What a disaster, 13 bytes wasted storing 1. I'll just rush off to the
    bug tracker and raise an issue to get the entire Cpython core rewritten
    before Armaggeddon strikes.


    --
    Python is the second best programming language in the world.
    But the best has yet to be invented. Christian Tismer


    Mark Lawrence
  • Chris Angelico at Nov 18, 2013 at 10:04 am

    On Mon, Nov 18, 2013 at 8:44 PM, wrote:
    string
    Satisfied Interfaces: Category, Cloneable<List<Element>>, Collection<Element>,
    Comparable<String>, Correspondence<Integer,Element>, Iterable<Element,Null>,
    List<Character>, Ranged<Integer,String>, Summable<String>
    A string of characters. Each character in the string is a 32-bit Unicode
    character. The internal UTF-16 encoding is hidden from clients.
    A string is a Category of its Characters, and of its substrings:

    I'm trying to figure this out. Reading the docs hasn't answered this.
    If each character in a string is a 32-bit Unicode character, and (as
    can be seen in the examples) string indexing and slicing are
    supported, then does string indexing mean counting from the beginning
    to see if there were any surrogate pairs?


    ChrisA
  • Ian Kelly at Nov 18, 2013 at 12:29 pm

    On Nov 18, 2013 3:06 AM, "Chris Angelico" wrote:
    I'm trying to figure this out. Reading the docs hasn't answered this.
    If each character in a string is a 32-bit Unicode character, and (as
    can be seen in the examples) string indexing and slicing are
    supported, then does string indexing mean counting from the beginning
    to see if there were any surrogate pairs?

    The string reference says:


    """Since a String has an underlying UTF-16 encoding, certain operations are
    expensive, requiring iteration of the characters of the string. In
    particular, size requires iteration of the whole string, and get(), span(),
    and segment() require iteration from the beginning of the string to the
    given index."""


    The get and span operations appear to be equivalent to indexing and slicing.
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20131118/1c4e541c/attachment-0001.html>
  • Chris Angelico at Nov 18, 2013 at 12:36 pm

    On Mon, Nov 18, 2013 at 11:29 PM, Ian Kelly wrote:
    On Nov 18, 2013 3:06 AM, "Chris Angelico" wrote:

    I'm trying to figure this out. Reading the docs hasn't answered this.
    If each character in a string is a 32-bit Unicode character, and (as
    can be seen in the examples) string indexing and slicing are
    supported, then does string indexing mean counting from the beginning
    to see if there were any surrogate pairs?
    The string reference says:

    """Since a String has an underlying UTF-16 encoding, certain operations are
    expensive, requiring iteration of the characters of the string. In
    particular, size requires iteration of the whole string, and get(), span(),
    and segment() require iteration from the beginning of the string to the
    given index."""

    The get and span operations appear to be equivalent to indexing and slicing.

    Right, that's what I was looking for and didn't find. (I was searching
    the one-page reference manual rather than reading in detail.) So, yes,
    they're O(n) operations. Thanks for hunting that down.


    ChrisA
  • Piet van Oostrum at Nov 18, 2013 at 2:31 pm

    Chris Angelico <rosuav@gmail.com> writes:

    On Mon, Nov 18, 2013 at 11:29 PM, Ian Kelly wrote:
    On Nov 18, 2013 3:06 AM, "Chris Angelico" wrote:

    I'm trying to figure this out. Reading the docs hasn't answered this.
    If each character in a string is a 32-bit Unicode character, and (as
    can be seen in the examples) string indexing and slicing are
    supported, then does string indexing mean counting from the beginning
    to see if there were any surrogate pairs?
    The string reference says:

    """Since a String has an underlying UTF-16 encoding, certain operations are
    expensive, requiring iteration of the characters of the string. In
    particular, size requires iteration of the whole string, and get(), span(),
    and segment() require iteration from the beginning of the string to the
    given index."""

    The get and span operations appear to be equivalent to indexing and slicing.
    Right, that's what I was looking for and didn't find. (I was searching
    the one-page reference manual rather than reading in detail.) So, yes,
    they're O(n) operations. Thanks for hunting that down.

    ChrisA

    It would be so much better to use the Flexible String Representation.
    --
    Piet van Oostrum <piet@vanoostrum.org>
    WWW: http://pietvanoostrum.com/
    PGP key: [8DAE142BE17999C4]
  • Mark Lawrence at Nov 18, 2013 at 3:06 pm

    On 18/11/2013 14:31, Piet van Oostrum wrote:
    Chris Angelico <rosuav@gmail.com> writes:
    On Mon, Nov 18, 2013 at 11:29 PM, Ian Kelly wrote:
    On Nov 18, 2013 3:06 AM, "Chris Angelico" wrote:

    I'm trying to figure this out. Reading the docs hasn't answered this.
    If each character in a string is a 32-bit Unicode character, and (as
    can be seen in the examples) string indexing and slicing are
    supported, then does string indexing mean counting from the beginning
    to see if there were any surrogate pairs?
    The string reference says:

    """Since a String has an underlying UTF-16 encoding, certain operations are
    expensive, requiring iteration of the characters of the string. In
    particular, size requires iteration of the whole string, and get(), span(),
    and segment() require iteration from the beginning of the string to the
    given index."""

    The get and span operations appear to be equivalent to indexing and slicing.
    Right, that's what I was looking for and didn't find. (I was searching
    the one-page reference manual rather than reading in detail.) So, yes,
    they're O(n) operations. Thanks for hunting that down.

    ChrisA
    It would be so much better to use the Flexible String Representation.

    I agree but approximately 0.0000000142857% of the world population
    disagrees.


    --
    Python is the second best programming language in the world.
    But the best has yet to be invented. Christian Tismer


    Mark Lawrence
  • Steven D'Aprano at Nov 18, 2013 at 1:31 pm

    On Mon, 18 Nov 2013 21:04:41 +1100, Chris Angelico wrote:

    On Mon, Nov 18, 2013 at 8:44 PM, wrote:
    string
    Satisfied Interfaces: Category, Cloneable<List<Element>>,
    Collection<Element>, Comparable<String>,
    Correspondence<Integer,Element>, Iterable<Element,Null>,
    List<Character>, Ranged<Integer,String>, Summable<String> A string of
    characters. Each character in the string is a 32-bit Unicode character.
    The internal UTF-16 encoding is hidden from clients. A string is a
    Category of its Characters, and of its substrings:
    I'm trying to figure this out. Reading the docs hasn't answered this. If
    each character in a string is a 32-bit Unicode character, and (as can be
    seen in the examples) string indexing and slicing are supported, then
    does string indexing mean counting from the beginning to see if there
    were any surrogate pairs?

    I can't figure out what that means, since it contradicts itself. First it
    says *every* character is 32-bits (presumably UTF-32), then it says that
    internally it uses UTF-16. At least one of these statements is wrong.
    (They could both be wrong, but they can't both be right.)


    Unless they have done something *really* clever, the language designers
    lose a hundred million points for screwing up text strings. There is
    *absolutely no excuse* for a new, modern language with no backwards
    compatibility concerns to choose one of the three bad choices:


    * choose UTF-16 or UTF-8, and have O(n) primitive string operations (like
    Haskell and, apparently, Ceylon);


    * or UTF-16 without support for the supplementary planes (which makes it
    virtually UCS-2), like Javascript;


    * choose UTF-32, and use two or four times as much memory as needed.




    --
    Steven
  • Chris Angelico at Nov 18, 2013 at 1:39 pm

    On Tue, Nov 19, 2013 at 12:31 AM, Steven D'Aprano wrote:
    Unless they have done something *really* clever, the language designers
    lose a hundred million points for screwing up text strings. There is
    *absolutely no excuse* for a new, modern language with no backwards
    compatibility concerns to choose one of the three bad choices:

    Yeah, but this compiles to JS, so it does have that backward compat
    issue - unless it's going to represent a Ceylon string as something
    other than a JS string (maybe an array of integers??), which would
    probably cost even more.


    You're absolutely right, except in the premise that Ceylon is a new
    and unshackled language. At least this way, if anyone actually
    implements Ceylon directly in the browser, it can use something
    smarter as its backend, without impacting code in any way (other than
    performance). I'd much rather they go for O(n) string primitives than
    maintaining the user-visible UTF-16 bug.


    ChrisA
  • Steven D'Aprano at Nov 18, 2013 at 2:30 pm

    On Mon, 18 Nov 2013 13:31:33 +0000, Steven D'Aprano wrote:

    On Mon, 18 Nov 2013 21:04:41 +1100, Chris Angelico wrote:
    On Mon, Nov 18, 2013 at 8:44 PM, wrote:
    string
    Satisfied Interfaces: Category, Cloneable<List<Element>>,
    Collection<Element>, Comparable<String>,
    Correspondence<Integer,Element>, Iterable<Element,Null>,
    List<Character>, Ranged<Integer,String>, Summable<String> A string of
    characters. Each character in the string is a 32-bit Unicode
    character. The internal UTF-16 encoding is hidden from clients. A
    string is a Category of its Characters, and of its substrings:
    I'm trying to figure this out. Reading the docs hasn't answered this.
    If each character in a string is a 32-bit Unicode character, and (as
    can be seen in the examples) string indexing and slicing are supported,
    then does string indexing mean counting from the beginning to see if
    there were any surrogate pairs?
    I can't figure out what that means, since it contradicts itself. First
    it says *every* character is 32-bits (presumably UTF-32), then it says
    that internally it uses UTF-16. At least one of these statements is
    wrong. (They could both be wrong, but they can't both be right.)

    Mystery solved: characters are only 32-bits in isolation, when plucked
    out of a string.


    http://ceylon-lang.org/documentation/tour/language-module/
    #characters_and_character_strings


    Ceylon strings are arrays of UTF-16 characters. However, the language
    supports characters in the Supplementary Multilingual Plane by having
    primitive string operations walk the string a code point at a time. When
    you extract a character out of the string, Ceylon gives you four bytes.
    Presumably, if you do something like like this:


    # Python syntax, not Ceylon
    mystring = "a\U0010FFFF"
    c = mystring[0]
    d = mystring[1]


    c will consist of bytes 0000 0061 and d will consist of the surrogate
    pair DBFF DFFF (the UTF-16BE encoding of code point U+10FFFF, modulo big-
    endian versus little-ending). Or possibly the UTF-32 encoding, 0010 FFFF.


    I suppose that's not terrible, except for the O(n) string operations
    which is just dumb. Yes, it's better than buggy, broken strings. But
    still dumb, because those aren't the only choices. For example, for the
    sake of an extra two bytes at the start of each string, they could store
    a flag and a length:


    - one bit to flag whether the string contained any surrogate pairs or
    not; if not, string ops could assume two-bytes per char and be O(1), if
    the flag was set it could fall back to the slower technique;


    - 15 bits for a length.


    15 bits give you a maximum length of 32767. There are ways around that.
    E.g. a length of 0 through 32766 means exactly what it says; a length of
    32767 means that the next two bytes are part of the length too, giving
    you a maximum of 4294967295 characters per string. That's an 8GB string.
    Surely big enough for anyone :-)


    That gives you O(1) length for *any* string, and O(1) indexing operations
    for those that are entirely in the BMP, which will be most strings for
    most people. It's not 1970 anymore, it's time for strings to be treated
    more seriously and not just as dumb arrays of char. Even back in the
    1970s Pascal had a length byte. It astonishes me that hardly any low-
    level language follows their lead.






    --
    Steven
  • Dave Angel at Nov 18, 2013 at 8:37 pm

    On 18 Nov 2013 14:30:54 GMT, Steven D'Aprano wrote:
    - 15 bits for a length.
    15 bits give you a maximum length of 32767. There are ways around that.
    E.g. a length of 0 through 32766 means exactly what it says; a length of
    32767 means that the next two bytes are part of the length too, giving
    you a maximum of 4294967295 characters per string. That's an 8GB string.
    Surely big enough for anyone :-)

    If you use nearly all of the possible 2 byte values then adding 2
    more bytes won't give you anywhere near 4 bI'll ion characters.
    You're perhaps thinking of bringing in four more bytes when the
    length exceeds 32k.


    --
    DaveA
  • Chris Angelico at Nov 18, 2013 at 11:25 pm

    On Tue, Nov 19, 2013 at 1:30 AM, Steven D'Aprano wrote:
    I suppose that's not terrible, except for the O(n) string operations
    which is just dumb. Yes, it's better than buggy, broken strings. But
    still dumb, because those aren't the only choices. For example, for the
    sake of an extra two bytes at the start of each string, they could store
    a flag and a length:

    True, but I suspect that _any_ variance from JS strings would have
    significant impact on the performance of everything that crosses the
    boundary. If anything, I'd be looking at a permanent 32-bit shim on
    the string (rather than the 16-or-32-bit that you describe, or the
    16-or-48-bit that Dave clarifies your theory as needing); that would
    allow strings up to 2GB (31 bits of pure binary length), and exceeding
    that could just raise a RuntimeError. Then, passing any string to a JS
    method would simply mean trimming off the first two code units.


    But the problem is also with strings coming back from JS. Every time
    you get something crossing from JS to Ceylon, you have to walk it,
    count up its length, and see if it has any surrogates (and somehow
    deal with mismatched surrogates). Every string, even if all you're
    going to do is give it straight back to JS in the next line of code.
    Potentially quite expensive, and surprisingly so - as opposed to
    simply saying "string indexing can be slow on large strings", which
    puts the cost against a visible line of code.


    ChrisA
  • Steven D'Aprano at Nov 19, 2013 at 2:13 am

    On Tue, 19 Nov 2013 10:25:00 +1100, Chris Angelico wrote:


    But the problem is also with strings coming back from JS.

    Just because you call it a "string" in Ceylon, doesn't mean you have to
    use the native Javascript string type unchanged.


    Since the Ceylon compiler controls what Javascript operations get called
    (the user never writes any Javascript directly), the compiler can tell
    which operations potentially add surrogates. Since strings are immutable
    in Ceylon, a slice of a BMP-only string is also BMP-only; concatenating
    two BMP-only strings gives a BMP-only string. I expect that uppercasing
    or lowercasing such strings will also keep the same invariant, but if
    not, well, you already have to walk the string to convert it, walking it
    again should be no more expensive.


    The point is not that my off-the-top-of-my-head pseudo-implementation was
    optimal in all details, but that *text strings* should be decent data
    structures with smarts, not dumb arrays of variable-width characters. If
    that means avoiding dumb-array-of-char naive implementations, and writing
    your own, that's part of the compiler writers job.


    Python strings can include null bytes, unlike C, even when built on top
    of C. They know their length, unlike C, even when built on top of C. Just
    because the native Java and Javascript string types doesn't do these
    things, doesn't mean that they can't be done in Javascript.



    - as opposed to simply saying "string
    indexing can be slow on large strings", which puts the cost against a
    visible line of code.

    For all we know, Ceylon already does something like this, but merely
    doesn't advertise the fact that while it *can* be slow, it can *also* be
    fast. It's an implementation detail, perhaps, much like string
    concatenation in Python officially requires building a new string, but in
    CPython sometimes it can append to the original string.




    Still, given that Pike and Python have already solved this problem, and
    have O(1) string indexing operations and length for any Unicode string,
    SMP and BMP, it is a major disappointment that Ceylon doesn't.






    --
    Steven
  • Chris Angelico at Nov 19, 2013 at 2:54 am

    On Tue, Nov 19, 2013 at 1:13 PM, Steven D'Aprano wrote:
    On Tue, 19 Nov 2013 10:25:00 +1100, Chris Angelico wrote:

    But the problem is also with strings coming back from JS.
    Just because you call it a "string" in Ceylon, doesn't mean you have to
    use the native Javascript string type unchanged.

    Indeed not, but there are going to be many MANY cases where a JS
    string has to become a Ceylon string and vice versa - a lot more often
    than CPython drops to C. For instance, suppose you run your Ceylon
    code inside a web browser. Pick up pretty much any piece of JavaScript
    code from any web page - how much string manipulation does it do, and
    how much does it call on various DOM methods? In CPython, only a small
    number of Python functions will end up dropping to C APIs to do their
    work (and most of those will have to do some manipulation along the
    way somewhere - eg chances are print()/sys.stdout.write() will
    eventually have to encode its output to 8-bit before passing it to
    some byte-oriented underlying stream, so the actual representation of
    a Python string doesn't matter); in browser-based work, that is
    inverted.


    However, Ceylon can actually be implemented on multiple backends (Java
    and JavaScript listed). It's fully possible that an
    "application-oriented" backend might use Pike-strings internally,
    while a "browser-oriented" backend could still use the underlying
    string representation. The questions are entirely of performance,
    since it's been guaranteed already to have the same semantics.


    I would really like to see JavaScript replaced in web browsers, since
    the ECMAScript folks have stated explicitly (in response to a question
    from me) that UTF-16 representation *must* stay, for backward compat.
    JS is a reasonable language - it's not terrible - but it has a number
    of glaring flaws. Ceylon could potentially be implemented in browsers,
    using Pike-strings internally, and then someone could write a
    JavaScript engine that compiles to Ceylon (complete with
    bug-compatibility stupid-code that encodes all strings UTF-16 before
    indexing into them). It would be an overall improvement, methinks.


    ChrisA
  • Chris Angelico at Nov 19, 2013 at 2:56 am

    On Tue, Nov 19, 2013 at 1:13 PM, Steven D'Aprano wrote:
    Still, given that Pike and Python have already solved this problem, and
    have O(1) string indexing operations and length for any Unicode string,
    SMP and BMP, it is a major disappointment that Ceylon doesn't.

    And of course, the part that's "solved" here is not the O(1) string
    indexing, but the fact that UTF-32 semantics with less memory usage
    than UTF-16. It's easy to get perfect indexing semantics if you don't
    mind how much space your strings take up.


    ChrisA
  • Steven D'Aprano at Nov 19, 2013 at 2:29 am

    On Mon, 18 Nov 2013 15:37:12 -0500, Dave Angel wrote:


    If you use nearly all of the possible 2 byte values then adding 2 more
    bytes won't give you anywhere near 4 bI'll ion characters. You're
    perhaps thinking of bringing in four more bytes when the length exceeds
    32k.

    Yep, I screwed up. Thanks for the correction.




    --
    Steven
  • Wxjmfauth at Nov 19, 2013 at 9:10 am

    Le lundi 18 novembre 2013 14:31:33 UTC+1, Steven D'Aprano a ?crit?:

    ... choose one of the three bad choices: ...



    * choose UTF-16 or UTF-8, and have O(n) primitive string operations (like

    Haskell and, apparently, Ceylon);



    * or UTF-16 without support for the supplementary planes (which makes it

    virtually UCS-2), like Javascript;



    * choose UTF-32, and use two or four times as much memory as needed.


    Nothing can beat the coding schemes endorsed by Unicode.


    They are all working on the smallest possible entity
    level (Unicode Transformation *Units*) with a unique
    set of these entities.


    To not forget. A set of characters is an artificial
    construction and by nature it can not follow the
    logic of a more "natural" set, eg. integers.


    jmf
  • Bob Martin at Nov 20, 2013 at 8:19 am

    in 710625 20131119 091055 wxjmfauth at gmail.com wrote:
    Le lundi 18 novembre 2013 14:31:33 UTC+1, Steven D'Aprano a écrit :

    ... choose one of the three bad choices: ...



    * choose UTF-16 or UTF-8, and have O(n) primitive string operations (like>

    Haskell and, apparently, Ceylon);



    * or UTF-16 without support for the supplementary planes (which makes it>

    virtually UCS-2), like Javascript;



    * choose UTF-32, and use two or four times as much memory as needed.

    Nothing can beat the coding schemes endorsed by Unicode.

    They are all working on the smallest possible entity
    level (Unicode Transformation *Units*) with a unique
    set of these entities.

    To not forget.

    Is that an egg-corn?
  • Rick Johnson at Nov 19, 2013 at 3:33 am
    I've never *really* been crazy about the plus operator
    concatenating strings anyhow, however, the semantics of "+"
    seem to navigate the "perilous waters of intuition" far
    better than "*".


         Addition of numeric types is well defined in maths:
         Take N inputs values and *reduce* them into a single
         value that represents the mathematical summation of
         all inputs.


         HOWEVER,


         Addition of strings (concatenation) requires
         interpreting the statement as a more simplistic
         "joining" process of : take N inputs and join them
         together in a *linear fashion* until they become a
         single value.


    As you might already know the latter is a far less elegant
    process, although the transformation remains "sane". Even
    though in the first case: with "numeric addition", the
    individual inputs are *sacrificed* to the god of maths; and
    in the second case: for "string addition", the individual
    inputs are *retained*; BOTH implementations of the plus
    operator expose a CONSISTENT interface -- and like any good
    interface the dirty details are hidden from the caller!


         INTERFACES ARE THE KEY TO CODE INTEGRITY and LONGEVITY!


    HOWEVER, HOWEVER. O:-)


    There is an inconsistency when applying the "*" operator
    between numerics and strings. In the case of numerics the
    rules are widely understood and quite logical, HOWEVER, in
    the case of "string products", not only are rules missing,
    any attempt to create a rule is illogical, AND, we've broken
    the consistency of the "*" interface!


         py> "a" * "4"
         'aaaa'


    Okay, that makes sense, but what about:


         py> "a" * "aaaa"


    That will haunt your nightmares!


    But even the previous example, whilst quite logical, is
    violating the "contract of transformations" and can ONLY
    result in subtle bugs, language designer woes, and endless
    threads on Pyhon-ideas that result in no elegant solution.


         THERE EXISTS NO PATH TO ELEGANCE VIA GLUTTONY


    Operator overloading must be restricted. Same goes for
    syntactic sugars. You can only do SO much with a sugar
    before it mutates into a salt.


         TOO MUCH OF A GOOD THING... well, ya know!
  • Steven D'Aprano at Nov 19, 2013 at 7:00 am

    On Mon, 18 Nov 2013 19:33:01 -0800, Rick Johnson wrote:


    I've never *really* been crazy about the plus operator concatenating
    strings anyhow, however, the semantics of "+" seem to navigate the
    "perilous waters of intuition" far better than "*".

    Addition of numeric types is well defined in maths: Take N inputs
    values and *reduce* them into a single value that represents the
    mathematical summation of all inputs.

    Which sum would that be?


    Addition of vectors, matrices, quaternions, tensors, something else?


    Do you perhaps mean the Whitney Sum?
    http://mathworld.wolfram.com/WhitneySum.html


    Ah, no, you're talking about addition of Real numbered values, where
    nothing can *possibly* go wrong:


         py> 0.1 + 0.1 + 0.1 == 0.3
         False


    Hmmm. Oh well, at least we know that adding 1 to a number is guaranteed
    to make it bigger:


         py> 1e16 + 1 > 1e16
         False


    Surely though, the order you do the addition doesn't matter:


         py> 1.5 + (1.3 + 1.9) == (1.5 + 1.3) + 1.9
         False




    Dammit maths, why do you hate us so???




    So, explain to me again, what is the *precise* connection between the
    mathematical definition of addition, as we learn about in school, and
    what computers do?



    HOWEVER,

    Addition of strings (concatenation) requires interpreting the
    statement as a more simplistic "joining" process of : take N inputs
    and join them together in a *linear fashion* until they become a
    single value.



    Ah, you mean like addition in base-1, otherwise known as the unary number
    system, also known as a tally.


    So if you want to add (decimal) 3 and 5 using base-1, we would write:

    + |||||

    and concatenating the tallies together gives:




    which if I'm not mistaken makes 8 in decimal.



    There is an inconsistency when applying the "*" operator between
    numerics and strings. In the case of numerics the rules are widely
    understood and quite logical, HOWEVER, in the case of "string products",
    not only are rules missing, any attempt to create a rule is illogical,
    AND, we've broken the consistency of the "*" interface!

    A foolish consistency is the hobgoblin of little minds.


    Just because you can't define a sensible meaning for str * str doesn't
    mean you can't define a sensible meaning for str * int.



    py> "a" * "4"
    'aaaa'

    Okay, that makes sense, but what about:

    py> "a" * "aaaa"

    That will haunt your nightmares!

    You're easily terrified if you have nightmares about that. I can't
    imagine what you would do if faced with the M-combinator applied to
    itself.





    But even the previous example, whilst quite logical, is violating the
    "contract of transformations"

    What contract of transformations?






    --
    Steven
  • Chris Angelico at Nov 19, 2013 at 7:18 am

    On Tue, Nov 19, 2013 at 6:00 PM, Steven D'Aprano wrote:
    py> "a" * "4"
    'aaaa'

    Okay, that makes sense, but what about:

    py> "a" * "aaaa"

    That will haunt your nightmares!
    You're easily terrified if you have nightmares about that. I can't
    imagine what you would do if faced with the M-combinator applied to
    itself.

    Not to mention that he has to construct his own nightmares. This not
    being PHP, it's unlikely to work quite the way he thinks it does:

    "a" * "4"
    Traceback (most recent call last):
       File "<pyshell#51>", line 1, in <module>
         "a" * "4"
    TypeError: can't multiply sequence by non-int of type 'str'


    Unless he has some strange Python interpreter that coalesces
    integer-like strings to integers, of course, in which case I
    completely understand why he's having nightmares.


    ChrisA
  • Gregory Ewing at Nov 20, 2013 at 5:25 am

    Steven D'Aprano wrote:
    Which sum would that be?

    Addition of vectors, matrices, quaternions, tensors, something else?

    Considering vectors, multiplying a vector by a scalar
    can be thought of as putting n copies of the vector
    together nose-to-tail.


    That's not very much different from putting n copies
    of a string one after another.


    --
    Greg
  • Steven D'Aprano at Nov 18, 2013 at 2:56 pm

    On Wed, 13 Nov 2013 14:33:27 -0500, Neal Becker wrote:


    http://ceylon-lang.org/documentation/1.0/introduction/



    I must say there are a few questionable design choices, in my opinion,
    but I am absolutely in love with the following two features:




    1) variables are constant by default;


    2) the fat arrow operator.




    By default, "variables" can only be assigned to once, and then not re-
    bound:


    String bye = "Adios"; //a value
    bye = "Adeu"; //compile error


    variable Integer count = 0; //a variable
    count = 1; //allowed




    (I'm not sure how tedious typing "variable" will get, or whether it will
    encourage a more functional-programming approach. But I think that's a
    very exciting idea and kudos to the Ceylon developers for running with
    it!)




    Values can be recalculated every time they are used, sort of like mini-
    functions, or thunks:


    String name { return firstName + " " + lastName; }


    Since this is so common in Ceylon, they have syntactic sugar for it, the
    fat arrow:


    String name => firstName + " " + lastName;




    If Python steals this notation, we could finally bring an end to the
    arguments about early binding and late binding of default arguments:




    def my_function(a=[early, binding, happens, once],
                     b=>[late, binding, happens, every, time]
                     ):
         ...




    Want!


    These two features alone may force me to give Ceylon a try.








    --
    Steven

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedNov 13, '13 at 7:33p
activeNov 20, '13 at 8:19a
posts38
users15
websitepython.org

People

Translate

site design / logo © 2022 Grokbase