FAQ
Hi folks,

As of this moment, please get my explicit permission before making ANY
commit to blead. We've polished off all known RC-blocker bugs and I
hope to release 5.12.0 RC1 tomorrow, April 20th.

If all goes reasonably well, I intend to release Perl 5.12.0 on
Thursday, April 28, 2011.

In a few moments, I'll push up Module::CoreList, perlhist and perldelta
changes which are nearly final. Spotting errors today will earn you a
shiny virtual gold star. Spotting errors tomorrow will make me cry just
a little bit.

Best,
Jesse

Search Discussions

  • Jesse Vincent at Apr 19, 2011 at 2:38 pm

    On Wed, Apr 20, 2011 at 12:33:20AM +1000, Jesse Vincent wrote:
    Hi folks,

    As of this moment, please get my explicit permission before making ANY
    commit to blead. We've polished off all known RC-blocker bugs and I
    hope to release 5.12.0 RC1 tomorrow, April 20th.
    Gah. 5.14.0 RC1.
    If all goes reasonably well, I intend to release Perl 5.12.0 on
    Thursday, April 28, 2011.
    Gah. 5.14.0.

    They tell me that Larry was the last Pumpking to release two "major"
    versions of Perl 5. I think there's a small part of me that doesn't want
    to break that record.

    -J
  • Nicholas Clark at Apr 19, 2011 at 2:45 pm

    On Tue, Apr 19, 2011 at 10:38:19AM -0400, Jesse Vincent wrote:

    They tell me that Larry was the last Pumpking to release two "major"
    versions of Perl 5. I think there's a small part of me that doesn't want
    to break that record.
    This is the mysterious "they" department that many people believe exists
    (in all sorts of volunteer organisations) and does all the work behind the
    scenes? The answer to "but who do you think is going to do this?"

    Because "they" got it wrong, as it was Sarathy. :-)

    Nicholas Clark
  • Nicholas Clark at Apr 19, 2011 at 2:41 pm

    On Wed, Apr 20, 2011 at 12:33:20AM +1000, Jesse Vincent wrote:
    Hi folks,

    As of this moment, please get my explicit permission before making ANY
    commit to blead. We've polished off all known RC-blocker bugs and I
    hope to release 5.12.0 RC1 tomorrow, April 20th.

    If all goes reasonably well, I intend to release Perl 5.12.0 on
    Thursday, April 28, 2011.
    I don't know how Easter pans out in other countries, but it happens that
    between the 20th and the 28th in the UK both Good Friday and Easter Monday
    are holidays. Hence you were hoping for testing by people at work, that's
    a 40% drop for any country with 2 days of holiday, and 20% for any with one.

    *Additionally*, in the UK, there is (I believe) the first ever pair of 4
    day weekends, as the following Friday and Monday are holidays. The result
    is that a lot of people are taking the intervening 3 work days off, getting
    11 days holiday for the price of 3. So there's probably >40% drop in the UK.

    [And as a digressions I'm guessing that the pair of 4 day weekends is likely
    to trigger various edge case and "can't happen" bugs in software - I know
    that my ex-employer had to fix one piece of code to cope, as it made
    previously reasonable calculations about the "worst case" number of working
    days in a calendar interval]

    Nicholas Clark
  • Jesse Vincent at Apr 19, 2011 at 2:53 pm

    On Tue, Apr 19, 2011 at 03:41:43PM +0100, Nicholas Clark wrote:
    On Wed, Apr 20, 2011 at 12:33:20AM +1000, Jesse Vincent wrote:
    Hi folks,

    As of this moment, please get my explicit permission before making ANY
    commit to blead. We've polished off all known RC-blocker bugs and I
    hope to release 5.12.0 RC1 tomorrow, April 20th.

    If all goes reasonably well, I intend to release Perl 5.12.0 on
    Thursday, April 28, 2011.
    I don't know how Easter pans out in other countries, but it happens that
    between the 20th and the 28th in the UK both Good Friday and Easter Monday
    are holidays. Hence you were hoping for testing by people at work, that's
    a 40% drop for any country with 2 days of holiday, and 20% for any with one.
    We'll play it slightly by ear. I'm quite interested in what the
    cpan-regression smoker thinks of 5.14.0 RC1.

    We've seen a LOT more involvement from the community during 5.13,
    which makes me quite happy and gives me some possibly-misplaced
    confidence in this release.

    While I don't think I've stated it explicitly (sorry), it's my intent
    to release 5.14.1 a month after 5.14.0 and 5.14.2 3 months after that,
    as we did for 5.12.

    -Jesse
  • Tom Christiansen at Apr 19, 2011 at 2:52 pm

    In a few moments, I'll push up Module::CoreList, perlhist and perldelta
    changes which are nearly final. Spotting errors today will earn you a
    shiny virtual gold star. Spotting errors tomorrow will make me cry just
    a little bit.
    Whose today, and whose tomorrow? :)

    There is an encoding error in perldelta:

    This line has had its individual UTF-8 each re-encoded as UTF-8:

    D. Hedden, Jesse Vincent, Jim Cromie, Jirka Hruška, John Peacock,

    I'm pretty certain that that is supposed to be:

    D. Hedden, Jesse Vincent, Jim Cromie, Jirka Hruška, John Peacock,

    --tom
  • Jesse Vincent at Apr 19, 2011 at 2:56 pm

    On Tue 19.Apr'11 at 8:52:42 -0600, Tom Christiansen wrote:
    In a few moments, I'll push up Module::CoreList, perlhist and perldelta
    changes which are nearly final. Spotting errors today will earn you a
    shiny virtual gold star. Spotting errors tomorrow will make me cry just
    a little bit.
    Whose today, and whose tomorrow? :)
    Spotting errors in the next 12 hours gets you a gold star.

    Spotting errors later gets you the satisfaction of seeing me cry.

    Better?
    There is an encoding error in perldelta:
    Thanks. Fixed.
    This line has had its individual UTF-8 each re-encoded as UTF-8:

    D. Hedden, Jesse Vincent, Jim Cromie, Jirka Hruška, John Peacock,
    I think you've just solved the encoding issues I have with mutt in
    screen, as the line above rendered _correctly_ there.

    I'm pretty certain that that is supposed to be:

    D. Hedden, Jesse Vincent, Jim Cromie, Jirka Hruška, John Peacock,

    --tom
  • Tom Christiansen at Apr 19, 2011 at 3:35 pm

    Jesse Vincent wrote on Wed, 20 Apr 2011 00:56:31 +1000:

    Whose today, and whose tomorrow? :)
    Spotting errors in the next 12 hours gets you a gold star.
    Spotting errors later gets you the satisfaction of seeing me cry.
    I think you've just solved the encoding issues I have with mutt in
    screen, as the line above rendered _correctly_ there.
    Oh good.

    Now, may I please interest you in sorting those names using a proper
    UCA sort instead of a naïve code-point sort? I kinda feel that we
    in Perl should be able to sort Unicode properly. :(

    Codepoint sort UCA sort, first-last UCA sort, last-first
    ----------------------------- ----------------------------- --------------------------
    A. Sinan Unur Aaron Crane Gisle Aas
    Aaron Crane Abhijit Menon-Sen Abigail
    Abhijit Menon-Sen Abigail Peter John Acklam
    Abigail Ævar Arnfjörð Bjarmason Alexander Alekseev
    Alastair Douglas Alastair Douglas Zsbán Ambrus
    Alex Davies Alexander Alekseev Arkturuz
    Alex Vandiver Alexander Hartmaier Andy Armstrong
    Alexander Alekseev Alexandr Ciornii Arvan
    Alexander Hartmaier Alex Davies Renee Baecker
    Alexandr Ciornii Alex Vandiver Charles Bailey
    Ali Polatel Ali Polatel Robin Barker
    Allen Smith Allen Smith Sullivan Beck
    Andreas König Andreas König Larwan Berke
    Andrew Rodland Andrew Rodland Craig A. Berry
    Andy Armstrong Andy Armstrong Ævar Arnfjörð Bjarmason
    Andy Dougherty Andy Dougherty Philippe Bruhat (BooK)
    Aristotle Pagaltzis Aristotle Pagaltzis Bram
    Arkturuz Arkturuz H.Merijn Brand
    Arvan Arvan Michael Breen
    Ben Morrow A. Sinan Unur Eric Brine
    Bo Lindbergh Ben Morrow Leon Brocard
    Boris Ratner Bo Lindbergh Tim Bunce
    Brad Gilbert Boris Ratner David Caldwell
    Bram Brad Gilbert David Cantrell
    brian d foy Bram Nuno Carvalho
    Brian Phillips brian d foy Tom Christiansen
    Casey West Brian Phillips chromatic
    Charles Bailey Casey West Father Chrysostomos
    Chas. Owens Charles Bailey Alexandr Ciornii
    Chip Salzenberg Chas. Owens Nicholas Clark
    Chris 'BinGOs' Williams Chip Salzenberg Nick Cleaton
    chromatic Chris 'BinGOs' Williams Tony Cook
    Craig A. Berry chromatic Aaron Crane
    Curtis Jewell Craig A. Berry Jim Cromie
    Dagfinn Ilmari Mannsåker Curtis Jewell Dan Dascalescu
    Dan Dascalescu Dagfinn Ilmari Mannsåker Alex Davies
    Dave Rolsky Dan Dascalescu Andy Dougherty
    David Caldwell Dave Rolsky Alastair Douglas
    David Cantrell David Caldwell Jan Dubois
    David Golden David Cantrell Paul Evans
    David Leadbeater David Golden Salvador Fandiño
    David Mitchell David Leadbeater Franz Fasching
    David Wheeler David Mitchell Michael Fig
    Eric Brine David Wheeler Shlomi Fish
    Father Chrysostomos Eric Brine brian d foy
    Fingle Nark Father Chrysostomos Goro Fuji
    Florian Ragwitz Fingle Nark Piotr Fusik
    Frank Wiegand Florian Ragwitz Salvador Ortiz Garcia
    Franz Fasching Frank Wiegand Rafael Garcia-Suarez
    Gene Sullivan Franz Fasching Brad Gilbert
    George Greer Gene Sullivan David Golden
    Gerard Goossen George Greer Ian Goodacre
    Gisle Aas Gerard Goossen Gerard Goossen
    Goro Fuji Gisle Aas Paul Green
    Grant McLean Goro Fuji George Greer
    gregor herrmann Grant McLean Jay Hannah
    H.Merijn Brand gregor herrmann Alexander Hartmaier
    Hongwen Qiu H.Merijn Brand Steve Hay
    Hugo van der Sanden Hongwen Qiu Jerry D. Hedden
    Ian Goodacre Hugo van der Sanden Maik Hentsche
    James E Keenan Ian Goodacre gregor herrmann
    James Mastros James E Keenan Rob Hoelz
    Jan Dubois James Mastros Peter J. Holzer
    Jay Hannah Jan Dubois Jirka Hruška
    Jerry D. Hedden Jay Hannah Tom Hukins
    Jesse Vincent Jerry D. Hedden Wolfram Humann
    Jim Cromie Jesse Vincent Marvin Humphrey
    Jirka Hruška Jim Cromie Curtis Jewell
    John Peacock Jirka Hruška Matt Johnson
    Joshua ben Jore John Peacock Paul Johnson
    Joshua Pritikin Joshua ben Jore Nick Johnston
    Karl Williamson Joshua Pritikin Joshua ben Jore
    Kevin Ryde Karl Williamson Nicolas Kaiser
    kmx Kevin Ryde James E Keenan
    Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 kmx Mike Kelly
    Larwan Berke Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 kmx
    Leon Brocard Larwan Berke Andreas König
    Leon Timmermans Leon Brocard Vadim Konovalov
    Lubomir Rintel Leon Timmermans David Leadbeater
    Lukas Mai Lubomir Rintel Moritz Lenz
    Maik Hentsche Lukas Mai Bo Lindbergh
    Marty Pauley Maik Hentsche Vernon Lyon
    Marvin Humphrey Marty Pauley Lukas Mai
    Matt Johnson Marvin Humphrey Max Maischein
    Matt S Trout Matt Johnson Walt Mankowski
    Max Maischein Matt S Trout Dagfinn Ilmari Mannsåker
    Michael Breen Max Maischein Paul Marquess
    Michael Fig Michael Breen Peter Martini
    Michael G Schwern Michael Fig James Mastros
    Michael Parker Michael G Schwern Grant McLean
    Michael Stevens Michael Parker Tye McQueen
    Michael Witten Michael Stevns Abhijit Menon-Sen
    Mike Kelly Michael Witten David Mitchell
    Moritz Lenz Mike Kelly Tatsuhiko Miyagawa
    Nicholas Clark Moritz Lenz Richard Möhn
    Nick Cleaton Nicholas Clark Ben Morrow
    Nick Johnston Nick Cleaton Steffen Müller
    Nicolas Kaiser Nick Johnston Fingle Nark
    Niko Tyni Nicolas Kaiser Yves Orton
    Noirin Shirley Niko Tyni Chas. Owens
    Nuno Carvalho Noirin Shirley Aristotle Pagaltzis
    Paul Evans Nuno Carvalho Michael Parker
    Paul Green Paul Evans Marty Pauley
    Paul Johnson Paul Green John Peacock
    Paul Marquess Paul Johnson Steve Peters
    Peter J. Holzer Paul Marquess Brian Phillips
    Peter John Acklam Peter J. Holzer Vincent Pit
    Peter Martini Peter John Acklam Ali Polatel
    Philippe Bruhat (BooK) Peter Martini Joshua Pritikin
    Piotr Fusik Philippe Bruhat (BooK) Hongwen Qiu
    Rafael Garcia-Suarez Piotr Fusik Florian Ragwitz
    Rainer Tammer Rafael Garcia-Suarez Boris Ratner
    Reini Urban Rainer Tammer Slaven Rezic
    Renee Baecker Reini Urban Todd Rinaldo
    Ricardo Signes Renee Baecker Lubomir Rintel
    Richard Möhn Ricardo Signes Andrew Rodland
    Richard Soderberg Richard Möhn Dave Rolsky
    Rob Hoelz Richard Soderberg Kevin Ryde
    Robin Barker Rob Hoelz Chip Salzenberg
    Ruslan Zakirov Robin Barker Hugo van der Sanden
    Salvador Fandiño Ruslan Zakirov Steven Schubiger
    Salvador Ortiz Garcia Salvador Fandiño Michael G Schwern
    Shlomi Fish Salvador Ortiz Garcia Noirin Shirley
    Sinan Unur Shlomi Fish Ricardo Signes
    Sisyphus Sinan Unur Sisyphus
    Slaven Rezic Sisyphus Allen Smith
    Steffen Müller Slaven Rezic Richard Soderberg
    Steve Hay Steffen Müller Michael Stevens
    Steve Peters Steve Hay Gene Sullivan
    Steven Schubiger Steven Schubiger Rainer Tammer
    Sullivan Beck Steve Peters Leon Timmermans
    Tatsuhiko Miyagawa Sullivan Beck Matt S Trout
    Tim Bunce Tatsuhiko Miyagawa Niko Tyni
    Todd Rinaldo Tim Bunce Sinan Unur
    Tom Christiansen Todd Rinaldo A. Sinan Unur
    Tom Hukins Tom Christiansen Reini Urban
    Tony Cook Tom Hukins Alex Vandiver
    Tye McQueen Tony Cook Jesse Vincent
    Vadim Konovalov Tye McQueen Casey West
    Vernon Lyon Vadim Konovalov David Wheeler
    Vincent Pit Vernon Lyon Frank Wiegand
    Walt Mankowski Vincent Pit Chris 'BinGOs' Williams
    Wolfram Humann Walt Mankowski Karl Williamson
    Yves Orton Wolfram Humann Michael Witten
    Zefram Yves Orton Ruslan Zakirov
    Zsbán Ambrus Zefram Zefram
    Ævar Arnfjörð Bjarmason Zsbán Ambrus Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯

    Now arguably I don't know that some of Asian names aren't bigendian. And
    I imagine that Salvador Ortiz Garcia is a double-barrelled Iberian
    matronymic + patronymic combo so should *not* sort right next to Rafael
    Garcia-Suarez (Ortiz should probably be the start of the sort element).

    But even still, don't both those second two columns look a whole lot
    better than the first one does?

    thanks,

    --tom
  • Jesse Vincent at Apr 19, 2011 at 3:24 pm

    On Tue 19.Apr'11 at 9:21:17 -0600, Tom Christiansen wrote:
    Jesse Vincent <jesse@fsck.com> wrote on Wed, 20 Apr 2011 00:56:31 +1000:
    Whose today, and whose tomorrow? :)
    Spotting errors in the next 12 hours gets you a gold star.
    Spotting errors later gets you the satisfaction of seeing me cry.
    I think you've just solved the encoding issues I have with mutt in
    screen, as the line above rendered _correctly_ there.
    Oh good.

    Now, may I please interest you in sorting those names using a proper
    UCA sort instead of a naïve code-point sort? I kinda feel that we
    in Perl should be able to sort Unicode properly. :(
    I'll take a patch to do the UCA sort first-last, especially if it comes
    with a patch to Porting/release_manager_guide.pod to make sure that it's
    done that way in the future.
  • Tom Christiansen at Apr 19, 2011 at 3:41 pm
    Jesse Vincent wrote
    on Wed, 20 Apr 2011 01:24:46 +1000:
    I'll take a patch to do the UCA sort first-last, especially if it comes
    with a patch to Porting/release_manager_guide.pod to make sure that it's
    done that way in the future.
    Here's your perldelta patch. It sorts using the standard UCA sort from
    Unicode::Collate, *and* it uses Unicode::LineBreak to wrap with print
    columns set to 72. Notice how it looks "right" on the screen now, even with
    diacriticals and East_Asian_Width=Wide characters being counted correctly.

    I'll send the other under separate cover. I have these tchrist-standard
    scripts ucsort and unifmt that do all that work for me, but I'm still
    working on the one-liner for your release_manager_guide entry.

    thanks,

    --tom

    =====snip====

    --- pod/perldelta.pod 2011-04-19 08:58:13.000000000 -0600
    +++ /tmp/perldelta.pod 2011-04-19 09:36:14.000000000 -0600
    @@ -4472,38 +4472,38 @@
    community of users and developers. The following people are known to
    have contributed the improvements that became Perl 5.14.0:

    -A. Sinan Unur, Aaron Crane, Abhijit Menon-Sen, Abigail, Alastair Douglas,
    -Alex Davies, Alex Vandiver, Alexander Alekseev, Alexander Hartmaier,
    -Alexandr Ciornii, Ali Polatel, Allen Smith, Andreas König, Andrew
    -Rodland, Andy Armstrong, Andy Dougherty, Aristotle Pagaltzis, Arkturuz,
    -Arvan, Ben Morrow, Bo Lindbergh, Boris Ratner, Brad Gilbert, Bram,
    -brian d foy, Brian Phillips, Casey West, Charles Bailey, Chas. Owens,
    -Chip Salzenberg, Chris 'BinGOs' Williams, chromatic, Craig A. Berry,
    -Curtis Jewell, Dagfinn Ilmari Mannsåker, Dan Dascalescu, Dave Rolsky,
    -David Caldwell, David Cantrell, David Golden, David Leadbeater, David
    -Mitchell, David Wheeler, Eric Brine, Father Chrysostomos, Fingle
    -Nark, Florian Ragwitz, Frank Wiegand, Franz Fasching, Gene Sullivan,
    -George Greer, Gerard Goossen, Gisle Aas, Goro Fuji, Grant McLean,
    -gregor herrmann, H.Merijn Brand, Hongwen Qiu, Hugo van der Sanden, Ian
    -Goodacre, James E Keenan, James Mastros, Jan Dubois, Jay Hannah, Jerry
    -D. Hedden, Jesse Vincent, Jim Cromie, Jirka Hruška, John Peacock,
    -Joshua ben Jore, Joshua Pritikin, Karl Williamson, Kevin Ryde, kmx,
    -Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯, Larwan Berke, Leon Brocard, Leon
    -Timmermans, Lubomir Rintel, Lukas Mai, Maik Hentsche, Marty Pauley,
    -Marvin Humphrey, Matt Johnson, Matt S Trout, Max Maischein, Michael
    -Breen, Michael Fig, Michael G Schwern, Michael Parker, Michael Stevens,
    -Michael Witten, Mike Kelly, Moritz Lenz, Nicholas Clark, Nick Cleaton,
    -Nick Johnston, Nicolas Kaiser, Niko Tyni, Noirin Shirley, Nuno Carvalho,
    -Paul Evans, Paul Green, Paul Johnson, Paul Marquess, Peter J. Holzer,
    -Peter John Acklam, Peter Martini, Philippe Bruhat (BooK), Piotr Fusik,
    -Rafael Garcia-Suarez, Rainer Tammer, Reini Urban, Renee Baecker, Ricardo
    -Signes, Richard Möhn, Richard Soderberg, Rob Hoelz, Robin Barker,
    -Ruslan Zakirov, Salvador Fandiño, Salvador Ortiz Garcia, Shlomi Fish,
    -Sinan Unur, Sisyphus, Slaven Rezic, Steffen Müller, Steve Hay, Steve
    -Peters, Steven Schubiger, Sullivan Beck, Tatsuhiko Miyagawa, Tim Bunce,
    -Todd Rinaldo, Tom Christiansen, Tom Hukins, Tony Cook, Tye McQueen, Vadim
    -Konovalov, Vernon Lyon, Vincent Pit, Walt Mankowski, Wolfram Humann,
    -Yves Orton, Zefram, Zsbán Ambrus and Ævar Arnfjörð Bjarmason.
    +Aaron Crane, Abhijit Menon-Sen, Abigail, Ævar Arnfjörð Bjarmason,
    +Alastair Douglas, Alexander Alekseev, Alexander Hartmaier, Alexandr
    +Ciornii, Alex Davies, Alex Vandiver, Ali Polatel, Allen Smith, Andreas
    +König, Andrew Rodland, Andy Armstrong, Andy Dougherty, Aristotle
    +Pagaltzis, Arkturuz, Arvan, A. Sinan Unur, Ben Morrow, Bo Lindbergh,
    +Boris Ratner, Brad Gilbert, Bram, brian d foy, Brian Phillips, Casey
    +West, Charles Bailey, Chas. Owens, Chip Salzenberg, Chris 'BinGOs'
    +Williams, chromatic, Craig A. Berry, Curtis Jewell, Dagfinn Ilmari
    +Mannsåker, Dan Dascalescu, Dave Rolsky, David Caldwell, David Cantrell,
    +David Golden, David Leadbeater, David Mitchell, David Wheeler, Eric
    +Brine, Father Chrysostomos, Fingle Nark, Florian Ragwitz, Frank Wiegand,
    +Franz Fasching, Gene Sullivan, George Greer, Gerard Goossen, Gisle Aas,
    +Goro Fuji, Grant McLean, gregor herrmann, H.Merijn Brand, Hongwen Qiu,
    +Hugo van der Sanden, Ian Goodacre, James E Keenan, James Mastros, Jan
    +Dubois, Jay Hannah, Jerry D. Hedden, Jesse Vincent, Jim Cromie, Jirka
    +Hruška, John Peacock, Joshua ben Jore, Joshua Pritikin, Karl Williamson,
    +Kevin Ryde, kmx, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯, Larwan Berke, Leon Brocard, Leon
    +Timmermans, Lubomir Rintel, Lukas Mai, Maik Hentsche, Marty Pauley,
    +Marvin Humphrey, Matt Johnson, Matt S Trout, Max Maischein, Michael
    +Breen, Michael Fig, Michael G Schwern, Michael Parker, Michael Stevens,
    +Michael Witten, Mike Kelly, Moritz Lenz, Nicholas Clark, Nick Cleaton,
    +Nick Johnston, Nicolas Kaiser, Niko Tyni, Noirin Shirley, Nuno Carvalho,
    +Paul Evans, Paul Green, Paul Johnson, Paul Marquess, Peter J. Holzer,
    +Peter John Acklam, Peter Martini, Philippe Bruhat (BooK), Piotr Fusik,
    +Rafael Garcia-Suarez, Rainer Tammer, Reini Urban, Renee Baecker, Ricardo
    +Signes, Richard Möhn, Richard Soderberg, Rob Hoelz, Robin Barker, Ruslan
    +Zakirov, Salvador Fandiño, Salvador Ortiz Garcia, Shlomi Fish, Sinan
    +Unur, Sisyphus, Slaven Rezic, Steffen Müller, Steve Hay, Steven
    +Schubiger, Steve Peters, Sullivan Beck, Tatsuhiko Miyagawa, Tim Bunce,
    +Todd Rinaldo, Tom Christiansen, Tom Hukins, Tony Cook, Tye McQueen,
    +Vadim Konovalov, Vernon Lyon, Vincent Pit, Walt Mankowski, Wolfram
    +Humann, Yves Orton, Zefram, and Zsbán Ambrus.

    This is woefully incomplete as it's automatically generated from version
    control history. In particular, it doesn't include the names of the
  • Jesse Vincent at Apr 19, 2011 at 3:44 pm

    On Tue 19.Apr'11 at 9:41:16 -0600, Tom Christiansen wrote:
    Jesse Vincent <jesse@fsck.com> wrote
    on Wed, 20 Apr 2011 01:24:46 +1000:
    I'll take a patch to do the UCA sort first-last, especially if it comes
    with a patch to Porting/release_manager_guide.pod to make sure that it's
    done that way in the future.
    Here's your perldelta patch. It sorts using the standard UCA sort from
    Unicode::Collate, *and* it uses Unicode::LineBreak to wrap with print
    columns set to 72. Notice how it looks "right" on the screen now, even with
    diacriticals and East_Asian_Width=Wide characters being counted correctly.

    I'll send the other under separate cover. I have these tchrist-standard
    scripts ucsort and unifmt that do all that work for me, but I'm still
    working on the one-liner for your release_manager_guide entry.
    Thanks.

    I don't actually mind having a tool or script in Porting for this.
  • Tom Christiansen at Apr 19, 2011 at 3:48 pm

    I'll send the other under separate cover. I have these tchrist-standard
    scripts ucsort and unifmt that do all that work for me, but I'm still
    working on the one-liner for your release_manager_guide entry.
    Thanks.
    I don't actually mind having a tool or script in Porting for this.
    Oh. That's *much* easier then. Moment.

    --tom
  • Tom Christiansen at Apr 19, 2011 at 7:03 pm
    If you can read this, you are using a MIME enabled tool, so you
    may not be presented with the proper table of contents for this
    package. It contains:

    % mhlist last +drafts
    msg part type/subtype size description
    1066 multipart/mixed 18K
    1 text/plain 876 letter from tchrist
    2 text/plain 3448 patch to how_to_write_a_perldelta
    3 text/plain 5065 unifmt script
    4 text/plain 8666 ucsort script

    Jesse Vincent <jesse@fsck.com> wrote on Wed, 20 Apr 2011 01:44:11 +1000:
    I don't actually mind having a tool or script in Porting for this.
    Here you go.

    -rw-r--r-- 1 tchrist wheel 3448 Apr 19 12:55 porting.patch
    -rwxr-xr-x 1 tchrist wheel 8666 Apr 19 12:55 ucsort.pl
    -rwxr-xr-x 1 tchrist wheel 5074 Apr 19 12:55 unifmt.pl

    --tom
  • Vadrer at Apr 19, 2011 at 4:31 pm

    On Tue, 2011-04-19 at 09:21 -0600, Tom Christiansen wrote:
    Jesse Vincent <jesse@fsck.com> wrote on Wed, 20 Apr 2011 00:56:31 +1000:
    Whose today, and whose tomorrow? :)
    Spotting errors in the next 12 hours gets you a gold star.
    Spotting errors later gets you the satisfaction of seeing me cry.
    I think you've just solved the encoding issues I have with mutt in
    screen, as the line above rendered _correctly_ there.
    Oh good.

    Now, may I please interest you in sorting those names using a proper
    UCA sort instead of a naïve code-point sort? I kinda feel that we
    in Perl should be able to sort Unicode properly. :(
    I am sorry to have maybe naive speculation,
    but this UCA sort just counterintuitive to me.
    Do I understand corectly, that in this sort letters "o" and "ö" are in
    the same place?

    In languages that I am aware "ö" is different from "o", and usually in
    the very end of alphabet - such in Estonian language, but also in other
    languages, I believe,

    ..... Andreas König
    ..... Vadim Konovalov
    these two should be in different order, because "o" is sooner in
    alphabet.

    Vadim.
  • Zefram at Apr 19, 2011 at 4:39 pm

    vadrer wrote:
    I am sorry to have maybe naive speculation,
    but this UCA sort just counterintuitive to me.
    Letter sorting is very language-dependent. Things that Unicode models
    as Latin letters with diacritics will sort as the base letter in some
    languages, but sort as completely distinct letters (often right at the
    end of the alphabet) in others. Given the wide variety of cultures from
    which our contributors hail, it is entirely impossible for us to come
    up with a sorting that will make every name look like it's correctly
    sorted from the point of view of its native culture.

    In similar vein, we can't mechanically pick out from each name which part
    ought to be used as the primary sort key. That's culturally dependent
    too. In this case the information could in principle be gathered, but
    it would have to be manually maintained, and I wouldn't fancy the job
    of researching it for historical contributors.

    In the light of these issues, I'm entirely satisfied with the first-last
    codepoint lexicographic sorting that is used in AUTHORS and perldeltas.

    -zefram
  • Vadrer at Apr 19, 2011 at 5:13 pm

    On Tue, 2011-04-19 at 17:38 +0100, Zefram wrote:
    vadrer wrote:
    I am sorry to have maybe naive speculation,
    but this UCA sort just counterintuitive to me.
    Letter sorting is very language-dependent. Things that Unicode models
    as Latin letters with diacritics will sort as the base letter in some
    languages, but sort as completely distinct letters (often right at the
    end of the alphabet) in others. Given the wide variety of cultures from
    which our contributors hail, it is entirely impossible for us to come
    up with a sorting that will make every name look like it's correctly
    sorted from the point of view of its native culture.
    Indeed, let it be as it currently is, no problem.

    It just surprises me to know that some languages in real life sort this
    way.

    Thanks.
    Vadim.
  • Tom Christiansen at Apr 19, 2011 at 5:57 pm

    It just surprises me to know that some languages in real life sort
    this way.
    And it would surprise any English speaker to see "ä" as a different
    letter from "a", let alone as "ae". Or to see "aa" follow "z", which is
    the most insanely bizarre thing. To English speakers, an a is an a, no
    matter its adornment or decoration.

    Also, consider Ævar Arnfjörð Bjarmason. The second name should
    clearly (to an English speaker) sort the same as "arnfjord". And
    in the UCA it does, despite there being no decomposition that maps
    d to ð. Indeed, that's what it does in Icelandic, too. Although
    all is not how you might imagine. In Icelandic, you get this:

    IS: a á b d ð e é f g h i í j k l m n o ó p r s t u ú v x y ý þ æ ö

    But in English and Dutch or even the German phonebook -- and indeed with
    the default UCA -- those letters sort in this order:

    UCA: a á æ b d ð e é f g h i í j k l m n o ó ö p r s t u ú v x y ý þ

    Whereas a brain-dead code-point sort gives this nonsense:

    DUH: a b d e f g h i j k l m n o p r s t u v x y á æ é í ð ó ö ú ý þ

    Yet in Estonian they do this:

    ET: a á æ b d ð e é f g h i í j k l m n o ó p r s t u ú v ö x y ý þ

    There is very good reason to use the UCA for all text sorts. If you
    want to use locale sort on top of it, that's a different matter, but
    code-point sorts are a sure sign that a computer hasn't figured out
    how to deal with bits that mean text.

    Perl is not that stupid, and we should not present it in so poor a light.

    --tom
  • Vadrer at Apr 19, 2011 at 7:27 pm

    On Tue, 2011-04-19 at 11:57 -0600, Tom Christiansen wrote:
    It just surprises me to know that some languages in real life sort
    this way.
    And it would surprise any English speaker to see "ä" as a different
    letter from "a", let alone as "ae".
    IMO it should not surprise English reader to see "a" and "ä" as
    different...
    Or to see "aa" follow "z", which is
    the most insanely bizarre thing. To English speakers, an a is an a, no
    matter its adornment or decoration.

    Also, consider Ævar Arnfjörð Bjarmason. The second name should
    clearly (to an English speaker) sort the same as "arnfjord". And
    in the UCA it does, despite there being no decomposition that maps
    d to ð. Indeed, that's what it does in Icelandic, too. Although
    all is not how you might imagine. In Icelandic, you get this:

    IS: a á b d ð e é f g h i í j k l m n o ó p r s t u ú v x y ý þ æ ö
    I am not familiar with Icelandic, but my Estonian knowledge says to me
    that this list is not totally fair - see below on what I mean.
    But in English and Dutch or even the German phonebook -- and indeed with
    the default UCA -- those letters sort in this order:

    UCA: a á æ b d ð e é f g h i í j k l m n o ó ö p r s t u ú v x y ý þ

    Whereas a brain-dead code-point sort gives this nonsense:

    DUH: a b d e f g h i j k l m n o p r s t u v x y á æ é í ð ó ö ú ý þ

    Yet in Estonian they do this:

    ET: a á æ b d ð e é f g h i í j k l m n o ó p r s t u ú v ö x y ý þ
    in Estonian alphabet, the order is this -

    A, B, D, E, G, H, I, J, K, L, M, N, O, P, R, S, T, U, V, Õ, Ä, Ö, Ü
    http://en.wikipedia.org/wiki/Estonian_alphabet

    (I learned Estonian in school)

    Estonian do not have letters á æ ð é í ó ú ý þ.
    It just don't.

    What you cited - is not alphabet, rather set of letters sorted by some
    agreed algorithm.

    So, having this set of letters means that you will not match paper
    indicies or dictionaries.

    And while you paid attention to accented characters, there are also
    languages where same-looking to English letters are different and take
    its own place both in sorting, indexing, etc - most obvious examples are
    Greek and Russian (my native, but maybe you've noticed already :) )

    These completely fall out of such attempt to sort.
    There is very good reason to use the UCA for all text sorts. If you
    want to use locale sort on top of it, that's a different matter, but
    code-point sorts are a sure sign that a computer hasn't figured out
    how to deal with bits that mean text.

    Perl is not that stupid, and we should not present it in so poor a light.
    Yes, I am not objecting to that,
    I am o-kay,
    just saying this is not intuitive.

    Moreover, this discussion becomes not Perl-related, so I am not arguing
    anymore :)

    Regards,
    Vadim.
  • Tom Christiansen at Apr 19, 2011 at 8:28 pm

    vadrer wrote on Tue, 19 Apr 2011 22:37:47 -0000:

    And it would surprise any English speaker to see "ä" as a different
    letter from "a", let alone as "ae".
    IMO it should not surprise English reader to see "a" and "ä" as
    different...
    They are different glyphs, but I quite assure you that they are perceived
    to be the same letters. The second is an "a" with a diaeresis sitting
    atop it, just as "ö" is an "o" with a diaeresis, ø is an "o" with a slash
    through it, and "ï" is an "i" with a diaeresis.

    That does not them from being a's, o's, and i's, which is how they sort.

    That's why in an English-language dictionary, "coöperate" falls
    after "coon" but before "coot". It also falls before "cooperite",
    for that matter, because "coöperate" and "cooperite" first differ
    by letter at the 7th position in each word, at which point "a"
    precedes "o".

    Check the order of headwords in the OED if you don't believe me.

    You'll also find there that "stød" follows "stocky" and precedes
    "stodge". That just how we order things.

    It also happen to be the way the untailored UCA sorts things:

    % ucsort words | unifmt -100
    coon coöperate cooperite coot correspondence cozen roäc road roam rozzer stocky stød stodge stuck

    And that word list comes out the same in English and it the language systems
    we're arguably the closet to:

    % ucsort --locale en words | unifmt -100
    coon coöperate cooperite coot correspondence cozen roäc road roam rozzer stocky stød stodge stuck
    % ucsort --locale nl words | unifmt -100
    coon coöperate cooperite coot correspondence cozen roäc road roam rozzer stocky stød stodge stuck
    % ucsort --locale de words | unifmt -100
    coon coöperate cooperite coot correspondence cozen roäc road roam rozzer stocky stød stodge stuck
    % ucsort --locale fr words | unifmt -100
    coon coöperate cooperite coot correspondence cozen roäc road roam rozzer stocky stød stodge stuck
    % ucsort --locale es words | unifmt -100
    coon coöperate cooperite coot correspondence cozen roäc road roam rozzer stocky stød stodge stuck
    % ucsort --locale pt words | unifmt -100
    coon coöperate cooperite coot correspondence cozen roäc road roam rozzer stocky stød stodge stuck
    % ucsort --locale it words | unifmt -100
    coon coöperate cooperite coot correspondence cozen roäc road roam rozzer stocky stød stodge stuck

    See? They all order those the same as English does. Although the
    German phonebook is admittedly different than the German dictionary,
    for reasons already explained:

    % ucsort --locale de words | unifmt -100
    coon coöperate cooperite coot correspondence cozen roäc road roam rozzer stocky stød stodge stuck
    % ucsort --locale de__phonebook words | unifmt -100
    coöperate coon cooperite coot correspondence cozen road roäc roam rozzer stocky stød stodge stuck

    Showing why an English speaker would not find things in a German phonebook. :)

    Sure, it works out differently in Nordic and Eastern European languages:

    % ucsort --locale hu words | unifmt -100
    coon cooperite coot coöperate correspondence cozen roäc road roam rozzer stocky stød stodge stuck

    % ucsort --locale et words | unifmt -100
    coon cooperite coot correspondence cozen coöperate road roam rozzer roäc stocky stød stodge stuck

    % ucsort --locale is words | unifmt -100
    coon cooperite coot correspondence cozen coöperate road roam rozzer roäc stocky stodge stuck stød
    % ucsort --locale da words | unifmt -100
    coon cooperite coot correspondence cozen coöperate road roam rozzer roäc stocky stodge stuck stød
    % ucsort --locale nn words | unifmt -100
    coon cooperite coot correspondence cozen coöperate road roam rozzer roäc stocky stodge stuck stød

    But if you think an English speaker wouldn't be surprised to see
    "coöperate" slung way down the alphabet to follow "cozen", or "stød"
    placed anywhere other than amongst the other "sto-" words, then you
    don't know the same English speakers as I know. To us, these are the
    same letters, not different letters, because we do not hold a letter
    with a diacritic to be a wholly different letter.

    Honest.

    And *that's* why Unicode offers a tailorable multilevel sort for text.

    --tom
  • Abigail at Apr 19, 2011 at 8:18 pm

    On Tue, Apr 19, 2011 at 11:57:17AM -0600, Tom Christiansen wrote:
    It just surprises me to know that some languages in real life sort
    this way.
    And it would surprise any English speaker to see "ä" as a different
    letter from "a", let alone as "ae". Or to see "aa" follow "z", which is
    the most insanely bizarre thing. To English speakers, an a is an a, no
    matter its adornment or decoration.

    Also, consider Ævar Arnfjörð Bjarmason. The second name should
    clearly (to an English speaker) sort the same as "arnfjord". And
    in the UCA it does, despite there being no decomposition that maps
    d to ð. Indeed, that's what it does in Icelandic, too. Although
    all is not how you might imagine. In Icelandic, you get this:

    IS: a á b d ð e é f g h i í j k l m n o ó p r s t u ú v x y ý þ æ ö

    But in English and Dutch or even the German phonebook -- and indeed with
    the default UCA -- those letters sort in this order:

    UCA: a á æ b d ð e é f g h i í j k l m n o ó ö p r s t u ú v x y ý þ
    That's not true for Dutch phonebooks, and not true for dictionaries
    either. Dutch doesn't sort accented letters different from non-accented
    letters. é does *not* sort after e. Nor before. It sorts *as*
    e. After all, in Dutch, accents do not change the letters - they change
    pronouncation (and sometimes, not even that - nor does the reverse
    hold). And for phonebooks, i sorts between h and j, unless followed by
    j, then it sorts as y. (Words like 'bijectie' aren't used as a name,
    so I've no idea how they sort in a phonebook).


    Having said that, I don't care at all how names are sorted in perldelta.



    Abigail
  • Tom Christiansen at Apr 19, 2011 at 8:30 pm

    That's not true for Dutch phonebooks, and not true for
    dictionaries either. Dutch doesn't sort accented letters
    different from non-accented letters. é does *not* sort after e.
    Nor before. It sorts *as* e. After all, in Dutch, accents do not
    change the letters - they change pronouncation (and sometimes,
    not even that - nor does the reverse hold).
    Yes, exactly; it's the same as English in that particular regard.
    Accents don't make something a different letter, nor change its
    sort position.

    I included Dutch in my previous message.

    --tom
  • Konovalov, Vadim (Vadim)** CTR ** at Apr 20, 2011 at 6:31 am

    From: Tom Christiansen
    To: Abigail
    That's not true for Dutch phonebooks, and not true for
    dictionaries either. Dutch doesn't sort accented letters
    different from non-accented letters. é does *not* sort after e.
    Nor before. It sorts *as* e. After all, in Dutch, accents do not
    change the letters - they change pronouncation (and sometimes,
    not even that - nor does the reverse hold).
    Yes, exactly; it's the same as English in that particular regard.
    Accents don't make something a different letter, nor change its
    sort position.
    this is the point - in some languages accented letters do not change
    letter (only its pronouncing), in other languages they do change letter.
    (thanks for these examples, now I will know better :) )

    I have another example of two-dotted letter :)

    In Russian, there is a letter "Ё" that is sorted immediately after "Е".
    (it is absolutely different compared to French same looking letter)

    But Russian usage of this letter in real life is a bit in the middle -
    while these two letters are strictly distinguished, these two
    dots in "Ё" are often omitted, so adding more inconveniences in
    reading, etc.

    In Estonian, all letters with two dots above are at the end of alphabet,
    and are different letters.

    have a nice day,
    Vadim.
  • Zsbán Ambrus at Apr 20, 2011 at 8:17 am

    On Tue, Apr 19, 2011 at 7:57 PM, Tom Christiansen wrote:
    And it would surprise any English speaker to see "ä" as a different
    letter from "a", let alone as "ae".  Or to see "aa" follow "z", which is
    the most insanely bizarre thing.  To English speakers, an a is an a, no
    matter its adornment or decoration.
    There's a significant difference between "ä" and "å" here.

    The "ä" is sorted in both places depending on the language: German
    sorts it mixed with "a", and so does it appear in a Hungarian list of
    names even though "ä" is never used in Hungarian, whereas Swedish and
    Estonian sorts it after "z".

    On the other hand, I believe "å" is sorted after "z" in all languages.
    It's certainly sorted that way in at least Swedish and Danish. Thus,
    I always expect it to be sorted after "z" no matter what kind of sort
    you're using. Similarly, I'd always expect "ø" to be sorted between
    "z" and "å". Some go as far and say that it's best to pretend that
    "å" is not an "a" with an accent but a separate base letter like "þ"
    is.

    Now the more tricky question is where you should sort "æ". Which ones
    of the following are used in at least one sorting order?
    (1) equivalent to "a"
    (2) equivalent to "ae"
    (3) between "ae" and "af"
    (4) between "a" and "b"
    (5) between "z" and "ø".

    Ambrus
  • Tom Christiansen at Apr 20, 2011 at 12:53 pm
    =?ISO-8859-1?Q?Zsb=E1n_Ambrus?= <ambrus@math.bme.hu> wrote
    on Wed, 20 Apr 2011 10:17:18 +0200:
    On the other hand, I believe "å" is sorted after "z" in all languages.
    No, it isn't. It is sorted as "a" in English.
    It's certainly sorted that way in at least Swedish and Danish. Thus,
    I always expect it to be sorted after "z" no matter what kind of sort
    you're using. Similarly, I'd always expect "ø" to be sorted between
    "z" and "å". Some go as far and say that it's best to pretend that
    "å" is not an "a" with an accent but a separate base letter like "þ"
    is.
    Similarly, "ø" is sorted as in "o" in English. Here is a list of
    head words from the current OED, with the year the word was first
    documented in written English on the left:

    1943 stockpiling, n.
    1508 stock still | stock-...
    1972 stock-take | stockta...
    1794 stock-ˌtaker, n.
    1858 stock-ˌtaking, n.
    1808 stock-work, n.
    a1400 stocky, adj.
    1954 stød, n.
    1825 stodge, n.
    1674 stodge, v.
    1847 stodge-full, adj.
    1905 stodger, n.
    1823 todgy, adj.

    Similarly, "Å" is also treated as an "a" with a diacritic,
    *which does not alter its sort order*. Here is part of
    the OED's list of initialisms under "a". Notice again
    where things sort:

    ATM n.
    A.T.S. n.
    ATV n.
    A.T.V.
    Å.U. n.
    A.U. n.
    AUEW n.
    AV n.
    A.V. n.
    A.V.H. n.
    A.V.O. n.
    A.V.M. n.
    A.W.O.L. n.
    A.W.U. n.
    AZT n.

    Diacritics do not affect sort order in English.

    --tom
  • Tom Christiansen at Apr 19, 2011 at 6:05 pm

    In the light of these issues, I'm entirely satisfied with the first-last
    codepoint lexicographic sorting that is used in AUTHORS and perldeltas.
    I'm not. Codepoint sorting is always worse than UCA sorting. Codepoint
    sorting is an embarrassment.

    --tom
  • Zsbán Ambrus at Apr 20, 2011 at 8:44 am

    On Tue, Apr 19, 2011 at 6:38 PM, Zefram wrote:
    In similar vein, we can't mechanically pick out from each name which part
    ought to be used as the primary sort key.  That's culturally dependent
    too.  In this case the information could in principle be gathered, but
    it would have to be manually maintained, and I wouldn't fancy the job
    of researching it for historical contributors.
    Indeed. Not only it is culturally dependent but also cannot be
    determined even if you know the culture, just like Tom Christiansen
    alluded to above. The two problems are determining if the name is
    written as last-first or first-last or does not have a first name part
    at all, and being able to tell whether there are more than one last
    names.

    Hungarian names are particularly troublesome because people will write
    then randomly in last-first and first-last form. Luckily you can
    usually make good guesses because you can freely download the list of
    all currently existing first names and the list of the most frequent
    last names[1]. However, even armed with these sometimes you can't
    quite be sure because you can meet
    (1) new first names that were not yet in the list when you downloaded it,
    (2) non-official nickname variants of first names, pennames that are
    not real first names, or even straight misspellings,
    (3) a name with more than one last names but the first name omitted,
    (4) a name where both the first and last names are existing first
    names, but the actual first name is more common as the last name --
    this is real because people will conciously avoid such names but you
    can't completely exclude them


    Zsbán Ambrus (of which Zsbán is the family name)


    [1] "http://tinyurl.com/3tmxc2e"; "http://tinyurl.com/3uor2n5".
  • Tom Christiansen at Apr 19, 2011 at 5:44 pm
    On Tue, 19 Apr 2011 19:41:38 -0000, vadrer <me@vadrer.org>
    opened up the following rathole the size of asteroid strike:
    Now, may I please interest you in sorting those names using a proper
    UCA sort instead of a naïve code-point sort? I kinda feel that we
    in Perl should be able to sort Unicode properly. :(
    I am sorry to have maybe naive speculation,
    but this UCA sort just counterintuitive to me.
    Do not be sorry. It is not naïve not to know the UCA.
    The UCA is a multi-level sort. At least four levels are defined:

    1 Primary consider only alphabetic ordering,
    so ignore diacritics, case distinctions, and non-letters

    2 Secondary also consider diacritics,
    so ignore case distinctions and non-letters

    3 Tertiary also consider case distinctions
    so ignore non-letters

    4 Quaternary consider other code points for tie-breaking

    Do I understand correctly, that in this sort letters "o" and "ö"
    are in the same place?
    Not exactly. They differ by their diacritic. That means that
    at the primary strength, which considers differences between letters
    and ignores diacritics, case, and nonalphabetics, they are the same.

    Here's an example using only English words for simplicity's sake:

    When Compared at Collation Strength...
    ______________________________________________
    Primary Secondary Tertiary Quaternary
    String#1 String#2 (alphabetic) (accents) (case) (etc)
    ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
    Bob Alice DIFFER -- -- --
    resume résumé SAME DIFFER -- --
    Bob bob SAME SAME DIFFER --
    I'll ill SAME SAME DIFFER --
    I'll Ill SAME SAME SAME DIFFER
    re-invent reinvent SAME SAME SAME DIFFER
    re-invent reïnvent SAME DIFFER -- --

    (sure hope you're reading this in a fixed-width font! :)

    There are further subtleties. One is that casing isn't always what you
    think it is. Here we show that the old-style "long s", "ſ", is just a
    case variation of the regular "s". This is like there being two
    different lowercase Greek sigmas but only one in uppercase. The
    capital of σ, ς, and Σ are all the same letter apart from case,
    just as are ſ, s, and S.

    Similarly with the German Eszett "ß" counting as "SS" when uppercased.
    This is a more interesting story, because you no longer have the same
    number of code points. When case folding produces the same number of
    code points, it's called "simple case folding" in Unicode parlance, and
    when it can produce a different number of code points, it's called "full
    case folding". Perl uses full casing folding, but everybody else uses
    only simple case folding.

    Watch:

    Primary Secondary Tertiary Quaternary
    String#1 String#2 (alphabetic) (accents) (case) (etc)
    ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
    castles caſtles SAME SAME DIFFER --
    σίγμας ΣΊΓΜΑΣ SAME SAME DIFFER --
    tschüß TSCHÜSS SAME SAME DIFFER --

    That's just the way (full-, not simple-) Unicode casing works.
    You get that when you compare strings case-insensitivity.:

    use utf8;
    use v5.014;
    say "castles" =~ /caſtles/i ? "SAME" : "DIFFERENT";
    say "σίγμας" =~ /ΣΊΓΜΑΣ/i ? "SAME" : "DIFFERENT";
    say "tschüß" =~ /TSCHÜSS/i ? "SAME" : "DIFFERENT";

    produces

    SAME
    SAME
    SAME

    Really nifty, eh?
    In languages that I am aware "ö" is different from "o", and usually in
    the very end of alphabet - such in Estonian language, but also in
    other languages, I believe,

    ..... Andreas König
    ..... Vadim Konovalov


    these two should be in different order, because "o" is
    sooner in alphabet.
    Now that is a **very** interesting issue.

    The barebones UCA sort that I ran those names through is by design
    one that works in a way that is not specific to any particular locale.
    I did this the names are of varying nationality, so it did not seem
    to be to be appropriate to choose one nationality over another.

    That means to use the default UCA sort.

    However, because the document is in the English language, there is also a
    fairly strong argument to be made that it should be sorted using what we
    sometimes call an English phonebook sort. That's the one used for sorting
    surnames, and you will find it used in English- language book shops. And
    if I had, things would have been different. For example, an English
    phonebook sort counts surnames starting with "Mc-" and "Mac-" (provided
    they are used as patronymics; "Macaroni" and "Macedonia" doesn't count!)
    being completely interchangeable, and it sorts them those in front of all
    other M- words. That would give this for surnames:

    Grant McLean
    Tye McQueen
    Lukas Mai
    Max Maischein
    Walt Mankowski
    Dagfinn Ilmari Mannsåker
    Paul Marquess
    Peter Martini
    James Mastros
    Abhijit Menon-Sen
    David Mitchell
    Tatsuhiko Miyagawa
    Richard Möhn
    Ben Morrow
    Steffen Müller

    But I did not do that. There are other locale issues that I
    address below.
    In languages that I am aware "ö" is different from "o", and usually in
    the very end of alphabet - such in Estonian language, but also in
    other languages, I believe,

    ..... Andreas König
    ..... Vadim Konovalov


    these two should be in different order, because "o" is
    sooner in alphabet.
    Ah, but in *whose* alphabet, one must ask?

    This varies considerably. In an English-language alphabet, or indeed
    in the default UCA, "ö" and "o" are the same letter, and thus occupy
    precisely the same spot in the alphabet. That means in these two
    surnames:

    Vadim Konovalov
    Andreas König

    the first letter that differs is not the second but rather the fourth
    one, where the "i" of König precedes the "o" of Konovalov. In a multi-
    level sort, you do not consider higher level differences if there are no
    lower level diffference. Only if the words had no differences AT ANY
    POSITION in the primary strength would you go on and use the secondary
    strength, which is one that does indeed consider diacritics.

    But now let us consider national sorts. When sorting German-language
    names, such as for a German-language phonebook, umlauted vowels sort
    just like they were spelled with the base vowel plus the letter "e".
    That means that König and Koenig would test equal in that national sort.
    I did not do that, though.

    Let us consider Estonian, per you example. It is true that
    in an Estonian locale, you would get a different ordering. Here's
    a localized diff:

    kmx
    -Andreas König
    Vadim Konovalov
    +Andreas König
    David Leadbeater
    Moritz Lenz

    and also

    David Mitchell
    Tatsuhiko Miyagawa
    -Richard Möhn
    Ben Morrow
    +Richard Möhn
    Steffen Müller

    but it goes rather beyond that.

    Michael Stevens
    Gene Sullivan
    +Ruslan Zakirov
    +Zefram
    Rainer Tammer
    Leon Timmermans

    So apparently Estonian sorts Z after S and before T. Curious.

    And it gets better:

    Reini Urban
    Alex Vandiver
    -Jesse Vincent
    Casey West
    David Wheeler
    Frank Wiegand
    Chris 'BinGOs' Williams
    Karl Williamson
    +Jesse Vincent
    Michael Witten
    -Ruslan Zakirov
    -Zefram
    Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯

    It appears that like some Swedish sorts, Estonian counts
    V and W as equals. In fact, if we were sorting on first names,
    Estonian would demand these changes:

    @@ -130,6 +130,8 @@
    Steven Schubiger
    Steve Peters
    Sullivan Beck
    +Zefram
    +Zsbán Ambrus
    Tatsuhiko Miyagawa
    Tim Bunce
    Todd Rinaldo
    @@ -138,10 +140,8 @@
    Tony Cook
    Tye McQueen
    Vadim Konovalov
    +Walt Mankowski
    Vernon Lyon
    Vincent Pit
    -Walt Mankowski
    Wolfram Humann
    Yves Orton
    -Zefram
    -Zsbán Ambrus

    Isn't that curious? That messing around with W and V also happens in the
    Swedish locale, here by last names:

    Reini Urban
    Alex Vandiver
    -Jesse Vincent
    Casey West
    David Wheeler
    Frank Wiegand
    Chris 'BinGOs' Williams
    Karl Williamson
    +Jesse Vincent
    Michael Witten

    and here by first names:

    Tye McQueen
    Vadim Konovalov
    +Walt Mankowski
    Vernon Lyon
    Vincent Pit
    -Walt Mankowski
    Wolfram Humann
    Yves Orton

    And Finnish appears to do the same thing as Swedish does.

    Here's another issue. I mentioned that Iberian surnames should sort
    first on the first surname before the second surname. Another matter
    is that in Spanish, the letter "ñ" is not an "n" with a tilde. It is
    a distict letter that sorts after "n" and before "o". And in a
    traditional Spanish sort, you must also consider that "ch" and "ll"
    are digraphs sorting after "c" and "l" respectively. That leads to
    this diff:

    Nuno Carvalho
    -Tom Christiansen
    -chromatic
    -Father Chrysostomos
    Alexandr Ciornii
    Nicholas Clark
    Nick Cleaton
    Tony Cook
    Aaron Crane
    Jim Cromie
    +Tom Christiansen
    +chromatic
    +Father Chrysostomos
    Dan Dascalescu
    Alex Davies

    And when sorted by first names, it would cause this switch:

    Moritz Lenz
    -Nicholas Clark
    Nick Cleaton
    Nick Johnston
    Nicolas Kaiser
    +Nicholas Clark
    Niko Tyni
    Noirin Shirley

    You'd have to the same for Welsh, which does the same thing.
    Welsh also counts "ph" as its own letter. That would therefore
    lead to this rearrangement for Welsh if we're considering last names:

    John Peacock
    Steve Peters
    -Brian Phillips
    Vincent Pit
    Ali Polatel
    Joshua Pritikin
    +Brian Phillips
    Hongwen Qiu
    Florian Ragwitz

    But this one for Welsh if we are considering first names:

    @@ -93,10 +93,10 @@
    Michael Witten
    Mike Kelly
    Moritz Lenz
    -Nicholas Clark
    Nick Cleaton
    Nick Johnston
    Nicolas Kaiser
    +Nicholas Clark
    Niko Tyni
    Noirin Shirley
    Nuno Carvalho
    @@ -107,8 +107,8 @@
    Peter J. Holzer
    Peter John Acklam
    Peter Martini
    -Philippe Bruhat (BooK)
    Piotr Fusik
    +Philippe Bruhat (BooK)
    Rafael Garcia-Suarez
    Rainer Tammer
    Reini Urban

    Let's see, I also noticed this name in the list:

    Dagfinn Ilmari Mannsåker

    What shall we do with the "å"? In for example the Danish
    alphabet, the final letters are y z æ ø å. However when
    sorting names, the "å" is treated as "aa" -- and vice versa!

    That means that Gisle Aas goes after Ruslan Zakirof in Danish:

    Ruslan Zakirov
    Zefram
    +Gisle Aas

    And "Æ" is considered a separate letter rather than the same
    as an "ae" contraction. Sorting by first names in Danish then
    require these changes:

    @@ -1,7 +1,5 @@
    -Aaron Crane
    Abhijit Menon-Sen
    Abigail
    -Ævar Arnfjörð Bjarmason
    Alastair Douglas
    Alexander Alekseev
    Alexander Hartmaier
    @@ -145,3 +143,5 @@
    Yves Orton
    Zefram
    Zsbán Ambrus
    +Ævar Arnfjörð Bjarmason
    +Aaron Crane

    I submit to you that such a sort of names in an English-language
    document would not work. It would be considered an error.

    Do you see how complicated this is? I urge you to examine
    the DUCET modifications in

    blead/lib/Unicode/Collate/Locale

    These are hardly exclusive (for examples, it's missing en__phonebook),
    but should give you an idea of the scope of the problem:

    af.pl es_trad.pl hy.pl nb.pl sq.pl zh.pl
    ar.pl et.pl ig.pl nn.pl sv.pl zh_big5.pl
    az.pl fi.pl is.pl nso.pl sw.pl zh_gb.pl
    ca.pl fil.pl ja.pl om.pl tn.pl zh_pin.pl
    cs.pl fo.pl kk.pl pl.pl to.pl zh_strk.pl
    cy.pl fr.pl kl.pl ro.pl tr.pl
    da.pl ha.pl ko.pl ru.pl uk.pl
    de_phone.pl haw.pl lt.pl se.pl vi.pl
    eo.pl hr.pl lv.pl sk.pl wo.pl
    es.pl hu.pl mt.pl sl.pl yo.pl

    Do you see now? This brings us back to your last statement:
    these two should be in different order, because "o" is
    sooner in alphabet.
    And I must again ask: in **WHOSE** alphabet??

    It is my considered and studied--but not unchangeable--opinion that I have
    in choosing the default UCA done **the only reasonable thing possible**.

    If you, or anyone else for that matter, should happen to believe otherwise,
    please carefully explain which sort you think we should be using, and why.

    But until and unless I hear otherwise, I shall continue to believe that the
    default UCA works best for us.

    --tom
  • Ævar Arnfjörð Bjarmason at Apr 19, 2011 at 6:39 pm

    On Tue, Apr 19, 2011 at 17:21, Tom Christiansen wrote:

    But even still, don't both those second two columns look a whole lot
    better than the first one does?
    Last-first presumes that people have significant last names, I don't.

    Anyway, this entire thing is silly as far as I'm concerned, let's just
    use code point order.

    UCA sorting leads to its own sillyness, like sorting "Æ" according to
    the rules of languages where it's an AE-like letter, which it's not in
    the language of the only committer we have whose name starts with an
    "Æ" (or a non-ASCII letter for that matter).
  • Tom Christiansen at Apr 19, 2011 at 6:42 pm

    Anyway, this entire thing is silly as far as I'm concerned, let's just
    use code point order.
    UCA sorting leads to its own sillyness,
    I disagree in the strongest possible terms.

    --tom
  • Tom Christiansen at Apr 19, 2011 at 7:18 pm
    Now that I'm done with the patch, I have more time to disagree
    more strongly.
    Anyway, this entire thing is silly as far as I'm concerned,
    That's merely your opinion.
    let's just use code point order.
    Code point order is bitwise order, and it is is inarguably complete idiocy
    for sorting Unicode *text*. The same letters are not arranged in
    bitwise-ascending order, so it will always look like an idiot sorted it.
    That's why the UCA exists, and why it is so rich. *NEVER* sort text in
    bitwise order. It is embarrassing, like people who sort titles without
    trimming the leading articles. Text is not bits. Text has other properties
    that childlike approaches to bit counting will never reveal.
    UCA sorting leads to its own sillyness, like sorting "Æ" according to
    the rules of languages where it's an AE-like letter, which it's not in
    the language of the only committer we have whose name starts with an
    "Æ" (or a non-ASCII letter for that matter).
    And just as soon as we are posting a version of perldelta that is
    written in Icelandic, we will use an Icelandic locale to do the
    sorting of names.

    The text is in the English language, and in the English language, an
    "Æ" sorts as though it were the "AE" contraction. To do otherwise
    it to violate the principles of English text.

    But shall we arrange the names per an English phonebook sort? I
    have not done that, but only because there is no en__phonebook locale.
    I have therefore used the default UCA. You may try to convince me that
    something else makes sense, but I warn you that I have thought about
    this matter really a great deal, so if you do have not yourself done
    so then I suspect you will not have at your disposal any arguments
    that will stand up to the sort of sincere scrutiny to which I have
    myself put the matter.

    The default UCA is the minimum point of departure for the sorting
    of text. Bits are irrelevant, stupid, and harmful.

    One may certainly wish to do something *more* than just the UCA.
    That's why so many locales exist.

    --tom

    UCA Slogan of The Year: "Bits are for twits!"
  • Ævar Arnfjörð Bjarmason at Apr 19, 2011 at 7:29 pm

    On Tue, Apr 19, 2011 at 21:17, Tom Christiansen wrote:
    Anyway, this entire thing is silly as far as I'm concerned,
    That's merely your opinion.
    I don't mean that proper sorting is silly in general, but that it's
    silly that the "blead is CLOSED for releng" thread has been turned
    into a wall of text discussing some minor detail in the perldelta.

    Nobody but people with sorting OCD care about this, everyone else just
    goes "oh, a lot of people contribute to perl" or "look, my name is in
    there woo".

    I'm not against applying the patch. Actually I'm now strongly in favor
    of if only because it'll end this discussion and we can focus on
    things actually relevant to getting 5.14 out.
  • Zefram at Apr 19, 2011 at 4:29 pm

    Tom Christiansen wrote:
    This line has had its individual UTF-8 each re-encoded as UTF-8:
    [funny foreign characters]
    I'm pretty certain that that is supposed to be:
    [funny foreign characters]

    *I'm* pretty certain that it's supposed to be:

    D. Hedden, Jesse Vincent, Jim Cromie, Jirka HruE<0x161>ka, John Peacock,

    so that the *file* encoding stops annoying us all. (Punt character
    encoding as far up the protocol stack as you can.) Also, Jirka's entry
    in AUTHORS has character encoding inconsistent with all of the other
    non-ASCII entries, which seems quite likely to be the cause of the
    manglement in perldelta.pod.

    -zefram

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedApr 19, '11 at 2:33p
activeApr 20, '11 at 12:53p
posts32
users9
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase