FAQ
In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/79f120c89a6e123770f65fd49893ef0b379cd922?hp=ad01aea0cf123e30f64d2b7f183dfa28c87af2ef>

- Log -----------------------------------------------------------------
commit 79f120c89a6e123770f65fd49893ef0b379cd922
Author: Karl Williamson <khw@cpan.org>
Date: Tue Apr 12 11:21:40 2016 -0600

     Change calculation of locale collation coefficients

     Every time a new collation locale is set, two coefficients are calculated
     that are used in predicting how much space is needed in the
     transformation of a string by strxfrm(). The transformed string is
     roughly linear with the the length of the input string, so we are
     calcaulating 'm' and 'b' such that

         transformed_length = m * input_length + b

     Space is allocated based on this prediction. If it is too small, the
     strxfrm() will fail, and we will have to increase the allotted amount
     and try again. It's better to get the prediction right to avoid
     multiple, expensive strxfrm() calls.

     Prior to this commit, the calculation was not rigorous, and failed on
     some platforms that don't have a fully conforming strxfrm().

     This commit changes to not panic if a locale has an apparent defective
     collation, but instead silently change to use C-locale collation. It
     could be argued that a warning should additionally be raised.

     This commit fixes [perl #121734].

M locale.c
M pod/perldelta.pod
M pod/perllocale.pod

commit c664130fefeaef47ddc7dcbf7ec1830d04af8ea7
Author: Karl Williamson <khw@cpan.org>
Date: Mon Apr 11 19:11:07 2016 -0600

     locale.c: Change algorithm for strxfrm() trials

     It's kind of guess work deciding how big a buffer to give to strxfrm().
     If you give it too small a one, it will fail. Prior to this commit, the
     buffer size was doubled and then strxfrm() was called again, looping
     until it worked, or we used too much memory.

     Each time a new locale is made, we try to minimize the necessity of
     doing this by calculating numbers 'm' and 'b' that can be plugged into
     the equation

         mx + b

     where 'x' is the size of the string passed to strxfrm(). strxfrm() is
     roughly linear with respect to its input's length, so this generally
     works without us having to do many loops to get a large enough size.

     But on many systems, strxfrm(), in failing, returns how much space you
     should have given it. On such systems, we can just use that number on
     the 2nd try and not have to keep guessing. This commit changes to do
     that.

     But on other systems this doesn't work. So the original method is
     retained if we determine that there are problems with strxfrm(), either
     from previous experience, or because using the size returned from the
     first trial didn't work

M embedvar.h
M intrpvar.h
M locale.c

commit 3c5f993ee8b2fd0912839a82e5d7d8c871a363ea
Author: Karl Williamson <khw@cpan.org>
Date: Sat Apr 9 20:40:48 2016 -0600

     locale.c: Free over-allocated space early

     We may over malloc some space in buffers to strxfrm(). This frees it
     now instead of waiting for the whole block to be freed sometime later.
     This can be a significant amount of memory if the input string to
     strxfrm() is long.

M locale.c

commit 4ebeff162503a0f3c404305458e4730ff29f1dea
Author: Karl Williamson <khw@cpan.org>
Date: Sat Apr 9 20:36:01 2016 -0600

     locale.c: White-space only

     Outdent and reflow because the previous commit removed an enclosing
     block.

M locale.c

commit 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e
Author: Karl Williamson <khw@cpan.org>
Date: Sat Apr 9 15:52:05 2016 -0600

     Change mem_collxfrm() algorithm for embedded NULs

     One of the problems in implementing Perl is that the C library routines
     forbid embedded NUL characters, which Perl accepts. This is true for
     the case of strxfrm() which handles collation under locale.

     The best solution as far as functionality goes, would be for Perl to
     write its own strxfrm replacement which would handle the specific needs
     of Perl. But that is not going to happen because of the huge complexity
     in handling it across many platforms. We would have to know the
     location and format of the locale definition files for every such
     platform. Some might follow POSIX guidelines, some might not.

     strxfrm creates a transformation of its input into a new string
     consisting of weight bytes. In the typical but general case, a 3
     character NUL-terminated input string 'A B C 00' (spaces added for
     readability) gets transformed into something like:
         A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00
     where the superscripted characters are weights for the corresponding
     input characters. Superscript 1 represents (essentially) the primary
     sorting key; 2, the secondary, etc, for as many levels as the locale
     definition gives. The 01 byte is likely to be the separator between
     levels, but not necessarily, and there could be some other mechanisms
     used on various platforms.

     To handle embedded NULs, the simplest thing would be to just remove them
     before passing in to strxfrm(). Then they would be entirely ignored,
     which might not be what you want. You might want them to have some
     weight at the tertiary level, for example. It also causes problems
     because strxfrm is very context sensitive. The locale definition can
     define weights for specific sequences of any length (and the weights can
     be multi-byte), and by removing a NUL, two characters now become
     adjacent that weren't in the input, and they could now form one of those
     special sequences and thus throw things off.

     Another way to handle NULs, that seemingly ignores them, but actually
     doesn't, is the mechanism in use prior to this commit. The input string
     is split at the NULs, and the substrings are independently passed to
     strxfrm, and the results concatenated together. This doesn't work
     either. In our example 'A B C 00', suppose B is a NUL, and should have
     some weight at the tertiary level. What we want is:
         A¹ C¹ 01 A² C² 01 A³ B³ C³ 00

     But that's not at all what you get. Instead it is:
         A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00
     The primary weight of C comes immediately after the teriary weight of A,
     but more importantly, a NUL, instead of being ignored at the primary
     levels, is significant at all levels, so that "a\0c" would sort before
     "ab".

     Still another possibility is to replace the NUL with some other
     character before passing it to strxfrm. That was my original plan, to
     replace each NUL with the character that this code determines has the
     lowest collation order for the current locale. On strings that don't
     contain that character, the results would be as good as it gets for that
     locale. That character is likely to be ignored at higher weight levels,
     but have some small non-ignored weight at the lowest ones. And
     hopefully the character would rarely be encountered in practice. When
     it does happen, it and NUL would sort identically; hardly the end of the
     world. If the entire strings sorted identically, the NUL-containing one
     would come out before the other one, since the original Perl strings are
     used as a tie breaker. However, testing showed a problem with this. If
     that other character is part of a sequence that has special weighting,
     the results won't be correct. With gcc, U+00B4 ACUTE ACCENT is the
     lowest collating character in many UTF-8 locales. It combines in
     Romanian and Vietnamese with some other characters to change weights,
     and hence changing NULs into U+B4 screws things up.

     What I finally have come to is to do is a modification of this final
     approach, where the possible NUL replacements are limited to just
     characters that are controls in the locale. NULs are replaced by the
     lowest collating control. It would really be a defective locale if this
     control combined with some other character to form a special sequence.
     Often the character will be a 01, START OF HEADING. In the very
     unlikely case that there are absolutely no controls in the locale, 01 is
     used, because we have to replace it with something.

     The code added by this commit is mostly utf8-ready. A few commits from
     now will make Perl properly work with UTF-8 (if the platform supports
     it). But until that time, this isn't a full implementation; it only
     looks for the lowest-sorting control that is invariant, where the
     the UTF8ness doesn't matter. The added tests are marked as TODO until
     then.

M embed.fnc
M embedvar.h
M intrpvar.h
M lib/locale.t
M locale.c
M pod/perldelta.pod
M pod/perllocale.pod
M proto.h
M t/porting/libperl.t

commit 59c018b996263ec705a1e7182f7fa996b72207da
Author: Karl Williamson <khw@cpan.org>
Date: Tue May 17 21:53:53 2016 -0600

     locale.c: Add, move, clarify comments

     This moves a large block of comments to before a block, outdents it, and
     adds to it, plus adding another comment

M locale.c

commit 165a1c52807daa7ad3ecc83f0811047937088904
Author: Karl Williamson <khw@cpan.org>
Date: Mon May 16 15:19:14 2016 -0600

     Keep track of if collation locale is UTF-8 or not

     This will be used in future commits

M embedvar.h
M intrpvar.h
M locale.c
M sv.c

commit 00bf60caa5125511dc13041a21f3d1cf2abff837
Author: Karl Williamson <khw@cpan.org>
Date: Mon May 16 15:15:26 2016 -0600

     locale.c: Don't use special locale collation for C locale

     We can skip all the locale collation calculations if the locale we are
     in is C or POSIX.

M locale.c

commit 4e615abd31bcd0bb8f321ae4687e5aef1a8aa391
Author: Karl Williamson <khw@cpan.org>
Date: Sat May 21 11:35:10 2016 -0600

     perllocale: Document NUL collation handling

     And add a TODO test, because this shortly will be improved upon

M lib/locale.t
M pod/perllocale.pod

commit f17bc913886cd59ab68fde7bec9131dbb90186a0
Author: Karl Williamson <khw@cpan.org>
Date: Fri May 13 11:51:55 2016 -0600

     lib/locale.t: Don't calculate value unless needed

M lib/locale.t
-----------------------------------------------------------------------

Summary of changes:
  embed.fnc | 2 +-
  embedvar.h | 3 +
  intrpvar.h | 4 +
  lib/locale.t | 31 +++-
  locale.c | 396 ++++++++++++++++++++++++++++++++++++++++++----------
  pod/perldelta.pod | 11 +-
  pod/perllocale.pod | 14 ++
  proto.h | 4 +-
  sv.c | 1 +
  t/porting/libperl.t | 7 +
  10 files changed, 395 insertions(+), 78 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index bf3b8c5..85166eb 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -909,7 +909,7 @@ Ap |I32 * |markstack_grow
  #if defined(USE_LOCALE_COLLATE)
  p |int |magic_setcollxfrm|NN SV* sv|NN MAGIC* mg
  : Defined in locale.c, used only in sv.c
-p |char* |mem_collxfrm |NN const char* s|STRLEN len|NN STRLEN* xlen
+p |char* |mem_collxfrm |NN const char* input_string|STRLEN len|NN STRLEN* xlen
  #endif
  Afpd |SV* |mess |NN const char* pat|...
  Apd |SV* |mess_sv |NN SV* basemsg|bool consume
diff --git a/embedvar.h b/embedvar.h
index 7e551be..6738368 100644
--- a/embedvar.h
+++ b/embedvar.h
@@ -168,6 +168,7 @@
  #define PL_in_clean_objs (vTHX->Iin_clean_objs)
  #define PL_in_eval (vTHX->Iin_eval)
  #define PL_in_load_module (vTHX->Iin_load_module)
+#define PL_in_utf8_COLLATE_locale (vTHX->Iin_utf8_COLLATE_locale)
  #define PL_in_utf8_CTYPE_locale (vTHX->Iin_utf8_CTYPE_locale)
  #define PL_incgv (vTHX->Iincgv)
  #define PL_initav (vTHX->Iinitav)
@@ -308,6 +309,8 @@
  #define PL_stderrgv (vTHX->Istderrgv)
  #define PL_stdingv (vTHX->Istdingv)
  #define PL_strtab (vTHX->Istrtab)
+#define PL_strxfrm_is_behaved (vTHX->Istrxfrm_is_behaved)
+#define PL_strxfrm_min_char (vTHX->Istrxfrm_min_char)
  #define PL_sub_generation (vTHX->Isub_generation)
  #define PL_subline (vTHX->Isubline)
  #define PL_subname (vTHX->Isubname)
diff --git a/intrpvar.h b/intrpvar.h
index 50a9ee0..f540a9d 100644
--- a/intrpvar.h
+++ b/intrpvar.h
@@ -245,6 +245,7 @@ PERLVAR(I, exit_flags, U8) /* was exit() unexpected, etc. */

  PERLVAR(I, utf8locale, bool) /* utf8 locale detected */
  PERLVAR(I, in_utf8_CTYPE_locale, bool)
+PERLVAR(I, in_utf8_COLLATE_locale, bool)
  #ifdef USE_LOCALE_CTYPE
      PERLVAR(I, warn_locale, SV *)
  #endif
@@ -563,6 +564,9 @@ PERLVAR(I, collation_name, char *) /* Name of current collation */
  PERLVAR(I, collxfrm_base, Size_t) /* Basic overhead in *xfrm() */
  PERLVARI(I, collxfrm_mult,Size_t, 2) /* Expansion factor in *xfrm() */
  PERLVARI(I, collation_ix, U32, 0) /* Collation generation index */
+PERLVARA(I, strxfrm_min_char, 3, char)
+PERLVARI(I, strxfrm_is_behaved, bool, TRUE)
+ /* Assume until proven otherwise that it works */
  PERLVARI(I, collation_standard, bool, TRUE)
       /* Assume simple collation */
  #endif /* USE_LOCALE_COLLATE */
diff --git a/lib/locale.t b/lib/locale.t
index dc31b46..ce0c987 100644
--- a/lib/locale.t
+++ b/lib/locale.t
@@ -900,12 +900,12 @@ sub disp_str ($) {

  sub report_result {
      my ($Locale, $i, $pass_fail, $message) = @_;
- $message //= "";
- $message = " ($message)" if $message;
      if ($pass_fail) {
   push @{$Okay{$i}}, $Locale;
      }
      else {
+ $message //= "";
+ $message = " ($message)" if $message;
   $Known_bad_locale{$i}{$Locale} = 1 if exists $known_bad_locales{$^O}
                                           && $Locale =~ $known_bad_locales{$^O};
   $Problem{$i}{$Locale} = 1;
@@ -1735,6 +1735,33 @@ foreach my $Locale (@Locale) {
                  last;
              }
          }
+
+ use locale;
+
+ ++$locales_test_number;
+ $test_names{$locales_test_number}
+ = 'Skip in locales where \001 has primary sorting weight; '
+ . 'otherwise verify that \0 doesn\'t have primary sorting weight';
+ if ("a\001c" lt "ab") {
+ report_result($Locale, $locales_test_number, 1);
+ }
+ else {
+ my $ok = "ab" lt "a\0c";
+ report_result($Locale, $locales_test_number, $ok);
+ }
+
+ ++$locales_test_number;
+ $test_names{$locales_test_number}
+ = 'TODO Verify that strings with embedded NUL collate';
+ my $ok = "a\0a\0a" lt "a\001a\001a";
+ report_result($Locale, $locales_test_number, $ok);
+
+ ++$locales_test_number;
+ $test_names{$locales_test_number}
+ = 'TODO Verify that strings with embedded NUL and '
+ . 'extra trailing NUL collate';
+ $ok = "a\0a\0" lt "a\001a\001";
+ report_result($Locale, $locales_test_number, $ok);
      }

      my $ok1;
diff --git a/locale.c b/locale.c
index 0bf8057..23c54e6 100644
--- a/locale.c
+++ b/locale.c
@@ -482,8 +482,11 @@ Perl_new_collate(pTHX_ const char *newcoll)
       PL_collation_name = NULL;
   }
   PL_collation_standard = TRUE;
+ is_standard_collation:
   PL_collxfrm_base = 0;
   PL_collxfrm_mult = 2;
+ PL_in_utf8_COLLATE_locale = FALSE;
+ *PL_strxfrm_min_char = '\0';
   return;
      }

@@ -493,46 +496,145 @@ Perl_new_collate(pTHX_ const char *newcoll)
   Safefree(PL_collation_name);
   PL_collation_name = stdize_locale(savepv(newcoll));
   PL_collation_standard = isNAME_C_OR_POSIX(newcoll);
+ if (PL_collation_standard) {
+ goto is_standard_collation;
+ }
+
+ PL_in_utf8_COLLATE_locale = _is_cur_LC_category_utf8(LC_COLLATE);
+ *PL_strxfrm_min_char = '\0';
+
+ /* A locale collation definition includes primary, secondary, tertiary,
+ * etc. weights for each character. To sort, the primary weights are
+ * used, and only if they compare equal, then the secondary weights are
+ * used, and only if they compare equal, then the tertiary, etc.
+ *
+ * strxfrm() works by taking the input string, say ABC, and creating an
+ * output transformed string consisting of first the primary weights,
+ * A¹B¹C¹ followed by the secondary ones, A²B²C²; and then the
+ * tertiary, etc, yielding A¹B¹C¹ A²B²C² A³B³C³ .... Some characters
+ * may not have weights at every level. In our example, let's say B
+ * doesn't have a tertiary weight, and A doesn't have a secondary
+ * weight. The constructed string is then going to be
+ * A¹B¹C¹ B²C² A³C³ ....
+ * This has the desired effect that strcmp() will look at the secondary
+ * or tertiary weights only if the strings compare equal at all higher
+ * priority weights. The spaces shown here, like in
+ * "A¹B¹C¹ * A²B²C² "
+ * are not just for readability. In the general case, these must
+ * actually be bytes, which we will call here 'separator weights'; and
+ * they must be smaller than any other weight value, but since these
+ * are C strings, only the terminating one can be a NUL (some
+ * implementations may include a non-NUL separator weight just before
+ * the NUL). Implementations tend to reserve 01 for the separator
+ * weights. They are needed so that a shorter string's secondary
+ * weights won't be misconstrued as primary weights of a longer string,
+ * etc. By making them smaller than any other weight, the shorter
+ * string will sort first. (Actually, if all secondary weights are
+ * smaller than all primary ones, there is no need for a separator
+ * weight between those two levels, etc.)
+ *
+ * The length of the transformed string is roughly a linear function of
+ * the input string. It's not exactly linear because some characters
+ * don't have weights at all levels. When we call strxfrm() we have to
+ * allocate some memory to hold the transformed string. The
+ * calculations below try to find coefficients 'm' and 'b' for this
+ * locale so that m*x + b equals how much space we need, given the size
+ * of the input string in 'x'. If we calculate too small, we increase
+ * the size as needed, and call strxfrm() again, but it is better to
+ * get it right the first time to avoid wasted expensive string
+ * transformations. */

   {
- /* A locale collation definition includes primary, secondary,
- * tertiary, etc. weights for each character. To sort, the primary
- * weights are used, and only if they compare equal, then the
- * secondary weights are used, and only if they compare equal, then
- * the tertiary, etc. strxfrm() works by taking the input string,
- * say ABC, and creating an output string consisting of first the
- * primary weights, A¹B¹C¹ followed by the secondary ones, A²B²C²;
- * and then the tertiary, etc, yielding A¹B¹C¹A²B²C²A³B³C³....
- * Some characters may not have weights at every level. In our
- * example, let's say B doesn't have a tertiary weight, and A
- * doesn't have a secondary weight. The constructed string is then
- * going to be A¹B¹C¹B²C²A³C³.... This has the desired
- * characteristics that strcmp() will look at the secondary or
- * tertiary weights only if the strings compare equal at all higher
- * priority weights. The length of the transformed string is
- * roughly a linear function of the input string. It's not exactly
- * linear because some characters don't have weights at all levels,
- * and there are some complications, so there is often per-string
- * overhead. When we call strxfrm() we have to allocate some
- * memory to hold the transformed string. The calculations below
- * try to find constants for this locale 'm' and 'b' so that m*x +
- * b equals how much space we need given the size of the input
- * string in 'x'. If we calculate too small, we increase the size
- * as needed, and call strxfrm() again, but it is better to get it
- * right the first time to avoid wasted expensive string
- * transformations. */
- /* 2: at most so many chars ('a', 'b'). */
- /* 50: surely no system expands a char more. */
-#define XFRMBUFSIZE (2 * 50)
- char xbuf[XFRMBUFSIZE];
- const Size_t fa = strxfrm(xbuf, "a", XFRMBUFSIZE);
- const Size_t fb = strxfrm(xbuf, "ab", XFRMBUFSIZE);
- const SSize_t mult = fb - fa;
- if (mult < 1 && !(fa == 0 && fb == 0))
- Perl_croak(aTHX_ "panic: strxfrm() gets absurd - a => %"UVuf", ab => %"UVuf,
- (UV) fa, (UV) fb);
- PL_collxfrm_base = (fa > (Size_t)mult) ? (fa - mult) : 0;
- PL_collxfrm_mult = mult;
+ /* We use the string below to find how long the tranformation of it
+ * is. Almost all locales are supersets of ASCII, or at least the
+ * ASCII letters. We use all of them, half upper half lower,
+ * because if we used fewer, we might hit just the ones that are
+ * outliers in a particular locale. Most of the strings being
+ * collated will contain a preponderance of letters, and even if
+ * they are above-ASCII, they are likely to have the same number of
+ * weight levels as the ASCII ones. It turns out that digits tend
+ * to have fewer levels, and some punctuation has more, but those
+ * are relatively sparse in text, and khw believes this gives a
+ * reasonable result, but it could be changed if experience so
+ * dictates. */
+ const char longer[] = "ABCDEFGHIJKLMnopqrstuvwxyz";
+ char * x_longer; /* Transformed 'longer' */
+ Size_t x_len_longer; /* Length of 'x_longer' */
+
+ char * x_shorter; /* We also transform a substring of 'longer' */
+ Size_t x_len_shorter;
+
+ /* mem_collxfrm() is used get the transformation (though here we
+ * are interested only in its length). It is used because it has
+ * the intelligence to handle all cases, but to work, it needs some
+ * values of 'm' and 'b' to get it started. For the purposes of
+ * this calculation we use a very conservative estimate of 'm' and
+ * 'b'. This assumes a weight can be multiple bytes, enough to
+ * hold any UV on the platform, and there are 5 levels, 4 weight
+ * bytes, and a trailing NUL. */
+ PL_collxfrm_base = 5;
+ PL_collxfrm_mult = 5 * sizeof(UV);
+
+ /* Find out how long the transformation really is */
+ x_longer = mem_collxfrm(longer,
+ sizeof(longer) - 1,
+ &x_len_longer);
+ Safefree(x_longer);
+
+ /* Find out how long the transformation of a substring of 'longer'
+ * is. Together the lengths of these transformations are
+ * sufficient to calculate 'm' and 'b'. The substring is all of
+ * 'longer' except the first character. This minimizes the chances
+ * of being swayed by outliers */
+ x_shorter = mem_collxfrm(longer + 1,
+ sizeof(longer) - 2,
+ &x_len_shorter);
+ Safefree(x_shorter);
+
+ /* If the results are nonsensical for this simple test, the whole
+ * locale definition is suspect. Mark it so that locale collation
+ * is not active at all for it. XXX Should we warn? */
+ if ( x_len_shorter == 0
+ || x_len_longer == 0
+ || x_len_shorter >= x_len_longer)
+ {
+ PL_collxfrm_mult = 0;
+ PL_collxfrm_base = 0;
+ }
+ else {
+ SSize_t base; /* Temporary */
+
+ /* We have both: m * strlen(longer) + b = x_len_longer
+ * m * strlen(shorter) + b = x_len_shorter;
+ * subtracting yields:
+ * m * (strlen(longer) - strlen(shorter))
+ * = x_len_longer - x_len_shorter
+ * But we have set things up so that 'shorter' is 1 byte smaller
+ * than 'longer'. Hence:
+ * m = x_len_longer - x_len_shorter
+ *
+ * But if something went wrong, make sure the multiplier is at
+ * least 1.
+ */
+ if (x_len_longer > x_len_shorter) {
+ PL_collxfrm_mult = (STRLEN) x_len_longer - x_len_shorter;
+ }
+ else {
+ PL_collxfrm_mult = 1;
+ }
+
+ /* mx + b = len
+ * so: b = len - mx
+ * but in case something has gone wrong, make sure it is
+ * non-negative */
+ base = x_len_longer - PL_collxfrm_mult * (sizeof(longer) - 1);
+ if (base < 0) {
+ base = 0;
+ }
+
+ /* Add 1 for the trailing NUL */
+ PL_collxfrm_base = base + 1;
+ }
   }
      }

@@ -1273,16 +1375,145 @@ Perl_init_i18nl10n(pTHX_ int printwarn)
   */

  char *
-Perl_mem_collxfrm(pTHX_ const char *s, STRLEN len, STRLEN *xlen)
+Perl_mem_collxfrm(pTHX_ const char *input_string,
+ STRLEN len,
+ STRLEN *xlen
+ )
  {
- char *xbuf;
- STRLEN xAlloc, xin, xout; /* xalloc is a reserved word in VC */
+ char * s = (char *) input_string;
+ STRLEN s_strlen = strlen(input_string);
+ char *xbuf = NULL;
+ STRLEN xAlloc, xout; /* xalloc is a reserved word in VC */
+ bool first_time = TRUE; /* Cleared after first loop iteration */

      PERL_ARGS_ASSERT_MEM_COLLXFRM;

- /* the first sizeof(collationix) bytes are used by sv_collxfrm(). */
- /* the +1 is for the terminating NUL. */
+ /* If this locale has defective collation, skip */
+ if (PL_collxfrm_base == 0 && PL_collxfrm_mult == 0) {
+ goto bad;
+ }
+
+ /* Replace any embedded NULs with the control that sorts before any others.
+ * This will give as good as possible results on strings that don't
+ * otherwise contain that character, but otherwise there may be
+ * less-than-perfect results with that character and NUL. This is
+ * unavoidable unless we replace strxfrm with our own implementation.
+ *
+ * XXX This code may be overkill. khw wrote it before realizing that if
+ * you change a NUL into some other character, that that may change the
+ * strxfrm results if that character is part of a sequence with other
+ * characters for weight calculations. To minimize the chances of this,
+ * now the replacement is restricted to another control (likely to be
+ * \001). But the full generality has been retained.
+ *
+ * This is one of the few places in the perl core, where we can use
+ * standard functions like strlen() and strcat(). It's because we're
+ * looking for NULs. */
+ if (s_strlen < len) {
+ char * e = s + len;
+ char * sans_nuls;
+ STRLEN cur_min_char_len;
+
+ /* If we don't know what control character sorts lowest for this
+ * locale, find it */
+ if (*PL_strxfrm_min_char == '\0') {
+ int j;
+ char * cur_min_x = NULL; /* Cur cp's xfrm, (except it also
+ includes the collation index
+ prefixed. */
+
+ /* Look through all legal code points (NUL isn't) */
+ for (j = 1; j < 256; j++) {
+ char * x; /* j's xfrm plus collation index */
+ STRLEN x_len; /* length of 'x' */
+ STRLEN trial_len = 1;
+
+ /* Create a 1 byte string of the current code point, but with
+ * room to be 2 bytes */
+ char cur_source[] = { (char) j, '\0' , '\0' };
+
+ if (PL_in_utf8_COLLATE_locale) {
+ if (! isCNTRL_L1(j)) {
+ continue;
+ }
+
+ /* If needs to be 2 bytes, find them */
+ if (! UVCHR_IS_INVARIANT(j)) {
+ continue; /* Can't handle variants yet */
+ }
+ }
+ else if (! isCNTRL_LC(j)) {
+ continue;
+ }
+
+ /* Then transform it */
+ x = mem_collxfrm(cur_source, trial_len, &x_len);
+
+ /* If something went wrong (which it shouldn't), just
+ * ignore this code point */
+ if ( x_len == 0
+ || strlen(x + sizeof(PL_collation_ix)) < x_len)
+ {
+ continue;
+ }
+
+ /* If this character's transformation is lower than
+ * the current lowest, this one becomes the lowest */
+ if ( cur_min_x == NULL
+ || strLT(x + sizeof(PL_collation_ix),
+ cur_min_x + sizeof(PL_collation_ix)))
+ {
+ strcpy(PL_strxfrm_min_char, cur_source);
+ cur_min_x = x;
+ }
+ else {
+ Safefree(x);
+ }
+ } /* end of loop through all bytes */
+
+ /* Unlikely, but possible, if there aren't any controls in the
+ * locale, arbitrarily use \001 */
+ if (cur_min_x == NULL) {
+ STRLEN x_len; /* temporary */
+ cur_min_x = mem_collxfrm("\001", 1, &x_len);
+ /* cur_min_cp was already initialized to 1 */
+ }
+
+ Safefree(cur_min_x);
+ }
+
+ /* The worst case length for the replaced string would be if every
+ * character in it is NUL. Multiply that by the length of each
+ * replacement, and allow for a trailing NUL */
+ cur_min_char_len = strlen(PL_strxfrm_min_char);
+ Newx(sans_nuls, (len * cur_min_char_len) + 1, char);
+ *sans_nuls = '\0';
+
+
+ /* Replace each NUL with the lowest collating control. Loop until have
+ * exhausted all the NULs */
+ while (s + s_strlen < e) {
+ strcat(sans_nuls, s);

+ /* Do the actual replacement */
+ strcat(sans_nuls, PL_strxfrm_min_char);
+
+ /* Move past the input NUL */
+ s += s_strlen + 1;
+ s_strlen = strlen(s);
+ }
+
+ /* And add anything that trails the final NUL */
+ strcat(sans_nuls, s);
+
+ /* Switch so below we transform this modified string */
+ s = sans_nuls;
+ len = strlen(s);
+ }
+
+ /* The first element in the output is the collation id, used by
+ * sv_collxfrm(); then comes the space for the transformed string. The
+ * equation should give us a good estimate as to how much is needed */
      xAlloc = sizeof(PL_collation_ix) + PL_collxfrm_base + (PL_collxfrm_mult * len) + 1;
      Newx(xbuf, xAlloc, char);
      if (UNLIKELY(! xbuf))
@@ -1294,49 +1525,70 @@ Perl_mem_collxfrm(pTHX_ const char *s, STRLEN len, STRLEN *xlen)

      /* Then the transformation of the input. We loop until successful, or we
       * give up */
- for (xin = 0; xin < len; ) {
- Size_t xused;
-
- for (;;) {
- xused = strxfrm(xbuf + xout, s + xin, xAlloc - xout);
-
- /* If the transformed string occupies less space than we told
- * strxfrm() was available, it means it successfully transformed
- * the whole string. */
- if ((STRLEN)xused < xAlloc - xout)
- break;
-
- if (UNLIKELY(xused >= PERL_INT_MAX))
- goto bad;
-
- /* Otherwise it should be that the transformation stopped in the
- * middle because it ran out of space. Malloc more, and try again.
- * */
- xAlloc = (2 * xAlloc) + 1;
- Renew(xbuf, xAlloc, char);
- if (UNLIKELY(! xbuf))
- goto bad;
- }
+ for (;;) {
+ STRLEN xused = strxfrm(xbuf + xout, s, xAlloc - xout);
+
+ /* If the transformed string occupies less space than we told strxfrm()
+ * was available, it means it successfully transformed the whole
+ * string. */
+ if (xused < xAlloc - xout) {
+ xout += xused;
+ break;
+ }
+
+ if (UNLIKELY(xused >= PERL_INT_MAX))
+ goto bad;
+
+ /* A well-behaved strxfrm() returns exactly how much space it needs
+ * (not including the trailing NUL) when it fails due to not enough
+ * space being provided. Assume that this is the case unless it's been
+ * proven otherwise */
+ if (LIKELY(PL_strxfrm_is_behaved) && first_time) {
+ xAlloc = xused + sizeof(PL_collation_ix) + 1;
+ }
+ else { /* Here, either:
+ * 1) The strxfrm() has previously shown bad behavior; or
+ * 2) It isn't the first time through the loop, which means
+ * that the strxfrm() is now showing bad behavior, because
+ * we gave it what it said was needed in the previous
+ * iteration, and it came back saying it needed still more.
+ * (Many versions of cygwin fit this. When the buffer size
+ * isn't sufficient, they return the input size instead of
+ * how much is needed.)
+ * Increase the buffer size by a fixed percentage and try again. */
+ xAlloc = (2 * xAlloc) + 1;
+ PL_strxfrm_is_behaved = FALSE;
+ }
+

- xin += strlen(s + xin) + 1;
- xout += xused;
+ Renew(xbuf, xAlloc, char);
+ if (UNLIKELY(! xbuf))
+ goto bad;

- /* Embedded NULs are understood but silently skipped
- * because they make no sense in locale collation. */
+ first_time = FALSE;
      }

- xbuf[xout] = '\0';
      *xlen = xout - sizeof(PL_collation_ix);
+
+ /* Free up unneeded space; retain ehough for trailing NUL */
+ Renew(xbuf, xout + 1, char);
+
+ if (s != input_string) {
+ Safefree(s);
+ }
+
      return xbuf;

    bad:
      Safefree(xbuf);
+ if (s != input_string) {
+ Safefree(s);
+ }
      *xlen = 0;
      return NULL;
  }

  #endif /* USE_LOCALE_COLLATE */
-
  #ifdef USE_LOCALE

  bool
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 76f4972..d334bb8 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -27,6 +27,14 @@ here, but most should go in the L</Performance Enhancements> section.

  [ List each enhancement as a =head2 entry ]

+=head2 Better locale collation of strings containing embedded C<NUL>
+characters
+
+In locales that have multi-level character weights, these are now
+ignored at the higher priority ones. There are still some gotchas in
+some strings, though. See
+L<perllocale/Collation of strings containing embedded C<NUL> characters>.
+
  =head1 Security

  XXX Any security-related notices go here. In particular, any security
@@ -331,7 +339,8 @@ well.

  =item *

-XXX
+Perl no longer panics when switching into some locales on machines with
+buggy C<strxfrm()> implementations in their libc. [perl #121734]

  =back

diff --git a/pod/perllocale.pod b/pod/perllocale.pod
index 018f916..ddb60f2 100644
--- a/pod/perllocale.pod
+++ b/pod/perllocale.pod
@@ -820,6 +820,9 @@ that a UTF-8 locale likely will just give you machine-native ordering.
  Use L<Unicode::Collate> for the full implementation of the Unicode
  Collation Algorithm.

+If Perl detects that there are problems with the locale collation order,
+it reverts to using non-locale collation rules for that locale.
+
  If you have a single string that you want to check for "equality in
  locale" against several others, you might think you could gain a little
  efficiency by using C<POSIX::strxfrm()> in conjunction with C<eq>:
@@ -1565,6 +1568,17 @@ called, and whatever it does is what you get.

  =head1 BUGS

+=head2 Collation of strings containing embedded C<NUL> characters
+
+C<NUL> characters will sort the same as the lowest collating control
+character does, or to C<"\001"> in the unlikely event that there are no
+control characters at all in the locale. In cases where the strings
+don't contain this non-C<NUL> control, the results will be correct, and
+in many locales, this control, whatever it might be, will rarely be
+encountered. But there are cases where a C<NUL> should sort before this
+control, but doesn't. If two strings do collate identically, the one
+containing the C<NUL> will sort to earlier.
+
  =head2 Broken systems

  In certain systems, the operating system's locale support
diff --git a/proto.h b/proto.h
index d16dd07..0819f26 100644
--- a/proto.h
+++ b/proto.h
@@ -5785,9 +5785,9 @@ STATIC char* S_stdize_locale(pTHX_ char* locs);
  PERL_CALLCONV int Perl_magic_setcollxfrm(pTHX_ SV* sv, MAGIC* mg);
  #define PERL_ARGS_ASSERT_MAGIC_SETCOLLXFRM \
   assert(sv); assert(mg)
-PERL_CALLCONV char* Perl_mem_collxfrm(pTHX_ const char* s, STRLEN len, STRLEN* xlen);
+PERL_CALLCONV char* Perl_mem_collxfrm(pTHX_ const char* input_string, STRLEN len, STRLEN* xlen);
  #define PERL_ARGS_ASSERT_MEM_COLLXFRM \
- assert(s); assert(xlen)
+ assert(input_string); assert(xlen)
  /* PERL_CALLCONV char* sv_collxfrm(pTHX_ SV *const sv, STRLEN *const nxp); */
  PERL_CALLCONV char* Perl_sv_collxfrm_flags(pTHX_ SV *const sv, STRLEN *const nxp, I32 const flags);
  #define PERL_ARGS_ASSERT_SV_COLLXFRM_FLAGS \
diff --git a/sv.c b/sv.c
index b7c5fae..e2288b5 100644
--- a/sv.c
+++ b/sv.c
@@ -14789,6 +14789,7 @@ perl_clone_using(PerlInterpreter *proto_perl, UV flags,
      /* Did the locale setup indicate UTF-8? */
      PL_utf8locale = proto_perl->Iutf8locale;
      PL_in_utf8_CTYPE_locale = proto_perl->Iin_utf8_CTYPE_locale;
+ PL_in_utf8_COLLATE_locale = proto_perl->Iin_utf8_COLLATE_locale;
      /* Unicode features (see perlrun/-C) */
      PL_unicode = proto_perl->Iunicode;

diff --git a/t/porting/libperl.t b/t/porting/libperl.t
index 00f2606..161f716 100644
--- a/t/porting/libperl.t
+++ b/t/porting/libperl.t
@@ -527,6 +527,13 @@ for my $symbol (sort keys %unexpected) {
        SKIP: {
          skip("uses sprintf for Gconvert in sv.o");
        }
+ }
+ elsif ( $symbol eq 'strcat'
+ && @o == 1 && $o[0] eq 'locale.o')
+ {
+ SKIP: {
+ skip("locale.o legitimately uses strcat");
+ }
      } else {
          is(@o, 0, "uses no $symbol (@o)");
      }

--
Perl5 Master Repository

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-changes @
categoriesperl
postedMay 24, '16 at 4:26p
activeMay 24, '16 at 4:26p
posts1
users1
websiteperl.org

1 user in discussion

Karl Williamson: 1 post

People

Translate

site design / logo © 2018 Grokbase