I've been looking at using locale and collation now we have ICU. Please
let me know if you have any comments:

Locale Functions with Unicode

The Unicode design document lists that all current functions that can make use
of a locale (such as strtoupper) are not going to be implemented in a locale
aware way. Although this will work for most situations, it might break BC for a
few situations. One (popular) example is: ::

setlocale(LC_ALL, 'tr_TR');
echo strtoupper('hans blix'), "\n";

In PHP 4 and 5 this returns (when viewing in iso-8859-9): ::


Where in PHP 6 this currently returns: ::


The string returned for PHP 4 and 5 is the correct one for Turkish.
See also note 1.

Locale Dependent Functions
There are other functions that deal with the locale settings, some in a
different way. A list of functions and how they use the system locale.

Array Sorting Functions
All the array sorting functions accept a flag "SORT_LOCALE_STRING" that changes
the sorting of array keys/value from a binary compare, to a locale based
compare. This uses the function strcoll(), which relies on the system's locale.

String Functions
Uses the system locale to determine which characters make up a word.

strnatcasecmp, strnatcmp
Use the locale to upper and lower case letters, and to determine if
something is a digit or not.

strcmp, strncmp
Do currently not use any locale, but perhaps they can make use of it, f.e.
in the ß vs ss case.

strcasecmp, strncasecmp
Uses the system locale to do lower casing on letters so that they can match
case-insensitive. See also note 2.

strtolower, strtoupper
Make both use of locale properties for characters to lower/upper case them

ucfirst, ucwords
Use character properties to upper and lower case the first letters of

Other Functions
Uses the system locale to return information about this locale.

Uses the system locale to format a number as monetary number.

Problems with System Locales
There are a number of problems with having to rely on the locale information
that is available on different platforms / installations. Locale information:

- can be different for each platform
- might not available depending on platform and installation
- does not have a common identifier on different platforms

ICU Locales and Collators
As ICU provides us with a platform and installation independent way of dealing
with locales and collation rules, we can use this to get rid of the current
dependency on system locales. There are three ways how we can upgrade our
functions to use ICU locales:

1. We simply make them use the default locale, as set by icu_loc_set_default()
and default collator (as set by a future icu_coll_set_default()).
2. We add a new parameter to the functions specifying which locale to use.
3. Create new functions that are locale and collation dependent (by using the
default locale/collation).

Each of those three options have pro's and con's.

Modifying Functions to Use ICU Locales

- No additional programming needed by users as the current functions would "just
work like expected". For people that do not care about locales, nothing will
really change, as the current default locale should be "C" or "POSIX".
- No ugly API for our string handling functions.


- It might break BC in some cases.

Adding a New Argument to Functions

- Doesn't break BC


- Additional work for programmers for every function call.
- Ugly API because of the passing of the locale name.

Create New Functions

- Doesn't break BC
- No ugly API


- Additional work for programmers as they need to replace the current functions
with the upgraded ones.
- It is crucial that the new functions can not be disabled, because of
- We need to come up with a good prefix for those.
- The new functions need to work when Unicode semantics are turned off.

Both the first and third options would in my opinion be acceptable, where I
would prefer the first one, as it gives as little headache as possible for
users to start using locales. This approach would well work for the String

For the array sorting function, I would prefer that the current
"SORT_LOCALE_STRING" simply starts using the ICU collation functionality, as
it's a relatively new flag. Another solution would be to create a new flag for
this, "SORT_ICU_LOCALE_STRING" that make the sorting functions use the
collation functionality provided by ICU.

For the Other Functions we should create a new function to format numbers in a
locale-aware way, as it would be very hard to make the current money_format
compatible with ICU and still give the full possibilities of ICU's numbering
formatting functionality.

Other Functions' Implementation
i18n_format_number($number, $type [, $custom_format])
A wrapper around ICU's unum.h C-API
(http://icu.sourceforge.net/apiref/icu4c/unum_8h.html) that allows you to
format numbers in locale specific ways.

i18n_parse_number($number, $type [, $custom_format])
A wrapper around the number parsing routines from unum.h

1. For some reason, in PHP 6, the strtoupper() function *does* make use of the
locale though:

By setting the locale with icu_loc_set_default("tr_TR") the PHP 6 example
gives the correct result: ::

echo strtoupper('hans blix'), "\n";

Shows: ::


2. the function zend_u_binary_strncmp doesn't compare anything binary, as it
uses U16_NEXT. Why do we still call it u_binary_strncmp?


Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupphp-internals @
postedAug 31, '05 at 3:00p
activeAug 31, '05 at 3:00p

1 user in discussion

Derick Rethans: 1 post



site design / logo © 2022 Grokbase