FAQ
Hello!

I've been looking at using locale and collation now we have ICU. Please
let me know if you have any comments:


Locale Functions with Unicode
=============================

Introduction
------------
The Unicode design document lists that all current functions that can make use
of a locale (such as strtoupper) are not going to be implemented in a locale
aware way. Although this will work for most situations, it might break BC for a
few situations. One (popular) example is: ::

<?php
setlocale(LC_ALL, 'tr_TR');
echo strtoupper('hans blix'), "\n";
?>

In PHP 4 and 5 this returns (when viewing in iso-8859-9): ::

HANS BLİX

Where in PHP 6 this currently returns: ::

HANS BLIX

The string returned for PHP 4 and 5 is the correct one for Turkish.
See also note 1.


Locale Dependent Functions
--------------------------
There are other functions that deal with the locale settings, some in a
different way. A list of functions and how they use the system locale.

Array Sorting Functions
~~~~~~~~~~~~~~~~~~~~~~~
All the array sorting functions accept a flag "SORT_LOCALE_STRING" that changes
the sorting of array keys/value from a binary compare, to a locale based
compare. This uses the function strcoll(), which relies on the system's locale.

String Functions
~~~~~~~~~~~~~~~~
str_word_count
Uses the system locale to determine which characters make up a word.

strnatcasecmp, strnatcmp
Use the locale to upper and lower case letters, and to determine if
something is a digit or not.

strcmp, strncmp
Do currently not use any locale, but perhaps they can make use of it, f.e.
in the ß vs ss case.

strcasecmp, strncasecmp
Uses the system locale to do lower casing on letters so that they can match
case-insensitive. See also note 2.

strtolower, strtoupper
Make both use of locale properties for characters to lower/upper case them
properly.

ucfirst, ucwords
Use character properties to upper and lower case the first letters of
words.

Other Functions
~~~~~~~~~~~~~~~
localeconv
Uses the system locale to return information about this locale.

money_format
Uses the system locale to format a number as monetary number.


Problems with System Locales
----------------------------
There are a number of problems with having to rely on the locale information
that is available on different platforms / installations. Locale information:

- can be different for each platform
- might not available depending on platform and installation
- does not have a common identifier on different platforms


ICU Locales and Collators
-------------------------
As ICU provides us with a platform and installation independent way of dealing
with locales and collation rules, we can use this to get rid of the current
dependency on system locales. There are three ways how we can upgrade our
functions to use ICU locales:

1. We simply make them use the default locale, as set by icu_loc_set_default()
and default collator (as set by a future icu_coll_set_default()).
2. We add a new parameter to the functions specifying which locale to use.
3. Create new functions that are locale and collation dependent (by using the
default locale/collation).

Each of those three options have pro's and con's.

Modifying Functions to Use ICU Locales
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pro:

- No additional programming needed by users as the current functions would "just
work like expected". For people that do not care about locales, nothing will
really change, as the current default locale should be "C" or "POSIX".
- No ugly API for our string handling functions.

con:

- It might break BC in some cases.

Adding a New Argument to Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pro:

- Doesn't break BC

con:

- Additional work for programmers for every function call.
- Ugly API because of the passing of the locale name.

Create New Functions
~~~~~~~~~~~~~~~~~~~~
pro:

- Doesn't break BC
- No ugly API

con:

- Additional work for programmers as they need to replace the current functions
with the upgraded ones.
- It is crucial that the new functions can not be disabled, because of
portability.
- We need to come up with a good prefix for those.
- The new functions need to work when Unicode semantics are turned off.


Discussion
----------
Both the first and third options would in my opinion be acceptable, where I
would prefer the first one, as it gives as little headache as possible for
users to start using locales. This approach would well work for the String
Functions.

For the array sorting function, I would prefer that the current
"SORT_LOCALE_STRING" simply starts using the ICU collation functionality, as
it's a relatively new flag. Another solution would be to create a new flag for
this, "SORT_ICU_LOCALE_STRING" that make the sorting functions use the
collation functionality provided by ICU.

For the Other Functions we should create a new function to format numbers in a
locale-aware way, as it would be very hard to make the current money_format
compatible with ICU and still give the full possibilities of ICU's numbering
formatting functionality.


Other Functions' Implementation
-------------------------------
i18n_format_number($number, $type [, $custom_format])
A wrapper around ICU's unum.h C-API
(http://icu.sourceforge.net/apiref/icu4c/unum_8h.html) that allows you to
format numbers in locale specific ways.

i18n_parse_number($number, $type [, $custom_format])
A wrapper around the number parsing routines from unum.h


Notes:
------
1. For some reason, in PHP 6, the strtoupper() function *does* make use of the
locale though:

By setting the locale with icu_loc_set_default("tr_TR") the PHP 6 example
gives the correct result: ::

<?php
icu_loc_set_default("tr_TR");
echo strtoupper('hans blix'), "\n";
?>

Shows: ::

HANS BLİX

2. the function zend_u_binary_strncmp doesn't compare anything binary, as it
uses U16_NEXT. Why do we still call it u_binary_strncmp?


regards,
Derick

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupphp-internals @
categoriesphp
postedAug 31, '05 at 3:00p
activeAug 31, '05 at 3:00p
posts1
users1
websitephp.net

1 user in discussion

Derick Rethans: 1 post

People

Translate

site design / logo © 2022 Grokbase